Member-only story
Parquet Metadata Deep Dive
I’ve talked about the parquet here in many blogs though, I would go with the parquet details, especially in the Metadata. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem
File Format
The file metadata contains the locations of all the column metadata start locations.
4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata
4-byte magic number "PAR1"

Parquet has a header, footer and FileMetaData, you can find the Header Magic number (4 bytes: PAR1) and Footer (4 bytes + Magic Number 4byte). Therefore, there are 12 bytes of Metadata but other FileMetaData vary including information about the file format version, the schema, and the encoding used for the data.
Metadata Kinds
There are several kinds of metadata in a Parquet file that are used to store information about the data and its organization. These metadata include: