Member-only story

Parquet Metadata Deep Dive

Park Sehun
5 min readJul 29, 2023

I’ve talked about the parquet here in many blogs though, I would go with the parquet details, especially in the Metadata. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem

File Format

The file metadata contains the locations of all the column metadata start locations.

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata
4-byte magic number "PAR1"

Parquet has a header, footer and FileMetaData, you can find the Header Magic number (4 bytes: PAR1) and Footer (4 bytes + Magic Number 4byte). Therefore, there are 12 bytes of Metadata but other FileMetaData vary including information about the file format version, the schema, and the encoding used for the data.

Metadata Kinds

There are several kinds of metadata in a Parquet file that are used to store information about the data and its organization. These metadata include:

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Already have an account? Sign in

No responses yet

Write a response