Member-only story

Parquet Metadata Deep Dive

5 min readJul 29, 2023

I’ve talked about the parquet here in many blogs though, I would go with the parquet details, especially in the Metadata. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem

File Format

The file metadata contains the locations of all the column metadata start locations.

4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata
4-byte magic number "PAR1"

Parquet has a header, footer and FileMetaData, you can find the Header Magic number (4 bytes: PAR1) and Footer (4 bytes + Magic Number 4byte). Therefore, there are 12 bytes of Metadata but other FileMetaData vary including information about the file format version, the schema, and the encoding used for the data.

Metadata Kinds

There are several kinds of metadata in a Parquet file that are used to store information about the data and its organization. These metadata include:

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Sign up with email

Already have an account? Sign in

Written by Park Sehun

https://www.linkedin.com/in/park-sehun-1097b140

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

All About Parquet Part 10 — Performance Tuning and Best Practices with Parquet

In

Data, Analytics & AI with Dremio

by

Alex Merced

All About Parquet Part 10 — Performance Tuning and Best Practices with Parquet

Free Copy of Apache Iceberg the Definitive Guide

Nov 5, 2024

When to Use COUNT(*) vs COUNT(1) in SQL Queries

Vijay Gadhave

When to Use COUNT(*) vs COUNT(1) in SQL Queries

Note: If you’re not a medium member, CLICK HERE

Jan 14

Lists

data science and AI

40 stories341 saves

Best of The Writing Cooperative

67 stories511 saves

Natural Language Processing

1977 stories1622 saves

SpaceX Has Finally Figured Out Why Starship Exploded, And The Reason Is Utterly Embarrassing

In

Predict

by

Will Lockett

SpaceX Has Finally Figured Out Why Starship Exploded, And The Reason Is Utterly Embarrassing

This should never have happened.

Mar 1

Building a Scalable and Open-Source Data Lake End to End Architecture :

In

Level Up Coding

by

Mukesh Vast

Building a Scalable and Open-Source Data Lake End to End Architecture :

MySql > Debezium > Kafka > Spark > Minio > Trino > Streamlit

Mar 6

Zstd vs Snappy vs Gzip: The Compression King for Parquet Has Arrived

In

Data Engineering Xperts

by

Ritam Mukherjee

Zstd vs Snappy vs Gzip: The Compression King for Parquet Has Arrived

For years, Snappy has been the go-to choice, but its dominance is being challenged

Dec 7, 2024

How Meta Solves Data Lineage At Scale

In

Data Engineer Things

by

Vu Trinh

How Meta Solves Data Lineage At Scale

Meta’s Approach to Data Lineage: How They Did It and What We Can Learn

Mar 6

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams