Member-only story

Apache Parquet

What is parquet?

5 min readApr 30, 2023

What is parquet?

Parquet is a columnar storage file format that is designed to store and process large amounts of data efficiently. It is an open-source project developed by the Apache Software Foundation and is widely used in big data processing frameworks like Apache Hadoop, Apache Spark, and Apache Hive.

In a Parquet file, data is organized into columns rather than rows, which allows for more efficient compression and encoding. This means that Parquet files can be read and processed more quickly than traditional row-based file formats like CSV or JSON, especially for analytical workloads that involve reading a subset of columns from large datasets. In other words, the Parquet file is comparatively slower than a row-based file (Avro) when it writes.

Parquet files also support nested data structures like arrays and maps, which makes them well-suited for storing and processing complex data types. Additionally, Parquet files can be compressed using a variety of codecs like Snappy, Gzip, and LZO, which allows for further reduction in storage and processing costs.

Compression in Parquet

Compressing Parquet files can potentially have an impact on performance, but the specific impact will depend on several factors, including the compression…