Member-only story

Oracle to Parquet (Pyarrow)

4 min readApr 29, 2023

As the data lake and data warehouse come into your daily life in analysing the data, the technique and tools get easier to be used without the huge learning curves. The tools like Azure Synapse and Databricks make your life simple to analyse the data and generate & visualise the result to the readers and customers.

Today, I am talking about the parquet which is a very important open-source data file format. Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.

There are several reasons why you might want to convert data to the Parquet format:

Efficiency: Parquet is designed to be highly efficient for columnar storage and processing. It uses a compressed and optimized columnar storage layout, which can reduce the amount of I/O required to access data and improve query performance. This can be especially important for big data applications where processing large volumes of data can be time-consuming and expensive.
Compatibility: Parquet is a widely used open-source format supported by many big data processing systems, including Hadoop, Spark, and Presto. Converting data to Parquet can make sharing and processing…