Member-only story
Python: PostgreSQL to Parquet
2 min readJun 24, 2023
To continue to learn about how to convert into parquet, I will talk about PostgreSQL to Parquet, today.
There are many libraries when it comes to conversion to parquet.
- pyarrow: This library provides a Python API for the functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with pandas, NumPy, and other software in the Python ecosystem. In PyArrow we use Snappy compression by default, but Brotli, Gzip, ZSTD, LZ4, and uncompressed are also supported
- fastparquet: fastparquet is a Python implementation of the parquet format, aiming to integrate into Python-based big data workflows. It is used implicitly by the projects Dask, Pandas and intake-parquet. Compression available by default: gzip, snappy, brotli, lz4, zstandard, optionally supported lzo.
- pandas: pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labelled” data both easy and intuitive. For compression, there are ‘snappy’, ‘gzip’, ‘brotli’, None, default ‘snappy’.
Source code
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp
import pandas as pd
from sqlalchemy import create_engine
# Define PostgreSQL connection parameters
db_username = ''
db_password = ''
db_hostname = ''
db_port = ''
db_name = ''
# Connect to PostgreSQL database
engine =…