Converting Pandas DataFrames to Parquet Format: A Comprehensive Guide

Introduction

Parquet is an open-source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row-based files like CSV or TSV files. It works exceptionally well with complex data types and nested structures. Pandas, being one of the most popular data manipulation libraries in Python, provides an easy-to-use method to convert DataFrames into Parquet format.

Datathreads Advertisement - On-Premise ETL,BI, and AI Platform

Why Choose Parquet?

Columnar Storage : Instead of storing data as a row, Parquet stores it column-wise, which makes it easy to compress and you end up saving storage.
Schema Evolution : Parquet supports schema evolution. You can add new columns or drop existing ones.
Performance : It’s heavily optimized for complex nested data structures and provides faster data retrieval.
Compatibility : Works well with a variety of data processing tools like Apache Spark, Apache Hive, Apache Impala, and Apache Arrow.

Installing Required Libraries

Before converting a DataFrame to Parquet, ensure that you have installed pandas and pyarrow or fastparquet since Pandas requires either of them for handling Parquet files:

Example in pandas

pip install pandas pyarrow 
# or 
pip install pandas fastparquet

Basic Conversion

Converting a DataFrame to a Parquet file is straightforward. Here is how you can do it:

Example in pandas

import pandas as pd 
    
# Creating a sample DataFrame 
data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 
    'Age': [28, 34, 29, 42], 
    'Address': ['New York', 'Toronto', 'San Francisco', 'Seattle'], 
    'Qualification': ['MBA', 'BCA', 'M.Tech', 'MBA']} 
df = pd.DataFrame(data) 

# Convert DataFrame to Parquet 
df.to_parquet('output.parquet')

Reading Parquet Files

You can also read a Parquet file back to a DataFrame with pd.read_parquet :

Example in pandas

df = pd.read_parquet('output.parquet') 
print(df)

Specifying Compression

Parquet supports various compression algorithms. You can specify the compression type using the compression parameter:

Example in pandas

df.to_parquet('output.parquet', compression='gzip')

Working with S3 Buckets

You can read and write DataFrames directly from/to a S3 bucket if the required libraries are installed:

Example in pandas

df.to_parquet('s3://mybucket/output.parquet') 
df = pd.read_parquet('s3://mybucket/output.parquet')

Make sure you have the s3fs library installed and configured:

Example in pandas

pip install s3fs

Partitioning

Parquet supports partitioning of data based on column values. Partitioning divides your dataset into multiple files, one per value of the partitioned column:

Example in pandas

df.to_parquet('output.parquet', partition_cols=['Name'])

Each unique value in the "Name" column will result in a separate Parquet file.

Conclusion

Storing your data in Parquet format can lead to significant improvements in both storage space and query performance. With the simple and well-documented pandas interface, converting your data to this efficient format is hassle-free.

The ability to read from and write to various sources like local file systems, distributed file systems, and cloud storage, as well as support for different compression algorithms, makes pandas and Parquet a powerful combination for handling large datasets.