How to Save a PySpark DataFrame to a CSV File: The Ultimate Guide

Published on April 17, 2025


Diving Straight into Saving a PySpark DataFrame to a CSV File

Saving a PySpark DataFrame to a CSV file is a critical task for data engineers using Apache Spark, enabling data export for sharing, storage, or integration with other tools. Whether you're persisting processed data for reporting, archiving results, or preparing datasets for downstream applications, this skill ensures seamless data workflows. This comprehensive guide explores the syntax and steps for saving a DataFrame to a CSV file, with targeted examples covering basic CSV writing, customizing options, handling nested data, and using SQL-based approaches. Each section addresses a specific aspect of CSV export, supported by practical code, error handling, and performance optimization strategies to build robust ETL pipelines. The primary method, write.csv(), is explained with all its parameters. Let’s save those DataFrames! For more on PySpark, see PySpark Fundamentals.


Saving a DataFrame to a CSV File with Default Settings

The primary method for saving a PySpark DataFrame to a CSV file is the write.csv() method, which exports the DataFrame to a directory containing one or more CSV files, depending on the DataFrame’s partitioning. By default, it writes the CSV with headers and uses commas as delimiters, making it ideal for simple ETL export tasks.

Understanding write.csv() Parameters

The write.csv() method is part of the DataFrameWriter interface, accessed via df.write. It accepts the following key parameters:

  • path (str, required): The directory path where the CSV file(s) will be saved (e.g., "output/data.csv"). Spark creates a directory, not a single file, due to distributed writing.
  • mode (str, optional, default="error"): Specifies the behavior if the output path exists:
    • "error": Throws an error if the path exists.
    • "overwrite": Overwrites the existing path.
    • "append": Appends data to the existing path.
    • "ignore": Silently skips writing if the path exists.
  • sep (str, optional, default=","): The delimiter separating fields in the CSV.
  • header (bool, optional, default=False): If True, includes column names as the first row.
  • quote (str, optional, default="\""): The character used to quote fields containing special characters.
  • escape (str, optional, default="\\"): The escape character for special characters.
  • nullValue (str, optional, default=None): The string representation for null values (e.g., "NULL").
  • compression (str, optional, default=None): Compression codec (e.g., "gzip", "bzip2", "none").
  • partitionBy (str or list, optional, default=None): Partitions the output by the specified column(s), creating subdirectories.

Here’s an example saving a DataFrame to a CSV file with default settings:

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("SaveToCSV").getOrCreate()

# Create DataFrame
data = [
    ("E001", "Alice", 25, 75000.0, "HR"),
    ("E002", "Bob", 30, 82000.5, "IT"),
    ("E003", "Cathy", 28, 90000.75, "HR"),
    ("E004", "David", 35, 100000.25, "IT"),
    ("E005", "Eve", 28, 78000.0, "Finance")
]
df = spark.createDataFrame(data, ["employee_id", "name", "age", "salary", "department"])

# Save DataFrame to CSV
df.write.csv("output/employees.csv", header=True)

Output (in output/employees.csv directory, e.g., part-00000-*.csv):

employee_id,name,age,salary,department
E001,Alice,25,75000.0,HR
E002,Bob,30,82000.5,IT
E003,Cathy,28,90000.75,HR
E004,David,35,100000.25,IT
E005,Eve,28,78000.0,Finance

This saves the DataFrame to a CSV file with headers in the output/employees.csv directory. The header=True option includes column names. Validate by reading the CSV back:

read_df = spark.read.csv("output/employees.csv", header=True)
assert read_df.count() == 5, "Incorrect row count in CSV"
assert "Alice" in [row["name"] for row in read_df.select("name").collect()], "Expected data missing"

Error to Watch: Existing path with default mode fails:

try:
    df.write.csv("output/employees.csv")  # Path already exists
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Path output/employees.csv already exists

Fix: Use mode="overwrite":

df.write.mode("overwrite").csv("output/employees.csv", header=True)

Customizing CSV Output with Advanced Options

To tailor the CSV output, use write.csv() options like sep, quote, nullValue, and compression. This is useful for meeting specific formatting requirements or optimizing storage in ETL pipelines, such as using tabs as delimiters or compressing files for efficiency.

# Save DataFrame with custom CSV options
df.write \
  .mode("overwrite") \
  .option("sep", "\t") \
  .option("quote", "'") \
  .option("nullValue", "NULL") \
  .option("compression", "gzip") \
  .option("header", True) \
  .csv("output/employees_custom.csv")

Output (in output/employees_custom.csv, e.g., part-00000-*.csv.gz):

employee_id'name'age'salary'department
E001'Alice'25'75000.0'HR
E002'Bob'30'82000.5'IT
E003'Cathy'28'NULL'HR
E004'David'35'100000.25'IT
E005'Eve'28'78000.0'Finance

This saves the DataFrame with:

  • Tab (\t) as the delimiter (sep="\t").
  • Single quotes (') for quoting (quote="'").
  • "NULL" for null values (nullValue="NULL").
  • Gzip compression (compression="gzip").
  • Headers included (header=True).

Validate:

read_df = spark.read.option("sep", "\t").option("quote", "'").option("header", True).csv("output/employees_custom.csv")
assert read_df.count() == 5, "Incorrect row count in CSV"
assert read_df.filter(col("salary").isNull()).count() == 1, "Incorrect null handling"

Error to Watch: Invalid option fails:

try:
    df.write.option("invalid_option", "value").csv("output/employees.csv")
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Invalid option: invalid_option

Fix: Use valid options:

valid_options = ["sep", "quote", "nullValue", "compression", "header"]
assert all(opt in valid_options for opt in ["sep", "quote"]), "Invalid CSV option"

Saving Nested Data to a CSV File

Nested DataFrames, with structs or arrays, require special handling when saving to CSV, as CSV is a flat format. You can flatten nested fields using select() or explode arrays before saving, ensuring all data is represented in the CSV. This is critical for exporting complex data in ETL pipelines.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("NestedSaveToCSV").getOrCreate()

# Define schema with nested structs
schema = StructType([
    StructField("employee_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("contact", StructType([
        StructField("phone", LongType(), True),
        StructField("email", StringType(), True)
    ]), True),
    StructField("department", StringType(), True)
])

# Create DataFrame
data = [
    ("E001", "Alice", (1234567890, "alice@company.com"), "HR"),
    ("E002", "Bob", (None, "bob@company.com"), "IT"),
    ("E003", "Cathy", (5555555555, "cathy@company.com"), "HR"),
    ("E004", "David", (9876543210, "david@company.com"), "IT")
]
df = spark.createDataFrame(data, schema)

# Flatten nested contact struct and save to CSV
flattened_df = df.select(
    col("employee_id"),
    col("name"),
    col("contact.phone").alias("phone"),
    col("contact.email").alias("email"),
    col("department")
)
flattened_df.write.mode("overwrite").csv("output/employees_nested.csv", header=True)

Output (in output/employees_nested.csv, e.g., part-00000-*.csv):

employee_id,name,phone,email,department
E001,Alice,1234567890,alice@company.com,HR
E002,Bob,,bob@company.com,IT
E003,Cathy,5555555555,cathy@company.com,HR
E004,David,9876543210,david@company.com,IT

This flattens the contact struct into phone and email columns before saving. Validate:

read_df = spark.read.csv("output/employees_nested.csv", header=True)
assert read_df.count() == 4, "Incorrect row count in CSV"
assert "alice@company.com" in [row["email"] for row in read_df.select("email").collect()], "Expected email missing"

Error to Watch: Saving nested data without flattening fails to represent structure:

try:
    df.write.csv("output/employees_nested_invalid.csv", header=True)
except Exception as e:
    print(f"Error: {e}")

Output (no error, but nested struct is written as string):

employee_id,name,contact,department
E001,Alice,"{1234567890,alice@company.com}",HR
...

Fix: Flatten nested fields:

assert "contact" not in flattened_df.columns, "Nested column not flattened"

Saving to CSV Using SQL Queries

For SQL-based ETL workflows or teams familiar with database querying, SQL queries via temporary views offer an alternative way to save a DataFrame to CSV. The SELECT statement prepares the data, and write.csv() saves the result.

# Create temporary view
df.createOrReplaceTempView("employees")

# Save to CSV using SQL
sql_df = spark.sql("SELECT employee_id, name, age, salary, department FROM employees")
sql_df.write.mode("overwrite").csv("output/employees_sql.csv", header=True)

Output (in output/employees_sql.csv, e.g., part-00000-*.csv):

employee_id,name,age,salary,department
E001,Alice,25,75000.0,HR
E002,Bob,30,82000.5,IT
E003,Cathy,28,90000.75,HR
E004,David,35,100000.25,IT
E005,Eve,28,78000.0,Finance

Validate:

read_df = spark.read.csv("output/employees_sql.csv", header=True)
assert read_df.count() == 5, "Incorrect row count in CSV"
assert "Cathy" in [row["name"] for row in read_df.select("name").collect()], "Expected name missing"

Error to Watch: Unregistered view fails:

try:
    sql_df = spark.sql("SELECT * FROM nonexistent")
    sql_df.write.csv("output/invalid.csv")
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Table or view not found: nonexistent

Fix: Verify view:

assert "employees" in [v.name for v in spark.catalog.listTables()], "View missing"
df.createOrReplaceTempView("employees")

Optimizing Performance for Saving to CSV

Saving a DataFrame to CSV involves distributed writing across executors, which can be I/O-intensive for large datasets. Optimize performance to ensure efficient export:

  1. Select Relevant Columns: Reduce data written:
df = df.select("employee_id", "name", "salary")
  1. Filter Rows: Limit rows before writing:
df = df.filter(col("salary") > 0)
  1. Repartition Data: Control output file count:
df = df.repartition(1)  # Single output file (use cautiously)
  1. Use Compression: Reduce storage and I/O:
df.write.option("compression", "gzip").csv("output/employees_compressed.csv")

Example optimized save:

optimized_df = df.select("employee_id", "name", "salary").filter(col("salary") > 0)
optimized_df.repartition(1).write.mode("overwrite").option("header", True).csv("output/employees_optimized.csv")

Monitor I/O and executor performance via the Spark UI to identify bottlenecks.

Error to Watch: Excessive partitions create many small files:

# Example with many partitions
df.write.csv("output/employees_many_files.csv")  # Creates multiple part-*.csv files

Fix: Adjust partitions:

assert df.rdd.getNumPartitions() <= 10, "Too many partitions, consider repartition"

Wrapping Up Your CSV Saving Mastery

Saving a PySpark DataFrame to a CSV file is a vital skill for exporting data in ETL pipelines. Whether you’re using write.csv() with default or customized options, flattening nested data for CSV compatibility, or leveraging SQL queries for intuitive exports, Spark provides powerful tools to address diverse data export needs. By mastering these techniques, optimizing performance, and anticipating errors, you can efficiently share and store data, enabling seamless integration with other systems. These methods will enhance your data engineering workflows, empowering you to manage data exports with confidence.

Try these approaches in your next Spark job, and share your experiences, tips, or questions in the comments or on X. Keep exploring with DataFrame Operations to deepen your PySpark expertise!


More Spark Resources to Keep You Going