Merging DataFrames in PySpark: A Comprehensive Guide to Union Operations

Introduction

link to this section

When working with data, it is often necessary to combine multiple datasets to create a single, unified dataset. PySpark offers several methods to merge DataFrames, with the Union operation being a popular choice. In this blog post, we will discuss the Union operation in PySpark, how it works, and provide examples to help you understand when and how to use it to merge DataFrames in your PySpark applications.

Table of Contents

link to this section
  1. The Basics of Union Operation

  2. Preparing DataFrames for Union

  3. Performing Union Operations 3.1 Using Union 3.2 Using UnionByName

  4. Examples 4.1 Merging DataFrames with Union 4.2 Merging DataFrames with UnionByName

  5. Handling Duplicate Rows

  6. Conclusion

The Basics of Union Operation

link to this section

The Union operation in PySpark is used to merge two DataFrames with the same schema. It stacks the rows of the second DataFrame on top of the first DataFrame, effectively concatenating the DataFrames vertically. The result is a new DataFrame containing all the rows from both input DataFrames.

Preparing DataFrames for Union

link to this section

Before performing a Union operation, ensure that the DataFrames you want to merge have the same schema, i.e., the same number and order of columns with matching data types. If the schemas do not match, you may need to modify the DataFrames using select, withColumn, or other DataFrame transformations to align the schemas.

Performing Union Operations

link to this section

PySpark offers two methods for performing Union operations: Union and UnionByName.

Using Union:

The Union method merges two DataFrames based on their column positions. It requires that the input DataFrames have the same number and order of columns, as well as matching data types.

Using UnionByName:

The UnionByName method merges two DataFrames based on their column names. It requires that the input DataFrames have matching column names and data types but does not require that the columns be in the same order.

Examples

link to this section

Merging DataFrames with Union:

Suppose we have two DataFrames containing sales data, and we want to merge them into a single DataFrame. We can use the Union method to do this:

from pyspark.sql import SparkSession 
        
spark = SparkSession.builder.appName("Union Example").getOrCreate() 

# Create two DataFrames with sales data 
sales_data1 = [("apple", 3), ("banana", 5), ("orange", 2)] 
sales_data2 = [("apple", 4), ("banana", 3), ("orange", 6)] 

df1 = spark.createDataFrame(sales_data1, ["product", "quantity"]) 
df2 = spark.createDataFrame(sales_data2, ["product", "quantity"]) 

# Merge DataFrames using union 
merged_df = df1.union(df2) 

# Show the result 
merged_df.show()

Merging DataFrames with UnionByName:

If the input DataFrames have different column orders but matching column names and data types, we can use the UnionByName method to merge them:

# Create two DataFrames with sales data and different column orders 
sales_data1 = [("apple", 3), ("banana", 5), ("orange", 2)] 
sales_data2 = [(3, "apple"), (5, "banana"), (6, "orange")] 

df1 = spark.createDataFrame(sales_data1, ["product", "quantity"]) 
df2 = spark.createDataFrame(sales_data2, ["quantity", "product"]) 

# Merge DataFrames using unionByName 
merged_df = df1.unionByName(df2) 

# Show the result 
merged_df.show() 


Handling Duplicate Rows

link to this section

When merging DataFrames using Union or UnionByName, duplicate rows are preserved in the resulting DataFrame. If you need to remove duplicate rows from the merged DataFrame, you can use the distinct or dropDuplicates method:

# Remove duplicate rows using distinct 
distinct_df = merged_df.distinct() 

# Show the result 
distinct_df.show() 

Alternatively, you can use dropDuplicates with a subset of columns to remove duplicate rows based on specific columns:

# Remove duplicate rows based on the 'product' 
column deduplicated_df = merged_df.dropDuplicates(["product"]) 

# Show the result 
deduplicated_df.show() 


Conclusion

link to this section

In this blog post, we have explored the Union operation in PySpark and discussed its use cases and functionality. By understanding the differences between Union and UnionByName and how to prepare your DataFrames for merging, you can effectively combine multiple DataFrames in your PySpark applications. Additionally, knowing how to handle duplicate rows in the merged DataFrame can help you maintain data integrity and ensure accurate results in your data processing tasks.