What is the difference between a broadcast join and a map-side join in Spark

Apache Spark provides various types of joins to combine data from two or more data sources. Two commonly used joins are broadcast join and map-side join. Both are used to improve the performance of join operations in Spark. However, there are some differences between these two types of joins.

In this blog, we will discuss the difference between a broadcast join and a map-side join in Spark.

Broadcast Join

A broadcast join is used when one table is significantly smaller than the other table. In a broadcast join, the smaller table is broadcasted to all the worker nodes, which then perform the join operation with the larger table. This ensures that each worker node has all the data it needs to perform the join operation, and reduces the amount of data that needs to be shuffled across the network.

Broadcast join is ideal when the size of the smaller table is less than the memory available on each worker node. This is because, if the size of the smaller table is larger than the available memory, it can cause out-of-memory errors on the worker nodes.

In Spark, we can perform a broadcast join by using the broadcast() function to broadcast a DataFrame or a Dataset. The broadcast() function converts the DataFrame or Dataset into a Broadcast variable, which is then sent to all the worker nodes. Here is an example of how to perform a broadcast join in Spark SQL:

import org.apache.spark.sql.functions.broadcast 
        
val df1 = spark.read.format("csv").load("table1.csv") 
val df2 = spark.read.format("csv").load("table2.csv") 

val smallTable = df1.filter("column1 = 'value'") 
val broadcastTable = broadcast(smallTable) 

val joinedDF = df2.join(broadcastTable, "join_column") 

In this example, df1 and df2 are two DataFrames that we want to join. We first create a new DataFrame smallTable by filtering df1 to only include the rows where column1 equals a certain value. We then broadcast this smaller DataFrame using the broadcast() function, and join it with df2 using the join() function.

Map-Side Join

A map-side join is used when both tables being joined are too large to be broadcasted to all worker nodes. In a map-side join, Spark divides the data in both tables into multiple partitions and then sends each partition to a different worker node. Each worker node then performs the join operation on its partition of the data. Finally, Spark merges the results from all worker nodes to create the final result.

Map-side join can be more efficient than shuffle join in some cases as it avoids the data shuffling across the network. However, map-side join is only possible when the join condition satisfies certain requirements. Specifically, the join condition must be an equijoin, which means that it can only involve equality operators.

In Spark, we can perform a map-side join by using the repartition() function to partition the DataFrames or Datasets that we want to join. Here is an example of how to perform a map-side join in Spark SQL:

val df1 = spark.read.format("csv").load("table1.csv") 
val df2 = spark.read.format("csv").load("table2.csv") 

val repartitionedDF1 = df1.repartition(100, "join_column") 
val repartitionedDF2 = df2.repartition(100, "join_column") 

val joinedDF = repartitionedDF1.join(repartitionedDF2, "join_column") 

In this example, df1 and df2 are two DataFrames that we want to join. We first use the repartition() function to partition both Dataframe

Difference between broadcast join and map-side join

The main difference between broadcast join and map-side join is how they handle data shuffling across the network. In a broadcast join, the smaller table is broadcast to all the worker nodes, and the join is performed locally on each node. This means that there is no data shuffling across the network.

In a map-side join, both tables are partitioned based on the join key, and the join operation is performed locally on each node. This means that there is some data shuffling across the network, but it is much less than other types of join operations.

Broadcast join is suitable when one of the tables involved in the join is small enough to fit in memory. Map-side join is suitable when both tables involved in the join are large.

When to use broadcast join and map-side join

Broadcast join is useful when one of the tables involved in the join is small enough to fit in memory. It can be used in scenarios where one table is a dimension table, and the other table is a fact table. The dimension table is small enough to fit in memory, while the fact table is large.

Map-side join is useful when both tables involved in the join are large. It is suitable for scenarios where the tables have been pre-partitioned based on the join key.