How to Convert Array Column into Multiple Rows in Spark

In a Spark DataFrame, you may have an array column that contains multiple values for each row. However, sometimes you may need to split the array column into multiple rows so that each row only contains a single value from the array. This can be useful for various operations such as aggregations or joining with other DataFrames.

The explode() Function

One way to accomplish this is by using the explode() function, which transforms an array column into multiple rows. The explode() function takes an array column as input and returns a new DataFrame with one row for each element in the array.

Here's an example in Scala:

import org.apache.spark.sql.functions._ 
        
// Create a sample DataFrame with an array column 
val data = Seq( ("Alice", Array("Maths", "Science")), 
    ("Bob", Array("English", "Social Science")) ) 

val df = data.toDF("Name", "Subjects") 

// Use explode() to convert the array column into multiple rows 
val explodedDF = df.select(col("Name"), explode(col("Subjects")).as("Subject")) 

explodedDF.show() 

In this example, we first create a sample DataFrame with a Name column and a Subjects array column. We then use the explode() function to convert the Subjects array column into multiple rows. The resulting DataFrame has one row for each element in the array.

Here's the output of the above example:

+-----+-------------+ 
| Name| Subject| 
+-----+-------------+ 
|Alice| Maths| 
|Alice| Science| 
| Bob| English| 
| Bob|Social Science| 
+-----+-------------+ 

As you can see, the explode() function has split the Subjects array column into multiple rows. The resulting DataFrame now has one row for each subject.

Conclusion

In conclusion, the explode() function is a simple and powerful way to split an array column into multiple rows in Spark. By using this function, you can easily transform your DataFrame to fit your specific requirements.