Spark DataFrame Column isin Function: A Deep Dive

Apache Spark is a fast, scalable, and flexible open-source distributed computing system that has made a significant impact in big data analytics. One of the most notable features of Spark is the DataFrame API, which supports operations like filtering data, aggregation, and transformation. In this blog, we'll explore the isin function, a versatile method that comes in handy when you need to filter DataFrame based on multiple values present in a column.

Understanding the isin Function

link to this section

The isin function is part of the DataFrame API and allows us to filter rows in a DataFrame based on whether a column's value is in a specified list. It's akin to the IN SQL operator, which checks if a value exists within a list of values. The isin function is crucial for tasks such as filtering data based on a predefined list of values or comparing data across different datasets.

Basic Usage of isin

link to this section

Let's start with a simple usage of the isin function. Suppose you have a DataFrame, df , with a column "Color", and you want to filter rows where the Color is either "Red", "Blue", or "Green". Here's how you can achieve this:

val colors = List("Red", "Blue", "Green") 
val df_filtered = df.filter(df("Color").isin(colors: _*)) 

In this example, colors: _* is a syntax that tells Scala to pass each element of the collection as its own argument, rather than all of it as a single argument. The resulting df_filtered DataFrame contains only the rows where the Color column's value is either "Red", "Blue", or "Green".

Using isin with Multiple Columns

link to this section

The isin function can be used with multiple columns as well. Let's say you have another column, "Size", and you want to filter rows where the Color is "Red", "Blue", or "Green" and the Size is "Small" or "Medium". Here's how:

val sizes = List("Small", "Medium") 
val df_filtered = df.filter(df("Color").isin(colors: _*) && df("Size").isin(sizes: _*)) 

This will result in a DataFrame that includes only those rows where the Color is in the specified list and the Size is also in its respective list.

isin with Complex Types

link to this section

Apart from working with basic types like String, Integer, and Float, the isin function can be used with complex types such as Arrays or Structs. For instance, if you have a column of type Array and you want to filter rows where the array contains either "Red" or "Blue", you could explode the array and then use the isin function:

import org.apache.spark.sql.functions.explode 
        
val df_exploded = df.withColumn("Color", explode(df("ColorsArray"))) 
val df_filtered = df_exploded.filter(df_exploded("Color").isin(colors: _*)) 

Here, the explode function creates a new row for each element in the "ColorsArray" column. The isin function is then applied to the "Color" column to filter the rows.

isin with Null Values

link to this section

When using the isin function, it's essential to be aware of how it handles null values. In Spark, isin returns false when the column's value is null, regardless of whether null is included in the list. If you want to keep null values in your filtered DataFrame, you'll have to handle it explicitly:

val df_filtered = df.filter(df("Color").isNull || df("Color").isin(colors: _*)) 

In this example, df("Color").isNull || df("Color").isin(colors: _*) will return true if the Color is null or if it's in the specified list, ensuring that null values are included in the resulting DataFrame.

isin in SQL Expressions

link to this section

Another powerful aspect of Spark is its support for SQL-like syntax. The isin function can also be used in SQL expressions for data filtering:

df.createOrReplaceTempView("dfview") 
val df_filtered = spark.sql("SELECT * FROM dfview WHERE Color IN ('Red', 'Blue', 'Green')") 

In this example, we're using SQL syntax to achieve the same result as the isin function in DataFrame API.

Using isin in Join Operations

link to this section

The isin function can also be combined with join operations for efficient filtering. Suppose you have two DataFrames, df1 and df2 , and you want to filter df1 based on the values present in a column of df2 . Here's how:

val df2_list = df2.select("Color").distinct().rdd.flatMap(_.toSeq).collect() 
val df1_filtered = df1.filter(df1("Color").isin(df2_list: _*)) 

In this example, we first extract the distinct colors from df2 into a list df2_list . We then filter df1 where Color is in df2_list using isin.

Conclusion

link to this section

The isin function is an incredibly handy tool in Spark's arsenal, allowing for sophisticated and flexible data filtering. Whether you're dealing with basic or complex types, multiple columns, or even null values, the isin function offers a clean and efficient way to filter data in a DataFrame. Like many Spark functions, understanding and utilizing isin effectively can significantly improve the quality of your data processing pipelines and allow you to extract meaningful insights from your data.

You can checkout PySpark Filter Dataframe Using Isin Tutorial.