PySpark isin Function: Advanced Data Filtering Techniques

Introduction

link to this section

In the big data ecosystem, Apache Spark is renowned for its robust capabilities for distributed computing. PySpark, the Python API for Spark, offers a Pythonic entry into the world of Spark, combining the power of distributed computing with Python's ease of use.

Today, we turn our focus towards a powerful yet often underappreciated tool within PySpark's armory, the isin function. Similar to its Pandas equivalent, PySpark's isin function plays a critical role in filtering data based on specific conditions. This blog post aims to dissect the isin function, exploring its use and various applications in PySpark.

Understanding PySpark's isin Function

link to this section

PySpark's isin function can be invoked on a DataFrame column, taking in either a list or DataFrame as its argument. The function checks if each element in the DataFrame column is contained in the given list or DataFrame.

Consider the following example:

# create a PySpark dataframe 
data = [("John", "USA"), ("Mike", "UK"), ("Sara", "Australia"), ("Lee", "China")] 
df = spark.createDataFrame(data, ["Name", "Country"]) 

# filter rows where Country is either USA or UK 
df_filtered = df.filter(df.Country.isin(["USA", "UK"])) 
df_filtered.show() 

Here, the line df.filter(df.Country.isin(["USA", "UK"])) filters the DataFrame to only include rows where the Country column value is 'USA' or 'UK'. This powerful combination of filter and isin methods provides a concise way to perform this filtering operation.

Applying isin on DataFrame Input

link to this section

While it's straightforward to use isin with a list, it also allows for a DataFrame as an input. Here's how it works:

# create a second PySpark dataframe 
countries = [("USA"), ("UK")] 
df_countries = spark.createDataFrame(countries, ["Country"]) 

# filter rows where Country is in df_countries dataframe 
df_filtered = df.filter(df.Country.isin(df_countries)) 
df_filtered.show() 

In this snippet, the isin function filters the original DataFrame based on the values in the df_countries DataFrame. This feature is incredibly useful when filtering data based on a large set of criteria.

Leveraging isin with a NOT Condition

link to this section

At times, you might want to filter out values contained in a list or DataFrame. This can be achieved by combining the isin function with the ~ (NOT) operator:

# filter rows where Country is not USA or UK 
df_filtered = df.filter(~df.Country.isin(["USA", "UK"])) 
df_filtered.show() 

Here, the ~ operator inverts the condition specified by the isin function. Consequently, the resulting DataFrame includes only rows where the Country is neither 'USA' nor 'UK'.

Broadcasting Large Lists with isin

link to this section

The power of isin can be harnessed for large lists through broadcasting. This is particularly useful when working with large datasets, as it optimizes your code to perform better on such data. Let's see how this is done:

from pyspark.sql.functions import broadcast 
        
# create a large list of countries 
large_list = ["USA", "UK", "India", "China", "Australia", "Canada"] 

# broadcast the list 
broadcasted_list = spark.sparkContext.broadcast(large_list) 

# filter rows where Country is in the broadcasted list 
df_filtered = df.filter(df.Country.isin(broadcasted_list.value)) 
df_filtered.show() 

Here, we use the broadcast function from pyspark.sql.functions to broadcast our large list. The broadcasted list can then be used in isin, improving the overall performance of the operation.

Using isin in Combination with Other PySpark Functions

link to this section

isin function can also be used in combination with other PySpark functions to create more complex queries. For instance, we can combine isin with the when function to create a new column based on conditions:

from pyspark.sql.functions import when 

# create a new column 'Region' based on 'Country' 
df = df.withColumn("Region", when(df.Country.isin(["USA", "UK"]), "Western").otherwise("Eastern")) 
df.show() 

In this example, we're using the when function from pyspark.sql.functions in combination with isin. We create a new column 'Region' and assign it the value 'Western' when the 'Country' column is either 'USA' or 'UK'. For all other cases, we assign it the value 'Eastern'.

Conclusion

link to this section

PySpark has established itself as an indispensable tool in the world of big data processing. Even though it might seem simple, the isin function plays a pivotal role in the PySpark data wrangling toolbox, providing an efficient mechanism to filter data based on list values or another DataFrame.

Whether you're an experienced data engineer or just commencing your big data journey, mastering functions like isin is integral to your data processing arsenal. It's the understanding of these building blocks that paves the way for writing efficient and effective data transformations, letting you get the most out of your data.