How to do String Manipulation with Regular Expressions in Spark and PySpark

String manipulation is a common task in data processing, and Spark provides powerful tools for working with strings in distributed environments. Regular expressions, or regex for short, are a powerful tool for string manipulation and pattern matching. In this blog, we will explore the basics of regular expressions in Spark, including common regex functions and examples of how to use them.

Regex Functions in Spark

Spark provides a variety of functions for working with regular expressions, including:

regexp_extract(str, pattern, index): Extracts a specific group matched by a regular expression from a string column.
regexp_replace(str, pattern, replacement): Replaces all occurrences of a pattern in a string column with a replacement string.
regexp_like(str, pattern): Tests whether a string column matches a regular expression pattern.
regexp_instr(str, pattern): Returns the position of the first occurrence of a regular expression pattern in a string column.
regexp_subset(str, pattern): Returns a Boolean column indicating whether a string column contains a match to a regular expression pattern.

Using Regular Expressions in Spark

Let's look at some examples of how to use regular expressions in Spark.

Example 1: Extracting Phone Numbers

Suppose we have a DataFrame with a column of phone numbers in the format "###-###-####". We want to extract just the area code (the first three digits) and create a new column with the result.

Scala Spark

PySpark

import org.apache.spark.sql.functions.regexp_extract 

val data = Seq( ("John", "123-456-7890"), 
    ("Jane", "234-567-8901"), 
    ("Bob", "345-678-9012") )
    .toDF("name", "phone_number") 
    
val areaCode = regexp_extract(data("phone_number"), "^\\d{3}-", 0).alias("area_code") 
val result = data.select(data("name"), data("phone_number"), areaCode) 

result.show()

Here, we use the regexp_extract() function to extract the first three digits of the phone number using the regular expression pattern r'^(\d{3})-'. The ^ symbol matches the beginning of the string, \d matches any digit, and {3} specifies that we want to match three digits. The parentheses create a capturing group that we can refer to later with the index parameter. We set index=1 to extract the first capturing group (the area code).

Example 2: Replacing Invalid Characters

Suppose we have a DataFrame with a column of email addresses that contains invalid characters (such as spaces) that we want to remove.

Scala Spark

PySpark

import org.apache.spark.sql.functions.regexp_replace 
        
val data = Seq( ("John", "john.doe@gmail.com"), 
    ("Jane", "jane.doe@gmail.com "), 
    ("Bob", "bob.smith @yahoo.com") )
    .toDF("name", "email") 
    
val cleanEmail = regexp_replace(data("email"), "\\s+", "").alias("clean_email") 
val result = data.select(data("name"), data("email"), cleanEmail) 

result.show()

Here, we use the regexp_replace() function to replace any whitespace characters (\s+) in the email column with an empty string. This effectively removes any spaces or other invalid characters.

Conclusion

Regular expressions are a powerful tool for string manipulation and pattern matching in Spark. By using the functions provided by Spark, we can easily extract specific parts of strings, replace invalid characters, and more. This knowledge will be useful for anyone working with text data in Spark.