How to do String Manipulation in Spark

Apache Spark is a powerful distributed computing framework that can process large amounts of data in a fast and scalable manner. One common task in data processing is string manipulation. In this blog, we will discuss some of the ways to perform string manipulation in Spark.

Concatenation

Concatenation is the process of joining two or more strings together. In Spark, we can use the concat function to concatenate two or more string columns. Here's an example:

Scala Spark

PySpark

import org.apache.spark.sql.functions._

val df = spark.createDataFrame([(1, "John", "Doe"), 
    (2, "Jane", "Doe")], 
    ["id", "first_name", "last_name"]) 
    
df.select(concat(df.first_name, df.last_name).alias("full_name")).show()

Output:

Output

+---------+ 
|full_name| 
+---------+ 
| JohnDoe | 
| JaneDoe | 
+---------+

Substring

Substring is a portion of a string. In Spark, we can use the substr function to extract a substring from a string. Here's an example:

Scala Spark

PySpark

import org.apache.spark.sql.functions._ 
        
val df = spark.createDataFrame([(1, "John", "Doe"), 
    (2, "Jane", "Doe")], 
    ["id", "first_name", "last_name"]) 
    
df.select(substring(df.first_name, 1, 2).alias("initials")).show()

Output:

Output

+--------+ 
|initials| 
+--------+ 
| Jo| 
| Ja| 
+--------+

Trim

Trim is a function that removes leading and trailing spaces from a string. In Spark, we can use the trim function to remove leading and trailing spaces from a string. Here's an example:

Scala Spark

PySpark

import org.apache.spark.sql.functions._ 
        
val df = spark.createDataFrame([(1, " John ", " Doe "), 
    (2, " Jane", "Doe ")], 
    ["id", "first_name", "last_name"]) 
    
df.select(trim(df.first_name).alias("first_name"), trim(df.last_name).alias("last_name")).show()

Output:

Output

+----------+---------+ 
|first_name|last_name| 
+----------+---------+ 
| John| Doe| 
| Jane| Doe| 
+----------+---------+

Replace

Replace is a function that replaces a substring in a string with another substring. In Spark, we can use the replace function to replace a substring in a string. Here's an example:

Scala Spark

PySpark

import org.apache.spark.sql.functions._  
        
val df = spark.createDataFrame([(1, "John", "Doe"), 
    (2, "Jane", "Doe")], 
    ["id", "first_name", "last_name"]) 
    
df.select(replace(df.first_name, "Jo", "Ja").alias("first_name"), 
    replace(df.last_name, "o", "a").alias("last_name")).show()

Output:

Output

+----------+---------+ 
|first_name|last_name| 
+----------+---------+ 
| Jan | Dae| 
| Jane| Dae| 
+----------+---------+

Conclusion

String manipulation is a common task in data processing. In this blog, we have discussed some of the ways to perform string manipulation in Spark, including concatenation, substring, trim, and replace. These functions can be used to transform and clean string data in Spark.