How to use the split() function in Spark

Spark offers many features to manipulate data, including the split() function. This function allows users to split a string into an array of substrings, based on a delimiter. In this blog, we will discuss the split() function in detail and explore its usage in Spark.

The split() Function

link to this section

The split() function is a built-in function in Spark that splits a string into an array of substrings based on a delimiter. The function takes two arguments: the first argument is the string to be split, and the second argument is the delimiter. The delimiter is a string that separates the different substrings.

The syntax of the split() function is as follows:

def split(str: Column, pattern: String): Column 

Here, str is the input string, and pattern is the delimiter. The function returns an array of substrings. Let's look at an example:

import org.apache.spark.sql.functions._ 
        
val data = Seq(("John,Doe"), ("Jane,Smith"), ("David,Williams")).toDF("name") 
val splitData = data.select(split(col("name"), ",")).as("split") 

In the above code, we have created a DataFrame called data with a column called name . We have used the split() function to split the name column into an array of substrings using the comma delimiter. We have then assigned the result to a new DataFrame called splitData .

The output of the above code would look like this:

+----------------+ 
|split(name) | 
+----------------+ 
|[John, Doe] | 
|[Jane, Smith] | 
|[David, Williams]| 
+----------------+ 

As you can see, the split() function has split the name column into an array of substrings based on the comma delimiter.

Using split() with Other Functions

link to this section

The split() function is a powerful tool in Spark for splitting strings into substrings based on a delimiter. It can be used in combination with other functions in Spark to transform and manipulate the resulting array of substrings. In this section, we will explore some examples of using split() with other functions in Spark.

Using split() with explode()

The explode() function in Spark is used to transform an array of elements into multiple rows. We can use the split() function in conjunction with explode() to split a string into substrings and create multiple rows for each substring.

Let's look at an example. Suppose we have a DataFrame called data with a column called names that contains a comma-separated string of names:

+--------------------+
| names| 
+--------------------+ 
| John,Doe,Jackson| 
|Jane,Smith,Williams | 
+--------------------+ 

We can use the split() function to split the names column into an array of substrings using the comma delimiter:

import org.apache.spark.sql.functions._ 
        
val splitData = data.select(split(col("names"), ",")).as("split") 
splitData.show() 

This will give us a DataFrame with a column called split that contains an array of substrings:

+--------------------+ 
| split| 
+--------------------+ 
|[John, Doe, Jackson]| 
|[Jane, Smith, Wil...| 
+--------------------+ 

We can then use the explode() function to transform the array of substrings into multiple rows:

val explodedData = splitData.select(explode(col("split"))).as("exploded") 
explodedData.show() 

This will give us a DataFrame with a column called exploded that contains a single substring per row:

+-------+ 
|exploded| 
+-------+ 
| John| 
| Doe| 
|Jackson| 
| Jane| 
| Smith| 
|Williams| 
+-------+ 

As you can see, we have used the split() function in conjunction with explode() to split the names column into substrings and create multiple rows for each substring.

Using split() with length()

The length() function in Spark returns the length of a string or an array. We can use the length() function in conjunction with split() to count the number of substrings created by the split() function.

Let's use the same DataFrame data from the previous example:

+--------------------+ 
| names| 
+--------------------+ 
| John,Doe,Jackson| 
|Jane,Smith,Williams | 
+--------------------+ 

We can use the split() function to split the names column into an array of substrings using the comma delimiter:

import org.apache.spark.sql.functions._ 
        
val splitData = data.select(split(col("names"), ",")).as("split") 
splitData.show() 

This will give us a DataFrame with a column called split that contains an array of substrings:

+--------------------+ 
| split| 
+--------------------+ 
|[John, Doe, Jackson]| 
|[Jane, Smith, Wil...| 
+--------------------+ 

We can then use the size() function to count the number of substrings in each row:

val lengthData = splitData.select(size(col("split"))).as("length") 
lengthData.show() 

This will give us a DataFrame with a column called length that contains the count of substrings in each row:

+------+ 
|length| 
+------+ 
| 3| 
| 3| 
+------+ 

As you can see, we have used the split() function in conjunction with size() to count the number of substrings in the names column.

Conclusion

link to this section

The split() function is a useful function in Spark that allows users to split a string into an array of substrings based on a delimiter. It can be used in conjunction with other Spark functions to manipulate the resulting array of substrings. In this blog, we have discussed the syntax of the split() function, and we have explored some examples of using the function with other Spark functions. We hope this blog has been helpful in understanding how to use the split() function in Spark.