Using the lit() Function in PySpark to Create a New Column with a Constant Value

Introduction

link to this section

In PySpark, the lit() function is used to create a new column in a DataFrame with a constant value. It can be useful when you want to add a column with a default value or a constant value for all rows in the DataFrame. In this blog post, we will discuss how to use the lit() function to create a new column with a constant value in a PySpark DataFrame.

Importing Necessary Libraries

link to this section

First, let's import the necessary libraries for working with PySpark DataFrames and the lit() function.

from pyspark.sql import SparkSession 
from pyspark.sql.functions import lit 

Creating a PySpark DataFrame

link to this section

Now, let's create a sample DataFrame to demonstrate the usage of the lit() function.

# Create a Spark session 
spark = SparkSession.builder \ 
    .appName("UsingLitFunction") \ 
    .getOrCreate() 
    
# Create a sample DataFrame 
data = [("Alice", 30), ("Bob", 25), ("Eve", 35), ("John", 40)] 
columns = ["Name", "Age"] dataframe = spark.createDataFrame(data, columns) 
dataframe.show() 

Using the lit() Function to Create a New Column

link to this section

To create a new column with a constant value in a DataFrame, use the lit() function in conjunction with the withColumn() function.

# Add a new column with a constant value 
dataframe_with_constant = dataframe.withColumn("Country", lit("United States")) 
dataframe_with_constant.show() 

Using the lit() Function with Different Data Types

link to this section

The lit() function can be used with different data types, such as integers, floats, strings, and booleans. When using the lit() function, ensure that you provide the correct data type for the constant value.

# Add new columns with different constant values and data types 
dataframe_with_constants = dataframe \ 
    .withColumn("Country", lit("United States")) \ 
    .withColumn("Score", lit(100)) \ 
    .withColumn("Rating", lit(4.5)) \ 
    .withColumn("Active", lit(True)) 
    
dataframe_with_constants.show() 

Conclusion

link to this section

The lit() function in PySpark is a versatile and easy-to-use tool for creating new columns in DataFrames with constant values. By understanding how to use the lit() function, you can efficiently add default or constant values to your PySpark DataFrames, making it a valuable addition to your data processing toolkit.