Spark Dataframe Basic Operators

In this blog post, we'll explore the basic operators of Spark DataFrame, which include transformations and actions. These operators are used to manipulate data in a DataFrame and perform operations like filtering, aggregation, and joining. By understanding these operators, you'll be able to perform basic data manipulations in Spark and build more complex data processing pipelines.

Transformations

link to this section

Transformations in Spark DataFrame are operations that create a new DataFrame by transforming the input DataFrame. Transformations are lazy and do not execute until an action is called, such as writing the output to disk or collecting the data to the driver program. Here are some of the basic transformations in Spark DataFrame:

  1. Select

The select transformation is used to select one or more columns from a DataFrame. You can either pass column names as arguments or use the "col" function to select columns. Here's an example:

df.select("col1", "col2") 
  1. Filter

The filter transformation is used to filter rows based on a given condition. You can either pass a boolean expression or use the "expr" function to filter rows. Here's an example:

df.filter(col("col1") > 10) 
  1. WithColumn

The withColumn transformation is used to add a new column to a DataFrame by transforming an existing column. You can either pass a column name and expression as arguments or use the "expr" function to create a new column. Here's an example:

df.withColumn("new_col", col("col1") + col("col2")) 
  1. Drop

The drop transformation is used to drop one or more columns from a DataFrame. You can either pass column names as arguments or use the "col" function to select columns. Here's an example:

df.drop("col1") 

Actions

link to this section

Actions in Spark DataFrame are operations that execute the transformations and return a result or write the output to disk. Actions trigger the execution of the lazy transformations and can be used to collect data to the driver program or write the output to disk. Here are some of the basic actions in Spark DataFrame:

  1. Show

The show action is used to display the first n rows of a DataFrame. You can either pass the number of rows to display or use the default value of 20. Here's an example:

df.show(10) 
  1. Collect

The collect action is used to retrieve all the data in a DataFrame to the driver program. This action can be expensive for large datasets and is generally used for small datasets or debugging purposes. Here's an example:

val data = df.collect() 
  1. Count

The count action is used to return the number of rows in a DataFrame. Here's an example:

val count = df.count() 
  1. Write

The write action is used to write the output of a DataFrame to disk in various formats, including CSV, Parquet, and JSON. Here's an example:

df.write.format("csv").save("output.csv") 

Conclusion

link to this section

In this blog post, we've explored the basic operators of Spark DataFrame, including transformations and actions. By understanding these operators, you'll be able to manipulate data in a DataFrame and perform basic data processing operations. These operators can be combined to build more complex data processing pipelines and enable scalable and distributed data processing.