Demystifying PySpark: SparkSession vs SparkContext

Introduction: When working with PySpark, it's crucial to understand the core components of Apache Spark, such as SparkSession and SparkContext. This blog aims to provide a comprehensive comparison between these two essential components and guide you on when to use them in your PySpark applications. Let's dive in and explore these key aspects of the Apache Spark framework.

A Brief Overview of Apache Spark

link to this section

Apache Spark is an open-source, distributed computing system that provides a fast and general-purpose cluster computing platform for big data processing. It supports various programming languages, including Python, Scala, Java, and R, and offers built-in libraries for machine learning, graph processing, and stream processing. In this blog, we will focus on PySpark, the Python API for Apache Spark.

Understanding SparkContext

link to this section

SparkContext, introduced in Spark 1.0, is the main entry point for using the Spark Core functionalities. It connects the cluster manager (like YARN, Mesos, or standalone) and coordinates resources across the cluster. SparkContext is responsible for:

  • Task scheduling and execution
  • Fault tolerance and recovery
  • Access to storage systems (like HDFS, S3, and more)
  • Accumulators and broadcast variables
  • Configuration and tuning

In general, a single SparkContext is created per application. To create a SparkContext, you must first configure a SparkConf object, which holds key-value pairs for Spark configurations.

Example:

from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("MyApp").setMaster("local") 
sc = SparkContext(conf=conf) 

For more detail on spark context visit pyspark context.

Understanding SparkSession

link to this section

Introduced in Spark 2.0, SparkSession aims to simplify the interaction with Spark's various functionalities. It's a unified entry point for DataFrame and Dataset API, Structured Streaming, and SQL operations. SparkSession encapsulates SparkContext and several other contexts, such as HiveContext and SQLContext, which were used in previous Spark versions. Key features of SparkSession include:

  • Support for DataFrame and Dataset API
  • Execution of SQL queries
  • Reading and writing data in various formats (JSON, Parquet, Avro, etc.)
  • Support for Hive's built-in functions and User-Defined Functions (UDFs)
  • Metadata management, including table and database operations

Creating a SparkSession is straightforward, as shown in the example below:

from pyspark.sql import SparkSession 
    
spark = SparkSession.builder \ 
    .appName("MyApp") \ 
    .master("local") \ 
    .getOrCreate() 

Detailed Guide on How to create pyspark session with detailed explanation about pyspark session.

Differences between PySpark Context and PySpark Session

FeaturePySpark ContextPySpark Session
InitializationCreated explicitly using SparkContext class.Created implicitly when starting a Spark application with SparkSession.builder().
FunctionalityProvides low-level functionality for interacting with Spark.Provides high-level functionality and integrates with DataFrame and Dataset APIs.
Spark UIHas a separate Spark UI for monitoring job progress and resources.Shares the same Spark UI as the SparkSession.
SQL and DataFrameSupports SQL and DataFrame operations through the SQLContext or HiveContext.Supports SQL and DataFrame operations through the SparkSession.
CatalogDoes not have a built-in catalog for managing tables and databases.Includes a catalog that provides methods for managing tables and databases.
Hive IntegrationRequires explicit configuration for integrating with Hive (e.g., HiveContext).Provides seamless integration with Hive out-of-the-box.
Data SourcesSupports reading and writing data from various formats through the RDD API.Provides a more advanced and unified API for reading and writing data sources.
Interactive ShellUsed with the PySpark shell (pyspark) to interactively run Spark commands.Used with the PySpark shell (pyspark) or Jupyter notebooks for interactive usage.
Python CompatibilityCompatible with Python 2 and Python 3.Compatible with Python 2 and Python 3.
Multiple ContextsAllows creating and managing multiple contexts in the same Spark application.Allows creating multiple sessions, each with its own context, within the same Spark application.
Session PersistenceDoes not persist the session state by default.Supports session persistence, allowing you to save and restore the session state.
Broadcast VariablesSupports broadcasting variables using SparkContext.broadcast().Supports broadcasting variables using SparkSession.builder().getOrCreate().sparkContext.broadcast().
DataFrame and Dataset APIDoes not have built-in DataFrame and Dataset APIs.Offers high-level DataFrame and Dataset APIs for working with structured data.
StreamingSupports Spark Streaming with the StreamingContext class.Supports Structured Streaming with the DataStreamWriter and DataStreamReader APIs.

These additional points highlight more differences between PySpark Context and PySpark Session, including session persistence, broadcasting variables, support for DataFrame and Dataset APIs, streaming capabilities, and handling multiple contexts or sessions within a Spark application.

When to Use Which?

link to this section

As we've seen, SparkSession is a more recent and comprehensive entry point for Spark applications, encapsulating SparkContext and other necessary contexts. When deciding whether to use SparkSession or SparkContext, consider the following factors:

  • Spark version : If you're using Spark 2.0 or later, it's recommended to use SparkSession, as it simplifies the interaction with different APIs and offers a unified programming interface.
  • API requirements: If your application heavily relies on DataFrames, Datasets, or SQL operations, SparkSession is the right choice. However, if you're working primarily with Resilient Distributed Datasets (RDDs) and don't need DataFrame or SQL functionality, you can stick to SparkContext.
  • Compatibility: If your application needs to be compatible with earlier Spark versions (before 2.0), you may have to use SparkContext, HiveContext, or SQLContext separately. However, this approach is less recommended due to the advantages of using SparkSession.

Accessing SparkContext from SparkSession

link to this section

Even when using SparkSession, you may still need to access the underlying SparkContext for specific tasks, like creating RDDs. Fortunately, SparkSession provides an easy way to access the SparkContext, as shown below:

sc = spark.sparkContext 

Conclusion

In this blog, we have discussed the differences between SparkSession and SparkContext, as well as when to use each in PySpark applications. Although SparkContext was the primary entry point in earlier Spark versions, SparkSession has become the preferred choice since Spark 2.0, offering a unified programming interface and simplifying interaction with different Spark APIs. However, it's essential to understand the specific requirements of your application to make the best decision between using SparkSession or SparkContext.