Apache Spark and PySpark Explained: Big Data Processing, Finally Digested

Your laptop has 16 GB of RAM. Your dataset has 200 GB of data. The query you need to run would take hours on a single machine — if it finished at all. Apache Spark was built to solve exactly this problem: when one machine is not enough, spread the work across many.

The analogy: one cashier vs. a hundred checkout lanes

Imagine a supermarket that has to serve 10,000 customers on a busy Saturday. With one cashier, every customer waits in a single line. The processing is sequential — each customer must finish before the next one begins. The store does not get faster; it just gets longer queues.

Now open 100 checkout lanes. The 10,000 customers split across all lanes simultaneously. Each cashier handles a small slice of the total work. The throughput is not slightly better — it is fundamentally different. The same work that took hours now takes minutes.

Apache Spark is the 100-lane checkout system for data. Instead of asking one machine to process your entire dataset from start to finish, Spark breaks the dataset into smaller chunks, sends each chunk to a different machine, and has all of those machines process their slice at the same time. This is distributed computing, and it is the core idea behind everything Spark does.

Where Spark came from

To understand why Spark was built, you need to understand what existed before it: Apache Hadoop.

In the late 2000s, Hadoop was the dominant framework for processing large datasets. Hadoop could split data across many machines and process each chunk in parallel — the right idea. But it had a painful limitation: after every step of a multi-step job, Hadoop wrote the intermediate results to disk. Disk storage is slow. If you had a pipeline with 10 transformation steps, Hadoop would write to disk and read back from disk between each one. For complex jobs, the constant disk I/O made everything crawl.

In 2012, researchers at UC Berkeley's AMPLab published a paper proposing a different approach: keep intermediate data in memory — in the machine's RAM — between steps instead of writing to disk. Memory is roughly 100 times faster to access than a hard drive. They called the new framework Apache Spark. For iterative jobs like machine learning, the speedup was dramatic: Spark was up to 100 times faster than Hadoop on these workloads.

In 2013, the same researchers founded Databricks to build a managed cloud platform on top of Spark. Spark itself remains free and open-source. Today it runs on every major cloud platform — inside Databricks, Amazon EMR, Google Dataproc, and Azure HDInsight.

The three players: Driver, Cluster Manager, Executors

When you run a Spark job, three distinct roles coordinate the work. Understanding each one makes Spark much easier to reason about.

The Driver is where your code runs. It is the boss. When you write a PySpark script and run it, the Driver reads your code, figures out what needs to happen, and builds an execution plan. The Driver does not process data itself — it plans and coordinates.

The Cluster Manager handles resource allocation. It is the staffing agency. The Driver tells the Cluster Manager how many machines it needs, and the Cluster Manager goes and finds them. Spark supports several cluster managers: YARN (the Hadoop resource manager), Kubernetes, and its own built-in Standalone mode. On Databricks, all of this is managed for you automatically.

The Executors are the workers. Each executor is a process running on a separate machine (or a separate core in local mode). Executors receive a slice of the data — called a partition — and process it. When the job is done, executors report results back to the Driver. Adding more executors means more parallel work; giving each executor more memory means it can handle larger partitions without spilling to disk.

How your data is split: partitions

Spark never loads your full dataset into one place. Instead, it divides the data into partitions — logical chunks, each assigned to one executor. Think of partitions like chapters in a book. If you have 10 editors and need to proofread a 1,000-page book, you give each editor 100 pages. They all work at the same time; the book gets edited 10 times faster.

A 10 GB file might be split into 200 partitions of 50 MB each. With 20 executors, each executor handles 10 partitions. Add 20 more executors and the same data is split across more workers — more parallelism, faster results. Partition count and size are the key levers when tuning a Spark job for performance.

RDDs and DataFrames: two ways to work with data

Spark has two main ways to represent and work with a distributed dataset.

The original abstraction is the RDD (Resilient Distributed Dataset). An RDD is a collection of data elements spread across the cluster. It is immutable (you never modify an RDD; you create a new one from it), fault-tolerant (if a machine fails, Spark can reconstruct the lost partition by replaying the operations that created it), and distributed (each partition lives on a different executor). RDDs give you complete flexibility, but they require verbose, low-level code and Spark cannot automatically optimise them.

The modern abstraction is the DataFrame. A DataFrame is an RDD organised into named, typed columns — much like a database table or a pandas DataFrame in Python. You can filter rows, join tables, group and aggregate — all with a clean API or with plain SQL. More importantly, Spark's query optimiser (the Catalyst Optimizer) analyses the full operation graph and rewrites steps to make the job faster. With RDDs, you get exactly what you asked for. With DataFrames, Spark improves on what you asked for.

Which one to use: DataFrames, almost always. They are faster (thanks to Catalyst), cleaner to read, and support both Python and SQL. RDDs still exist for low-level control — for example, when working with unstructured data like text or images that does not fit the tabular model. But for the vast majority of data engineering work, DataFrames are the right choice.

Lazy evaluation: plan first, execute once

One of the most important things to understand about Spark is that it is lazy. When you write a transformation — filtering rows, selecting columns, joining tables — Spark does not execute it immediately. It records the operation and moves on.

Spark divides all operations into two types:

Transformations are lazy — they describe what you want but trigger no computation. Examples: filter(), select(), join(), groupBy(), withColumn().
Actions trigger actual execution. Examples: count(), collect(), show(), write().

When you chain transformations, Spark builds a DAG — a Directed Acyclic Graph — that describes the full sequence of operations. No data moves. No compute is used. Only when you call an action does Spark look at the entire plan, optimise it, and execute everything in one efficient pass.

Think of it like a GPS that calculates the entire route before you start driving, rather than giving you one turn at a time. By seeing the full journey upfront, it can pick the most efficient path. Spark's Catalyst Optimizer does the same: it sees all your transformations at once and rearranges them for efficiency. If you filter rows early in a chain, Catalyst will push that filter as close to the data source as possible, so it reads the minimum amount of data from the start.

PySpark: Spark with Python syntax

Apache Spark was originally written in Scala and runs on the Java Virtual Machine (JVM). For many data engineers and data scientists, Scala was a barrier — a new language to learn on top of an already complex distributed system.

PySpark is the official Python API for Apache Spark. It gives you full access to Spark's capabilities — DataFrames, SQL, streaming, MLlib — using Python syntax. When you write PySpark code, your Python process communicates with the Spark JVM running underneath. You write Python; Spark executes the work on the cluster.

PySpark is why Spark became dominant in data engineering and data science. Python is already the most widely used language in both fields. PySpark means that engineers who know Python can process billions of rows without learning a new language.

A few things to keep in mind about PySpark specifically:

PySpark DataFrames are not pandas DataFrames. The API looks similar, but a pandas DataFrame lives entirely on one machine. A PySpark DataFrame is distributed across the cluster. Operations on a PySpark DataFrame trigger distributed Spark jobs; pandas operations stay local.
Pandas API on Spark — officially called pyspark.pandas — lets you write pandas-style code that runs on Spark under the hood. Useful for migrating existing pandas code, but the performance characteristics differ from native PySpark DataFrames.
Python UDFs have a performance cost. If you write a custom Python function (a UDF, or User-Defined Function) and apply it row by row in PySpark, each row must cross the boundary between Python and the JVM. On large datasets, this can be slow. Use built-in Spark functions wherever possible — they run entirely inside the JVM.

The Spark ecosystem

Spark is not just a batch processing engine. It ships with higher-level modules that handle different categories of work.

Spark SQL

Spark SQL lets you query DataFrames using standard SQL syntax. You register a DataFrame as a temporary view, then run SQL against it. Under the hood, Spark SQL uses the same Catalyst Optimizer as the DataFrame API — so SQL queries and Python DataFrame code produce equivalent execution plans. For teams that prefer SQL, Spark SQL makes the full power of distributed compute accessible without writing Python.

MLlib

MLlib is Spark's built-in machine learning library. It provides distributed implementations of common algorithms — linear regression, decision trees, clustering, collaborative filtering — that operate directly on Spark DataFrames. Because the data never leaves the cluster, you can train models on datasets far too large to fit in memory on a single machine.

Structured Streaming

Structured Streaming treats a real-time data stream as an infinite table that keeps growing. You write the same DataFrame operations you would use for batch data, and Spark continuously processes new records as they arrive. With Spark 4.1, Structured Streaming added a real-time mode capable of millisecond-level latency — making Spark viable for use cases that previously required a dedicated streaming engine like Apache Flink.

When Spark is the right tool — and when it is overkill

Spark comes with real overhead: spinning up a cluster, partitioning data, coordinating executors. For a 100-row CSV file, a plain Python script is faster. Spark earns its complexity when:

Your dataset is too large to fit in memory on one machine
Your transformations are too slow on a single CPU
You are training ML models on large feature sets that need distributed compute
You need to process real-time streams at scale
You are on Databricks or a managed Spark environment where cluster overhead is handled for you

It is likely overkill when your data fits comfortably in memory (under a few GB), when your team works primarily in SQL and a managed warehouse like BigQuery or Snowflake covers your scale, or when you are early in building a data platform and the complexity of managing clusters outweighs the benefit.

Most growing data teams reach Spark eventually — particularly as datasets grow, ML becomes part of the strategy, or pipelines outgrow what SQL alone can handle. The question is not whether Spark is capable (it is). The question is whether your team and your data are at the scale where its power justifies its complexity.

PySpark on Databricks. If you use Databricks, you are already using PySpark. Every Databricks notebook has a SparkSession available as spark by default. The cluster — Driver, Cluster Manager, Executors — is provisioned automatically. You write PySpark or SQL; Databricks handles the distributed infrastructure.

Apache Spark and PySpark Explained: Big Data Processing, Finally Digested

The analogy: one cashier vs. a hundred checkout lanes

Where Spark came from

The three players: Driver, Cluster Manager, Executors

How your data is split: partitions

RDDs and DataFrames: two ways to work with data

Lazy evaluation: plan first, execute once

PySpark: Spark with Python syntax

The Spark ecosystem

Spark SQL

MLlib

Structured Streaming

When Spark is the right tool — and when it is overkill

Working on something similar?