← Back to journal
·4 min read

PySpark at Scale: Lessons from Building an End-to-End Data Pipeline

PySparkBig DataKafkaHDFSData Engineering

For the DSM010 Big Data module of my MSc in Data Science and AI at the University of London, I built a complete end-to-end data pipeline for cryptocurrency market analysis. The project covered everything from raw data ingestion to machine learning predictions, all running on PySpark with Kafka for streaming and HDFS for storage. Here is what I learned.

Pipeline Architecture: 9 Stages

The pipeline processes cryptocurrency market data through nine distinct stages:

  1. Ingestion - Pull raw market data from APIs and land it in HDFS as Parquet files.
  2. Cleaning - Handle missing values, remove duplicates, normalize timestamps across exchanges.
  3. Enrichment - Join market data with external signals (volume, social sentiment proxies).
  4. Feature engineering - Compute technical indicators: SMA, EMA, RSI, MACD, Bollinger Bands.
  5. Aggregation - Roll up to multiple time granularities (1min, 5min, 1hr, daily).
  6. Streaming ingestion - Kafka consumer for real-time price updates.
  7. Stream processing - Windowed aggregations on streaming data.
  8. Model training - Train price movement classifiers on historical features.
  9. Prediction serving - Apply trained models to streaming data for real-time signals.

Why PySpark Over Pandas

The dataset was large enough that pandas on a single machine would either run out of memory or take hours. PySpark distributed the computation across the cluster, and more importantly, the same code that processed batch data could be adapted for streaming with Spark Structured Streaming.

That said, PySpark has a steeper learning curve. The lazy evaluation model means errors surface at action time, not when you write the transformation. Debugging a chain of 15 transformations where the error is "Column 'close_price' does not exist" somewhere in the middle is not fun.

Feature Engineering with Technical Indicators

Computing technical indicators in PySpark requires window functions, which map naturally to the sliding window calculations that indicators like SMA and RSI need:

from pyspark.sql import Window
from pyspark.sql import functions as F

# Define a window partitioned by symbol, ordered by timestamp
window_14 = Window.partitionBy("symbol").orderBy("timestamp").rowsBetween(-13, 0)
window_50 = Window.partitionBy("symbol").orderBy("timestamp").rowsBetween(-49, 0)

# Simple Moving Average
df = df.withColumn("sma_14", F.avg("close").over(window_14))
df = df.withColumn("sma_50", F.avg("close").over(window_50))

# RSI calculation
delta = F.col("close") - F.lag("close", 1).over(
    Window.partitionBy("symbol").orderBy("timestamp")
)
gain = F.when(delta > 0, delta).otherwise(0)
loss = F.when(delta < 0, F.abs(delta)).otherwise(0)

df = df.withColumn("avg_gain", F.avg(gain).over(window_14))
df = df.withColumn("avg_loss", F.avg(loss).over(window_14))
df = df.withColumn("rsi_14",
    100 - (100 / (1 + F.col("avg_gain") / F.col("avg_loss")))
)

The window function approach is elegant and performs well when partitioned correctly. The key is partitioning by symbol so each cryptocurrency's indicators are computed independently.

Kafka Streaming Integration

For the real-time component, I used Spark Structured Streaming to consume from Kafka topics:

stream_df = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "crypto-prices")
    .option("startingOffsets", "latest")
    .load()
    .select(F.from_json(F.col("value").cast("string"), schema).alias("data"))
    .select("data.*")
)

The streaming pipeline applies the same feature engineering transformations as the batch pipeline, which was one of the main reasons for choosing PySpark. Write once, run in both batch and streaming mode.

Lessons Learned

Partitioning strategy matters enormously. My first attempt at computing features across the full dataset without partitioning by symbol resulted in a single-partition shuffle that took 40 minutes. After repartitioning by symbol, the same computation ran in under 3 minutes.

Spark SQL vs DataFrame API is a style choice. I used the DataFrame API for transformations and Spark SQL for ad-hoc analysis. Both generate the same execution plans. Pick whichever your team reads more easily.

Memory management is the real skill. Learning to read Spark UI, identify skewed partitions, and configure executor memory properly taught me more than any textbook. When a stage fails with an OOM error, the fix is usually better partitioning, not more memory.

Cache strategically, not everywhere. I cached the enriched dataset after feature engineering since it was reused by both the aggregation and model training stages. Caching intermediate DataFrames that are only used once just wastes memory.

This project gave me hands-on experience with the full lifecycle of a big data pipeline, from raw ingestion to real-time prediction. The biggest takeaway is that the hard part is not writing the transformations. It is understanding how data moves through the cluster and designing your pipeline so that movement is efficient.