← Back to work
Data Engineering

Big Data Pipeline

End-to-end cryptocurrency analysis with real-time streaming

PySparkKafkaHDFSMLlibMongoDBPostgreSQL

Overview

An end-to-end big data pipeline for cryptocurrency market analysis, built as part of MSc coursework at the University of London.

Pipeline Architecture

The pipeline consists of 9 interconnected stages:

  1. Data Ingestion: CSV uploads to HDFS, Spark-based cleaning and validation, partitioned Parquet output
  2. Exploratory Analysis: Spark SQL and DataFrame API for summary statistics and cross-coin correlation
  3. Feature Engineering: Technical indicators (SMA, EMA, RSI, MACD, Bollinger Bands)
  4. ML Classification: Logistic Regression, Random Forest, and GBT classifiers using MLlib
  5. ML Clustering: K-Means and Bisecting K-Means for market regime detection
  6. ML Regression: Volatility prediction with Linear Regression and GBT regressors
  7. Recommender System: Content-based similarity + ALS collaborative filtering
  8. Real-time Streaming: Kafka producer/consumer with Spark Structured Streaming
  9. Database Integration: MongoDB for document storage, PostgreSQL for structured results

Academic Context

MSc Data Science & AI, University of London (DSM010 - Big Data).