Data Engineering

Big Data Pipeline

End-to-end cryptocurrency analysis with real-time streaming

PySparkKafkaHDFSMLlibMongoDBPostgreSQL

Overview

An end-to-end big data pipeline for cryptocurrency market analysis, built as part of MSc coursework at the University of London.

The pipeline consists of 9 interconnected stages:

Data Ingestion: CSV uploads to HDFS, Spark-based cleaning and validation, partitioned Parquet output
Exploratory Analysis: Spark SQL and DataFrame API for summary statistics and cross-coin correlation
Feature Engineering: Technical indicators (SMA, EMA, RSI, MACD, Bollinger Bands)
ML Classification: Logistic Regression, Random Forest, and GBT classifiers using MLlib
ML Clustering: K-Means and Bisecting K-Means for market regime detection
ML Regression: Volatility prediction with Linear Regression and GBT regressors
Recommender System: Content-based similarity + ALS collaborative filtering
Real-time Streaming: Kafka producer/consumer with Spark Structured Streaming
Database Integration: MongoDB for document storage, PostgreSQL for structured results

MSc Data Science & AI, University of London (DSM010 - Big Data).