Data Engineering
Big Data Pipeline
End-to-end cryptocurrency analysis with real-time streaming
PySparkKafkaHDFSMLlibMongoDBPostgreSQL
Overview
An end-to-end big data pipeline for cryptocurrency market analysis, built as part of MSc coursework at the University of London.
Pipeline Architecture
The pipeline consists of 9 interconnected stages:
- Data Ingestion: CSV uploads to HDFS, Spark-based cleaning and validation, partitioned Parquet output
- Exploratory Analysis: Spark SQL and DataFrame API for summary statistics and cross-coin correlation
- Feature Engineering: Technical indicators (SMA, EMA, RSI, MACD, Bollinger Bands)
- ML Classification: Logistic Regression, Random Forest, and GBT classifiers using MLlib
- ML Clustering: K-Means and Bisecting K-Means for market regime detection
- ML Regression: Volatility prediction with Linear Regression and GBT regressors
- Recommender System: Content-based similarity + ALS collaborative filtering
- Real-time Streaming: Kafka producer/consumer with Spark Structured Streaming
- Database Integration: MongoDB for document storage, PostgreSQL for structured results
Academic Context
MSc Data Science & AI, University of London (DSM010 - Big Data).