MLB Statcast Real-Time Data Pipeline
Washington University in St. Louis, Fall 2025
This project implements a comprehensive real-time data pipeline for MLB Statcast pitch data, developed as part of Washington University in St. Louis’s CSE 5114: Data Manipulation and Management at Scale course.
Project Overview
The pipeline ingests, processes, and visualizes MLB pitch-by-pitch data from Baseball Savant’s Statcast system. It demonstrates modern data engineering practices including ETL automation, data warehousing, stream processing, and interactive analytics.
Architecture
The system consists of several integrated components:
- Data Ingestion: Automated fetching of Statcast pitch data from MLB’s Baseball Savant API
- ETL Pipeline: Apache Airflow DAGs for daily and historical backfill processing
- Data Warehouse: Snowflake for scalable storage and analytics
- Stream Processing: Kafka simulation for real-time pitch data streaming
- Dashboard: Interactive Streamlit application for pitch analysis and visualization
Key Features
- Composite Key Design: Unique pitch identification using
game_pk,at_bat_number, andpitch_number - Automated ETL: Airflow DAGs for daily incremental loads and historical backfills
- Real-Time Simulation: Kafka-based streaming for live pitch data simulation
- Interactive Analytics: Streamlit dashboard with pitch visualizations, player statistics, and game analysis
- Scalable Storage: Snowflake data warehouse with optimized schema design
Technologies Used
- Python
- Apache Airflow
- Snowflake
- Apache Kafka
- Streamlit
- Pandas
- pybaseball (Statcast API)
