Matthew-Rende.com - Blog - Hockey Stats Pipeline

Hockey Stats Pipeline Blog Post

When I have time, I enjoy watching sports—but I can’t help getting distracted by the stats. The moment a commentator drops a stat, my brain shifts gears. I start wondering how they found that number, what the dataset looks like, how they identified a specific streak, and why they only considered games since a certain date. Eventually, I lose track of the game entirely and go full data-nerd mode.

That curiosity led me to build a simple NHL stats database focused on some of my favorite players. I wanted this project to be approachable—something that folks with a basic SQL background could follow and build on. Ideally, the database could even help someone dominate their fantasy hockey league. I don’t have the time to commit myself (I stick to fantasy football—one transaction a week, thank you very much), but the data possibilities in hockey are too interesting to ignore. I built the database in PostgreSQL using a dimensional model with tables for teams and dates, and fact tables for team game logs and individual player scoring logs. Huge shout out to Hockey-Reference.com for making their data available. The raw data needed some cleaning, which I handled in Python. One challenge was the description field, which initially looked like:
“Goal Scorer: Connor McDavid Assisted by Leon Draisaitl Assisted by Mattias Ekholm.”
To analyze this properly, I needed to break each component into separate fields. (Hockey-Reference has since improved their data format, but at the time, this transformation was on me.) You can check out the parsing and transformation code here:
👉 Notebook: Parsing Player Scoring Descriptions

Most of the time, I worked locally, but I also ran the pipeline on Azure using Databricks to test portability and performance. That version is available here:
👉 Notebook: Transforming Data on Azure Databricks

To keep things fun and digestible, I created a small notebook to highlight some findings—focusing on Jack Hughes, a standout young (as of 2023) American player. The notebook answers a few questions with charts and basic analysis. You can check it out here:
👉 Notebook: Jack Hughes Analysis

This project was a blast, and a few hockey friends even reached out after watching the YouTube videos. Lately, I’ve been watching more baseball—and wow, the data volume is next level. I read that each MLB game generates 7 terabytes of data. That’s mind-blowing. While I’ll continue to get lost in the numbers, I won’t be building a baseball database anytime soon.

Link to the Repo: Here