In March of 2024, I took a break from social media (Twitter/Instagram/TikTok/Etc.) because the polarizing algorithms and toxicity were driving me nuts. I took to LinkedIn, in search of a community of professional data nerds, that develop and learn in their freetime. Luckily, I found some great people on there, but my feed was quickly inundated with LinkedInfluencers reposting the same content whether recycling their own, other's, or generating posts with AI. Then, in October, I heard that data people were all jumping over to Bluesky, so I thought I'd give it a shot. Needless to say, I love the data and developer community here. Furthermore, the fundamental design and decentralization of the platform make this that much cooler. Then, I learned about WhiteWind and some other integrations, and decided this is a great way to post updates and keep myself accountable.
A little about me, I'm a Tulane Unviersity graduate (BSM '20, MS '21) and currently a Data (Analytics) Engineer at GM. In my current role I wear many hats! In my 3.5 years at General Motors I've gained experience desinging database architecture, building ETL pipelines, being an IT admin, owning products, managing projects, automating tests, and providing training/mentorship for Power BI/Databricks; however, most of my experience is in BI reporting and legacy application upgrades.
So, this past April I decided I wanted to really learn the ins and outs of modern data engineering and data science. So, I bought a MacBook Air and started coding! From my Masters degree at Tulane, I had experience with R, Python, SQL for coding, as well as the statistical knwoledge needed for machine learning. Instead of those areas, I began with zsh/bash scripting (using the Terminal to do everything I could), then I learned the basics of git for version control, then I jumped into dbt for data modeling, and finally Docker so I could understand containers. Following that, I began Harvard's CS50p course in Python programming. I wanted to understand Python, the programming language, because I had learned Python using Jupyter notebooks instead of scripts, packages, testing, etc.
In the few month or so since then, I've worked on a few public projects which you can view on my GitHub. These projects are pretty varied, and narrow in scope.
- DataCamp projects based on simple ETL, analytics, or programming.
- My CS50p lecture notes and problem set submissions.
- A simple Introduction to DuckDB using the NFL Big Data Bowl CSV files.
- A private repository with bash scripts and configuration files for starting up my AWS EC2 instance, connecting via ssh, Docker containers, quarto, and the GitHub CLI.
More recently, I decided to really test my data engineering and science skills by participating in the NFL Big Data Bowl 2025.
Currently, I've made solid progress with my initial exploratory data analysis and project configuration. Here are some quick notes about the setup of my project environment (from IDE to tools/versions).
- Using VS Code with the Jupyter, Jinja, YAML, Quarto (for notes/project submissions), and dbt extensions.
- DuckDB is my primary database tool (for now), with dbt for the data modeling
- Then, I'm using Python and Jupyter Notebooks for the analysis/ML component
The reason I may switch to PostgreSQL for the primary Database is to just gain experience with DuckDB as a DEV environement and Postgres for PROD. Realistically, however, for the scope of this project DuckDB accomplishes everything I need it to.
For the forseeable future, the only side project I'll be working on is this, so my next few posts will only look at the project progress and my thoughts about the Big Data Bowl, feel free to checkout the GitHub repository where I'm saving my work.
Some notes about my current project progress:
- The project folder has a few subdirectories, including nfl_dbt which is the dbt project folder
- The raw data came in the form of 13 CSVs from Kaggle. 4 of which are 50mb or less, 9 of which are ~1gb.
- I'm using Databricks' "Medallion Architecture" to guide my data modeling workflow.
- I built the initial dbt models, using DuckDB as the DEV target (enabling 4 threads) and loaded the "bronze" schema which contains the 13 raw tables
- I aggregated the data into the "silver" schema, which contains an aggregated play data table
- I further aggregated the data into the "gold" schema, which provides basic analytic tables
- Currently, I completed an initial analysis using an EDA notebook where I looked at using a LinearRegression and KNN to compare pre-snap play data with play outcomes.
- I settled on a KNN model, but I'm only seeing about a 61.1% accuracy rate (confusion matrix and explanation below).
So, I'm at a bit of a crossroads, with a few ways forward. It may be simpler (for the initial project/submission) to build a linear regression model that takes pre-snap play data as features, and then looks at yards gained (or loss) for the output. Conversely, if I stick with the KNN model I'll need to make some changes. The majority of the outputs are either Gain or Completed, which refer to a positive rushing play and a completed pass, respectively. The issue here, the model overwhelmingly predicts those values, but fails to accurately predict things like Touchdowns, Sacks, or Interceptions.
So, I may need to limit possible play outcomes, or at least combine some categories (i.e. Turnover for Fumble + Interception). Or, add some more presnap data, such as down and distance (I currently only use starting yard line, along with categorical data). If you made it this far, thank you! Below is the confusion matrix output from my current KNN model. I'll add some hashtags at the end as an experiment too, because I'm not sure if that will help with post discoverability and/or integrate with Bluesky feeds.
#DataBS #DataScience #DataEngineering #Python #MachineLearning #NFL #Statistics #DuckDB