Building an NFL Over/Under Prediction Model with Machine Learning
Table of Contents
The NFL betting market is massive, with billions of dollars wagered annually on everything from game outcomes to point spreads. While professional sportsbooks employ teams of analysts and sophisticated models, the challenge of accurately predicting NFL spreads remains notoriously difficult.
In this project, I set out to build my own machine learning model for predicting NFL point spreads, following along with a Kerry Sports Analyst YouTube tutorial as my starting point. What began as a simple replication exercise quickly evolved into a comprehensive data science project that involved web scraping, model generation, and API development.
The goal wasn’t just to build another prediction model, but to understand the entire pipeline from raw data collection to deployable predictions. Over the course of this project, I scraped 15 years of NFL game data (2009-2024), enhanced it with stadium-specific information, and created a simple API to make predictions accessible.
While the model’s accuracy is still being tweaked and adjusted, the process taught me valuable lessons about data science workflows, machine learning, and the challenges of sports prediction.
Data Collection
The first challenge in building any machine learning model is obtaining quality data. For NFL spread prediction, I needed historical game data including scores, point spreads, and other relevant statistics. Following the Kerry Sports Analyst tutorial, I began by scraping NFL scores and spread data from Pro-Football-Reference.com. The site provides detailed game logs for every NFL season going back to 2009, including:
- Game scores
- Offensive and defensive team stats
- Point spreads and over/under totals
Scraping Process
I implemented a two-step approach to ensure data reliability and reproducibility:
-
Raw HTML Collection: First, I created a script to systematically download and save the raw HTML from Pro-Football-Reference for each season from 2009-2024. This approach had several advantages:
- Rate limiting compliance: Avoided overwhelming the server with rapid requests
- Data preservation: Ensured I had a local copy of the data in case the site structure changed
- Reproducibility: Could re-parse the data multiple times without re-scraping
-
Data Parsing: Using a Jupyter notebook, I then parsed the saved HTML files to extract the relevant game data and generate a clean CSV file. This separation of concerns made the process more manageable and allowed for iterative improvements to the parsing logic.
The scraping covered approximately 15 years of NFL data (2009-2024), which was the full historical dataset available from Pro-Football-Reference. This resulted in thousands of games with comprehensive statistics, providing sufficient training data while focusing on the modern NFL era with consistent rules and playing styles.
Data Cleaning
Once I had the raw HTML data, the next step was cleaning and preparing the data for machine learning. The initial dataset required several transformations to make it usable:
Field Renaming: After using BeautifulSoup to parse the HTML and loading the data into a pandas DataFrame, several columns had generic names like ‘Unnamed: 0’ that needed to be renamed:
Unnamed: 5
→win
(game outcome)Unnamed: 8
→away
(away team indicator)- Similar renaming for other statistical columns to improve readability
Removing Bye Weeks: NFL teams have a bye week during the regular season where they don’t play. These entries were filtered out since they don’t represent actual games and would skew the model training.
Adding Temporal Features: To help the model identify potential patterns and trends, I added several time-based columns that could serve as predictive features:
- Day of Week: Extracted from game dates to identify if games were played on Thursday, Sunday, or Monday
- After Bye Week: Created a boolean column tracking whether each team was coming off a bye week, as teams often perform differently after rest periods
Data Merging: One of the more complex aspects of data preparation was merging the game statistics with Vegas betting lines. The initial scraping provided game scores and team stats, but I needed to combine this with historical point spreads and betting data. This required:
- Matching games by date and team names across different data sources
- Ensuring betting line data was available for the corresponding game dates
- Dealing with missing or incomplete betting data for certain games
These cleaning steps transformed the raw scraped data into a structured dataset ready for feature engineering and model training. The process also involved handling missing values, standardizing team names across different seasons, and ensuring consistent data types for all columns. I then saved the data to a clean CSV file with the columns below:
RangeIndex: 8690 entries, 0 to 8689
Data columns (total 34 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Season 8690 non-null int64
1 Team 8690 non-null object
2 Opp 8690 non-null object
3 Away 8690 non-null int64
4 Week 8690 non-null object
5 AfterBye 8690 non-null int64
6 Playoff 8690 non-null int64
7 Day 8690 non-null object
8 DayOfWeek 8690 non-null int64
9 Date 8690 non-null object
10 Month 8690 non-null int64
11 Time 8690 non-null object
12 Win 8690 non-null int64
13 OT 8690 non-null int64
14 Rec 8690 non-null object
15 PtsScored 8690 non-null int64
16 PtsAllowed 8690 non-null int64
17 Off1stD 8690 non-null int64
18 OffTotYd 8690 non-null int64
19 OffPassY 8690 non-null int64
20 OffRushY 8690 non-null int64
21 OffTO 8690 non-null int64
22 Def1stD 8690 non-null int64
23 DefTotYd 8690 non-null int64
24 DefPassY 8690 non-null int64
25 DefRushY 8690 non-null int64
26 DefTO 8690 non-null int64
27 ExpPoints_Offense 8690 non-null float64
28 ExpPoints_Defense 8690 non-null float64
29 G# 8690 non-null int64
30 Spread 8690 non-null float64
31 OU 8690 non-null float64
32 SpreadWin 8690 non-null int64
33 OUWin 8690 non-null object
Initial Model Training
With the cleaned dataset ready, I moved to the model training phase. I loaded the generated CSV file into a Jupyter Lab notebook using pandas and began the process of preparing the data for machine learning.
Feature Selection
I started by selecting only the columns I wanted to train on, focusing on the most relevant features for spread prediction:
selected_columns = ['Season', 'Week', 'AfterBye', 'Date', 'Month', 'DayOfWeek',
'Tm_Name', 'Tm_Pts', 'Away', 'Opp_Name', 'Opp_Pts', 'Spread', 'Total']
I then performed several key data transformations:
Over/Under Classification: I added columns to track betting outcomes:
- Over/Under Result: Created a column indicating whether the total points scored was over, under, or push (exactly equal to the total line)
Data Sorting: Sorted the data by season and then by week to ensure proper chronological order for time-series analysis.
Training Data Filtering: Applied specific filters to create the training dataset:
- Time Period: Only games from 2018 onwards to focus on recent NFL trends
- Away Games Only: Since each game appears twice in the dataset (once for each team), I filtered to only away games to avoid data duplication
- Feature Set: Trained the model on three key features:
Spread
: The point spread for the gameTotal
: The over/under total points lineAfterBye
: Whether the team was coming off a bye week
- Prediction Target: The model was trained to predict games for “Under the total” - identifying when the total points scored would be under the Vegas over/under line
Model Implementation
For the initial model, I used a K-Nearest Neighbors classifier with 7 neighbors:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=7)
Results Validation: Following Kerry’s approach, I generated a confusion matrix to visualize the model’s performance and compared my results with what Kerry achieved in his tutorial. This validation step was crucial to ensure I had correctly implemented the same methodology.
Historical Comparison: Kerry’s video only covered data up until 2023, so my 2024 predictions represented new territory not covered in the original tutorial. This provided an opportunity to test the model’s performance on completely unseen data and compare the accuracy.
Prediction Accuracy Results: The model’s performance for predicting “Under the total” across different years:
- 2021: 54% accurate
- 2022: 59% accurate
- 2023: 53% accurate
- 2024: 48% accurate
The results show the model performed reasonably well in the 2021-2023 period, with 2022 being the strongest year at 59% accuracy. The 2024 performance dropped to 48%, highlighting the challenges of sports prediction and the importance of continuous model refinement.
Enhanced Data Collection
After validating the initial model, I decided to enhance the dataset with additional environmental and venue-specific information that could potentially improve prediction accuracy. I scraped additional data from NFLWeather for each game including:
- Weather conditions: Temperature, precipitation, wind speed, and other weather factors that could affect game outcomes
- Stadium type: Whether the game was played in a dome (indoor) or open-air stadium
- Grass type: Whether the playing surface was artificial turf or natural grass
- Location: Whether games were played in the US or internationally (London, Mexico, etc.)
Once I had collected this additional information, I combined it with the existing game stats data from the previous dataset. This integration required:
- Matching games: Ensuring the weather and stadium data corresponded to the correct games in the original dataset
- Data validation: Verifying that the additional information was accurate and complete
- Feature engineering: Creating new columns to represent the environmental factors in a format suitable for machine learning
The enhanced dataset now included both the original game statistics and the new environmental variables, providing a more comprehensive foundation for model training and potentially better prediction accuracy. The final dataset contained 4,158 entries with 53 columns, including the original game data plus the new stadium and weather information:
RangeIndex: 4158 entries, 0 to 4157
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Season 4158 non-null int64
1 Team 4158 non-null object
2 Opp 4158 non-null object
3 Away 4158 non-null int64
4 Week 4158 non-null int64
5 AfterBye 4158 non-null int64
6 Playoff 4158 non-null int64
7 Day 4158 non-null object
8 DayOfWeek 4158 non-null int64
9 Date 4158 non-null object
10 Month 4158 non-null int64
11 Time 4158 non-null object
12 Win 4158 non-null int64
13 OT 4158 non-null int64
14 Rec 4158 non-null object
15 PtsScored 4158 non-null int64
16 PtsAllowed 4158 non-null int64
17 Off1stD 4158 non-null int64
18 OffTotYd 4158 non-null int64
19 OffPassY 4158 non-null int64
20 OffRushY 4158 non-null int64
21 OffTO 4158 non-null int64
22 Def1stD 4158 non-null int64
23 DefTotYd 4158 non-null int64
24 DefPassY 4158 non-null int64
25 DefRushY 4158 non-null int64
26 DefTO 4158 non-null int64
27 ExpPoints_Offense 4158 non-null float64
28 ExpPoints_Defense 4158 non-null float64
29 G# 4158 non-null int64
30 Spread 4158 non-null float64
31 OU 4158 non-null float64
32 SpreadWin 4158 non-null int64
33 OUWin 4158 non-null object
34 Stadium 4158 non-null object
35 City 4158 non-null object
36 State 4158 non-null object
37 International 4158 non-null int64
38 Grass 4158 non-null object
39 Dome 4158 non-null int64
40 Capacity 4158 non-null int64
41 Address 4158 non-null object
42 Zipcode 4158 non-null int64
43 kickoffWeatherOverview 4158 non-null object
44 kickoffWeatherTemp 4158 non-null int64
45 kickoffWeatherAirSpeed 4158 non-null int64
46 kickoffWeatherAirGust 4158 non-null int64
47 kickoffWeatherAirDir 4158 non-null object
48 kickoffWeatherPrec 4158 non-null int64
49 kickoffWeatherCloudCover 4158 non-null int64
50 kickoffWeatherHumidity 4158 non-null int64
51 kickoffWeatherDewPoint 4158 non-null int64
52 kickoffWeatherVisability 4158 non-null int64
Model Retraining
With the enhanced dataset now including environmental and venue information, I retrained the model using the same K-Nearest Neighbors technique but with an expanded feature set. The new model incorporated the following features:
Spread
: The point spread for the gameTotal
: The over/under total points lineAfterBye
: Whether the team was coming off a bye weekDome
: Whether the game was played in a dome (indoor) stadium
Enhanced Model Results: The retrained model’s performance for predicting “Under the total” across different years:
- 2021: 56% accurate
- 2022: 57% accurate
- 2023: 50% accurate
- 2024: 47% accurate
The addition of the dome feature provided modest improvements in 2021 and 2022, with accuracy increasing by 2% and maintaining the 57% level respectively. However, the model still struggled with 2023 and 2024 predictions, with 2024 dropping to 47% accuracy - well below the profitable threshold. This suggests that while environmental factors can provide some predictive value, they may not be sufficient to overcome the inherent challenges of NFL prediction, particularly in more recent seasons.
API Development
To make the model easily accessible for generating 2025 predictions, I created a simple Flask API server. This provided a convenient way to run the model without needing to load Jupyter notebooks or manage the data pipeline manually.
The Flask API accepts POST requests to the /games
endpoint with an array of game objects. Each game object contains the following parameters:
Game
: The matchup (e.g., “Cowboys @ Eagles”)Spread
: The point spread for the gameTotal
: The over/under total points lineAfterBye
: Whether the team is coming off a bye week (0 or 1)Dome
: Whether the game is played in a dome stadium (0 or 1)
Request Example:
curl -X POST "http://localhost:7200/games" \
-H "Content-Type: application/json" \
-d '{
"games": [
{
"AfterBye": 0,
"Dome": 0,
"Game": "Cowboys @ Eagles",
"Spread": -8.5,
"Total": 48.5
}
]
}'
Response Example:
{
"games": [
{
"Game": "Cowboys @ Eagles",
"Prediction": "Under"
}
]
}
This API design allows for easy integration with other applications or manual testing of specific game scenarios. The JSON format makes it simple to build frontend interfaces or integrate with other systems for automated prediction workflows.
2025 Week 1 Predictions
Based on my adjusted ML model, below are my predictions for games under the Vegas lines. The betting lines were pulled from FanDuel and represent the current spreads and totals for Week 1 of the 2025 NFL season.
Game | Spread | Total |
---|---|---|
Cowboys @ Eagles | -8.5 | 48.5 |
Chiefs @ Chargers | 3.0 | 46.5 |
Giants @ Commanders | -5.5 | 45.5 |
Cardinals @ Saints | 6.5 | 42.5 |
Bengals @ Browns | 5.5 | 47.5 |
Titans @ Broncos | -8.5 | 42.5 |
Texans @ Rams | -3 | 43.5 |
Ravens @ Bills | 1.5 | 50.5 |
Vikings @ Bears | 1.5 | 43.5 |
These predictions represent 9 out of the 16 total Week 1 games. The remaining 7 games did not meet the model’s confidence threshold for making predictions. As with any sports prediction model, these should be used for informational purposes and not as guaranteed betting advice.
Acknowledgment
I’d like to acknowledge Kerry Sports Analyst for his excellent YouTube tutorials on building predictive models. His videos have been incredibly useful for learning the fundamentals of sports data analysis and machine learning implementation. For anyone interested in this field, I highly recommend not only checking out his predictive model videos but also exploring his content on identifying trends and patterns in sports data. His approach to breaking down complex concepts into digestible tutorials made this project much more accessible.