Predicting NFL Offensive Play-Calling With Python

Joe Nicholson – USA TODAY Sports

Defensive coordinators are constantly trying to predict the opposing offense’s next move. Meanwhile, offensive play-callers look to maximize their unpredictability without sacrificing efficiency.

It’s like a game of chess. Sometimes, strategists get too cute and end up humiliating themselves. I just saw the Carolina Panthers run a fake reverse toss out of the wildcat formation on fourth and inches instead of giving the ball to theirĀ 6’5, 245-pound quarterback. Occasionally these trick plays pan out perfectly. The Eagles and Saints have both won a Super Bowl off of gutsy coaching decisions. Football is unpredictable by nature. It’s not possible to always foresee your opponent’s plays, but you can try to maximize the accuracy of your predictions.

In this project, I’ll train a simple machine learning model to predict offensive play-calls. It will try to forecast whether the upcoming offensive play will be a pass or a run. Let’s get started.

First, we need to load our dataset. I scraped a play-by-play dataset using the nflscrapR package. I saved this to a CSV file, which I can load into a data frame after importing various libraries which I will need to use.

from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import ensemble
import pandas as pd
import numpy as np

df = pd.read_csv("pbp.csv")

Now we have a data frame (df) with every NFL play since the 2009 season. It contains over 250 different columns and nearly 500,000 rows. That’s far more than we’ll need (or want to deal with). Let’s take the columns we’ll actually use for our model. I’ve also found that some rows of this dataset include blank / null values, so we’ll avoid that future problem by removing those rows now.

df = df[['game_date', 'posteam','yardline_100','quarter_seconds_remaining','half_seconds_remaining','game_seconds_remaining','qtr','down','goal_to_go','ydstogo','play_type','score_differential','shotgun','no_huddle']]

df = df.dropna()

We’ve reduced our data frame (df) to just fourteen columns. The first two will be used for us to filter the data frame further, while the latter twelve will serve as our model’s features (or the independent variables).

Now, we need to separate ‘df’ into a training dataset and a testing dataset. The training dataset is the information that the model learns and fits an algorithm into. The testing dataset is the information that we feed into our algorithm to test its accuracy. Both datasets must be separate because you have to test your model on untrained data. Both datasets include the features (twelve independent variables) and the label (the dependent variable, aka the play-type). We’ll compare the model’s predicted labels for the testing dataset with the actual labels.

We’ll have to make a few decisions on how we’ll organize our data. First of all, we’ll try to predict play-calling from a certain team. After all, football games are played between two teams. It wouldn’t make sense to try and predict the decision making of the Buffalo Bills by using data from the Seattle Seahawks. I’ll arbitrarily choose the New Orleans Saints to be the team of interest in this exercise. The Saints’ offense has consistently been one of the very best in the league over the past thirteen years, thanks to Hall of Fame quarterback Drew Brees and an offensive mastermind in Sean Payton. I’ll use Saints plays from the 2009-2017 seasons as the training dataset and the 2018 season will serve as our testing dataset. The 2018 season has already occurred, so we’ll be able to compare our model’s projections to the actual results.

Earlier, we reduced our data frame to fourteen columns. Now, we’re reducing the number of data points by using row filters. Here is the code I’ll use to separate our full play-by-play data frame (df) into training and testing datasets:

training_df = df[(~df.game_date.str.contains('2018')) & (df.posteam == 'NO') & (df.down.isin(range(1,5))) & ((df.play_type == 'run') | (df.play_type == 'pass'))]

testing_df = df[(df.game_date.str.contains('2018')) & (df.posteam == 'NO') & (df.down.isin(range(1,5))) & ((df.play_type == 'run') | (df.play_type == 'pass'))]

The testing dataset only consists of rows which contain the string ‘2018’ in the game date, while the training dataset is composed of rows which do not have ‘2018’ in the game date because the tilde (~) reverses the Boolean indicator. The rest of the filters are the same for each dataset — the team in possession must be the New Orleans Saints, the play-type must be a pass or run, and it must be 1st, 2nd, 3rd, or 4th down

We now have all the data we need. Before we go on to training our machine learning model, let’s set some baselines for our accuracy.

Our dependent variable (play-type) is binary, meaning it only has two settings: pass or run. Therefore, we would like to aim higher than 50% accuracy.

We can also take a look at the Saints’ relative frequency of play calls in the 2018 season. Let’s graph it.

rel_freq = testing_df['play_type'].value_counts()

plt.pie(rel_freq, labels = ('pass', 'run'), autopct='%.2f%%')
plt.title("saints' 2018 play-type distribution")
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0)
plt.show()

A model that predicts a pass on every play would achieve an accuracy of 52.92%. That’s already better than 50%, so we’ll use it as our new baseline.

Okay, now we need to get our data ready for the model. We need to explicitly separate the features and the label for each dataset:

training_features = training_df[['yardline_100','quarter_seconds_remaining','half_seconds_remaining','game_seconds_remaining','qtr','down','goal_to_go','ydstogo','score_differential','shotgun','no_huddle']]

training_label = training_df['play_type']

testing_features = testing_df[['yardline_100','quarter_seconds_remaining','half_seconds_remaining','game_seconds_remaining','qtr','down','goal_to_go','ydstogo','score_differential','shotgun','no_huddle']]

testing_label = testing_df['play_type']

Finally, we can train our model and make predictions. We are going to use a gradient boosting machine as the machine learning technique for our algorithm. If you want to an in-depth explanation on gradient boosting, check out this great article by Prince Grover. In short, though, boosting is a technique in which many predictors (decision trees) are created sequentially. Each new tree tries to minimize the residual error from the previous tree. On the other hand, bagging (like in random forests) includes parallel / independent decision trees which collectively vote to generate the final prediction. Anyway, here’s the code for fitting the gradient boosting machine.

gbr = ensemble.GradientBoostingClassifier(n_estimators = 100, learning_rate = 0.02)

gbr.fit(training_features, training_label)

We have successfully trained our algorithm with the training dataset. Let’s apply this algorithm to the independent variables of our testing dataset to determine the accuracy of our model.

prediction = gbr.predict(testing_features)

accuracy = accuracy_score(testing_label, prediction)

print("Accuracy: "+"{:.2%}".format(accuracy))
Accuracy: 75.19%

Our algorithm achieved an accuracy of 75.19%, which is considerably better than the 52.92% baseline. I think this is a satisfactory result.

But why stop now? Our model is successful, but it isn’t really telling us any information. Let’s try interpreting our algorithm so we can see which variables are most important to predicting the play-calling of the New Orleans Saints (aka our model’s feature importance).

features = ['yl_100', 'q_sec', 'h_sec', 'g_sec', 'qtr', 'down' ,'g2g', 'yd2g', 'sd', 'shot', 'nh'] 

# unabbreviated variable names in same order:
# 'quarter_seconds_remaining', 'half_seconds_remaining', 'game_seconds_remaining', 'qtr', 'down', 'goal_to_go', 'ydstogo', 'score_differential', 'shotgun', 'no_huddle'

feature_importance = gbr.feature_importances_.tolist()

plt.bar(features,feature_importance)
plt.title("gradient boosting classifier: feature importance")
plt.show()

Whether or not the Saints are in the shotgun formation is by far the most significant factor in our algorithm. The amount of yards needed for a first down comes in at a very distant second. A few variables had zero weight in our model: the number of seconds left in the quarter, the quarter, and whether the offense entered the play without huddling. But let’s go back to the shotgun formation. Let’s take a dataset of the Saints’ shotgun plays in 2018 just so we can see this apparent discrepancy ourselves.

shotgun_df = df[(df.game_date.str.contains('2018')) & (df.posteam == 'NO') & (df.down.isin(range(1,5))) & ((df.play_type == 'run') | (df.play_type == 'pass')) & (df.shotgun == 1)]

relative_frequency = shotgun_df['play_type'].value_counts()

plt.pie(relative_frequency, labels = ('pass', 'run'), autopct='%.2f%%')
plt.title("saints' 2018 shotgun formation play-type distribution")
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0)
plt.show()

Can’t argue with that. The Saints passed the ball on 78% of plays while they were in shotgun formation. No wonder it’s the most significant feature in the algorithm.

Before we wrap this up, I want to explore the analysis of teams other than the New Orleans Saints. Earlier, I said that it wouldn’t make sense to try and predict the decision making of the Buffalo Bills by using data from the Seattle Seahawks. What if play-calling between different coaches isn’t as volatile as I presumed? Let’s test it by taking our model which we trained using data from the Saints and applying it to the other thirty-one NFL teams. We’ll determine the accuracy of the model for each of the teams.

First, I’ll need to create an array with all of the teams in the full play-by-play dataset. I’ll take the unique values of that column of the dataset. If I took all of the values, we would get an array with a length of nearly 500,000. After taking the unique values, I’ll check the length of the array to make sure we get a value of 32 in return.

df.posteam.unique()
len(df.posteam.unique())
35

The San Diego Chargers and St. Louis Rams both relocated to Los Angeles within this range of time, plus the Jacksonville Jaguars are listed as both ‘JAC’ and ‘JAX’ for some reason. We can quickly clean this up by replacing the problematic values.

df['posteam'].replace('STL','LA',inplace=True)
df['posteam'].replace('SD','LAC',inplace=True)
df['posteam'].replace('JAC','JAX',inplace=True)

len(df.posteam.unique())
32

There we go. Now, let’s create our for-loop which will iterate through the model projections for all 32 teams. We’ll use most of the same code from earlier in this article. There are just a few changes made to fit the loop (like using the variable ‘tm’ instead of the string ‘NO’ for the Saints).

results = pd.DataFrame(columns =['accuracy'])

for tm in df.posteam.unique():
    
    training_df = df[(~df.game_date.str.contains('2018')) & (df.posteam == tm) & (df.down.isin(range(1,5))) & ((df.play_type == 'run') | (df.play_type == 'pass'))]
    testing_df = df[(df.game_date.str.contains('2018')) & (df.posteam == tm) & (df.down.isin(range(1,5))) & ((df.play_type == 'run') | (df.play_type == 'pass'))]

    training_features = training_df[['yardline_100','quarter_seconds_remaining','half_seconds_remaining','game_seconds_remaining','qtr','down','goal_to_go','ydstogo','score_differential','shotgun','no_huddle']]
    training_label = training_df['play_type']
    
    testing_features = testing_df[['yardline_100','quarter_seconds_remaining','half_seconds_remaining','game_seconds_remaining','qtr','down','goal_to_go','ydstogo','score_differential','shotgun','no_huddle']]
    testing_label = testing_df['play_type']
    
    prediction = gbr.predict(testing_features)
    accuracy = accuracy_score(testing_label, prediction)
   
    results.loc[tm] = [accuracy]

We now have a data frame (results) of thirty-two rows which contains the team abbreviation and the accuracy of the model for each team. Remember, the same algorithm was used for all 32 teams, and the algorithm was trained using data from just the New Orleans Saints. Let’s visualize the results.

plt.bar(results.index,results.accuracy)
plt.xticks(rotation=90)
plt.title("accuracy of predictive model trained on saints for all 32 nfl teams")
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0)
plt.show()

Surprisingly, the model was extremely accurate across the board with an average accuracy of 71.72%. I suppose our observations on the feature importances actually make this unsurprising, though. The shotgun formation isn’t exclusive to the New Orleans Saints, after all. Anyway, you might notice that the model performed poorly on the Seahawks (56.6%) and Ravens (60.3%). The Seahawks only passed the ball on 52.66% of plays in the shotgun formation. For the Ravens, it’s similarly low at 59.57%. Meanwhile, the accuracy for the Patriots was actually the highest at 78.9% — even better than the team the model was actually trained on. The Patriots passed the ball on 85.89% of plays from the shotgun formation. Noticing a pattern? The model heavily weighs the shotgun formation as an indicator of a passing play, so it comes up short when making predictions for a team that doesn’t follow this trend.

Of course, it would still be possible (and preferable) to use data from the team of interest for training a model. This won’t always be possible, though. The Arizona Cardinals have a new head coach (Kliff Kingsbury) and quarterback (Kyler Murray) this year. If you want to predict their offensive play-calling, it wouldn’t be useful to compile data from when Bruce Arians and Carson Palmer led the offense.

I think the power of machine learning is clear here, though. An accuracy of 75.19% when analyzing a single team is a huge improvement over the baseline, and there are still plenty of refinements that could be made. For example, I would guess that the offense’s personnel would be an important feature for predicting the play-call. A team would be more likely to run if they have more tight ends on the field, while one would probably expect a pass from a team in a five-wide set. The specific players on the field could also be a clue. For example, I imagine that the Saints run the ball more often while Taysom Hill is on the field. We’re only scratching the surface of the possibilities. Machine learning shouldn’t replace coaching, but data science can certainly go along with it.

Comments

  Subscribe  
Notify of