Analyzing NBA Officiating With Python

Mark D. Smith – USA Today

In 2015, the NBA began releasing their controversial Last Two Minute Reports, described as “the league’s assessment of officiated events that occurred in the last two minutes of games that were at or within three points during any point in the last two minutes of the fourth quarter (and overtime, where applicable).” These reports are intended to add some transparency to the league’s operations and possibly ease some of the frustration fans often feel towards officiating. The National Basketball Referees Association petitioned to stop the reports, arguing that this transparency “does nothing to change the outcome of the game” and “encourages anger and hostility towards NBA officials.” After all, having the league publicly acknowledge that your favorite team was robbed of a win isn’t going to make you any happier.

Still, these reports have went on for the past five years. And they have given us a lot of data to work with. This website contains a dataset consisting of every L2M report from March 2015 to March 2020.. Let’s start digging away at all of this information using Python to see what we’ll learn.

First, I’ll import the main packages I always have at hand for this type of analysis. I’m extracting the full L2M report dataset from the aforementioned website and finding the dimensions of the new DataFrame along with a list of its columns.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv('')

(45076, 38)
Index(['period', 'time', 'call_type', 'committing', 'disadvantaged',
       'decision', 'comments', 'game_details', 'page', 'file', 'game_date',
       'away_team', 'home_team', 'call', 'type', 'date', 'home', 'away',
       'scrape_time', 'stint', 'home_bkref', 'bkref_id', 'ref_1', 'ref_2',
       'ref_3', 'attendance', 'committing_min', 'committing_team',
       'committing_side', 'disadvantaged_min', 'disadvantaged_team',
       'disadvantaged_side', 'type2', 'time_min', 'time_sec', 'time2',
       'season', 'playoffs'],

The DataFrame df contains 45,076 rows and 38 columns listed above. That’s a lot of information! There are many possibilities here, but I think a good place to start our analysis would be judging the accuracy of NBA officiating over the past six seasons.

The decision column can contain four different values: CC, IC, CNC, INC, NaN. These abbreviations stand for correct call, incorrect call, correct no-call, incorrect no-call, and none applicable. Knowing whether or not the call / no-call is correct or incorrect is obviously an important part of determining officiating accuracy. So, I want to start off by removing those odd officiating events in which the decision column contains the ‘NaN’ value, because there is no metric for us to actually judge the accuracy of officiating in those cases.

df = df[~df.decision.isna()].reset_index(drop=True)

Now, I’m going to create a column that serves as a binary indicator for whether the referees acted correctly or incorrectly on a given play.

def correct_decision(decision):
    if (decision == 'CC') or (decision == 'CNC'):
        return 1
        return 0
df['correct_decision'] = np.vectorize(correct_decision)(df.decision)

I can now use the groupby function to split the DataFrame df based on the season column and determine the average of the new correct_decision column for each season. Graphing this will illustrate the change in officiating accuracy in the L2M reports over the past six seasons.

from matplotlib.ticker import PercentFormatter


plt.ylabel('correct decision percentage')
plt.title('accuracy of nba officiating in l2m reports since 2015')


It seems that NBA officiating steadily improved from 2015 to 2018 and then plateaued between 93 and 94 percent accuracy. So far in 2020, 93.07% of decisions in last two minute reports were categorized as correct calls or no-calls by the NBA. Overall in all reports since 2015, this rate sits at a solid 92.01%. Not bad.

The dataset also lists the three officials for every game. Using the same ridge regression approach that is used to calculate regularized adjusted plus-minus (RAPM) for players, I attempted to quantify the officiating ability of every individual referee. The code for this portion was a bit long, so I’ll just link to it here. Here are the results:

A referee’s ‘value’ is calculated based on the percentage of recorded calls/no-calls by their lineup that are ruled correct in the NBA’s L2M reports. This obviously isn’t a perfect method of quantifying officiating performance for a number of reasons. And I have no idea how to actually evaluate the results — there’s no other public data for me to compare these results to in order to see if they actually match up. Oh well, moving on.

Every sports fan out there thinks that the referees make mistakes that disproportionately disadvantages their favorite team. But which teams have actually gotten the short end of the stick in close games most often?


plt.xlabel('# of plays disadvantaged')
plt.title('most / least disadvantaged nba teams in l2m reports')

Here’s some validation for fans of the Toronto Raptors and all the people who think the league is rigged for the Los Angeles Lakers and Golden State Warriors.

Also, why don’t we breakdown the specific calls that referees most or least frequently make mistakes on?

rows_list = []

for n in df.type.value_counts()[df.type.value_counts()>100].index:
    tf = df[df.type == n].reset_index(drop=True)
    cc = tf[tf.decision == 'CC'].shape[0]
    ic = tf[tf.decision == 'IC'].shape[0]
    inc = tf[tf.decision == 'INC'].shape[0]
    dict1 = {'call':n,'accuracy':cc/(cc+ic+inc)}


calls = pd.DataFrame(rows_list)

Sounds about right. Personal take fouls are intentional fouls — it’s pretty hard for a referee to miss a player committing a blatant foul on purpose. The defensive three second violation, meanwhile, is exactly what you’d expect an official to miss. In fact, it seems like referees aren’t particularly great at counting in general. Three of the four types of calls they blow most often are related to counting time. The other one, traveling, is related to counting the number of steps a player takes. Maybe the NBA needs a dedicated counting official at every game.

Finally, let’s take a look at time. All of these plays occur in the last two minutes of a game, but does officiating accuracy fluctuate in that 120-second span? To be more specific, is the theory that officials abstain from calling fouls at the absolute end of games actually true?

The DataFrame df has separate columns for minutes and seconds remaining. I’d rather combine that into one column, so I’ll do that before plotting the histogram for incorrect no-calls.

df['seconds_remaining'] = (df.time_min * 60) + df.time_sec

plt.hist(df[df.decision == 'INC'].seconds_remaining,bins=24,edgecolor='black')

plt.xlabel('seconds remaining')
plt.title('frequency of incorrect no-calls in the final two minutes')


It is quite clear that referees in the NBA tend to “swallow their whistles” at the end of NBA games. Why? Maybe they’re explicitly told to do so because it’s less exciting to watch free throws decide a game. Or maybe it’s a psychological thing: referees don’t want to be the ones to decide a game (even though they could still decide a game by not blowing their whistle, lack of action may feel less direct than actually taking action).

There’s a lot more things to look at in this data, but I’ll leave it at that. Again, I downloaded the dataset used in this article from this website — go ahead and play with it yourself!

Notify of
Inline Feedbacks
View all comments