Measuring Shot Quality in the NBA With Python

Vincent Carchietta – USA TODAY Sports

Last February, I introduced Expected Effective Field Goal Percentage (XeFG%). It is simply defined as the effective field goal percentage (eFG%) a league-average player would be expected to have given a certain shot selection. The average big man will have a high XeFG% because they take more shots around the rim, while the average guard will have a lower XeFG% because they shoot more jumpers. A player’s XeFG% can then be compared to their actual eFG% to see how much better or worse they’re shooting compared to the expectation. The same can be done to evaluate defensive performance through the use of Expected Effective Field Goal Percentage Allowed (DXeFG%). While both metrics are primarily used on an individual player basis, they can also be applied to teams.

I calculated all of the past XeFG% data using Microsoft Excel. It took an annoyingly large amount of time, especially because I had to redo the entire process for every season since 2014. It wasn’t automated; most of the work was manual. I eventually reached the point most Excel users eventually reach — I wanted to learn Python in order to decrease the amount of time it takes to perform tasks like this. So, I did. With the NBA regular season about 17% complete, I figured I’d cook up a script to automatically calculate XeFG% and document the process here.

First, let’s import the modules we’ll need for this project. We’ll also define the headers that are required for scraping the website.

import pandas as pd
import numpy as np
import itertools
import requests
import json

headers = {'Host': '','User-Agent': 'Firefox/55.0','Accept': 'application/json, text/plain, */*','Accept-Language': 'en-US,en;q=0.5','Accept-Encoding': 'gzip, deflate','Referer': '','x-nba-stats-origin': 'stats','x-nba-stats-token': 'true','DNT': '1',}

Our goal is to take all of the data we want from the NBA Stats website and convert it to data frames for us to easily manipulate. We’ll first need to navigate to a page on the site which displays the our desired data in a table, like here. Next, hit CTRL+Shift+I (or right-click the table and click Inspect) to pull up Google Chrome’s Developer Tools. Make sure you’re on the Network tab at the top right. Then, click the XHR tab below, and hit CTRL + R to refresh. One of the results that pop up should be a long URL that looks something like this:

The data from the table is stored with the JSON format at this link. We can easily adjust the variables in the URL to change the output data. For example, notice where it says “Season=2018-19.” If we changed that to “Season=2019-20,” guess what happens?

To calculate XeFG%, we use two variables: shot distance and degree of shot contest. This corresponds to two parts of the URL: “CloseDefDistRange” and “ShotDistRange=.” The link above stores data for all shots where the defender is 0-2 feet away from the shooter. I want data for every possible shot, though: when the defender is 2-4 feet away, 4-6 feet away, 6+ feet away, etc. I can use a for-loop to do this without needing to find the URLs for each combination of variables.

df_list = []

cvg = ['0-2+Feet+-+Very+Tight','2-4+Feet+-+Tight','4-6+Feet+-+Open','6%2B+Feet+-+Wide+Open'] 

dist = ['','%3E%3D10.0']

for cvg,dist in itertools.product(cvg,dist):

    url = '' + str(cvg) + '&College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&DribbleRange=&GameScope=&GameSegment=&GeneralRange=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2018-19&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&ShotDistRange=' + str(dist) + '&StarterBench=&TeamID=0&TouchTimeRange=&VsConference=&VsDivision=&Weight='

    json = requests.get(url, headers=headers).json()

    data = json['resultSets'][0]['rowSet']
    columns = json['resultSets'][0]['headers']

    df = pd.DataFrame.from_records(data, columns=columns)
    df.drop(drop_cols, axis=1, inplace=True)
    df.columns = ['player_id','player_name','tm','age','gp','g','fg2m','fg2a','fg3m','fg3a']

I stored the four link parameters for degree of shot contest under the variable ‘cvg.’ These should be self-explanatory. For ‘dist,’ or shot distance, maybe not. The first value (which is nothing) gives data for every shot regardless of distance. The second value (‘%3E%3D10.0’) gives data for shots taken further than 10 feet from the basket. With this, we’ll be able to separate two-pointers into two-pointers within ten feet of the basket and two-pointers further than ten feet from the basket. The for-loop spits out a data frame (which I cleaned up a little bit by dropping unneeded columns) called ‘df’ for each combination of ‘cvg’ and ‘dist.’ Each data frame (eight in total) is added to df_list; a list of data frames.

Now, I want to create a data frame with basic player information. I’ll create a for-loop that takes the player information I want from each of the eight data frames in ‘df_list’ and adds it to a new list of data frames called ‘db_list.’ I’ll then combine all of the data frames in db_list to a single data frame and drop the duplicate rows to get my player database.

db_list = []

for n in range(0,len(df_list)):

    db = df_list[n][['player_id','player_name','tm']]
db = pd.concat(db_list).drop_duplicates().reset_index(drop=True)

Simple enough. Here’s what a sample of ‘db’ looks like (displayed by inputting ‘db.head()’ into console):

Cool. In the code used to generate this data frame, I referred to each specific data frame in ‘df_list’ by its index. According to Python syntax, df_list[1] is a data frame, df_list[2] is another, etc. But which data does each data frame actually represent? Let’s separate the data frames into names that we can understand.

# vt: very tightly contested shots
# t: tightly contested shots
# o: open shots
# wo: wide open shots
# reg: all shots
# tenft: only shots greater than ten feet

vt_reg = df_list[0]
vt_tenft = df_list[1]

t_reg = df_list[2]
t_tenft = df_list[3]

o_reg = df_list[4]
o_tenft = df_list[5]

wo_reg = df_list[6]
wo_tenft = df_list[7]

I used a random naming system that I came up with so that the data frames weren’t called something like ‘shots_very_tightly_contested_and_further_than_ten_feet_from_basket.’

I’ve organized all of the data, but now I want to put it all into one data frame. This isn’t actually necessary, but I prefer having all of my data in one place. So, let’s create a copy of our data frame with player information called ‘db’ and merge all of the data we want from the other eight data frame onto this copied data frame.

final = db.copy()

for df in [vt_reg,t_reg,o_reg,wo_reg]:

    final = pd.merge(final,df[['player_id','fg2m','fg2a','fg3m','fg3a']],on='player_id',how='outer')
for df in [vt_tenft,t_tenft,o_tenft,wo_tenft]:
    final = pd.merge(final,df[['player_id','fg2m','fg2a']],on='player_id',how='outer')

This code takes the shot make / miss from all eight data frames and merges it onto one data frame called ‘final’ based on the shared column ‘player_id.’ The problem is that many of these columns share the same names (‘fg2m’, ‘fg2a’, etc). I’ll have to manually go through and rename the columns properly.

final.columns = ['player_id','player_name','tm', # player info
                'vt_fg2m','vt_fg2a','vt_fg3m','vt_fg3a', # all very tight 2s / 3s
                't_fg2m','t_fg2a','t_fg3m','t_fg3a', # all tight 2s / 3s
                'o_fg2m','o_fg2a','o_fg3m','o_fg3a', # all open 2s / 3s
                'wo_fg2m','wo_fg2a','wo_fg3m','wo_fg3a', # all wide open 2s / 3s
                'vt_gr10_fg2m','vt_gr10_fg2a', # very tight 2s > 10 ft
                't_gr10_fg2m','t_gr10_fg2a', # tight 2s > 10 ft
                'o_gr10_fg2m','o_gr10_fg2a', # open 2s > 10 ft 
                'wo_gr10_fg2m','wo_gr10_fg2a'] # wide open 2s > 10 ft

That’s a lot of columns. Let’s add some more. I previously mentioned that we can separate total two-pointers into two-pointers less than and greater than ten feet from the hoop. We have the two-pointers greater than ten feet from the basket, so let’s subtract those from the total two-pointers to find the two-pointers less than ten feet from the basket.

for b in ['vt','t','o','wo']:

    final[b + '_le10_fg2m'] = final[b + '_fg2m'] - final[b + '_gr10_fg2m']
    final[b + '_le10_fg2a'] = final[b + '_fg2a'] - final[b + '_gr10_fg2a']

I used a for-loop to do this quicker. There’s only a few lines of code, but it creates eight columns (made and attempted two-pointers less than feet from the basket for every degree of shot contest in ‘b’).

Now I have all the data we need. I’ll make things a little easier by creating columns that add up some of the stats from other columns. With these totals, I can quickly calculate effective field goal percentage (eFG%) as well.

final.fillna(0, inplace=True)

final['fgm'] = final.vt_fg2m + final.vt_fg3m + final.t_fg2m + final.t_fg3m + final.o_fg2m + final.o_fg3m + final.wo_fg2m + final.wo_fg3m

final['fga'] = final.vt_fg2a + final.vt_fg3a + final.t_fg2a + final.t_fg3a + final.o_fg2a + final.o_fg3a + final.wo_fg2a + final.wo_fg3a

final['fg3m'] = final.vt_fg3m + final.t_fg3m + final.o_fg3m + final.wo_fg3m

final['fg3a'] = final.vt_fg3a + final.t_fg3a + final.o_fg3a + final.wo_fg3a

final['efg'] = (final.fgm + (0.5 * final.fg3m)) / final.fga

There’s another five columns. Next order of business: finding the average field goal percentage (FG%) for every type of shot. Every possible combination of degree of shot contest and shot distance. I’ll store these values in three dictionaries.

fg2L = {} # twos < ten ft
fg2G = {} # twos > ten ft
fg3 = {} # threes

for b in ['vt','t','o','wo']:
    fg2L[b] = final[b + '_le10_fg2m'].sum() / final[b + '_le10_fg2a'].sum()
    fg2G[b] = final[b + '_gr10_fg2m'].sum() / final[b + '_gr10_fg2a'].sum()
    fg3[b] = final[b + '_fg3m'].sum() / final[b + '_fg3a'].sum()

These three dictionaries (‘fg2L’, ‘fg2G’, and ‘fg3’) each store four values that represent the league-average FG% for a specific type of shot. If I input ‘fg2L’ into the console, it spits this out:

{'vt': 0.45915873758644,
 't': 0.5625660408030562,
 'o': 0.7013221153846154,
 'wo': 0.8909308829404714}

This means that on two-pointers within ten feet of the basket with the nearest defender less than two feet away, players hit 45.9% of their shots. When the nearest defender is at least six feet away, though, the success rate jumps up to 89.1%. As one may expect. Let’s graph all of the league-average percentages to add some color to this article (I’ll convert the dictionaries to data frame form and then use the Matplotlib module to create the visualization).

import matplotlib.pyplot as plt

avg = pd.DataFrame([fg2L,fg2G,fg3], columns=fg2L.keys(), index=['2pa < 10\'','2pa > 10\'','3pa'])

avg.columns = ['very tight','tight','open','wide open']['#1f77b4','#ff7f0e','#2ca02c','#d62728'])
plt.legend(title='shot contest')
plt.title('nba avg fg% for different shot types')

It’s not at all surprising to see that the degree of shot contest impacts the FG% of two-pointers within ten feet of the basket more than deeper shots. Interesting note: the league-average FG% on very tightly contested shots within ten feet of the hoop is approximately equal to the average FG% on wide open two-pointers further than ten feet from the hoop. Those are just shots that actually count, too — the probability of drawing free throws is obviously higher on the contested two-pointers closer to the basket and that isn’t factored into these numbers.

Anyway, back to the task at hand. The rest of the process is pretty easy. Now that we know the league-average percentages for every type of shot, we can multiply each percentage to the corresponding number of attempts for each player in order to find out how many shots a league-average player would have hit given each player’s shot selection. With this, we can calculate that league-average player’s effective field goal percentage — aka the initial player’s XeFG%.

final['x2pm'] = (final.vt_le10_fg2a * fg2L['vt']) + (final.vt_gr10_fg2a * fg2G['vt']) + (final.t_le10_fg2a * fg2L['t']) + (final.t_gr10_fg2a * fg2G['t']) + (final.o_le10_fg2a * fg2L['o']) + (final.o_gr10_fg2a * fg2G['o']) + (final.wo_le10_fg2a * fg2L['wo']) + (final.wo_gr10_fg2a * fg2G['wo'])

final['x3pm'] = (final.vt_fg3a * fg3['vt']) + (final.t_fg3a * fg3['t']) + (final.o_fg3a * fg3['o']) + (final.wo_fg3a * fg3['wo'])

final['xefg'] = (final.x2pm + (1.5 * final.x3pm)) / final.fga

Now, let’s get a column that represents the difference between eFG% and XeFG%, filter out the ~40 columns we no longer need, and sort the data frame.

final['diff'] = final.efg - final.xefg

xefg = final[['player_name','tm','fga','xefg','efg','diff']]

xefg = xefg.sort_values(by='player_name').reset_index(drop=True)

And now we have a data frame called ‘xefg’ with all of the data we need. Here’s a sample.

That wasn’t too hard. It’s certainly 100x quicker than using Microsoft Excel. In a future article, I’ll post a 2019-2020 quarter-season update for individual and team XeFG% / DXeFG%. It’ll be far easier to calculate now!

Notify of
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
7 months ago

What IDE did u use to compile the code?