Determining NBA Player Similarity Through Cluster Analysis

Jerome Miron – USA TODAY

In my previous two articles, I used principal component analysis and Gaussian mixture model clustering in order to create a new way of classifying different NBA players. The same data from those articles can be used to assess player similarity through hierarchical agglomerative clustering (HAC).

Hierarchical cluster analysis refers to the method of building a hierarchy of clusters. Hierarchical clustering can either be “bottom-up,” where you start with one cluster for each observation and merge similar clusters at each step of the hierarchy, or “top-down,” where you start with one cluster consisting of every observation which you split into small clusters at each step of the hierarchy. The former type of clustering is formally known as hierarchical agglomerative clustering. Here’s an example:

Hierarchical Clustering in Data Mining - GeeksforGeeks

From the plot (which is known as a dendrogram), we can see that points B,C and D,E are very similar to one another. Point A is extremely dissimilar to all the other points — it wasn’t merged until the final step.

Now, imagine that every eligible1 NBA player from the 2019-20 season is represented by a single point. With 111 players in our analysis, we therefore start off with 111 separate clusters using agglomerative clustering. At each following iteration, the closest (most similar) pairs of clusters are merged until only one cluster is left.

How do we actually quantify the similarity between clusters in order to determine which combinations should be made at each step? One way is through the use of Ward’s minimum variance method. In this method, pairs of clusters are found at each step which will lead to the minimum increase in total within-cluster variance.

After applying Ward’s method, I was able to create the following dendrogram representing the cluster hierarchy of NBA players using the following code. The data collection involved in creating the data frame df is detailed here.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import scipy.cluster.hierarchy as shc
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

testdf = df[(df.MPG > 23) & (df.GP > 15) & (df.SEASON == '2019-20')].reset_index(drop=True)

features = [x for x in df.columns if (x != 'PLAYER_NAME') &  (x != 'POSITION') & (x != 'SEASON')]

x = testdf.loc[:, features].values
y = testdf.loc[:,['PLAYER_NAME']].values

x = StandardScaler().fit_transform(x) # standardize all values

pca = PCA(n_components=0.99)
principalComponents = pca.fit_transform(x)

plt.figure(figsize=(8,22))
plt.title('2019-20 NBA Hierarchical Clustering Dendrogram')
dend = shc.dendrogram(shc.linkage(x, method='ward'),labels=list(testdf.PLAYER_NAME),orientation='left')

plt.yticks(fontsize=8)
plt.xlabel('Height')

plt.tight_layout()

plt.show()

Okay, that’s a lot of information. If you read this from left to right, you’ll notice that this is essentially one huge cluster consisting of 195 players. This cluster consists of three smaller clusters, with each cluster consisting of smaller and smaller clusters until we reach the individual players.

This form of data visualization allows us to assess the similarity between various players. Here are a few points that stand out:

  • At a quick glance, the cyan blue cluster seems to contain role players, the red cluster contains ball-dominant guards and forwards, while the green cluster primarily consists of big men. There are some exceptions2 and I’m sure there’s a better way to describe the trend, but that’s the first thing that sticks out to me.
  • The most similar player to LeBron James is apparently Luka Doncic. That makes sense — they’re both versatile offensive players who stuff the stat sheets while also being quite ball-dominant. They both possess an uncanny feel for the game that elevates the performance of their teammates. I think LeBron is a step above Luka defensively, but the similarities are there.
  • The comparison between Embiid and Davis makes a lot of sense to me. Both are among the league’s absolutely best defensive players. They’re also extremely valuable offensive players but are a bit inconsistent in terms of shooting the ball. They both probably shoot jumpers a bit more than they should, as their skills are best suited for the low post.
  • The most similar player to Zion Williamson is … Montrezl Harrell. Both players are incredible offensive big men despite their short stature compared to other bigs. They also have deficiencies on the defensive side of the ball. It seems like a reasonable comparison.
  • The catch-and-shoot role player cluster! JJ Redick, Davis Bertans, Landry Shamet, and Duncan Robinson are paired with one another inside of a cluster. All of them primarily operate without the ball in their hands, and when they do have the ball, they’re nailing 3-pointers at an efficient rate.
  • The Tatum/Siakam comparison makes sense in more ways than just the data: both young players that represent the future for two powerhouses of the Eastern Conference. Well, and the present. They’re already stars, of course.
  • I like the comparison between Jrue Holiday and Eric Bledsoe. Both are among the league’s best defensive guards, and while their scoring is far more mediocre.
  • See that cluster with Westbrook, Butler, and DeRozan? Let’s call it the “elite guards and forwards who struggle with 3-pointers” cluster.
  • I like the cluster with Paul George, Kawhi Leonard, Khris Middleton, Jayson Tatum, Brandon Ingram, Zach Lavine, and Pascal Siakam. They’re all forwards who can are elite and versatile scorers. There’s also a pretty wide range in age there.
  • There are some results in there that don’t really make sense, but in general, I’m pretty happy with the results. I think it’s a really interesting visualization with a lot of cool insights.

    The '2019-20' in the code can be easily replaced with any of the past seven seasons, so let’s go ahead and take a blast to the past by taking a look at a dendrogram of NBA players from the 2013-14 regular season.

    There’s only two big clusters in this dendrogram. Interesting. Maybe it’s partially due to the greater prominence of the center position.

    The LeBron James – Kevin Durant comparison is funny because of how many people were probably engaging in arguments about the two all-time greats after Durant’s historic MVP season in 2014. Those arguments were also being had in the past few years during Durant’s stint with the Warriors.

    It’s also cool to see that the most similar player to Stephen Curry at this point was James Harden. Curry averaged 24 PPG for the 51-31 Warriors in 2014, while averaged 25 PPG for the 54-28 Rockets. They were both very good players at the time, but nobody could have predicted how good they became.

    Alright, I’m gonna leave it at that. Observe some more cool stuff on your own, there’s plenty of information left in there. You can check out Part I and Part II from the series in which I previously compiled and utilized this data, along with all of the code from that series here in this GitHub repository.


    1. Players who played at least 23 minutes per game in at least 15 games.
    2. Donte DiVincenzo isn’t ball-dominant, Blake Griffin isn’t a role player, Draymond Green isn’t a big man, etc.

    Comments

      Subscribe  
    Notify of