### Background

Basketball analysts have searched for years for a viable all-in-one metric for quantifying the on-court impact of basketball players. We can always look at basic statistics like points and rebounds, but what about everything that doesn’t show up in the box score? Skills like setting effective screens, contesting shots, boxing out, and off-ball gravity can all have a huge impact on any given basketball play while not showing up in the box score. What if we could really assess player impact without needing to rely on these numbers?

A traditional way to represent player impact without box score stats is to just use base plus-minus. For example, Steph Curry had a +4.0 plus-minus per 100 possessions in the 2021 season. This number means that with Curry on the court, the Warriors outscored their opponents by four points per every 100 possessions. You can also compare this to his net plus-minus compared to when he’s off the floor. Steph had a +8.6 on-off plus-minus, meaning that the Warriors outscore their opponents by 8.6 more points when Steph is on the floor than when he is not.

While easy to understand, the traditional plus-minus metric is very flawed. If an inferior player’s minutes heavily aligned with Curry’s, their plus-minus would look far better than it should just because they get to play with Curry. In other words, base plus-minus does not adjust for the strength of your teammates. Furthermore, while a quick look at on-off plus-minus may let you know that a player is carrying their team, it tells you more about how strong a team’s bench is than anything else. James Harden posted an on-off plus-minus of +9.1 in 2020, which pales in comparison with his +0.2 mark in 2021 (albeit on a small sample size). Did he get that much better? Of course not – the Brooklyn Nets are just far more equipped to play at a high level without Harden on the floor than the Rockets were.

So, we need to adjust for the other players on the floor. That’s the idea of Adjusted Plus-Minus (APM) – solving the system of linear equations representing the players on a court and the associating plus-minus for their duration on the floor.

Suppose that we have a matrix called **A** representing the players on the floor (one column for each player, a value of 1 if they’re on the floor for that stint and a value of 0 otherwise) and a vector **b** representing the plus-minus per 100 possessions for each stint. We can then solve for **x** which is a vector of coefficients corresponding to each players representing their on-court value.

That’s the adjusted plus-minus solution. As you might anticipate, it has its own drawbacks. Most notably a high degree of variance. This problem can be alleviated with the addition of a filtering term that essentially acts as a penalty for outliers – it converges all values towards zero.

An optimal **lambda** value can be found which yields an approximate solution to the original **Ax=b** problem. This equation is the method used to calculate regularized adjusted plus-minus (RAPM).

Notice that at this point, the only stat that has been used is the stint plus-minus for each combination of players on the floor. No assists, rebounds, steals, blocks, etc. Or even points scored by a specific player – just the overall plus-minus in that duration of time.

Improvements upon RAPM like RPM and RAPTOR incorporate Bayesian priors with box score and tracking stats to give a better prior estimate of player value for the model. Instead of regressing to zero, you can regress to this prior value. This leaves us with two versions of RAPM: non-prior informed RAPM (NPI RAPM) which uses nothing but the lineup metrics, or prior informed RAPM (PI RAPM) which improves the accuracy of the model in small samples by incorporating box score numbers.

In most of its use cases, NPI RAPM has noticeable weaknesses. People usually want to see the numbers for a single season, which is a relatively small sample size for a regression without a suitable prior. There’s also a lot of variance in small samples because of factors like the three-point shot. A player shooting higher or lower than their average from three-point range in a small sample may have nothing to do with the lineup or the defense, but have a large impact on the result of a regression. Thus, alternatives like EPM and PIPM incorporate luck-adjustments that are make them much more adequate for smaller samples.

In larger samples, though, you’d expect most of these factors to mostly sort themselves out. Five year RAPM is often cited and yields more reasonable results than a one season sample. What if we go even further than that? Say, 25 years?

### 25 Year RAPM (1997-2021)

I began this project by scraping play-by-play data for every regular season and postseason game since 1997. Then I used the ideas in this tutorial (applying it to Python) to get lineup data for each possession in the play-by-play. I was successfully able to do this for every dataset except for the 1997 regular season, which contained a lot of missing information. The data used in final RAPM calculations is almost entirely complete from the 1997 postseason to Game 6 of the 2021 Finals.

In order to address the greater importance of the postseason, I doubled playoff possessions to increase their weight in calculations. At the end of the data collection process, I had compiled 859,049 stints across 5,972,736 possessions.

I also utilized a prior to improve the accuracy of the regression. I had initially planned not to do so, but it came to my attention that any long-term RAPM essentially requires some form of prior to account for player aging. Peak Shaquille O’Neal was one of the most unstoppable players in the history of the game. An old 38-year-old version of Shaq was far less of an intimidating force. Both players would be treated the same with NPI RAPM, meaning that 38-year-old Shaq’s teammates would be penalized for playing alongside him because of how great he was in the past. That doesn’t make any sense.

My initial plan was to incorporate a simple prior based on age. The problem is that all players don’t age on the same curve. Prior studies found that a Bayesian prior based on playtime & team strength is more reliable than age, so I went with that approach in this project. Thus, a player’s minutes per game and a team’s net rating were the only two external statistics used in the regression.

All that was left was to find an optimal lambda value and then run the ridge regression. Additional measures could certainly be taken to improve the accuracy of the model, such as a luck-adjustment or a different way to treat “garbage time” possessions. For now, I didn’t want to interfere too much unless it was required (like the aforementioned prior).

A quick disclaimer: this is not a player ranking, nor is it meant to be. If I was ranking the top ten players of the 21st century, I would undoubtedly include players like Kobe Bryant and Kevin Durant in my list. Anyone would. But the point of this article isn’t to give my personal rankings. It’s to observe the results of an objective regression.

I’ve plotted the results on a scatter plot with both total possessions and RAPM. There is obviously a large range of total possessions – the data includes both Dirk Nowitzki’s entire 20 year career along with LaMelo Ball’s injury-shortened rookie season. RAPM is, after all, a rate metric; volume shouldn’t be ignored. You can check out a full interactive graph here.

If someone with no prior knowledge of basketball was asked to guess who the best player in the NBA was over the past 25 years based on nothing but this data, they would probably pick LeBron James. And they would be right. Arguably the greatest player in the history of the sport, LeBron has the highest RAPM from 1997-2021 on top of playing the most possessions by a *wide *margin. Even if you disregard his four NBA titles, four Finals MVPs, four MVPs, 17 All-NBA selections, and 27/7/7 career averages, he comes out on top. His on-court impact is undeniable no matter how you look at it.

Tim Duncan and Kevin Garnett are two other legends of the game who are appreciated by impact metrics like RAPM. Despite the world of difference in their career achievements (four more rings and three more Finals MVPs for Duncan), one could argue that this has more to do with situation than anything else. Garnett’s often considered one of the more underrated players because of his transcendent on-court impact that isn’t reflected in his box score stats, on top of the fact that he spent much of his prime carrying a middling Timberwolves squad.

Right up there with Garnett and Duncan are two legendary point guards in Chris Paul and Steph Curry. Paul, the “Point God,” is one of the greatest floor generals in the history of the game. More of a traditional point guard, CP3 has called the shots for elite offenses his entire career while also playing great defense for his position. Meanwhile, Curry’s unprecedented shooting ability makes him one of the most impactful offensive players of all-time. The threat of his perimeter shooting along with his off-ball movement draws defenders away from his teammates and makes everyone’s job easier on the Warriors. On top of elite finishing and solid playmaking, Curry’s offensive impact is legendary.

Joel Embiid ranks a whopping *2nd* in RAPM. While that’s certainly higher than I anticipated, Embiid is certainly a super special player and one of the most impactful players in the league currently. Unfortunately, his biggest problem is the inability to stay healthy. He just came off of an incredible season averaging 29/11 on 64% TS% while playing great defense – he likely would’ve been the league MVP if he hadn’t gotten hurt. Oh, Embiid also averaged 28/11 on 63% TS% in the playoffs while playing on a torn meniscus. He’s pretty good. Nonetheless, I don’t anticipate he’ll maintain a top-two ranking in the RAPM leaderboards as he plays more time. He’s great, but probably not at the level of guys like Steph, Garnett, and Duncan.

Kobe Bryant’s the guy who tends to be rated lower than expected by advanced metrics like RAPM. As one of the more polarizing players in league history, it’s definitely true that many people overrate his all-time standing. But ranking 59th overall from 1997-2021 is still insanity. That’s behind recent players like Dillon Brooks and Otto Porter Jr. I’m not entirely sure why the metric is underrating him so much. Perhaps his post-prime years have a large impact? The prior should account for that but I’ll look into the highest RAPM peaks in another article to see if it provides any answers.

Those are the main results that stand out, but here’s the full data consisting of 2267 players over the past 25 years.

The other players in the top 10 are Nikola Jokic, Manu Ginobili, Draymond Green, and Jayson Tatum. Draymond and Manu are other players frequently given a lot of love by impact metrics, so it’s not surprising to see them here. Jokic has also been a high impact player throughout his entire NBA career thus far, culminating in an extremely impressive MVP season in 2021. Tatum’s a fantastic young player but the top 10 is certainly higher than I expected to see him. I’d say the same thing I said about Embiid – he’s great, but probably not *that *great.

If we limit the rankings to players with at least 150,000 possessions to get rid of the Joel Embiids and Jayson Tatums in there (kukos to both of them, though), we get Dirk Nowitzki, James Harden, Shaquille O’Neal, Kevin Durant, Rasheed Wallace, and Paul Pierce rounding out the top 10 after James, Garnett, Paul, and Duncan. O’Neal is another player who I expected to rank higher. Among all players, his 3.78 RAPM is the 16th highest since 1997. He’s probably a bit hurt by the 1997 cutoff because of how great he was in Orlando.

Other observations:

There’s a few other things I’d like to do with this data, such as looking for the best five year individual peaks during this span. Maybe that would better illustrate the greatness of Shaquille O’Neal and Kobe Bryant. I could also separate the regular season and postseason results and observe the difference in each. For now, I think this 25 year outlook was enough for one article and a good start to the analysis. There were a few “problems” (it’s not really a problem because again, this isn’t meant to be a player ranking) like Kobe Bryant’s low ranking and the high ranking of Joel Embiid, but I’m satisfied with the project overall.

Great work! As someone with a weak stats background (little bit at uni but not much), what’s the approximate confidence interval on these?

I imagine that depending on priors there are different stable solutions?

Great question. Unfortunately, the sklearn ridge regression function doesn’t have variance built-in or any way to extract it. I then tried to do the ridge regression manually with matrix multiplication as seen here, but my computer proved to not have the memory to handle the massive amount of data involved when doing so. It’s an interesting and important question so I’d love to see someone else tackle it.

Just to clarify, the dimension of vector b is N where N is the number of stints right? So b represents the overall team plus-minus for each stint? And suppose you have two players steph curry and clay thompson on the warriors who exclusively play with each other (i.e. steph curry is on the court if and only if clay thompson is on the court), then there would be no way to determine with RAPM who is the more impactful player without a prior correct?

That is correct (the team plus-minus is adjusted for how many possessions were played in that stint, though – it’s per 100 possessions, to be exact). And yes, if two players are always on the court with one another, we would not be able to isolate each individual player’s impact.

What does it mean for somebody to have a RAPM of X for their career when you have a non-constant prior? That they average X overall, including the prior? That they peaked at X, and were lower other years when they had lower prior?

It should be interpreted as an average over the entire 25 year timespan while accounting for the prior. Looking at smaller samples (like five year RAPM results) would be more appropriate for estimating a player’s peak value.