Background
Reddit is a popular social media network split into various communities based on different interests. Each community is known as a subreddit and they’re referred to with the prefix ‘r/’ before the community name. There’s a subreddit out there for anything you can think of. For example, you can browse r/sports for general sports discussion or r/soccer for general soccer discussion. Even more specifically, you can browse r/barca for news and discussion specifically pertaining to F.C. Barcelona.
In this article, we’ll focus on the NBA subreddit (r/nba), which is the most popular sport-specific community on the website.
Almost four million users are subscribed to r/nba – no other sports subreddit outside of r/sports has even hit three million. At the time of this article being written, there are 23,000 users currently on r/nba despite no games currently taking place. It’s certainly an active community.
After every NBA game, a user posts something called a “Post Game Thread.” This post’s title gives the basic information about a game, including the winning and losing teams, the score, and anything noteworthy that occurred. For example, the post game thread when the Nets defeated the Clippers a month ago was titled “[Post Game Thread] The Brooklyn Nets (14-9) defeat the Los Angeles Clippers (16-6) 124-120 behind 39 points from Kyrie Irving and a 23/11/14 triple double from James Harden.” The post itself contains the game’s box score, and the comments are free to any discussion related to the team. That particular post game thread received 1012 comments.
Reddit is known for its scoring system. Users can ‘upvote’ or ‘downvote’ posts and comments. Posts with a higher score (upvotes minus downvotes) show up on the front page of a subreddit. The post game thread (PGT) for the Nets-Clippers game received a score of 2608 (98% upvote ratio). Meanwhile, the PGT for the Nuggets beating the Thunder last night has a score of just 205 with only 50 comments. Clearly it didn’t gain as much traction. This makes sense, of course. People are gonna care more about a starstudded matchup between two Finals contenders than the Nuggets blowing out the Thunder by 30.
The question I wanted to explore in this article is simple. Using a post’s score to represent its popularity, which teams’ post game threads tend to differ in score the most between wins and losses? Are there any teams whose losses get a much of traction but no one really cares about their wins? Or vice versa?
Data
I used PRAW (Python Reddit API Wrapper) and PSAW (Python Pushshift.io API Wrapper) to obtain the submission data for every post game thread from the 2019-20 regular season prior to the bubble. I didn’t explore the 2020-21 regular season data because it is currently in progress, and I didn’t include the games played in the bubble because I felt that the abnormal circumstances could influence the data.
In total, 971 NBA games in the 2019-20 regular season were played prior to the suspension of the season on March 11th. I was able to compile the data for 960 of these games. It’s not entirely clear why 11 games are missing from the dataset. Maybe the user who posted the PGT later deleted their account (or their account was terminated). Or a game was so low-profile that literally nobody wanted to go out of their way to create a PGT.1 In any case, having 98.9% of the data is certainly not too shabby.
Let’s quickly explore the data.
The least popular post game thread of the pre-bubble 2019-20 regular season was a 15-point win for the Cleveland Cavaliers over the Detroit Pistons. The game’s corresponding PGT has a score of 28 and received only 11 comments, four of which simply remarked on the fact that the PGT was posted two hours after the game ended. The most popular post game thread was the Warriors’ shocking upset win over the Houston Rockets on Christmas Day. The post had a score of 18437 and there were a whopping 2101 comments.
The distribution of post game thread scores is also clearly right-skewed.
Interestingly enough, 48.6% of the post game threads in the dataset have a score of less than 500. Of course, this differs from team to team. The lowest score for a Lakers PGT was 526, for example. That’s the point of this article, after all.
Process & Results
The submission data for the 960 post game threads include the post’s title, score, upvote ratio, number of comments, etc. The title can be used to determine which team won the game and which team lost. This process was incredibly tedious because there’s no convention for titling post game threads. The majority of people use the template “team X defeats team Y” along with the score and any other information, which makes it quite easy to split the title string into the text before ‘defeats’ and the text after ‘defeats’ and then parse each substring for team names. The difficulty arises in the 18% of post game threads that use a word other than ‘defeat.’ Words like “beat” were fine and understandable, but going through verbs like “wallop,” “wollop,” “stave off,” “wax,” eviscerate,” “demolish,” “banish,” “obliterate,” and “annihilate” was rather time-consuming. It really makes you appreciate standardization.
With the winner and loser of every post game thread determined, I could then split the data into post game threads for each team’s wins and losses. Given the right-skewed distribution of PGT scores, I recorded the median score for individual team’s PGTs after wins and then after losses. Then I created a new variable to represent the difference in median scores (median PGT scores after wins minus median PGT scores after losses) for each team.
A negative median PGT score differential means that losses for that team get more traction than wins. A positive median PGT score differential obviously means the opposite — post game threads for that team’s wins are more popular than those for their losses.
One would expect win percentage and median PGT score differential to be negatively correlated, as it is. A loss is obviously going to be more noteworthy for a team that wins 80% of their games than it would be for a team that wins 20% of their games. Thus, I think the most reasonable way to evaluate popularity here is to apply an exponential regression model to the data and rank the teams based on their residual. An exponential model was chosen because it would make sense (and the data supports this trend) for fans to equally neglect mediocre teams, whether they have a 50% win percentage or even lower. After this point is when you’d expect the dropoff.
We can now sort the teams by their residual (actual median PGT score differential expected median PGT score diferential based on win percentage).
For the most part these results check out. I would definitely agree that the Lakers and Clippers were by far the most disliked teams in the 2019-20 season. I would’ve guessed that the Warriors were the most liked team, but they’re not too far off and the Raptors aren’t a surprising result either. The Warriors also clearly stand out as the only bad team to be considered very liked. The most recent graph shows this as well.
There are some issues with the methodology that are made clear by the results. Are the Bucks and Raptors actually loved by r/nba users? Or do they only seem to be loved because of how hated the Clippers and Lakers are (thus influencing the exponential regression)? Maybe the Bucks and Raptors aren’t loved at all and the LA teams are just hated. Also, what are the Bulls doing there? I have no idea.
While the exponential regression may not be the best way of going about this, some of the trends don’t need to be confirmed with a model. Just take a look at the distribution of post game thread scores after the Lakers win and after they lose.
Almost every score for a post game thread after a Lakers loss would be considered an outlier if it was a Lakers win. NBA fans sure love seeing the Lakers lose.
Future Work
This project stemmed from a very simple question. After completing it, though, I’m not entirely satisfied with the answer. I can’t help but think that a more complex analysis can reveal far more significant results. A model could be trained to predict the score of every post game thread based on variables such as each team’s win percentage, the difference in score, etc. Then the actual PGT scores can be compared to that expected total, not the one produced by a crude exponential regression. I think this idea also wouldn’t be very difficult and I plan on trying it out after the completion of the 2020-21 season. For now, this will suffice and it was a fun dive into analyzing data from a social media site..
- Both of these theories are possible because the creation of post game threads are not automated. Thus, it’s the responsibility of individual users to create them. It’s rarely an issue (most games feature multiple users trying to race to be the first to post the PGT as the game ends), but there have been instances in which the thread wasn’t posted until hours after the game because of how uninteresting that game was.
Ahmed, did you run a statistical model on the 2021 march madness bracket? I came across your 2019 march madness model a while back…it was the most comprehensive model I could find in regards to number of variables used. Would be super interested in seeing it if you did. Thank you.
I’m glad you liked it. The model is currently being updated and an article should be out within the next 48 hours.
Awesome, looking forward to it!