Two years ago, I created a model to predict the 2019 March Madness tournament. It performed pretty well, most notably picking Auburn to advance to the Final Four and correctly predicting that Virginia would win the National Championship. On the whole, the model’s forecasted bracket finished in the 99th percentile of ESPN’s Tournament Challenge.
Back then, I used a massive Excel spreadsheet to train a linear regression model that violated many fundamental rules of statistics. While the result may have been solid, the actual creation of the model was awful and it almost certainly would’ve performed poorly in the long run. Come on.
Of course, I was a beginner at the time, so it’s hard to blame myself too much. Since then, I’ve learned a lot about data science, and this time around I started from scratch and created the model in Python. I conducted exhaustive feature selection and hyperparameter tuning in order to optimize its predictive ability, which was not the case in 2019.
While the model is different, the idea at its core remains the same. I want to holistically assess a team’s chance of winning each game based on a variety of different variables, including their recent performance leading up to the tournament, their performances against opponents representative of a tournament field, playstyle, etc.
Anyway, let’s get into the predictions.
First Four
Wichita State beating Drake is the closest thing to a safe bet here, but even that’s not exactly a sure thing. Oddly enough, the projected losers in all four of these games are actually the odds-on favorites to win. Not by much, of course.
West Region
Top Team: Gonzaga is entering the tournament as the clear favorites — Vegas gives them implied odds of around 32% to win the tournament, with Illinois at a distant second at just 16%. There’s a very real chance that they will finish their season undefeated. And the model doesn’t really think they’re facing any stout competition in their region. Anything short of a Final Four berth would be a surprise.
Potential Disappointment: Neither the No. 2 Iowa Hawkeyes or the No. 3 Kansas Jayhawks are given a super optimistic shot at making it to the Sweet Sixteen. Both of them have their hands full with No. 6 USC and No. 7 Oregon respectively in the second round.
Sleeper Pick: Along the same lines, USC and Oregon both have relatively solid odds to make it to the Elite Eight. If USC were to upset Kansas and move on to play Iowa in the Sweet Sixteen, they’re projected to have a 48% win probability in that game. And if Oregon moves on to play Kansas in the Sweet Sixteen, the model actually slightly favors them to advance with a 53% win probability. Both teams would still be massive underdogs against Gonzaga in the Elite Eight, but still. Knocking off the No. 2 and No. 3 teams in the region would be a superb rub.
Most Probable First-Round Upsets: No. 11 Wichita State over No. 6 USC (40.58 percent chance)
East Region
Top Team: Michigan isn’t considered a juggernaut to the degree of Gonzaga, but they’re certainly still the favorites to come out of the East. None of Michigan’s projected opponents have a win probability of 33% or higher against them.
Potential Disappointment: Alabama’s 57% chance of defeating No. 7 Connecticut in the second round is not as high as you’d expect for a 2v7 matchup. It would be a massive disappointment if Alabama was knocked off that early, although it would be great for the Huskies, who would have a projected 55.28% chance of beating Texas in the Sweet Sixteen.
Sleeper Pick: The team in this region that has the best shot at knocking off Michigan if they played is actually the No. 4 Florida State Seminoles. In this hypothetical matchup (which the model does not think will occur), the Seminoles would have a projected 43 percent chance of coming out on top, in which case they would advance to the Elite Eight to most likely play Texas or Alabama. The model projects that Florida State would have a 46% chance of winning both of those matchups — not an awful shot at a Final Four berth. Oddly enough, the No. 7 Huskies are actually projected to have a better chance (56%) at beating Florida State in an Elite Eight matchup. That would be an even more exciting run.
Most Probable First-Round Upsets: No. 12 Georgetown over No. 5 Colorado (35.09 percent chance)
South Region
Top Team: Baylor’s projected path is quite similar to Michigan’s in the East. Neither team is expected to dominate like Gonzaga, but they’re also not expected to have a particularly tough time. Unlike Michigan though, there’s no hypothetical interregionional matchup that poses a serious threat for an upset. Baylor maintains their high win probability against teams like Purdue, Arkansas, Texas Tech, etc.
Potential Disappointment: The model really doesn’t seem to love Arkansas. For whatever reason, the 15 seed Colgate Raiders are forecasted to have a 37.5 percent chance of defeating No. 3 Arkansas in the first round. Wow. And while Arkansas is still projected to advance to the second round, their run is expected to stop then with a loss to Texas Tech. A second round exit (and potential first round loss!) for a three seed would be a huge disappointment.
Sleeper Pick: The No. 6 Texas Tech Red Raiders are the obvious pick here. The model gives them the edge over No. 3 Arkansas and projects that they have a solid 43% chance at knocking off No. 2 Ohio State. Their chances at advancing to the Elite Eight aren’t bad at all. The bigger sleeper, though, is in the first round. Texas Tech actually only has a 53% chance of advancing to the second round, as the No. 11 Utah State Aggies are projected to have a favorable 47% win probability. That’s almost a coin flip. If they were to win, the Aggies would have a solid 45% chance of topping Arkansas in the second round to advance to the Sweet Sixteen. Not bad.
Most Probable First-Round Upsets: No. 11 Utah State over No. 3 Texas Tech (47.27 percent chance), No. 14 Colgate over No. 3 Arkansas (37.50 percent chance)1
Midwest Region
Top Team: The No. 2 Houston Cougars are the only two seed favored by the model to win their region and it’s not even that close. While Vegas gives the No. 1 Illinois Fighting Illini the second best odds to win the tournament, our model doesn’t even think their chances of beating Houston in the Elite Eight is above 35%. Houston would also be expected to crush teams like No. 7 Clemson, No. 5 Tennessee, No. 4 Oklahoma State, and No. 3 West Virginia if those matchups occurred.
Potential Disappointment: The No. 3 West Virginia Mountaineers are projected to lose in the second round to the No. 6 San Diego State Aztecs, and it’s not exactly a close call. The Mountaineers are only given a 32.6 percent shot at advancing to the Sweet Sixteen, and if they do, don’t expect them to go any further. Houston would have a 76% edge in such a matchup if it were to occur.
Sleeper Pick: The San Diego State Aztecs are clearly projected to make an impressive Sweet Sixteen run, although the model’s confident that they won’t go any further than that. Liberty and Syracuse are given decent chances to pull off first round exits, but they would both have bad to advance to the Sweet Sixteen after that. Tennessee would have a slightly better chance (64%) at beating Illinois than Oklahoma State, but not by much. All in all, don’t expect any massive surprises. The model likes the chances of a one seed versus two seed matchup in the Elite Eight.
Most Probable First-Round Upsets: No. 13 Liberty over No. 4 Oklahoma State (36.52 percent chance), No. 11 Syracuse over No. 6 San Diego State (39.35 percent chance), No. 10 Rutgers over No. 7 Clemson (51.94 percent chance)
Final Four
Michigan isn’t expected to have a great shot at defeating Gonzaga in the Final Four, and while the Baylor-Houston matchup is a tighter call, Houston still has a decisive 63% win probability. In the projected National Championship, Gonzaga has their toughest matchup of the tournament against the two seed Cougars, but they’re predicted to come out on top nonetheless. And with that, the Gonzaga Bulldogs are the projected 2021 champions of the college basketball world.
Afterword
It’s funny that this model is much improved since the 2019 version, yet it will almost certainly perform worse. Predicting the tournament is a very delicate thing — New Mexico State was two missed free throws away from making the Auburn Final Four prediction look very stupid. There are a few features of the model that were probably quite obvious to you already. For instance, it’s shy. The one seeds certainly have a greater than 90% chance of winning in the first round, but the model refuses to be so optimistic. There are also plenty of variables that are unaccounted for, such as injuries, the whole pandemic thing impacting the tournament, and other tournament results.2
In any case, I’m hopeful that I’ll be able to continue improving the model throughout the years just like I did this time around. Maybe it won’t finish in the 99th percentile of ESPN brackets, though.
If you’d like to see the model’s projections for matchups not listed above, you can search for them in table below which consists of 2,278 rows, one for every possible matchup. Just type the name of the two teams separated by a space. For instance, if you want to search for the Gonzaga-Baylor matchup, type “Gonzaga Baylor” into the search bar.
- The model projects a 17.7% chance that both Utah State and Colgate pull off their first round upsets. Both teams would have around 50% odds of winning their matchup against each other in the second round, but more importantly, we’d have a guaranteed 11 or 14 seed in the Sweet Sixteen!
- The model projects that No. 1 Gonzaga has a 88.93% (again, shy) chance of beating No. 16 Drexel in a matchup. The only way the two teams could actually play each other is if both teams advanced to the National Championship. However, Drexel’s 11% win probability is calculated based on data from games that have already occurred. It doesn’t take into account the fact that in order for Drexel to play Gonzaga, they would have won five games against top teams like No. 1 Illinois.
So what are some of the biggest inputs in this model? I typically use fivethirtyeight to make my picks so I’m curious as to how this one might be different.
FiveThirtyEIght uses composite computer rating systems like Pomeroy and Sagarin as inputs for its model, while this model directly uses the teams’ statistics as the inputs, specifically the statistics that have been found to be most important over the past ten years, such as the difference in average height between the teams, difference in win percentage against quality opponents, assist percentage, etc. Unlike this model, though, the FiveThirtytEight model does adjust for injuries and gives bonuses for tournament wins.