It’s probably a sign of how long we’ve been without baseball that this week’s column was inspired by football. A couple of weeks ago, the Annual Ultimate Flying Handegg Game of Doom was played, and since I live in Atlanta I was vaguely aware of one of the teams playing in the game. According to conversations that I overheard at the water cooler, the Falcons were up 28-3 at one point, but didn’t win. According to various win probability models of football, at one point the Falcons were considered to have a 99 percent chance to win the game. Of course, that didn’t actually happen. So, can we believe win probability models anymore?

There has also been another well-documented recent case of a win probability model put forth by a prominent alumnus of this site that said the blue team was likely to win, but it turned out that the red team actually won that one.

In football, like in baseball, there’s really only so far that you can go to build a win probability model. There is data from previous games to look at, and I suppose we can look back on some combination of the score, possession of the ball, field position, and the time remaining in the game and figure out how often a team in that situation won. But in this particular case, it was the Super Bowl. The other team, even though down by a bunch of points, is in this game by definition because they are really good at playing football. The teams that were likely to fall behind 28-3 in a random regular season game are probably the bad ones. Are they a good comparison group for the conference champions? Or is this just a case where there might have been a 99 percent chance of winning, but … hey, there was always a chance.

I don’t know enough about football to answer that question, but it got me thinking about baseball win probability models. We have our own models, and in general they are based on the idea of looking back some number of years and figuring out how many times this runners/outs/inning/score combination has happened before, and who won what percentage of the time. But is that enough?

To get enough of a sample size, it’s common practice to include several decades worth of data. Baseball is nice in that we get 162 games per year per team and we have excellent, near complete play-by-play data stretching back into the Truman administration. The problem is that baseball has undergone a few changes since the Truman administration. Even in the past 25 years, the game has gone from teams scoring 4.12 runs a game (in 1992) through the go-go 1990s and peaking in 2000 at 5.14 runs per game, and then descending to a low in 2014 of 4.07.

A one-run lead means something different if no one ever scores any runs than when teams are putting up 11-10 scores on the daily. Does it make sense to normalize our expectations of who’s going to win based on data from games 15-20 years ago when things were very different?

**Warning! Gory Mathematical Details Ahead!**

The first thing that I did was create a data set with all games from 1993-2016, and based on the standard model of win probability, found the visitor and home win probabilities for all game states, based on the inning, runners, outs, and score differential (I put all score differentials of seven and above into one common bucket). I looked at all situations from the perspective of the team currently batting. This is generally what is on your screen when they are talking about win probability.

Once I had those win probability numbers, I turned them into logged-odds ratios (for ease of interpretation). I then asked those win probabilities to predict whether the team currently at bat would win the game. Some of you out there are thinking “isn’t that basically just asking a data set to predict itself?” Yes, it is. And it did a fantastic job. But, I gave it a little bit of help. If the historical record is all we need, then no other variables should be able to tell us anything about who’s going to win.

I added in five pieces of information. I added in the batting team’s average runs scored per game for that year, as well as the pitching team’s runs allowed per game. If we know that the batting team likes to score a lot of runs, or if the pitching team enjoys giving them up, then that’s going to be important information. Nothing moves the old win probability meter faster than scoring a couple of runs. Of course, in most situations the teams will switch in 10 minutes and they might do something that will swing the win probability back the other way. So, I added the runs scored/runs allowed for each team for when they got to that side of the ball. Finally, I added in the runs scored per game by all teams across all of baseball for that year to give some idea of the run environment in which they were playing.

Not surprisingly, four of the additional pieces of information were significant predictors in the final regression model. Overall run environment didn’t make it, probably because I was already controlling for the two teams playing and how many runs both of them liked to score and give up. Plus, the entire league isn’t playing this game. The two teams on the field are.

Given all of this information, I asked the program to adjust the estimated win probability as necessary to account for these differing run scoring/allowing tendencies. I looked at how different these revised win probabilities were.

A note before we go any further. I’m not particularly interested in nailing down exact win probability values for specific situations here. It’s much more interesting to see the order of magnitude of the changes that this additional information produces. If knowing the quality of the teams on the field moves the needle only slightly, then we know that the win probability model as we know it now is pretty good. If numbers are sliding all over the place, we’ll know that win probability must be tempered, if not replaced, with something that specifically accounts for the teams currently playing.

The median amount of adjustment (I took the absolute value) was just shy of three percentage points across all events. That means half the time, the “classic” win probability model matched to the “adjusted” win probability model to within three percentage points or closer. But that also means that half the time, the adjustment was more than three percentage points. In fact, there were a few cases (these were extreme, but they did occur) where the adjustment was 26 percentage points. The 90th percentile for the level of adjustment was just shy of 10 percentage points (9.47 to be exact). That means that in roughly 10 percent of the cases, the model was moving the needle by 10 percentage points or more. A reminder that the scale only goes up to 100

Again, this won’t come as a shock, but a little after-the-fact poking through the data set found that the cases where the model suggested that the “classic” win probability model was getting it wrong were mostly when really bad teams were playing really good teams (as measured by average runs scored/given up). I did some extra checks and found that those differences in team quality tracked the amount of “adjustment” needed in the model almost perfectly.

In practical terms, it just confirms what makes sense intuitively, that the relative quality of the two teams playing matters. But the amount that it matters is what brings us to this point. As some point of comparison, the median event (of any sort) in a major-league game changes the “classic” win probability model by 3.4 percentage points (again, 1993-2016 data). The median single (i.e., one-base hit) moves the dial roughly the same amount. Same with the median strikeout.

If we had a game being played by two perfectly equal teams, the historical record as an estimate of win probability makes sense. Of course, there are likely to be some differences in strength between the two teams that actually show up on a given rainy Tuesday night in May. For the median amount of inequality that we saw in the years under study, that inequality was worth about three percentage points of win probability in the direction of the team that was better. So, roughly, one event ahead.

**It Ain’t Over ‘Til It’s Over (or You’re Playing the ’27 Yankees)**

Using the past record and only the past record to generate a win probability estimate is … not bad. Most of the time, it’s going to give a number that is “close enough” and that probably doesn’t change any strategic decisions or narratives about how the game is going qualitatively. Is there really a difference between a 24 and 27 percent chance of winning? But when we’re dealing with two teams that are very mis-matched, the amount of “yeah, but …” becomes more important. In particularly uneven games, the magnitude of that effect can be in the high single digits, or sometimes in the double digits. That starts to be the sort of effect that can impact strategic decisions. Or does it?

Here’s the average amount of “adjustment” based on the relative quality of the teams that the model calls for, by inning.

As the game wears on, the amount of adjustment that is needed above and beyond the historical record begins to fall. Using events from the seventh inning on only, the median adjustment needed is 1.2 percentage points, and even the 90th percentile is “only” seven percentage points of adjustment.

So, by the time we really start paying attention to win probability in a real game (and by the time the manager is making the majority of his tactical substitutions), the relative talents of the two teams start to matter less and the fact that they have reached a point where the weaker team is somehow up by two runs matters more. Is win probability broken? For practical purposes, not really. I think that if it is “technically” broken, it’s broken in a way that about matches with our intuition of how it’s broken.