keyboard_arrow_uptop
Image credit: USA Today Sports

At his website, Bill James recently published a column entitled “Judge and Altuve,” as well as a follow-up column. Therein, James argues that Wins Above Replacement (WAR) is wrong to evaluate Aaron Judge’s run contributions as equivalent in “win” value to those of Jose Altuve, because the Astros won more games than the Yankees.

The backdrop for the criticism is this: wins obviously arise from runs, specifically the difference between the number of runs scored and those that a team allowed. The question is how many runs should be considered equivalent to a win, and whether that value should be static or dynamic.

James’ argument, as I understand it, is that there needs to be a 100 percent equivalency between the games a team actually wins and the runs they actually score or prevent. Thus, his “run-to-win” value would be dynamic and vary by team. WAR(P), by contrast, uses the overall league-average relationship between runs and wins to assign win value. James’ ire was focused on Baseball Reference’s WAR measurements in particular (Altuve 8.3; Judge 8.1), but the criticism is fairly generalized to any system that has a similar philosophy, and he does not limit his criticism to MVP evaluation. Rather, it is clear that James sees the MVP situation as a symptom of a larger defect in how WAR operates.

While his argument is interesting, I don’t think it proves as much as James seems to believe. It is fine to point out that WAR—by using the MLB average relationship of runs to wins—can produce some curious results on the margins; but that does not mean those incongruities are automatically meaningful or indicative of a true problem. Back-fitting a team’s actual wins to a player’s value, as James suggests, repackages the same problem in a different form: we still have hitters who get on base but don’t get driven in, or pitchers who keep the ball on the ground but have poor fielders behind them, and we then have to decide how to fairly adjust for those situations.

In fact, James built his career upon observing and skewering such incongruities, so it seems rather strange for him to criticize a more statistically reasonable approach—using the grand mean value of a run to the entire league—as opposed to the noisier estimate of what a run ended up meaning to a particular team (and even then, still only the average value to that particular team).1 The fact that a player’s value is not fully realized does not mean that player has no unrealized value. Put another way, even if Reds hitters struck out to complete every inning in which Joey Votto drew a walk, it seems odd to claim that those walks were that much less worth doing.

Ultimately, player value depends heavily on your assumptions, and particularly on how you decide to measure and compile a player’s supposed contribution. Let’s take James’ apparent position and label it as Position A: the value of a particular player is identified by that player’s production of events which are valued on the league-average values for those events in runs, adjusted for park/environment, and then further adjusted by the average value of runs to wins for the player’s particular team during a given season. Let’s also add a second, very important assumption that James implicitly makes, but does not discuss: that the sole events worthy of consideration are the outcomes that actually occurred.

By contrast, let’s label the traditional WAR approach as Position B: the expected value of a player is identified by that player’s production of events which are valued on the league-average values for those events in runs, adjusted for park/environment. Here, value depends not on your team, but on the overall average relationship between runs and wins in baseball during a given season. Once again, this valuation is judged solely by the outcomes that actually happened, and by assuming, as James does, that the players credited with the play’s results are 100 percent responsible for them.

In my opinion, both Positions A and B, although arguably reasonable, are inferior to what I will call Position C: that the value of a particular player is identified by that player’s production of events which are valued on the league-average values for those events in runs, adjust for park/environment and other contextual factors, but—critically, because there are multiple individuals involved in every baseball play—must be further adjusted by the run value that each player most likely contributed to an outcome. Position C is what increasingly motivates our new statistics here at Baseball Prospectus, including Deserved Run Average (DRA), Swipe Rate Above Average (SRAA, or stolen bases), and Catcher Framing. I also believe that Position C best reflects the approach generally taken by state-of-the-art front offices in baseball.2

Let’s summarize these positions and their components as follows:

Position Win Value Contributions Considered
A Average team run value Actual results only
B Average MLB run value Actual results only
C Average MLB run value Most likely contribution to actual results

The elephant in the room here, as usual, is variance. James’ articles seem to brush off randomness as mere “luck,” but variance cannot be dismissed so cavalierly. Baseball analysts hear and talk a great deal about linear weights, which are the average values of various batting events in baseball. An out is worth—on average, over the course of a half-inning—about .3 negative runs, a single about 0.7 positive runs, and a home run about 1.4 positive runs.

What we ought to hear much more about is the variance among those events. This variation can be estimated: assign the linear weight to every event in baseball for a season (~185,000 in 2017), fit a mixture model to accommodate the multiple modes, and take the weighted mean of their standard deviations: it turns out to be about 0.3 runs. This number is interesting for a number of reasons, not least because the standard deviation is essentially the same as the value of an out itself.

In other words, any play, regardless of how capable the players are, can end up being an out, even if, on average, it should be something quite different. Thus, an out can actually win a game if the out scores a runner; likewise, a double combined with someone else’s baserunning blunder can guarantee a loss. This is why we watch; it is why we smile stupidly and scream “baseball!” after a particularly improbable sequence; it is why no game is “over” until the final out has actually been made. It is, at bottom, the same variance that gives us the predicament in question.

So, how do we account for this variance? According to Position A, the only thing that matters about Joey Votto’s walks is how the other Reds hitters capitalized on them. Any walks that did not translate into runs scored were, statistically speaking, a waste of everybody’s time. Votto might as well have struck out and spared us the terrific battles between him and so many pitchers. As James points out, this approach has the advantage of ensuring that everyone’s “contributions” retroactively sum to zero; however it also has the disadvantage of seeming ridiculous—at least to me and others interested in evaluating Votto’s contributions in various contexts.

Virtually all run-scoring events require timely assistance from other teammates. Why should the inherent value of a player depend almost entirely on the contributions of other players, with the sheer randomness of those contributions often amounting to an undeserved out? If Votto’s on-base skills were plopped onto the Astros, under James’ system, his “value” would skyrocket, as the remaining Astros sprayed hits all over the place, uniquely rewarding his on-base skills. While this might true up the ultimate “results” of any team, a player whose value depends heavily on his teammates is not being given his inherent “value” at any time. If this is truly what you prefer, that of course is fine, and it is fine for James to prefer it for his own purposes. But most people, I suspect, would find it highly problematic.

Let’s check in with Position B, the traditional “WAR” position. James disagrees with the usage of average MLB run-to-win values. But WAR does this because it sees the best measure of a player’s value as an inherent, neutralized number: a number which does not penalize the player for the team to which they were assigned or the stadiums in which they were ordered to play. As James notes, this of course results in unexplained variance, the sort that causes the Yankees to win only 91 games instead of the 100 that their statistics suggested they should have won.

But so what? Sticking with Baseball Reference, we can correlate team WAR (batting and pitching combined) with winning percentage, and the correlation is .93,3 which means that bWAR accounts for about 87 percent4 of all run scoring and prevention in baseball. That’s pretty darn good, and not atypical for the various WAR systems. Does that leave 13 percent of what happens on a baseball field unaccounted for? It sure does. But again, so what? WAR doesn’t pretend this variance does not exist; it merely refuses to punish individual players for the inherent volatility we enjoy seeing in the game. And while there are those who enjoy complaining about WAR for this reason, my sense is that many of these people would complain about WAR regardless.5

That leaves Position C. Whereas Position A declares that variance is always somebody’s fault, and Position B assumes that variance is not worth accounting for, Position C embraces the variance and tries to work within its constraints. This requires attacking the assumption that each play’s outcome is 100 percent caused by the players officially credited for that outcome. Position B—the WAR approach—still relies on the outcomes of each batting event as a true reflection of the credited player’s entire contribution to each play; to get around this unrealistic assumption for other purposes, its adherents typically use “regression to the mean” to try to get a sense of the player’s true “ability” and likely future contributions. This doesn’t affect the credited WAR, but is one way to ensure that the present does not unduly cloud the future.

Position C rejects this duality: instead, it focuses on reasonably apportioning each player’s responsibility for each play at the time it is measured, and compiled into win values. When you address the credit issue up front, there is no need to worry about the issue later, and no need to “regress” any player’s statistics to get to some better place eventually. Instead, you focus on getting it right the first time: use shrinkage and prior information to give credit only when it is most probable, and allow simple variance to take credit for the rest. This is how Deserved Run Average works, and how Baseball Prospectus’ pitcher WARP operates also.

This approach has real added value. As we’ve shown, DRA manages both to substantially describe what has actually occurred, while also doing a better job of consistently anticipating player results from season to season. Since Position C already discounts the way the play has been officially credited, it makes sense to stick with the WAR approach of evaluating wins by the average run-to-win value, rather than any team’s particular value. This makes Position C in the end almost the polar opposite of James’ Position A when it comes to player valuation. But it is a position we find to be much more sensible and reflective of how well a player has most likely contributed, both to his team and to baseball in general.

We also believe that Position C is the future. The field of statistics increasingly seems likely to coalesce around the understanding that statistics is about appreciating uncertainty, not precision. By embracing uncertainty, you recognize that the correct approximation of a player’s win value is neither “8.3” nor “8.1” per se, but rather “8.3 plus or minus 1.5 runs” versus “8.1 plus or minus 1.2 runs.”6 The comparison between the two players then is not between two-tenths of a run, but rather the extent to which the uncertainty intervals around those two players actually overlap.

The extent to which they don’t overlap tells you the percentage likelihood that WAR(P) is missing something, and the analyst can then sensibly consider what additional factors—clutchness, tough ballparks, terrific managing, or what have you—can fill in the gap. In doing so, analysts (and columnists alike) can consider additional information with an appreciation of how much, or how little, those additional factors most likely can be said to actually matter.

We readily concede that we have a ways to go in our effort to make Position C a reality. The point of this discussion isn’t to brag about what we’ve done so much as to recognize how much left we and many others have yet to do. We have started down the path of Position C, but much of it remains unfinished. While we consider variance in generating many of our WARP estimates, that isn’t true for all of them (most notably, for offensive statistics). Even the WARP estimates that embrace variance haven’t yet given you those intervals around our point predictions.

Hopefully we can start doing that soon. In time, we believe that effort will be recognized by most to have been well spent, and that Position C will come to be seen as the best way to evaluate player contributions in sports. Indeed, its goals are so distinct that it may ultimately be helpful to find some entirely different term to describe what Position C aims to compile. For now, our addition of the “P” to the end of “WAR” will have to do.

In the meantime, I don’t really care whether you decide to take Position A, B, or C (and this includes Bill James!), as long as you disclose which method you chose. Regardless of your preference, it certainly isn’t worth getting upset about.


Footnotes:

1. It is particularly strange to see this argument coming from the originator of the so-called Pythagorean Theorem of Baseball, which advocates looking at a team’s deserved wins rather than their actual wins, the former being determined by run differential.

2. Admittedly, this may be because front offices usually care little about past performance, and instead focus on ability level, with an eye toward the future. This caveat is important, but I suspect most advanced analysts would favor Position C even if asked solely to grade past performance.

3. Pearson correlation.

4. The square of the Pearson correlation, aka R-squared, assuming a generally linear relationship.

5. Thoughtful WAR criticisms are always welcome, but many of WAR’s critics seem to be frustrated by WAR’s tendency to discourage contrary and more convenient  narratives of player contribution.

6. The ranges are for illustration only.

You need to be logged in to comment. Login or Subscribe
Meir Meir
11/21
What's your basis for saying James is criticizing WAR generally and not just for MVP (or more generally season-value) purposes? He expressly makes a distinction between "[w]hat a player may reasonably be expected to do in the future" (for which he thinks WAR is well-suited) and the player's "value in a season which is in the past" (for which he thinks WAR is problematic).
Adrock
11/21
Excellent column, compellingly argued. Although this is the first time I've ever seen "true" used as a verb, and I'm not sure I liked it.
Jim Maher
11/28
When surveying or trying to line up something exactly, theappropriate verb is "true"
Richard Mueller
11/21
Enjoyed the article and now better understand WAR. Thanks
nberlove
11/21
I think you are missing a lot of what James was trying to say. His primary argument is that WAR does not correspond to wins as the name implies or as many people suggest. And this argument is pretty straight forward and can easily be seen by looking at BP's own stats. Using WARP & PWARP, Yankees players produced about 55 wins above replacement. Assuming a replacement team wins 48 games, the Yankees 'should have' won 103 games. However, they only won 91 or 12 fewer than expected. Contrast that to the Astros. Their expected wins based on WARP was 101, just what they won. Therefore, how can really say Judge produced the 7.4 wins WARP suggests (or Gardner produced 3.9 or Frazier 1.1) when there is such a big disconnect between the team's WARP expected wins and its actual wins. The difference is so big that if you throw out Judge's 7.4 'wins' the Yankees still under-perform their expected number, almost suggesting his contributions were worthless (which, of course, they were not). The question then is, if they Yankees only really won 43 games above replacement (instead of 55) how do you really value Judge's 7.4. Maybe that 7.4 was really only worth 5.8 actual wins (7.4 x 43/55). Furthermore, James believes the key to working this out lies in evaluating and considering context, which goes against the religion of WAR/WARP. Saying the difference is all a matter of luck (or for the Yankees this year, the lack there of) just does not cut it for James, nor should it considering how much data that is now easily available.
zulu
11/21
I don't think you understood James' argument. His point is not that WAR fails to capture value or assign it properly -- his point is that if you add up all of the WAR for the Yankees, you'll arrive at a total that reflects a 100-win total. In fact, they won 91 games, and we have to account for this /somehow/ when talking about player value. His point is not that this variance is "luck," but rather that this variance, even if it is luck, is real. The MVP award is based off of real results -- the player who most contributed to his team winning real games -- and giving all of the Yankees credit for winning 100 games when they really won 91 ignores that reality. His point, then, is that the MVP award should be awarded based off of what really happened, rather than what would happen if you simulated the season a million times or off of who would provide the most future value or whatever. And so simply looking at WAR, which is based around the average relationship between runs and wins, divorces the MVP award from that reality.
bhalpern
12/07
This changes the reason to disagree with Mr. James but not the disagreement. Isn't the logic you cite pretty much the same reason we complain about 'traditional' MVP voters making decisions based on flawed outcome based stats like RBI? Shouldn't we prefer an MVP who did the most to give his team the opportunity to win more games regardless of the ability of the rest of the team to capitalize on it?
Shaun P.
11/21
Bill James invented Win Shares on the basis of divvying up a team's actual wins among its players (I realize I am oversimplifying a bit). I get the appeal - direct correlation between the W-L columns and what happens on the field - but it never made sense to me. I did not, and do not think, we can measure everything that happens on a baseball field, and accurately assign a value to it. I think Position C comes closest and it really reminds me of the work Colin Wyers was doing here at BP before the Astros got him. Of course, due to a myriad of problems with the stat, no one relies on Win Shares for analysis. It sounds like Bill James wants to get back to having the nice correlation that Win Shares provided, without recognizing the inherent problem (again). Great analysis, Jonathan!
Mac Guyver
12/07
To build on your Joey Votto example. If there are two outs and the Reds are trailing by 3+ runs, then yes, those walks are worth less than a meaningful walk. It is my understanding that James argues against WAR in its usage. WAR is an approximation of player value and his contribution to his team. Too many people don't understand that, which stems from a lack of understanding of the bigger picture. It all makes sense to me - those that don't understand the big picture want to simplify the analysis and WAR does exactly that... in the wrong context, which is what James is pointing at. It is ironic that those with a partial understanding want to argue against someone who has spent their life building that deeper understanding.