As that old pop song goes, “oops, he did it again.” Sports Illustrated’s Jon Heyman is asking questions about WAR:

If this sounds familiar, Heyman asked a similar question about Starling Marte and Bryce Harper roughly two weeks ago. There was a predictable dust-up on Twitter, then and now, and a few articles were written. And while much noise has been made, very little has been said, and that’s sad for a number of reasons:

1. Heyman is actually right on the merits here,

2. These aren’t edge cases or weird outliers but examples of a fundamental problem with the construction of certain WAR metrics and with several of the underlying components,

3. These are problems that have been identified for some time now, and solutions actually exist for them already,

4. And finally, that the response to Heyman is indicative of a larger problem with the sabermetrics movement.

Now, I am not a huge believer in sabermetric evangelism. I’m wholly grateful that there’s a community of people interested in sabermetric analysis—it lets me make a living doing this, if nothing else. But primarily I like preaching to the choir. And why shouldn’t I? The choir shows up every Sunday, they’ve been reading the New Testament and I don’t have to spend a lot of time explaining why sometimes he’s Paul and sometimes he’s Saul, and we can get down to business and move forward into new territory rather than being constantly stuck at the beginning. Now, there are certainly people out there who are interested in sabermetrics but haven’t had a lot of exposure to it, and you don’t want to leave them out. And there are certainly people who have closed themselves off to sabermetrics and are spreading falsehoods about the field as a result, and you want to respond to that (although sometimes we—and when I say we, I certainly include myself—engage in it a little too often).

And sometimes there isn’t much you can do. I wrote what I did about Hawk Harrelson and The Will To Win because at some point, you have to come to the conclusion that someone isn’t worth talking to anymore. Hawk’s problem wasn’t that he was wrong, it was that he was stuck in a frame of mind that starts from conclusions and will, when it cares to, circle back around to find some evidence to support it. And that, in insisting on perfection from sabermetrics but not from his worldview, Hawk was simply not engaging with the questions in an intellectually honest way.

But I get the sense that people are viewing Heyman in much the same light. And I think this is dangerous, for several reasons. One, I think he’s being honest in his questions. I admit this is the weakest point of my argument (because I can’t really know his intent), but it’s also the least relevant, and it’s also the weakest part of the opposing case (because anyone else can’t really know his intent). Two, I think his question is useful to consider even if it’s not meant in a sense of honest inquiry, and that we do ourselves a disservice if we ignore useful questions just because of where they came from.

And I think we have to beware of the idea of sabermetric tribalism, that there’s an “us” and a “them” and that we are right and they are wrong. “We” are going to end up being wrong some of the time, and “they” are going to end up being right some of the time (and sometimes all of us will be wrong). Treating the search for truth as a set of rooting interests diminishes it; if you want to cheer for something to win, cheer for a baseball team, not a way of studying baseball. At the same time, we shouldn’t shy away from internal disputes in order to avoid handing ammunition to outsiders. Yes, someone who isn’t looking at sabermetrics as a scientific process can misinterpret and misrepresent internal dissent. But avoiding treating sabermetrics as a scientific process as a result is not a defense but a surrender.

So first I’m going to handle the common responses to Heyman and point out where they are wrong, and then I’m going to loop back around to showing why Heyman was right, in a way that is intuitive to people who regularly discuss sabermetrics. The problem with bad arguments in defense of WAR are multifold, listed from least to most harmful:

1. Bad arguments are unlikely to win many converts, because they’re bad arguments. (In fact, this is probably a good question that more sabermetricians should be asking. If you’re talking to an audience that is educated on the subject of baseball but not in sabermetrics and they don’t seem to take your point, is it because they lack the education (or open-mindedness) to understand what you’re saying, or is it simply that they lack the formal knowledge to explain to you why you’re wrong in terms that you understand?)

2. Bad arguments allow us to overlook good questions that would lead us to improve our current metrics.

3. And most severely, bad arguments indicate that many sabermetric proponents have misidentified the key lesson of sabermetrics, and are passing on that misunderstanding to others, devaluing the term sabermetrics in the process.

One of the common sabermetric notions trotted out to defend WAR from Heyman’s line of inquiry is that of sample size. Those of us with long memories (at least, long memories for this field, which isn’t saying much) may remember the rather pithy formulation of Voros’ Law, named after Voros McCracken of DIPS fame:

“Any major league hitter can hit just about anything in 60 at bats.”

The thing is, can is not the same as has.

When sabermetricians talk about sample size, they’re basing their arguments in what’s known as true score theory or classical test theory. (They are not precisely the same thing, but for the purposes of our discussion the differences do not matter.) What’s important to realize is that true score theory talks about how reliably our measurements over a sample period measure the true talent level of the subject in question. It’s a question about ability, or framed another way it’s a question about the repeatability of the performance. And it’s a discussion framed largely around random variation, or if you want to talk about a loaded word, “luck.” This really doesn’t translate well to this discussion, though—the question isn’t whether or not that’s the talent level of the two players being compared (Russell Carleton addressed that question today), but whether or not the performances are really equivalent.

A related argument is to make a similar comparison using a traditional stat, like saying that Yuniesky Betancourt and Chase Utley both have 24 RBI, so why isn’t Heyman asking questions about RBI?

Which, fine, if you want to treat WAR and RBI as equivalents, go forth and prosper. If you want to call yourself a sabermetrician, though, that’s a patently silly argument. At the risk of going all “thing shaped like itself” on you, what RBI purports to measure is the number of runs batted in, given the rulebook definition. We can argue about how well that captures a player’s offensive value (not all that well) or how well it measures total value (even less well). But those are questions about how useful that definition of the term is; it’s not being debated whether or not those players actually have those RBI totals. These debates are sometimes had historically, but those are debates over recordkeeping—not to belittle their importance, but that doesn’t undermine the idea that RBI totals are “facts” any more than someone claiming that 2+2 equals five makes rudimentary addition controversial.

WAR, on the other hand, is an estimate, not a mere recording of fact. This is why two people that ostensibly agree both on what is to be measured and the underlying recordkeeping can come to two differing estimates. This is because sabermetrics isn’t content with just counting things. We attempt to build models that relate what’s being counted to how runs and wins work on a team level. I would argue that’s far more useful, but it also means that sabermetrics deals in estimates while “old-school” stats deal far more in the counting of facts.

Sabermetrics shouldn’t be afraid of estimation—it isn’t a dirty word. But it should admit that’s what it’s doing, instead of trying to treat estimates like facts. Which is why defending WAR by comparing it to RBIs is simply a false equivalency in this case. WAR is not a fact, and RBIs are. That isn’t a value claim, and it is not a point in favor of RBIs, but if sabermetricians will not defend the practice of estimation then we shouldn’t be surprised when others shun our estimates.

Another defense of WAR is that it correlates with team wins, and it has a higher correlation with defensive metrics included than with them excluded. Both claims are factually true, in that there is a significant correlation between team wins and team WAR, and that WAR with defense included correlates better with team wins without it. But while true, they don’t really contribute anything to WAR’s defense.

Let’s play devil’s advocate for a second. Can we construct a metric with an even higher correlation to team wins than team WAR? We certainly can:

That gets a correlation of 0.95 on 2013 data, higher than the correlation Dave Cameron found between WAR (defense or not) and team wins. So why don’t we use pitcher wins and RBI instead of WAR?

We can draw parallels to the ecological fallacy, which cautions us to be careful about applying inferences drawn at the group level to the individual. Similarly, we know that pitcher wins and runs batted in sum up to team wins and team runs scored respectively, but in practice each has deficiencies in how it goes about attributing those wins and runs at a more granular level. Pitcher wins will, for instance, credit a relief pitcher who happens to be the pitcher of record when the go-ahead run scores, even if the reliever pitched poorly (and maybe even surrendered a run to put the team behind), rather than the starting pitcher who went six scoreless innings to keep the team in a position for the go-ahead run to in fact cause the team to go ahead. Similarly, RBI will reward the player who hits a sacrifice fly to drive in a runner on third, but completely ignore the contribution of the batter whose double allowed the runner to advance from first to third.

So instead of pitcher wins and RBIs we turn to WAR, which may not reconcile quite as well at the team level but, we think, does a better job of describing how each individual player’s performance contributes to wins.

In that light, using the correlation with team wins to defend WAR in general, or any particular implementation detail especially, is hypocritical. Once we subscribe to the notion that correlation with team wins tells us something, we’re led down the path to RBI and wins. And once we turn away from that path, we cannot chose to venture back down it only when it suits us.

Nobody disputes that when measures like UZR and DRS are summed at the team level, there is a correlation with team defense as measured by things like DER. And at the team level, when you add in DER (or some proxy for it, like team UZR or DRS) to a DIPS measure like FIP, you will increase your correlation with team wins. But that tells us absolutely nothing about how DRS or UZR are apportioning that defensive credit to individual players, which is the exact thing being criticized.

What all of these arguments have in common (well, among other things) is that they don’t really get at the heart of the matter and address the issue that Heyman is bringing up. Up until now, they share that with this column. At this point I suppose it’s only fair that I stop slinking around the pole and pouting and actually start to take some clothes off. I started off claiming Heyman was right, so let’s get down to brass tacks and explain why.

Let’s start off with a simple spin off linear weights, based off Estimated Runs Produced:

It’s a simple formula to be sure, entirely linear with a set of constants and a bunch of inputs. Each of those constants, of course, is an estimate. And so associated with each of those constants is a certain amount of measurement error. It’s small enough that normally we don’t have to deal with it (especially since we use the same constants for every batter), but it’s there.

If we want to get batting runs above average, we can take the above formula and do:

ERP is figured using our formula above, PA is the player’s PA, lgR is the total number of runs scored in MLB and lgPA is the total number of PA in MLB.

Now, let’s compare to the fundamental formula behind all of our defensive metrics:

Where “league” refers to all other players at the same position.

It resembles our formula for figuring batting runs above average, yes? Except in our batting runs formula, the only estimated quantity is ERP. The rest is all factual data. For our defensive measure, on the other hand, Plays Made and League Plays Made are our only factual data. Everything else has to be estimated.

Or to put it another way, we have a record of every plate appearance, which batter took it, and what the outcome was. All we have to do is estimate how those outcomes relate to runs. For our defensive metrics, we know the outcome, but we not only have to estimate how it related to runs but we have to estimate which player it was a fielding chance for.

All of which is to say that our estimates of a player’s contribution on defense are much less certain than our estimates of a player’s contribution on offense. Having said that, let’s look at the players Heyman cited Saturday night, as well as the Baseball-Reference cards for each player saved on that night:











Elliot Johnson










Mark Reynolds










We see that Reynolds has played more, which is reflected in his four additional runs attributed to the replacement level adjustment. He’s 12 batting runs ahead of Johnson, and behind him two runs on position and one run on baserunning. They’re even on double play avoidance. So that’s 13 runs between them on things mostly measured by the OPS cited by Heyman, their playing time, or their fielding position, in Reynolds’ favor. And yet they’re in a dead heat in WAR, entirely because of a 12-run difference in their estimated fielding prowess in Johnson’s favor.

And the question Heyman is asking is, how confident are we that the 12 runs on defense reflects what actually happened (not the talent level of each), compared to how confident we are about the 13-run difference comprised primarily of offense?

You know, I know, everyone knows the answer to that. We are more confident that the gap between their offensive production (measured by OPS, linear weights, or anything in between) is meaningful—that is to say, reflective of what actually happened—than the difference between them in DRS. If you don’t believe me, you can ask John Dewan, who runs Baseball Info Solutions, the organization that collects the data and runs the calculations behind DRS. At SABR Analytics, he was asked about the reliability of those measures, and said:

I feel like we're getting about 60 or 70 percent of the picture with current defensive metrics versus 80 or 90 percent on offense," said Dewan. "If I knew how to find the other 40 percent, I'd be doing it!

So, given what we know about the reliability of offensive metrics versus the reliability of defensive metrics in terms of how well they convey what actually happened in-sample, the logical conclusion is that Heyman is right. It is likely that Reynolds was better than Johnson, or that Harper was better than Marte, over the period Heyman was examining. And WAR would be better if it would show that, instead of treating them as equal. We treat one run in DRS equivalent to one batting run above average, but there’s absolutely no reason that we should (or that we have to). Heyman has brought up an area where WAR is “wrong” and it would be useful to make improvements.

It turns out that there is in fact a very large body of work about how to add two quantities of varying reliability. It involves estimating the amount of confidence (or uncertainty) you have in each estimate. It has use in things like polling, for example. And it can certainly be applied to measures of fielding; we’ve used it to regress our FRAA, for example, to account for random error in our estimate of fielding opportunities. It’s nontrivial to apply this kind of thinking to metrics like DRS and UZR that aren’t built from the ground up to be used this way; there are a lot of sources of potential systemic bias in those metrics that make it much more difficult to regress them based on the amount of statistical power you have. But there is absolutely a way to combine quantities of differing reliability together such that you’re putting more weight on the more reliable estimates. And if other WAR implementations were to do so, that would fix the problem Heyman is bringing up.

But instead of taking this as a challenge to improve what we’re doing, it seems like people are mostly taking this as an opportunity to circle the wagons around the way things are now, responding to Heyman’s tweets with dismissive, condescending tweets of their own. If nothing else, if we want to be able to sit around at MVP time and snark about mainstream writers ignoring defense and baserunning when it comes to Mike Trout, it would probably help if we didn’t have people telling them to go back to community college to learn math when they ask questions like these. Because if we want to convince the world that sabermetrics manages to incorporate defense into its evaluations in a useful way, then trumpeting Heyman’s findings as a triumph rather than as a flaw hurts our case. How can we be trusted to get Trout right if we’re not only wrong about things like Marte versus Harper, but we’re unwilling to admit it?

But that’s not the main reason I’m upset here. No, I’m upset because sabermetrics is about asking questions and seeking honest answers, no matter what the conventional wisdom is. The people who have self-selected themselves as sabermetricians have a conventional wisdom that Heyman is questioning, and so reflexively he’s getting attacked. Our sabermetric community is standing in the way of sabermetric progress to the extent that Jon Heyman is being a better sabermetrician than many people who would call themselves one. And that’s a real problem.

This low-rent knockoff Fire Joe Morgan nonsense has to stop. It is well past time for us to stop letting the default mode of communication between baseball researchers and baseball reporters be the one established by sitcom writers. Science involves competing hypotheses, attempts to duplicate each other’s work, and debates that gel into consensus because the weight of evidence eventually becomes too great to ignore. But that kind of healthy debate is increasingly missing from our field.

So the enemy isn’t them. The enemy is us. We are the ones establishing what sabermetrics will be for this generation and the next, and we seem to be increasingly doing it by abandoning our principles in pursuit of popularity. Each day, new questions are piling up about our purported ability to measure fielding, and we show less and less of an interest in answering them. And in the process, we’re committing some of the same sins we’re so quick to point out in others.