2023 SABR Analytics Conference Research Awards: Voting Open Now!

As that old pop song goes, “oops, he did it again.” Sports Illustrated’s Jon Heyman is asking questions about WAR:

If this sounds familiar, Heyman asked a similar question about Starling Marte and Bryce Harper roughly two weeks ago. There was a predictable dust-up on Twitter, then and now, and a few articles were written. And while much noise has been made, very little has been said, and that’s sad for a number of reasons:

1. Heyman is actually right on the merits here,

2. These aren’t edge cases or weird outliers but examples of a fundamental problem with the construction of certain WAR metrics and with several of the underlying components,

3. These are problems that have been identified for some time now, and solutions actually exist for them already,

4. And finally, that the response to Heyman is indicative of a larger problem with the sabermetrics movement.

Now, I am not a huge believer in sabermetric evangelism. I’m wholly grateful that there’s a community of people interested in sabermetric analysis—it lets me make a living doing this, if nothing else. But primarily I like preaching to the choir. And why shouldn’t I? The choir shows up every Sunday, they’ve been reading the New Testament and I don’t have to spend a lot of time explaining why sometimes he’s Paul and sometimes he’s Saul, and we can get down to business and move forward into new territory rather than being constantly stuck at the beginning. Now, there are certainly people out there who are interested in sabermetrics but haven’t had a lot of exposure to it, and you don’t want to leave them out. And there are certainly people who have closed themselves off to sabermetrics and are spreading falsehoods about the field as a result, and you want to respond to that (although sometimes we—and when I say we, I certainly include myself—engage in it a little too often).

And sometimes there isn’t much you can do. I wrote what I did about Hawk Harrelson and The Will To Win because at some point, you have to come to the conclusion that someone isn’t worth talking to anymore. Hawk’s problem wasn’t that he was wrong, it was that he was stuck in a frame of mind that starts from conclusions and will, when it cares to, circle back around to find some evidence to support it. And that, in insisting on perfection from sabermetrics but not from his worldview, Hawk was simply not engaging with the questions in an intellectually honest way.

But I get the sense that people are viewing Heyman in much the same light. And I think this is dangerous, for several reasons. One, I think he’s being honest in his questions. I admit this is the weakest point of my argument (because I can’t really know his intent), but it’s also the least relevant, and it’s also the weakest part of the opposing case (because anyone else can’t really know his intent). Two, I think his question is useful to consider even if it’s not meant in a sense of honest inquiry, and that we do ourselves a disservice if we ignore useful questions just because of where they came from.

And I think we have to beware of the idea of sabermetric tribalism, that there’s an “us” and a “them” and that we are right and they are wrong. “We” are going to end up being wrong some of the time, and “they” are going to end up being right some of the time (and sometimes all of us will be wrong). Treating the search for truth as a set of rooting interests diminishes it; if you want to cheer for something to win, cheer for a baseball team, not a way of studying baseball. At the same time, we shouldn’t shy away from internal disputes in order to avoid handing ammunition to outsiders. Yes, someone who isn’t looking at sabermetrics as a scientific process can misinterpret and misrepresent internal dissent. But avoiding treating sabermetrics as a scientific process as a result is not a defense but a surrender.

So first I’m going to handle the common responses to Heyman and point out where they are wrong, and then I’m going to loop back around to showing why Heyman was right, in a way that is intuitive to people who regularly discuss sabermetrics. The problem with bad arguments in defense of WAR are multifold, listed from least to most harmful:

1. Bad arguments are unlikely to win many converts, because they’re bad arguments. (In fact, this is probably a good question that more sabermetricians should be asking. If you’re talking to an audience that is educated on the subject of baseball but not in sabermetrics and they don’t seem to take your point, is it because they lack the education (or open-mindedness) to understand what you’re saying, or is it simply that they lack the formal knowledge to explain to you why you’re wrong in terms that you understand?)

2. Bad arguments allow us to overlook good questions that would lead us to improve our current metrics.

3. And most severely, bad arguments indicate that many sabermetric proponents have misidentified the key lesson of sabermetrics, and are passing on that misunderstanding to others, devaluing the term sabermetrics in the process.

One of the common sabermetric notions trotted out to defend WAR from Heyman’s line of inquiry is that of sample size. Those of us with long memories (at least, long memories for this field, which isn’t saying much) may remember the rather pithy formulation of Voros’ Law, named after Voros McCracken of DIPS fame:

“Any major league hitter can hit just about anything in 60 at bats.”

The thing is, can is not the same as has.

When sabermetricians talk about sample size, they’re basing their arguments in what’s known as true score theory or classical test theory. (They are not precisely the same thing, but for the purposes of our discussion the differences do not matter.) What’s important to realize is that true score theory talks about how reliably our measurements over a sample period measure the true talent level of the subject in question. It’s a question about ability, or framed another way it’s a question about the repeatability of the performance. And it’s a discussion framed largely around random variation, or if you want to talk about a loaded word, “luck.” This really doesn’t translate well to this discussion, though—the question isn’t whether or not that’s the talent level of the two players being compared (Russell Carleton addressed that question today), but whether or not the performances are really equivalent.

A related argument is to make a similar comparison using a traditional stat, like saying that Yuniesky Betancourt and Chase Utley both have 24 RBI, so why isn’t Heyman asking questions about RBI?

Which, fine, if you want to treat WAR and RBI as equivalents, go forth and prosper. If you want to call yourself a sabermetrician, though, that’s a patently silly argument. At the risk of going all “thing shaped like itself” on you, what RBI purports to measure is the number of runs batted in, given the rulebook definition. We can argue about how well that captures a player’s offensive value (not all that well) or how well it measures total value (even less well). But those are questions about how useful that definition of the term is; it’s not being debated whether or not those players actually have those RBI totals. These debates are sometimes had historically, but those are debates over recordkeeping—not to belittle their importance, but that doesn’t undermine the idea that RBI totals are “facts” any more than someone claiming that 2+2 equals five makes rudimentary addition controversial.

WAR, on the other hand, is an estimate, not a mere recording of fact. This is why two people that ostensibly agree both on what is to be measured and the underlying recordkeeping can come to two differing estimates. This is because sabermetrics isn’t content with just counting things. We attempt to build models that relate what’s being counted to how runs and wins work on a team level. I would argue that’s far more useful, but it also means that sabermetrics deals in estimates while “old-school” stats deal far more in the counting of facts.

Sabermetrics shouldn’t be afraid of estimation—it isn’t a dirty word. But it should admit that’s what it’s doing, instead of trying to treat estimates like facts. Which is why defending WAR by comparing it to RBIs is simply a false equivalency in this case. WAR is not a fact, and RBIs are. That isn’t a value claim, and it is not a point in favor of RBIs, but if sabermetricians will not defend the practice of estimation then we shouldn’t be surprised when others shun our estimates.

Another defense of WAR is that it correlates with team wins, and it has a higher correlation with defensive metrics included than with them excluded. Both claims are factually true, in that there is a significant correlation between team wins and team WAR, and that WAR with defense included correlates better with team wins without it. But while true, they don’t really contribute anything to WAR’s defense.

Let’s play devil’s advocate for a second. Can we construct a metric with an even higher correlation to team wins than team WAR? We certainly can:

That gets a correlation of 0.95 on 2013 data, higher than the correlation Dave Cameron found between WAR (defense or not) and team wins. So why don’t we use pitcher wins and RBI instead of WAR?

We can draw parallels to the ecological fallacy, which cautions us to be careful about applying inferences drawn at the group level to the individual. Similarly, we know that pitcher wins and runs batted in sum up to team wins and team runs scored respectively, but in practice each has deficiencies in how it goes about attributing those wins and runs at a more granular level. Pitcher wins will, for instance, credit a relief pitcher who happens to be the pitcher of record when the go-ahead run scores, even if the reliever pitched poorly (and maybe even surrendered a run to put the team behind), rather than the starting pitcher who went six scoreless innings to keep the team in a position for the go-ahead run to in fact cause the team to go ahead. Similarly, RBI will reward the player who hits a sacrifice fly to drive in a runner on third, but completely ignore the contribution of the batter whose double allowed the runner to advance from first to third.

So instead of pitcher wins and RBIs we turn to WAR, which may not reconcile quite as well at the team level but, we think, does a better job of describing how each individual player’s performance contributes to wins.

In that light, using the correlation with team wins to defend WAR in general, or any particular implementation detail especially, is hypocritical. Once we subscribe to the notion that correlation with team wins tells us something, we’re led down the path to RBI and wins. And once we turn away from that path, we cannot chose to venture back down it only when it suits us.

Nobody disputes that when measures like UZR and DRS are summed at the team level, there is a correlation with team defense as measured by things like DER. And at the team level, when you add in DER (or some proxy for it, like team UZR or DRS) to a DIPS measure like FIP, you will increase your correlation with team wins. But that tells us absolutely nothing about how DRS or UZR are apportioning that defensive credit to individual players, which is the exact thing being criticized.

What all of these arguments have in common (well, among other things) is that they don’t really get at the heart of the matter and address the issue that Heyman is bringing up. Up until now, they share that with this column. At this point I suppose it’s only fair that I stop slinking around the pole and pouting and actually start to take some clothes off. I started off claiming Heyman was right, so let’s get down to brass tacks and explain why.

Let’s start off with a simple spin off linear weights, based off Estimated Runs Produced:

It’s a simple formula to be sure, entirely linear with a set of constants and a bunch of inputs. Each of those constants, of course, is an estimate. And so associated with each of those constants is a certain amount of measurement error. It’s small enough that normally we don’t have to deal with it (especially since we use the same constants for every batter), but it’s there.

If we want to get batting runs above average, we can take the above formula and do:

ERP is figured using our formula above, PA is the player’s PA, lgR is the total number of runs scored in MLB and lgPA is the total number of PA in MLB.

Now, let’s compare to the fundamental formula behind all of our defensive metrics:

Where “league” refers to all other players at the same position.

It resembles our formula for figuring batting runs above average, yes? Except in our batting runs formula, the only estimated quantity is ERP. The rest is all factual data. For our defensive measure, on the other hand, Plays Made and League Plays Made are our only factual data. Everything else has to be estimated.

Or to put it another way, we have a record of every plate appearance, which batter took it, and what the outcome was. All we have to do is estimate how those outcomes relate to runs. For our defensive metrics, we know the outcome, but we not only have to estimate how it related to runs but we have to estimate which player it was a fielding chance for.

All of which is to say that our estimates of a player’s contribution on defense are much less certain than our estimates of a player’s contribution on offense. Having said that, let’s look at the players Heyman cited Saturday night, as well as the Baseball-Reference cards for each player saved on that night:











Elliot Johnson










Mark Reynolds










We see that Reynolds has played more, which is reflected in his four additional runs attributed to the replacement level adjustment. He’s 12 batting runs ahead of Johnson, and behind him two runs on position and one run on baserunning. They’re even on double play avoidance. So that’s 13 runs between them on things mostly measured by the OPS cited by Heyman, their playing time, or their fielding position, in Reynolds’ favor. And yet they’re in a dead heat in WAR, entirely because of a 12-run difference in their estimated fielding prowess in Johnson’s favor.

And the question Heyman is asking is, how confident are we that the 12 runs on defense reflects what actually happened (not the talent level of each), compared to how confident we are about the 13-run difference comprised primarily of offense?

You know, I know, everyone knows the answer to that. We are more confident that the gap between their offensive production (measured by OPS, linear weights, or anything in between) is meaningful—that is to say, reflective of what actually happened—than the difference between them in DRS. If you don’t believe me, you can ask John Dewan, who runs Baseball Info Solutions, the organization that collects the data and runs the calculations behind DRS. At SABR Analytics, he was asked about the reliability of those measures, and said:

I feel like we're getting about 60 or 70 percent of the picture with current defensive metrics versus 80 or 90 percent on offense," said Dewan. "If I knew how to find the other 40 percent, I'd be doing it!

So, given what we know about the reliability of offensive metrics versus the reliability of defensive metrics in terms of how well they convey what actually happened in-sample, the logical conclusion is that Heyman is right. It is likely that Reynolds was better than Johnson, or that Harper was better than Marte, over the period Heyman was examining. And WAR would be better if it would show that, instead of treating them as equal. We treat one run in DRS equivalent to one batting run above average, but there’s absolutely no reason that we should (or that we have to). Heyman has brought up an area where WAR is “wrong” and it would be useful to make improvements.

It turns out that there is in fact a very large body of work about how to add two quantities of varying reliability. It involves estimating the amount of confidence (or uncertainty) you have in each estimate. It has use in things like polling, for example. And it can certainly be applied to measures of fielding; we’ve used it to regress our FRAA, for example, to account for random error in our estimate of fielding opportunities. It’s nontrivial to apply this kind of thinking to metrics like DRS and UZR that aren’t built from the ground up to be used this way; there are a lot of sources of potential systemic bias in those metrics that make it much more difficult to regress them based on the amount of statistical power you have. But there is absolutely a way to combine quantities of differing reliability together such that you’re putting more weight on the more reliable estimates. And if other WAR implementations were to do so, that would fix the problem Heyman is bringing up.

But instead of taking this as a challenge to improve what we’re doing, it seems like people are mostly taking this as an opportunity to circle the wagons around the way things are now, responding to Heyman’s tweets with dismissive, condescending tweets of their own. If nothing else, if we want to be able to sit around at MVP time and snark about mainstream writers ignoring defense and baserunning when it comes to Mike Trout, it would probably help if we didn’t have people telling them to go back to community college to learn math when they ask questions like these. Because if we want to convince the world that sabermetrics manages to incorporate defense into its evaluations in a useful way, then trumpeting Heyman’s findings as a triumph rather than as a flaw hurts our case. How can we be trusted to get Trout right if we’re not only wrong about things like Marte versus Harper, but we’re unwilling to admit it?

But that’s not the main reason I’m upset here. No, I’m upset because sabermetrics is about asking questions and seeking honest answers, no matter what the conventional wisdom is. The people who have self-selected themselves as sabermetricians have a conventional wisdom that Heyman is questioning, and so reflexively he’s getting attacked. Our sabermetric community is standing in the way of sabermetric progress to the extent that Jon Heyman is being a better sabermetrician than many people who would call themselves one. And that’s a real problem.

This low-rent knockoff Fire Joe Morgan nonsense has to stop. It is well past time for us to stop letting the default mode of communication between baseball researchers and baseball reporters be the one established by sitcom writers. Science involves competing hypotheses, attempts to duplicate each other’s work, and debates that gel into consensus because the weight of evidence eventually becomes too great to ignore. But that kind of healthy debate is increasingly missing from our field.

So the enemy isn’t them. The enemy is us. We are the ones establishing what sabermetrics will be for this generation and the next, and we seem to be increasingly doing it by abandoning our principles in pursuit of popularity. Each day, new questions are piling up about our purported ability to measure fielding, and we show less and less of an interest in answering them. And in the process, we’re committing some of the same sins we’re so quick to point out in others.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
I think this is a good article, but Jon Heyman makes a habit of trolling saber-people every month or two. He's actually reasonably good at it, judging from the Twitter rage that invariably ensues.
So your response is basically, "He started it!"
No, just that I don't think he is actually interested in the answers to the questions he poses. That tends to make a meaningful conversation a tough difficult.
Colin's point is that, even if he is trolling, he has a point, and if we ignore his point because we think he's trolling we're inhibiting ourselves and out pursuits with regards to baseball. Whether we have a meaningful conversation with Heyman or just with ourselves about what he brings up is irrelevant if we're leaning something about the stats we use and the problems we take on.
This is all true and agreed, but the reason the reaction to Heyman is so snarky is that he IS a troll. He "asks" these "questions" all the time and gives zero indication that he's actually trying to learn from them. Case in point: the Reynolds-Johnson "mystery" is EXACTLY THE SAME as the Marte-Harper one he was grousing about a few weeks ago. I really don't think trying to engage with him is at all different from trying to engage with Hawk.

And really, I agree wholeheartedly with everything Colin says except... does Heyman's question really speak to anything? Any time two position players have differing offensive stats but similar total WAR (or vice-versa), it's due to some combination of: defense, playing time, position, baserunning. Click over to B-R, scroll to the "2013" line, and see which one(s) it is. This takes about 10 seconds. It's not a mystery, and really anyone who's sabermetrically literate knows this already. We also know that defensive metrics are rather uncertain, and people are working on that. I guess it's good to have that point made explicit now and again, but beyond that, I don't see what value or impetus Heyman's inquiry really brings.
Except that a lot of sabermetric types are asking these questions (and the right way, unlike Heyman, who just goes "I don't get it, so I'll presume it could only be completely wrong") already. We're trying to perfect our estimations of WAR, and the best ways to measure and regress defense.

Heyman's point isn't to promote deeper thinking, it's to throw a wrench into the works so that the machine stops.
In actual fact, many of us *in* the sabermetric community have simply stopped asking these questions, because we get the same barrage of ad hominem, sub-FJM abuse that Heyman gets when we ask them.

Colin wasn't kidding when he said Heyman was being a better sabermetrician than most of his critics. One of the things you have to do in sabermetrics is to ignore, to resolutely ignore, feelings and intentions and motives. These things are not amenable to quantitative analysis.

Heyman has no need whatsoever to shut anyone up (I'm sure he'd like to do without the abuse, but it's par for the course for journalists these days, so I doubt it's a particular concern).
Even if he's trolling, it's a teachable moment. Even if Heyman doesn't take the lesson, there will be others who will read and understand. Even if he's trolling, he brings up a reasonable and important question, and it's one that needs to be addressed, and we need to honestly address it, even if it means that the answer is "We don't know yet and you are correct."
I said this down below in response to another comment, but I think it's important enough to repeat.

If a question isn't worth answering, then that's one thing.

This was deemed by many people a question worth answering, they just answered it wrong.

Heyman's intent doesn't make the wrong answer any less wrong or any less in need of correcting.
There are two separate issues here. One is whether there are legitimate questions to answer about how WAR is constructed. The answer to this is clearly yes, and these should be discussed. The other is whether there is value responding in a manner such as this to someone who is obviously trolling. I'd imagine that Jon Heyman is delighted to have caught such a big fish with his little troll right now. Ignoring him, and looking at issues with WAR at another time, and in a way not related to his trolling, is a much better option.
Forget who asked the question, or why. Just consider the question on its own merits.
Exactly. Colin wrote a great post, but he screwed up one important part - "I think [Heyman]'s being honest in his questions"

He's not. If you try to talk to Heyman about sabermetrics, you will quickly find out that he's not interested in an actual discussion. He doesn't want to learn a thing. He wants to troll.

Colin, the "low rent FJM knockoff" you are referring to are almost certainly people that have tried to reasonably respond to Heyman only to have him crack wise or suggest they still live in their mom's basement. It's understandably that they crack back.
Fantastic article Colin. Could not agree more.
Heyman has to start looking at other forms of WAR/WARP. The Reynolds / Johnson argument does not even come into play if you input a heavily regressed defensive stat like UZR, instead of DRS.
Looking at the data on Fangraphs as of Saturday night (I saved the leaderboard from when Heyman made his comments in case I needed the data for this), the SD of DRS is 1.75, compared to 1.78 for UZR. Don't really think that UZR is any more regressed than DRS is (I don't think either of them are regressed, really).
A beginning note: of the 12 defense runs Johnson is awarded above Reynolds, 2 aren't even because of his performance but rather simply because of where he is standing on the field.

The problem between offense and defensive WAR is not simply one of needing better defensive metrics--we can not, and probably will not ever be able to, eliminate the subjectivity inherent in evaluating defensive performance. That subjectivity will always be present because 1) defense measures are to some degree dependent the performance of other players in the defense and 2) there is no objective standard for what "good" defense looks like in the first place (e.g., is range more important or sure hands?)

I'm going to leave it at that because I don't want to start on a rant about the insanity of reifying replacement level performance.

WAR should always be presented in a decomposed format, with offense and defense contribution completely separated. Let the USER of the statistic decide how to weigh the two components rather than jam them together into a single misleading number.
That's how Fangraphs shows it (bottom of every player page). The reader does have the choice of adding up the separate components (or picking and choosing), or letting Fangraphs automatically do it for them.
"we’re not only wrong about things like Marte versus Harper, but we’re unwilling to admit it?"

Are we sure that we ARE wrong about Marte vs Harper? Marte has been outstanding this year, not just in stats, but, as someone who watches a ton of Pirates games, in things like baserunning and defensive play. I'll bet whatever metric we have 50 years from now, if applied to this same period, would come up with Harper and Marte as practically identical.

As for Reynolds vs Johnson, Mark Reynolds is blind, so that'll hurt the ol' warp.

One of the problems I have with the "Saber-people need to stop being jerkballs to Heyman" argument is that I haven't seen many legit, respectable sabermetricians be a jerkball to Heyman about this.

Every reaction that I've seen (which isn't many, tbh - I try to stay away from these bs controversies) from anybody I would even think about reading has been measured, considerate and thoughtful.

I guess there have been some NotGraphs things, but here, DCameron at FG, hell, even Joe Sheehan, the Marty McSorley of the group, pretty much agreed with Heyman. If everyone who believes advanced metrics is going to be judged by what some idiots on twatter reply to Heyman, then I get to judge non-statvangelicals on the worst comment I can find under a Jay Jaffe piece at

Yeah, this article felt like preaching to the choir some more. Then again, I didn't see the Tweets.
Cameron, among others, wrote a rebuttal to Heyman; I think it misrepresents him to say he pretty much agreed. (I don't think Cameron's rebuttal was correct, as I mentioned in the section dealing with the correlation between WAR and wins. But I don't think your characterization of it is reflective of what he said.)

And it isn't just that we need to stop being a jerkball (shoot, I'm a pretty big jerkball myself). It's that Heyman actually has a point, and that sabermetrics needs to spend more time engaging with criticism of defensive metrics because there's a lot of work still to be done on them.
Yeah, I didn't write that clearly enough.

It was Sheehan who essential agrees with Heyman in that early season DMetrics are unreliable and I find Sheehan to be among the most combative of the saberish writers. I didn't mean to imply that Cameron agreed with Heyman, only that he wasn't distainful, but rather thoughtful in his response to Heyman.

I think, for the most part, the stats community "establishment" is pretty good about recognizing flaws, limitations, areas for improvement, I.E., change in replacement level at B-R and FG, the switch to the play-by-play defensive data that you and Jay made to JAWS a few years back, the move away from VORP...

But perhaps I just read only the writers who have a tone of intellectual honesty to their writing/research. In the last week alone, Rany and Professor Parks were going out of their way to make fun of their poor predictions on Hosmer or Brendan Harris.

Also, it drives me crazy that a paid writer doesn't even attempt grammar and punctuation in a tweet. How hard is it to capitalize "I".

This was a really interesting article, btw.
Excellent article, but can you please put your clothes back on now?
I usually don't try to argue about WAR. As Colin said it is an estimation. I can't defend it with numbers (I don't know the formula well enough) or therefor logic, and it is difficult to get non-saber involved people to understand that.
I usually just talk about stats that are quantifiable and "see-able", and give examples of how other stats are flawed. Usually I can work this into a "stat x should clearly then have a higher correlation with being a better player than stat y" sort of argument.
I can communicate these sorts of things much more easily than WAR. And also WAR is but 1 stat in a field where there have been dozens created. Any point one is attempting to make where WAR supports the argument will almost certainly be supported by many other stats.
If one stat doesn't tell the right story, make another one (stat that is).
Excellent article Colin, it's great to hear someone so mathematically gifted also able to speak so eloquently and persuasively in plain English. Hopefully people out there agree with you.
Reynolds war for six weeks into the season has to be a record considered he is utterly and completely unable to see.
I think "WAR Mystery of the Week" would be a pretty fantastic weekly Unfiltered feature, actually.
A lot of fans don't have the appetite to digest concepts like estimation and confidence levels of components, etc. However, Heyman is a smart person with a prominent role in the industry. Despite the fact that he has received negative tweets that may present bad arguments for WAR, he has undoubtedly also heard the same good arguments you raise for and against WAR. I bet if you pushed him in private, he could articulate much of what you say here. It's hard to infer meaning in tweets, and I respect people's ability to play off that. Consider him a successful troll. But he's not raising new questions here. He's prodding, and maybe the sabermetric community needs prodding. I do think there is a point we can start dismissing tweets that go no further than shallow observations.
You have not tried to have a discussion with Heyman. I can say this without a doubt. Heyman is not a "smart person", at least in regards to this. He may not be dumb, but a conversation with him will reveal that he cannot actually articulate much of what Colin says here, or really do much more than repeat ' doesn't make since to me, thus advanced metrics must be useless'.

He did this years ago when Miguel Cabrera led the league in WARP. Heyman couldn't possibly comprehend that a team could still be one of the worst in the league even if it had one of the best players. So his conclusion was that Cabrera must not be that great a player. Does that sound like someone who is a "smart person" or who can articulate what Colin says here?
I think in a nutshell Heyman is saying, "let's not talk about WAR when it is not a meaningful number, let's talk about something else"

In May, WAR is more deceptive than helpful vs other metrics.
While I'm pleased you took Heyman's point and ran in an interesting direction, his initial question seems like an Intro to Logic question:

1) This guy is hitting much better than that guy
2) According to Measure X, this guy and that guy are roughly equivalent
3) Therefore, Measure X must include something other than hitting

And what fan, sabermetric or not, has EVER praised Mark Reynolds' DEFENSE? (Caveat: Reynolds has certainly made individual plays that were excellent, but his overall defense has always been considered below-average, even at UVa.)

I mean, I'm not a super-sophisticated fan, but I'm a fair Logician, and the first thing I thought when I saw the stat was, "Boy, Elliot Johnson must be a pretty good defender." Is that really that clever a leap to make?

(As an Indians fan, I already knew that Mark Reynolds is not a "pretty good defender.")
Actually I have heard announcers praise Reynolds contribution on defense at 1B last year in replacing Chris Davis and the overall improvement to the team defense with Machado replacing Reynolds at 3B. The overall takeaway is that Reynolds is significantly better at 1B than he was at 3B.
Well done, Colin.
Great article.
Kudos to the editors to keep this out in front of the paywall. It needs to be read by many.
I've said this many times: I don't like the idea of combining UZR and offensive metrics (to get WAR). It is combining apples and oranges. It is a little like OPS (a false combination), but a lot worse.

The simple reason is that UZR needs to be regressed a lot more than the defensive metrics (and probably a lot more than UBR or other baserunning metrics).

An unregressed offensive metric tells you basically what happened, but obviously it does not reflect the best estimate of talent without regression. (Big mistake that lots of folks make is equating the two).

Unfortunately, the way that UZR and DRS are constructed, unregressed numbers tell us neither what happened, nor do that reflect our best estimate of talent.

If you want to combine UZR (or DRS) and offensive metrics like lwts, you would need to regress the UZR some amount to reflect an estimate of what really happened (less regression than you would if you wanted to estimate true talent).

Unfortunately that is not done, so you end up with a "hyprid" monster in WAR which includes a pretty much exact measure of what happened on offense (translated into theoretical runs/wins - which really mean nothing, BTW) AND a rough estimate of what happened on defense.

The result is two things: One, a number, WAR, which represents what happened on offense, plus a rough estimate of what happened on defense. And any time the defense number is less than or greater than our estimate of the player's true talent, we should assume that the defense numbers (e.g. UZR) is too high or low in terms of what happened.

There is just no getting around regressing UZR before combining it with offense. If you don't do that, which no one does, then, yes, you have to take those WAR numbers with a large grain of salt, especially when the UZR component of WAR is far away from our estimate of that player's true talent UZR. Even if UZR is not far away, the error bars around the UZR, in terms of an estimate of what happened, are large - or at the very least, they exist. There are virtually no error bars around the offensive component of WAR, at least with regard to what they represent - exact offensive results translated to theoretical runs.

At the end of the year, we can probably live with the unregressed UZR part of WAR. 1.5 months into the season - nah...
That's actually an interesting philosophical point -- do you use a player's fielding data from other seasons as part of your regressed estimate of what he actually DID? I can see arguments on both sides (although currently we don't do it like that).
Yes, you do. Of course you do. If a player is zero for his career, and he has a +15 per 150 in half a season, it is likely that we overestimated the difficulty of his chances.

If that player is Brendan Ryan, and he has a +15 half way through the season, it is more likely that that is what happened.
MGL is correct. Which is why 99% of the people won't accept altering the quality of a SS's performance in 2013 based on how well he was estimated to play in 2011-2012.

Furthermore, we'd have to constantly revise those estimate as future information comes in. We had no information with say Andrelton Simmons, and suddenly, we've got tons. What looks like favorable view of the quality of his chances in 2012 is now going to be changed to "fair" and possibly even "too harsh" in 2012, based on his performance in 2013.

MGL got backlash when he updated how UZR was calculated in the off-season. Imagine the backlash as we have to update these things in-season.
Moral of the story: Don't combine two numbers when one is regressed or does not need to be regressed to tell you what you think it tells you, and the other needs to be regressed to tell you what you think it tells you or want it to tell you.

And that warning is even stronger when you have a small sample size!
Great article...

So, is BP working on reliability/error bar/standard deviation components of its estimates?

If the Reynolds/Johnson WAR were presented something like 0.8(+-.5)/ 0.9(+-.6), it may be more complicated, but a accurate portrayal, leading to less confusion.
Well, would the error bars actually be all that different? Beyond the issue of sample size for Johnson vs. Reynolds, I'd think they'd still have the same level of error. A +10 defender and a +1 defender in the same amount of innings should have equal amounts of uncertainty, for the most part. This isn't nearly as useful as regressing them, as that would shrink the contribution that UZR or DRS has on total WAR.

excellent article
I'm fine with the idea that WAR needs improvement and that the analytical community needs to be more introspective. And, I completely agree that the Fire Joe Morgan knockoff stuff is lame. It's really just a comedic attempt that has completely run its course, and fighting with anti-SABR trolls seems pointless anyway, since statistical analysis has won the battle pretty convincingly by now.

But as others have said, I think you miss the mark by simply assuming A: Jon Heyman is making his criticisms in good faith, and B: that we should listen to him even if he is in bad faith.

Bad faith arguing should never be encouraged. The key to any scientific improvement is an honest give and take, and if Heyman is not actually tweeting with an eye towards that, he has no business being in the conversation. If the community has to stop for every troll tweet, re-evaluate the position, and re-discover everything, nothing will ever get accomplished because you just keep beating dead horses.

Do you really imagine a situation where an improved WAR stops creating strange early season equivalencies? Even a perfect metric will spit data that is strange at time (namely, because reality itself is very strange). And as long as that's happening, there will be bad faith critiques. Do we have to re-do WAR for every Heyman tweet? Does it make sense for the State Dept. to spend every day at a hearing for Benghazi? Or for the Bush administration to spend every day explaining why 9/11 truther theories are stupid? All are colossal wastes of time because a bad faith arguer won't ever drop it anyway.

TL;DR -- If Heyman's trolling is showing us real WAR deficiencies, WAR probably needs better quality control. But Heyman is still a troll, and engaging with trolls (to debate or to snark) is still a waste of time.
It doesn't really matter whether Heyman is trolling or not, though. It's not like people just ignored him, people actually replied to him, some of them in-depth -- they just gave him the wrong answer. If people were just ignoring him, that'd be one thing, but they were saying wrong things instead.

And if people were giving Heyman the wrong answer BECAUSE he was trolling, that's even worse.
Right, if a bad faith response or argument helps to cause analysis to move forward, than so be it. Sometimes that happens. It is not always legitimate discussion by intelligent, knowledgeable, and reasonable people that moves science in the right direction.
Excellent article. Colin basically makes the point I tried to make in my comment on the harrelson-kinney dustup, only much better. Both sides have valid points, neither side is 100% right and neither is 100% wrong, and the debate is best served by making rational arguments rather than those of a straw man or ad-hominem nature. I would like to see all supporters of advanced analysis adopt the tone of Brian Kinney and Colin where as well as their cause.
I think that this argument would be better if you explicitly addressed Cameron and/or other professionals. Sean Forman, for example, talked about the Heyman Tweet on Keith Law's podcast. He specifically broke down the small number of plays that Marte made that Harper didn't make. Going after anonymous amateur sabermetricians is kind of like shooting fish in a barrel. I know that specifically addressing Cameron and Forman may create social awkwardness, but talking directly to them is really the best way to facilitate progress in sabermetrics.

To make a political analogy: is it better to address the argument of a professional pundit's op-ed or is it better to address the Internet commentators on that op-ed?
I spent 500 words on Cameron's response, linked to it and called him by name. I'm not sure what more I could have done, except maybe call him by name in the headline.
Thanks for the interesting article. Looking at the formula for batting runs above replacement, I realized that it basically weighs each outcome with a certain value. It doesn't take into consideration that all outs or hits are not equal. For example, hitting a ground ball to the right side with no outs to move a runner from second to third is way more valuable than striking out. A 13 pitch at bat is worth much more than a first pitch GIDP. And that's not even talking about the different "hit" values. While WAR can be a useful tool, I believe the problem arises when people don't recognize its limitations and think it gives a complete picture of performance is the only statistic needed to measure a player's performance.
Thank you thank you thank you for writing this.