From the Twitters yesterday morning:

All the context you really need to know to understand that tweet:

  • He’s referring to the implementation of WAR on Baseball-Reference, and
  • The fielding metric currently in use there, DRS, has Barney’s defense rated at over twice what any other popular fielding metric does.

I leave the matter of Barney’s defensive rating as an exercise for the reader—that equine hasn’t stopped twitching, but I’ll hold off on the beatings for right now. No, I want to discuss something that came out of a long discussion on Twitter in response to that remark: What does it mean when we think WAR (or any other metric) is wrong? Can we still use it? Should we discard it until we’ve worked out all the imperfections?

Or another example – CBS Sports recently ran a piece entitled “Aroldis Chapman broke FIP.” The heart of the piece:

So, that's the short version — and here's the fun outlier, Chapman's FIP for July is -0.99 — that's right, that little mark in front of what would be an impressive FIP means it's silly good. (His xFIP is a slightly less — but still completely certifiable — crazy -0.73.)

I'd seen this somewhere and looked it up myself — it's true. And to try to figure out what that menat [sic], I emailed one of the smartest baseball stat folks I know (and who will return my emails), Dave Cameron of Here's what he had to say about it.

Basically, he's been so good he broke the formula. Obviously, it's not possible for a pitcher to have an ERA lower than 0, so a negative FIP just means that based on his walk rate, strikeout rate and home run rate, the formula expects him to have given up zero runs this month.

What does the real world think about this theory? Chapman's appeared in 12 games this month, throwing 11 1/3 innings and allowed no runs. So it works!

Unlike the WAR example earlier, we can confirm that FIP is wrong in Chapman’s case, since negative runs are impossible in baseball. So if FIP can be wrong for Chapman, is it possibly wrong in other cases? Does that mean we should stop using FIP?

The answer to those questions, if you’re impatient:

  1. Yes. I would even go so far as to say it’s “wrong” in all cases, or at least the vast majority of them.
  2. Not really.

It’s quite easy to see how FIP “breaks” here—it’s a linear model, and the slope of the line means that it will go below zero if the conditions are right. Unlike reality, FIP is not bound at zero at the lower end. If a pitcher’s strikeout rate, relative to his walks and home runs, gets very high, you will see FIP go negative. But FIP will bend before it breaks—there are going to be some above-zero pitchers who nonetheless have lower estimates than they would if FIP was realistically bound at zero on the lower end.

To be blunt, FIP is a model. What we can say about models is this:

  • All models are wrong.
  • Many are useful.
  • Some are more useful than others.
  • The utility of a model often depends on the question you are trying to answer.

Let’s take an example from physics. Most everyone is taught Isaac Newton’s theories of gravity is school these days, despite the fact that Newtonian physics has been shown to be at best incomplete, and in terms of working physics has been supplanted by Einstein’s theories of relativity and the notion of quantum mechanics. There are real, observed phenomena (like measurements of the rotation of the galaxy) that point out problems with Newton’s theories. So why do we still teach them and use them?

The answer is simple—because they’re still useful in predicting the behavior of gravity as we can observe it in our everyday lives. The extreme cases where it breaks down, either at the level of large galaxies or of minute quantum particles, are simply not relevant to us. Moreover, learning Newton’s theories can still teach us quite a bit about gravity.

(Some of you may be trying to reconcile this statement with my opinions about, say, batted ball data. I do believe that imperfect models, which is to say all models, can be useful. But that isn't to say that all of them are. When looking at batted ball data, there is evidence that suggests that the model isn't adding anything to our understanding, while adding to our complexity. When it comes to accepting a model, I tend to err on the side of parsimony, which is to say the idea that the simplest theory that explains what we're looking at is best. That doesn't mean that complexity is bad, but for a more complicated model to be embraced it should offer conclusive evidence that it's adding to our understanding.)

The world is not as simple as being right or wrong. As science fiction author Isaac Asimov wrote, "When people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." As Patriot (whose blog is the sabermetric site with the lowest ratio of readers to useful insights out there, and which you owe it to yourself to read more often) put it:

The statistician George Box once wrote that “Essentially, all models are wrong, but some are useful.” Whether the context in which this was written is identical or even particularly close to the sabermetric issues I’m going to touch on isn’t really the point. Perfect models only occur when severe constraints can be imposed. Since you can’t have a perfect model, the designer and user must decide what level of accuracy is acceptable for the purposes for which the model will be used.

So let me tell you a little story.

I was in the kitchen, doing dishes. Suddenly, I hear the sound of a child’s tears from the living room, and as I look over my shoulder, I see my little girl standing in the doorway of the kitchen. In her left hand is the saucer section of the USS Enterprise, registry number NCC-1701-D. In her right hand is the engineering section of the USS Enterprise.

That particular model kit (which I had assembled all the way back in high school) was not intended to detach the two sections of the ship from each other. She looked up at me and said, “I broke your spaceship, daddy.”

Now, I am going to tell you what I told her that day.

Sometimes, these things happen. Sometimes models get old. Sometimes they’re brittle. Sometimes they’re built by teenagers living in their mother’s basement and they’re not built to last. That’s okay. It was put out there to be used, played with, and enjoyed. We’ll see if it’s still usable, or if we can fix it. If not, we’ll throw it away and get a new one.*

This is of course a fine line to be walked. We don’t want to keep models around after they’ve long been supplanted by superior ones. We want to keep progressing, and if we’re too accommodating, we’ll stagnate. (This is, if you think about it, one of the key reasons Bill James was able to have the impact he did—because by and large that’s what the generation of baseball writers before him had done.)

But if we wait around for perfect models, we will be waiting forever. This doesn’t mean we shouldn’t be humble about how we use our models, because we should. And we should be infinitely more cautious when using a model where we haven’t found flaws than using a model where we have found flaws. We should avoid false certainties, instead presenting our uncertainties and our uncertainties about our uncertainties. But we can do that only by playing with our toys more, not less. We learn so much more when we take the toys off the shelves and start to use them.

*In case you were wondering – the Utopia Planitia shipyards were unable to salvage the craft. In life, as in art, it has been replaced by the Sovereign-class USS Enterprise, NCC-1701-E. It lights up and makes cool noises when you press the buttons on it.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Lovely piece, finds the balance between humility and hubris when it comes to statistical questing. Also I am glad you have a new Enterprise to play with.
I wonder if "all models are false," is really just another way of saying, "all models are models." A model is inherently an abstraction and necessarily a simplification. Thus labels of "true" or "false," "right" or "wrong," don't make sense. As you nicely put it, models can only be more or less useful for studying a given feature.
Excellent article with a great lesson and an even better ending. You're an excellent writer, Colin - how do we get you to write more articles?
Very enjoyable read. Thank you!
Models can be good as a framework of discussion provided the model outputs something meaningful/relevant. The trick is the model itself should be something stable/consistent. FIP, WAR, etc work in a lot of cases and rarely "break", so they make for a good "set of rules" to begin a discussion.
Direction, then precision.
Really well done.
"•The fielding metric currently in use there, DRS, has Barney’s defense rated at over twice what any other popular fielding metric does."

Btw Darwin Barney's 2012 FRAA is at 8.7...
The DRS model, much like the Newtonian model of physics, does break down at the edges, yes. A great example is Lawrie playing rover in RF as a 3B.

I still don't see how the model broke down, per say, for a second baseman. What gravitational constant went haywire for Barney to have such a high defensive rating?

You can waive your hand and dismiss his rating by stating "models aren't perfect", but I'd find it a lot more interesting if you uncovered how, exactly, Darnwin Barney broke the model.
Yeah, I agree with this. Figuring stuff out like that will help with understanding defensive stats I'd guess. This is really a fantastic article as well. BP has been bringing some heat this summer with several really fine pieces demonstrating the way SABERMETRICs are maturing as we are getting really good at understanding that there are still a lot of things to learn and a lot of things we still can't solve to certainty. That's what is the coolest thing about all these stats and studies. As Byrum Saam used to say many years ago. You never do know. And that's a wonderful thing.
It's hard to tell what goes on under the hood with a lot of these defensive metrics. I remember last year looking at Barney's Range Factor and Fielding Percentage and that he only rated better than Weeks and Uggla among NL second basemen. And yeah, RF and PCT are flawed, but just from those "baseball card" stats it is hard to see why other "advanced" metrics give him a lot of credit since it's not clear what actually makes up those metrics.
I second this request to inspect how the metrics measure Barney's year and why they diverge. Is it so important that the models are correct on Darwin Barney? No. But it should be very useful to know why they diverge so much, if at least from a curiosity standpoint. That can give us a better way to know what questions the models answer for us. I have wanted the same thing for that year FRAA diverged for Peter Bourjos.
I'll join the chorus on this one, I'd like to know how it happened. Of course, I have little faith in any defensive metric, so DRS "breaking", if it did, doesn't concern me that much, except that it holds other, perhaps more valid metrics (mostly offense) up to ridicule by the unwashed. BP has a defensive metric in which players defensive value is increased or diminished based on the OFFENSIVE ability of those who play their position, which makes no sense at all. And if Darwin Barney's numbers are off the chart, what does that make the numbers on Don Mattingly and Keith Hernandez BP cards, which show Mattinginly as having negative defensive value in 1985 and '86, and Hernandez being negative in 1987 and 1988. Barney's numbers can't be sillier than those.
I remember how the last FRAA revision wreaked havoc on Jaffe's JAWS scores.
Positional offense isn't a factor in FRAA at all.
Why does FRAA rate Darwin Barney highly?