From the Twitters yesterday morning:

All the context you really need to know to understand that tweet:

  • He’s referring to the implementation of WAR on Baseball-Reference, and
  • The fielding metric currently in use there, DRS, has Barney’s defense rated at over twice what any other popular fielding metric does.

I leave the matter of Barney’s defensive rating as an exercise for the reader—that equine hasn’t stopped twitching, but I’ll hold off on the beatings for right now. No, I want to discuss something that came out of a long discussion on Twitter in response to that remark: What does it mean when we think WAR (or any other metric) is wrong? Can we still use it? Should we discard it until we’ve worked out all the imperfections?

Or another example – CBS Sports recently ran a piece entitled “Aroldis Chapman broke FIP.” The heart of the piece:

So, that's the short version — and here's the fun outlier, Chapman's FIP for July is -0.99 — that's right, that little mark in front of what would be an impressive FIP means it's silly good. (His xFIP is a slightly less — but still completely certifiable — crazy -0.73.)

I'd seen this somewhere and looked it up myself — it's true. And to try to figure out what that menat [sic], I emailed one of the smartest baseball stat folks I know (and who will return my emails), Dave Cameron of Here's what he had to say about it.

Basically, he's been so good he broke the formula. Obviously, it's not possible for a pitcher to have an ERA lower than 0, so a negative FIP just means that based on his walk rate, strikeout rate and home run rate, the formula expects him to have given up zero runs this month.

What does the real world think about this theory? Chapman's appeared in 12 games this month, throwing 11 1/3 innings and allowed no runs. So it works!

Unlike the WAR example earlier, we can confirm that FIP is wrong in Chapman’s case, since negative runs are impossible in baseball. So if FIP can be wrong for Chapman, is it possibly wrong in other cases? Does that mean we should stop using FIP?

The answer to those questions, if you’re impatient:

  1. Yes. I would even go so far as to say it’s “wrong” in all cases, or at least the vast majority of them.
  2. Not really.

It’s quite easy to see how FIP “breaks” here—it’s a linear model, and the slope of the line means that it will go below zero if the conditions are right. Unlike reality, FIP is not bound at zero at the lower end. If a pitcher’s strikeout rate, relative to his walks and home runs, gets very high, you will see FIP go negative. But FIP will bend before it breaks—there are going to be some above-zero pitchers who nonetheless have lower estimates than they would if FIP was realistically bound at zero on the lower end.

To be blunt, FIP is a model. What we can say about models is this:

  • All models are wrong.
  • Many are useful.
  • Some are more useful than others.
  • The utility of a model often depends on the question you are trying to answer.

Let’s take an example from physics. Most everyone is taught Isaac Newton’s theories of gravity is school these days, despite the fact that Newtonian physics has been shown to be at best incomplete, and in terms of working physics has been supplanted by Einstein’s theories of relativity and the notion of quantum mechanics. There are real, observed phenomena (like measurements of the rotation of the galaxy) that point out problems with Newton’s theories. So why do we still teach them and use them?

The answer is simple—because they’re still useful in predicting the behavior of gravity as we can observe it in our everyday lives. The extreme cases where it breaks down, either at the level of large galaxies or of minute quantum particles, are simply not relevant to us. Moreover, learning Newton’s theories can still teach us quite a bit about gravity.

(Some of you may be trying to reconcile this statement with my opinions about, say, batted ball data. I do believe that imperfect models, which is to say all models, can be useful. But that isn't to say that all of them are. When looking at batted ball data, there is evidence that suggests that the model isn't adding anything to our understanding, while adding to our complexity. When it comes to accepting a model, I tend to err on the side of parsimony, which is to say the idea that the simplest theory that explains what we're looking at is best. That doesn't mean that complexity is bad, but for a more complicated model to be embraced it should offer conclusive evidence that it's adding to our understanding.)

The world is not as simple as being right or wrong. As science fiction author Isaac Asimov wrote, "When people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." As Patriot (whose blog is the sabermetric site with the lowest ratio of readers to useful insights out there, and which you owe it to yourself to read more often) put it:

The statistician George Box once wrote that “Essentially, all models are wrong, but some are useful.” Whether the context in which this was written is identical or even particularly close to the sabermetric issues I’m going to touch on isn’t really the point. Perfect models only occur when severe constraints can be imposed. Since you can’t have a perfect model, the designer and user must decide what level of accuracy is acceptable for the purposes for which the model will be used.

So let me tell you a little story.

I was in the kitchen, doing dishes. Suddenly, I hear the sound of a child’s tears from the living room, and as I look over my shoulder, I see my little girl standing in the doorway of the kitchen. In her left hand is the saucer section of the USS Enterprise, registry number NCC-1701-D. In her right hand is the engineering section of the USS Enterprise.

That particular model kit (which I had assembled all the way back in high school) was not intended to detach the two sections of the ship from each other. She looked up at me and said, “I broke your spaceship, daddy.”

Now, I am going to tell you what I told her that day.

Sometimes, these things happen. Sometimes models get old. Sometimes they’re brittle. Sometimes they’re built by teenagers living in their mother’s basement and they’re not built to last. That’s okay. It was put out there to be used, played with, and enjoyed. We’ll see if it’s still usable, or if we can fix it. If not, we’ll throw it away and get a new one.*

This is of course a fine line to be walked. We don’t want to keep models around after they’ve long been supplanted by superior ones. We want to keep progressing, and if we’re too accommodating, we’ll stagnate. (This is, if you think about it, one of the key reasons Bill James was able to have the impact he did—because by and large that’s what the generation of baseball writers before him had done.)

But if we wait around for perfect models, we will be waiting forever. This doesn’t mean we shouldn’t be humble about how we use our models, because we should. And we should be infinitely more cautious when using a model where we haven’t found flaws than using a model where we have found flaws. We should avoid false certainties, instead presenting our uncertainties and our uncertainties about our uncertainties. But we can do that only by playing with our toys more, not less. We learn so much more when we take the toys off the shelves and start to use them.

*In case you were wondering – the Utopia Planitia shipyards were unable to salvage the craft. In life, as in art, it has been replaced by the Sovereign-class USS Enterprise, NCC-1701-E. It lights up and makes cool noises when you press the buttons on it.