August 21, 2013
The Series Ahead
The hardest part of explaining sabermetrics to someone who’s versed in traditional baseball stats is explaining that they’re different not just in degree, but also in kind. The definition of an RBI, for instance, hasn’t changed since it was made an official statistic in 1920. The stats created by sabermetricians are much more prone to revision. Some look at this as a bug, because they view sabermetrics only as potentially better versions of traditional stats.
But sabermetrics isn’t ever a finished product. (This is not, in fact, a bad thing.) So instead of expecting our stats to calcify, we should be expecting them to grow and change as we develop the ideas beneath them. So it is with WARP, which has undergone any number of changes over the years. And now we’re going to be changing WARP again. But we’re going to be throwing open the doors and letting you watch us while we work. So we’re kicking off a series of articles, running each Wednesday, where we’ll take you inside what we’re doing. There’ll be a lot of math, but also a lot of discussion about what WARP is trying to measure and the philosophy behind various choices.
We hope this will do several things. We think that talking openly about what we’re doing will help us build better metrics, because we’ll be getting more feedback earlier in the process. And we think it will help you all to understand the metrics we’re building, because you’ll have more insight not only into what is being done, but why. And this is not a finished product—the goal of this series is to have readers looking over my shoulder as I work. If what you want is a final summation, that will be coming down the road. If you want to watch the development process at work, this is for you.
There are a handful of goals we want to accomplish by doing this, which I’ll outline below. But before we get into that, I’d like to talk a bit about the goal of WARP. Why do we need a total value metric? What are we trying to achieve? What should it be used for, and where should we be cautious about using it?
WARP is an answer. To figure out how to arrive at it, we must pose a question. The act of posing one question entails not asking others, which doesn’t necessarily mean that they aren’t worth asking. Having said that, we are picking this question because we think it’s a useful one. And we’re going to ask a question like you should ask for a wish from a genie: with extreme precision.
What we want to know is: How do we estimate what a player has done to contribute to winning baseball games for an average baseball team? There’s really a lot packed into a small space there, so let’s unpack it.
There’s one other thing we want to consider, and that’s how we split the responsibility for events. Baseball’s scorekeeping system revolves around the concept of double-entry bookkeeping— every hit for the offense is also a hit allowed by the defense. Every run scored is a run allowed. Every win for one team is a loss for some other. This is intrinsic to how the sport is constructed. Others, like our own Russell Carleton, have attempted to build models that don’t follow from this premise. It’s interesting work, and it has its uses. But in terms of a system that attempts to explain wins and losses—a batter strikeout doesn’t explain more of his team’s performance than a pitcher strikeout of his team, even if we think that batters have more “skill” in striking out or not than pitchers do in getting them to do such. We leave such questions of skill to another place and time (again, recognizing that to pose some questions means not posing others).
A perfect model, with our objectives in mind, would reconcile flawlessly between what it says on offense and what it says on defense for any given event. We lack a perfect system. But at the very least, we have a goal to bear in mind. Falling short of that goal is unfortunate, but having a goal at all will help point us in the right direction.
So what are we trying to achieve with this WARP revamp? We have a few goals in mind. First, we want to spend some special attention on the concept of replacement level. It’s an important and widely-used concept in player evaluation. But it seems to be one of the hardest concepts to sell to those who aren’t fully on board with the sabermetric movement. And even among the casual adherents to sabermetrics, it seems to be one of the more misunderstood and debated topics. We want to look at what replacement level does, why it’s important, how it affects our metrics and how it changes over time.
The next thing we want to focus on is the interaction between pitchers and defense. It’s probably the greatest area of controversy between various methods of player evaluation. It’s also the area where our metrics seem to come up with the largest number of counterintuitive conclusions. We want to look at this with a fresh eye and see what comes out of it. We’ll be examining DIPS from a new perspective and seeing how well it holds up. And of course, we’ll look at how to measure fielding.
Lastly, and perhaps most importantly, we’ll be looking at how we assess our work. Sabermetrics is in many senses a scientific endeavor. But there are a lot of pieces to a total value metric, and not all of them have received the same amount of scrutiny. It is not enough for us to propose ways to measure something; we also need to propose ways to measure how well we’ve measured it.
One commonly proposed test of how well a total value system does its work is looking at how well it predicts team results. This is dangerous ground for sabermetricians to stand on; it turns out that old-school stats like RBIs and pitcher wins do a much better job of that than any total-value stat proposed. The entire point of this exercise is that we are willing to sacrifice the best possible accounting of team wins in order to do better at expressing an individual player’s contributions, isolated as best we can from those of his teammate. Using reconciliation to team results causes us to lose sight of that goal and provides very little insight into the quality of our work.
We do want to assess our work, however. So along the way, we’ll propose methods of testing various components of the work we’re doing. This will also give us the opportunity to truly assess the accuracy of our estimates, and to produce error bars for the work we’re doing. One of the steps forward made by PECOTA was that the point forecast has been accompanied by an estimate of the full range of performance for that forecast. Hopefully, we can make the same step forward with WARP.
Sabermetrics is a key part of a lot of modern debates about the MVP award and the Cy Young and the Hall of Fame and a host of other things. There is little that we of a sabermetric bent can do to make everyone open to our ideas. That’s going to take time. But there are people out there who, while not complete converts to a sabermetric worldview, are open to treating our ideas and comments with respect. We can do more to advance the discussion by moving away from stridency and certainty and moving toward an embrace of certainty. We’re numbers people; we measure things. It’s what we do. We can, and should, measure what we measure with as well.
So here’s your chance for feedback, on both our goals for WARP and how you think we should go about getting there. We look forward to hearing from you, and for the chance to try to do something new and exciting together.