The hardest part of explaining sabermetrics to someone who’s versed in traditional baseball stats is explaining that they’re different not just in degree, but also in kind. The definition of an RBI, for instance, hasn’t changed since it was made an official statistic in 1920. The stats created by sabermetricians are much more prone to revision. Some look at this as a bug, because they view sabermetrics only as potentially better versions of traditional stats.
But sabermetrics isn’t ever a finished product. (This is not, in fact, a bad thing.) So instead of expecting our stats to calcify, we should be expecting them to grow and change as we develop the ideas beneath them. So it is with WARP, which has undergone any number of changes over the years. And now we’re going to be changing WARP again. But we’re going to be throwing open the doors and letting you watch us while we work. So we’re kicking off a series of articles, running each Wednesday, where we’ll take you inside what we’re doing. There’ll be a lot of math, but also a lot of discussion about what WARP is trying to measure and the philosophy behind various choices.
We hope this will do several things. We think that talking openly about what we’re doing will help us build better metrics, because we’ll be getting more feedback earlier in the process. And we think it will help you all to understand the metrics we’re building, because you’ll have more insight not only into what is being done, but why. And this is not a finished product—the goal of this series is to have readers looking over my shoulder as I work. If what you want is a final summation, that will be coming down the road. If you want to watch the development process at work, this is for you.
There are a handful of goals we want to accomplish by doing this, which I’ll outline below. But before we get into that, I’d like to talk a bit about the goal of WARP. Why do we need a total value metric? What are we trying to achieve? What should it be used for, and where should we be cautious about using it?
WARP is an answer. To figure out how to arrive at it, we must pose a question. The act of posing one question entails not asking others, which doesn’t necessarily mean that they aren’t worth asking. Having said that, we are picking this question because we think it’s a useful one. And we’re going to ask a question like you should ask for a wish from a genie: with extreme precision.
What we want to know is: How do we estimate what a player has done to contribute to winning baseball games for an average baseball team? There’s really a lot packed into a small space there, so let’s unpack it.
- We want an estimate. We are willing to accept a certain amount of imprecision (although ideally we’d like to know what that imprecision is). Importantly, this means we do not ask if we are right or wrong; we are very often both. Instead we ask how close we are.
- We’re interested in an individual baseball player. It has been said that baseball is an individual sport masquerading as a team sport. It isn’t so. Even the individual records of baseball reflect team accomplishment—how many runs a player has batted in depends substantially on how well batters ahead of him have done at getting on base and advancing themselves or others into scoring position, for instance. We need to be careful at all points to ask how much a player’s teammates have contributed to the raw numbers.
- We want to know what a player has done. To use the technical terms of statistics, we view a player’s performance in a given time period as a population, not a sample. If you redid that sample a thousand times, that player could have done a lot of things. If you look at other samples, it’s very likely that this player has done different things. It doesn’t matter. We aren’t interested in what a player could have done, but what he actually did.
- We want to know about wins. That doesn’t mean that wins are the only thing in baseball worth caring about. However, it is about the only thing that’s worth measuring objectively—most of the rest is all either subjective or trivial (it doesn’t take a lot of math to count home runs). (Revenues, profits and the like are the exception.) It’s also an important element of baseball—winning or losing is the fundamental objective of the game, after all.
- We want to know how a player would have helped an average team. Very likely any particular player we might be interested in will be on a team that is not, in fact, average. But to make comparisons between players, we want to convert everything into a common baseline.
There’s one other thing we want to consider, and that’s how we split the responsibility for events. Baseball’s scorekeeping system revolves around the concept of double-entry bookkeeping— every hit for the offense is also a hit allowed by the defense. Every run scored is a run allowed. Every win for one team is a loss for some other. This is intrinsic to how the sport is constructed. Others, like our own Russell Carleton, have attempted to build models that don’t follow from this premise. It’s interesting work, and it has its uses. But in terms of a system that attempts to explain wins and losses—a batter strikeout doesn’t explain more of his team’s performance than a pitcher strikeout of his team, even if we think that batters have more “skill” in striking out or not than pitchers do in getting them to do such. We leave such questions of skill to another place and time (again, recognizing that to pose some questions means not posing others).
A perfect model, with our objectives in mind, would reconcile flawlessly between what it says on offense and what it says on defense for any given event. We lack a perfect system. But at the very least, we have a goal to bear in mind. Falling short of that goal is unfortunate, but having a goal at all will help point us in the right direction.
So what are we trying to achieve with this WARP revamp? We have a few goals in mind. First, we want to spend some special attention on the concept of replacement level. It’s an important and widely-used concept in player evaluation. But it seems to be one of the hardest concepts to sell to those who aren’t fully on board with the sabermetric movement. And even among the casual adherents to sabermetrics, it seems to be one of the more misunderstood and debated topics. We want to look at what replacement level does, why it’s important, how it affects our metrics and how it changes over time.
The next thing we want to focus on is the interaction between pitchers and defense. It’s probably the greatest area of controversy between various methods of player evaluation. It’s also the area where our metrics seem to come up with the largest number of counterintuitive conclusions. We want to look at this with a fresh eye and see what comes out of it. We’ll be examining DIPS from a new perspective and seeing how well it holds up. And of course, we’ll look at how to measure fielding.
Lastly, and perhaps most importantly, we’ll be looking at how we assess our work. Sabermetrics is in many senses a scientific endeavor. But there are a lot of pieces to a total value metric, and not all of them have received the same amount of scrutiny. It is not enough for us to propose ways to measure something; we also need to propose ways to measure how well we’ve measured it.
One commonly proposed test of how well a total value system does its work is looking at how well it predicts team results. This is dangerous ground for sabermetricians to stand on; it turns out that old-school stats like RBIs and pitcher wins do a much better job of that than any total-value stat proposed. The entire point of this exercise is that we are willing to sacrifice the best possible accounting of team wins in order to do better at expressing an individual player’s contributions, isolated as best we can from those of his teammate. Using reconciliation to team results causes us to lose sight of that goal and provides very little insight into the quality of our work.
We do want to assess our work, however. So along the way, we’ll propose methods of testing various components of the work we’re doing. This will also give us the opportunity to truly assess the accuracy of our estimates, and to produce error bars for the work we’re doing. One of the steps forward made by PECOTA was that the point forecast has been accompanied by an estimate of the full range of performance for that forecast. Hopefully, we can make the same step forward with WARP.
Sabermetrics is a key part of a lot of modern debates about the MVP award and the Cy Young and the Hall of Fame and a host of other things. There is little that we of a sabermetric bent can do to make everyone open to our ideas. That’s going to take time. But there are people out there who, while not complete converts to a sabermetric worldview, are open to treating our ideas and comments with respect. We can do more to advance the discussion by moving away from stridency and certainty and moving toward an embrace of certainty. We’re numbers people; we measure things. It’s what we do. We can, and should, measure what we measure with as well.
So here’s your chance for feedback, on both our goals for WARP and how you think we should go about getting there. We look forward to hearing from you, and for the chance to try to do something new and exciting together.