Reworking WARP: The Series Ahead

August 21, 2013

The hardest part of explaining sabermetrics to someone who’s versed in traditional baseball stats is explaining that they’re different not just in degree, but also in kind. The definition of an RBI, for instance, hasn’t changed since it was made an official statistic in 1920. The stats created by sabermetricians are much more prone to revision. Some look at this as a bug, because they view sabermetrics only as potentially better versions of traditional stats.

But sabermetrics isn’t ever a finished product. (This is not, in fact, a bad thing.) So instead of expecting our stats to calcify, we should be expecting them to grow and change as we develop the ideas beneath them. So it is with WARP, which has undergone any number of changes over the years. And now we’re going to be changing WARP again. But we’re going to be throwing open the doors and letting you watch us while we work. So we’re kicking off a series of articles, running each Wednesday, where we’ll take you inside what we’re doing. There’ll be a lot of math, but also a lot of discussion about what WARP is trying to measure and the philosophy behind various choices.

We hope this will do several things. We think that talking openly about what we’re doing will help us build better metrics, because we’ll be getting more feedback earlier in the process. And we think it will help you all to understand the metrics we’re building, because you’ll have more insight not only into what is being done, but why. And this is not a finished product—the goal of this series is to have readers looking over my shoulder as I work. If what you want is a final summation, that will be coming down the road. If you want to watch the development process at work, this is for you.

There are a handful of goals we want to accomplish by doing this, which I’ll outline below. But before we get into that, I’d like to talk a bit about the goal of WARP. Why do we need a total value metric? What are we trying to achieve? What should it be used for, and where should we be cautious about using it?

WARP is an answer. To figure out how to arrive at it, we must pose a question. The act of posing one question entails not asking others, which doesn’t necessarily mean that they aren’t worth asking. Having said that, we are picking this question because we think it’s a useful one. And we’re going to ask a question like you should ask for a wish from a genie: with extreme precision.

What we want to know is: How do we estimate what a player has done to contribute to winning baseball games for an average baseball team? There’s really a lot packed into a small space there, so let’s unpack it.

We want an estimate. We are willing to accept a certain amount of imprecision (although ideally we’d like to know what that imprecision is). Importantly, this means we do not ask if we are right or wrong; we are very often both. Instead we ask how close we are.
We’re interested in an individual baseball player. It has been said that baseball is an individual sport masquerading as a team sport. It isn’t so. Even the individual records of baseball reflect team accomplishment—how many runs a player has batted in depends substantially on how well batters ahead of him have done at getting on base and advancing themselves or others into scoring position, for instance. We need to be careful at all points to ask how much a player’s teammates have contributed to the raw numbers.
We want to know what a player has done. To use the technical terms of statistics, we view a player’s performance in a given time period as a population, not a sample. If you redid that sample a thousand times, that player could have done a lot of things. If you look at other samples, it’s very likely that this player has done different things. It doesn’t matter. We aren’t interested in what a player could have done, but what he actually did.
We want to know about wins. That doesn’t mean that wins are the only thing in baseball worth caring about. However, it is about the only thing that’s worth measuring objectively—most of the rest is all either subjective or trivial (it doesn’t take a lot of math to count home runs). (Revenues, profits and the like are the exception.) It’s also an important element of baseball—winning or losing is the fundamental objective of the game, after all.
We want to know how a player would have helped an average team. Very likely any particular player we might be interested in will be on a team that is not, in fact, average. But to make comparisons between players, we want to convert everything into a common baseline.

There’s one other thing we want to consider, and that’s how we split the responsibility for events. Baseball’s scorekeeping system revolves around the concept of double-entry bookkeeping— every hit for the offense is also a hit allowed by the defense. Every run scored is a run allowed. Every win for one team is a loss for some other. This is intrinsic to how the sport is constructed. Others, like our own Russell Carleton, have attempted to build models that don’t follow from this premise. It’s interesting work, and it has its uses. But in terms of a system that attempts to explain wins and losses—a batter strikeout doesn’t explain more of his team’s performance than a pitcher strikeout of his team, even if we think that batters have more “skill” in striking out or not than pitchers do in getting them to do such. We leave such questions of skill to another place and time (again, recognizing that to pose some questions means not posing others).

A perfect model, with our objectives in mind, would reconcile flawlessly between what it says on offense and what it says on defense for any given event. We lack a perfect system. But at the very least, we have a goal to bear in mind. Falling short of that goal is unfortunate, but having a goal at all will help point us in the right direction.

So what are we trying to achieve with this WARP revamp? We have a few goals in mind. First, we want to spend some special attention on the concept of replacement level. It’s an important and widely-used concept in player evaluation. But it seems to be one of the hardest concepts to sell to those who aren’t fully on board with the sabermetric movement. And even among the casual adherents to sabermetrics, it seems to be one of the more misunderstood and debated topics. We want to look at what replacement level does, why it’s important, how it affects our metrics and how it changes over time.

The next thing we want to focus on is the interaction between pitchers and defense. It’s probably the greatest area of controversy between various methods of player evaluation. It’s also the area where our metrics seem to come up with the largest number of counterintuitive conclusions. We want to look at this with a fresh eye and see what comes out of it. We’ll be examining DIPS from a new perspective and seeing how well it holds up. And of course, we’ll look at how to measure fielding.

Lastly, and perhaps most importantly, we’ll be looking at how we assess our work. Sabermetrics is in many senses a scientific endeavor. But there are a lot of pieces to a total value metric, and not all of them have received the same amount of scrutiny. It is not enough for us to propose ways to measure something; we also need to propose ways to measure how well we’ve measured it.

One commonly proposed test of how well a total value system does its work is looking at how well it predicts team results. This is dangerous ground for sabermetricians to stand on; it turns out that old-school stats like RBIs and pitcher wins do a much better job of that than any total-value stat proposed. The entire point of this exercise is that we are willing to sacrifice the best possible accounting of team wins in order to do better at expressing an individual player’s contributions, isolated as best we can from those of his teammate. Using reconciliation to team results causes us to lose sight of that goal and provides very little insight into the quality of our work.

We do want to assess our work, however. So along the way, we’ll propose methods of testing various components of the work we’re doing. This will also give us the opportunity to truly assess the accuracy of our estimates, and to produce error bars for the work we’re doing. One of the steps forward made by PECOTA was that the point forecast has been accompanied by an estimate of the full range of performance for that forecast. Hopefully, we can make the same step forward with WARP.

Sabermetrics is a key part of a lot of modern debates about the MVP award and the Cy Young and the Hall of Fame and a host of other things. There is little that we of a sabermetric bent can do to make everyone open to our ideas. That’s going to take time. But there are people out there who, while not complete converts to a sabermetric worldview, are open to treating our ideas and comments with respect. We can do more to advance the discussion by moving away from stridency and certainty and moving toward an embrace of certainty. We’re numbers people; we measure things. It’s what we do. We can, and should, measure what we measure with as well.

So here’s your chance for feedback, on both our goals for WARP and how you think we should go about getting there. We look forward to hearing from you, and for the chance to try to do something new and exciting together.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Colin Wyers

More about:

WARP

Latest Articles

You need to be logged in to comment. Login or Subscribe

gcarbert

8/21

Wish my high school math teacher made things this easy to understand. Looking forward to see how this new sausage will get made.

Reply to gcarbert

sitdancer

8/21

I'm looking forward to the article series.

Have you or others ever thought about modeling how team WARP ends up as more than a simple sum of the players' WARP in order to get a better tie into predicting team outcome?

Reply to sitdancer

jfcross

8/21

I'm looking forward to this too. It would be good to see some discussion of openWAR:

http://www.math.smith.edu/~bbaumer/pub/jsm2013_openWAR_slides.pdf

since it seems like you guys have some of the same admirable goals (openness, conservation of runs and error estimates).

Reply to jfcross

cwyers

8/21

I've been corresponding with Ben some over the past week or so, and I've gotten my hands into his source code a little bit. It's certainly something I'm open to discussing.

Reply to cwyers

jdouglass

8/21

Colin, one of the things I thought was great in the openWAR work was the idea of error estimates.

I'll use a TV analogy. I loved the season premier of Breaking Bad because it addressed the elephant: Walt knows. Hank knows that Walt knows. Walt knows that Hank knows that Walt knows. Now the writers can get down to what's really important in the show, and not let plot manipulations dictate the last 7 hours.

Similarly, the folks out there who choose to poo-poo the versions of WAR often include in their argument that it can't be exact to a decimal point. But you know that, Sean knows that, Tango knows that, all of your literate readers know that. An error estimate tied to WARP, to me, would address the elephant. WARP knows that it's not perfectly exact. It knows that the--I don't want to say illiterate--not-as-literate reader knows that it's not exact. Now the not-as-literate reader knows that WARP knows that they know it's not as exact. Now we can move on to more important stuff, like how WARP, rWAR and fWAR differ, why that is, and why those ideas are important.

I'd love love love love to see a +/- column next to WARP in your stats pages that use that value metric.

Reply to jdouglass

cwyers

8/21

I think you might enjoy what we're doing next week.

Reply to cwyers

gpurcell

8/26

Thanks for that link!

Reply to gpurcell

nicholj

8/21

I like your comments about double-entry bookeeping and never really thought of baseball stats that way. It would be nice if all baseball statistics could be so symmetrical but unfortunately they are not. For example, due to the 'error' there are some inconsistencies. A batter who reaches base via the error is credited with an out but the pitcher is not credited with recording an out. Even more absurd when the error occurs on a strikeout - the batter is credited with an out, the pitcher with a strikeout but the pitcher is not credited with an out creating a flaw in some pitcher stats like K/9. Also, this leads to other uneven entries like batter runs scored being undifferentiated whereas pitcher runs allowed are differentiated between earned and unearned and the total number of runs scored not equalling the total number of runs batted in.

Also there base-running errors where pitchers are credited with an out (despite not doing anything to earn that out) but batters are not credited with recording an out in such situations. A simple example is getting thrown out at second trying to stretch a single into a double. The pitcher is credited with the out despite giving up the hit but the batter is credited with a hit despite getting out. As a result there are more pitcher outs than batter outs.

I am very curious as to whether you will attempt to address these unsymmetrical situations to create a true double-entry system or accept the imprecision.

Reply to nicholj

cwyers

8/21

In the official stats, that's true. In the play-by-play record, though, you can reconcile things very easily. WARP as it stands now accounts for ROE as a time on base, not an out, for instance.

Reply to cwyers

TheRedsMan

8/21

While perhaps you've made a business decision to keep BP stats separate from the work done by Baseball Reference & FanGraphs (or perhaps to use ESPN as your distribution channel), as a stat-friendly, I find it a tad frustrating that BP seems to be off operating in it's own world without active interaction on the pages of its blog with the work being done elsewhere.

At the most basic level, "WAR" has reached the public market for stats, but BP insists on calling their version of the model WARP. I'm very, very interested in this series, I just hope it is additive to the work being done elsewhere -- and consciously so.

Reply to TheRedsMan

mgolovcsenko

8/21

Spot on.

I hear (on podcasts & increasingly so on TV/radio) references to fWAR & bWAR ... never can recall a reference to bpWARP.

If you build this hopefully better tree in a walled-off forest, there's not enough people to watch it stand (or fall).

You're not just trying to improve on the measurement of WAR ... but also build awareness and broader usage of whatever you end up with.

Reply to mgolovcsenko

cwyers

8/21

I think in the long run, more harm than good has been done by giving two different metrics identical nomenclature. I don't see a reason to add to that confusion by creating a third name. Once you come to that conclusion, it really doesn't seem like there's a reason to abandon the name WARP.

Reply to cwyers

dpease

8/21

Different stats being named the same thing is a bug, not a feature.

Reply to dpease

TangoTiger1

8/21

WAR is the framework or specification.

fWAR is *an* implementation of WAR
rWAR is *an* implementation of WAR

It's not identical, any more than Oracle's implemention of SQL92 is identical to DB2's version.

Thank you for reading

Related Articles

Latest Articles

More about:

Latest Articles

Related Articles