The hardest part of explaining sabermetrics to someone who’s versed in traditional baseball stats is explaining that they’re different not just in degree, but also in kind. The definition of an RBI, for instance, hasn’t changed since it was made an official statistic in 1920. The stats created by sabermetricians are much more prone to revision. Some look at this as a bug, because they view sabermetrics only as potentially better versions of traditional stats.
But sabermetrics isn’t ever a finished product. (This is not, in fact, a bad thing.) So instead of expecting our stats to calcify, we should be expecting them to grow and change as we develop the ideas beneath them. So it is with WARP, which has undergone any number of changes over the years. And now we’re going to be changing WARP again. But we’re going to be throwing open the doors and letting you watch us while we work. So we’re kicking off a series of articles, running each Wednesday, where we’ll take you inside what we’re doing. There’ll be a lot of math, but also a lot of discussion about what WARP is trying to measure and the philosophy behind various choices.
We hope this will do several things. We think that talking openly about what we’re doing will help us build better metrics, because we’ll be getting more feedback earlier in the process. And we think it will help you all to understand the metrics we’re building, because you’ll have more insight not only into what is being done, but why. And this is not a finished product—the goal of this series is to have readers looking over my shoulder as I work. If what you want is a final summation, that will be coming down the road. If you want to watch the development process at work, this is for you.
There are a handful of goals we want to accomplish by doing this, which I’ll outline below. But before we get into that, I’d like to talk a bit about the goal of WARP. Why do we need a total value metric? What are we trying to achieve? What should it be used for, and where should we be cautious about using it?
WARP is an answer. To figure out how to arrive at it, we must pose a question. The act of posing one question entails not asking others, which doesn’t necessarily mean that they aren’t worth asking. Having said that, we are picking this question because we think it’s a useful one. And we’re going to ask a question like you should ask for a wish from a genie: with extreme precision.
What we want to know is: How do we estimate what a player has done to contribute to winning baseball games for an average baseball team? There’s really a lot packed into a small space there, so let’s unpack it.
- We want an estimate. We are willing to accept a certain amount of imprecision (although ideally we’d like to know what that imprecision is). Importantly, this means we do not ask if we are right or wrong; we are very often both. Instead we ask how close we are.
- We’re interested in an individual baseball player. It has been said that baseball is an individual sport masquerading as a team sport. It isn’t so. Even the individual records of baseball reflect team accomplishment—how many runs a player has batted in depends substantially on how well batters ahead of him have done at getting on base and advancing themselves or others into scoring position, for instance. We need to be careful at all points to ask how much a player’s teammates have contributed to the raw numbers.
- We want to know what a player has done. To use the technical terms of statistics, we view a player’s performance in a given time period as a population, not a sample. If you redid that sample a thousand times, that player could have done a lot of things. If you look at other samples, it’s very likely that this player has done different things. It doesn’t matter. We aren’t interested in what a player could have done, but what he actually did.
- We want to know about wins. That doesn’t mean that wins are the only thing in baseball worth caring about. However, it is about the only thing that’s worth measuring objectively—most of the rest is all either subjective or trivial (it doesn’t take a lot of math to count home runs). (Revenues, profits and the like are the exception.) It’s also an important element of baseball—winning or losing is the fundamental objective of the game, after all.
- We want to know how a player would have helped an average team. Very likely any particular player we might be interested in will be on a team that is not, in fact, average. But to make comparisons between players, we want to convert everything into a common baseline.
There’s one other thing we want to consider, and that’s how we split the responsibility for events. Baseball’s scorekeeping system revolves around the concept of double-entry bookkeeping— every hit for the offense is also a hit allowed by the defense. Every run scored is a run allowed. Every win for one team is a loss for some other. This is intrinsic to how the sport is constructed. Others, like our own Russell Carleton, have attempted to build models that don’t follow from this premise. It’s interesting work, and it has its uses. But in terms of a system that attempts to explain wins and losses—a batter strikeout doesn’t explain more of his team’s performance than a pitcher strikeout of his team, even if we think that batters have more “skill” in striking out or not than pitchers do in getting them to do such. We leave such questions of skill to another place and time (again, recognizing that to pose some questions means not posing others).
A perfect model, with our objectives in mind, would reconcile flawlessly between what it says on offense and what it says on defense for any given event. We lack a perfect system. But at the very least, we have a goal to bear in mind. Falling short of that goal is unfortunate, but having a goal at all will help point us in the right direction.
So what are we trying to achieve with this WARP revamp? We have a few goals in mind. First, we want to spend some special attention on the concept of replacement level. It’s an important and widely-used concept in player evaluation. But it seems to be one of the hardest concepts to sell to those who aren’t fully on board with the sabermetric movement. And even among the casual adherents to sabermetrics, it seems to be one of the more misunderstood and debated topics. We want to look at what replacement level does, why it’s important, how it affects our metrics and how it changes over time.
The next thing we want to focus on is the interaction between pitchers and defense. It’s probably the greatest area of controversy between various methods of player evaluation. It’s also the area where our metrics seem to come up with the largest number of counterintuitive conclusions. We want to look at this with a fresh eye and see what comes out of it. We’ll be examining DIPS from a new perspective and seeing how well it holds up. And of course, we’ll look at how to measure fielding.
Lastly, and perhaps most importantly, we’ll be looking at how we assess our work. Sabermetrics is in many senses a scientific endeavor. But there are a lot of pieces to a total value metric, and not all of them have received the same amount of scrutiny. It is not enough for us to propose ways to measure something; we also need to propose ways to measure how well we’ve measured it.
One commonly proposed test of how well a total value system does its work is looking at how well it predicts team results. This is dangerous ground for sabermetricians to stand on; it turns out that old-school stats like RBIs and pitcher wins do a much better job of that than any total-value stat proposed. The entire point of this exercise is that we are willing to sacrifice the best possible accounting of team wins in order to do better at expressing an individual player’s contributions, isolated as best we can from those of his teammate. Using reconciliation to team results causes us to lose sight of that goal and provides very little insight into the quality of our work.
We do want to assess our work, however. So along the way, we’ll propose methods of testing various components of the work we’re doing. This will also give us the opportunity to truly assess the accuracy of our estimates, and to produce error bars for the work we’re doing. One of the steps forward made by PECOTA was that the point forecast has been accompanied by an estimate of the full range of performance for that forecast. Hopefully, we can make the same step forward with WARP.
Sabermetrics is a key part of a lot of modern debates about the MVP award and the Cy Young and the Hall of Fame and a host of other things. There is little that we of a sabermetric bent can do to make everyone open to our ideas. That’s going to take time. But there are people out there who, while not complete converts to a sabermetric worldview, are open to treating our ideas and comments with respect. We can do more to advance the discussion by moving away from stridency and certainty and moving toward an embrace of certainty. We’re numbers people; we measure things. It’s what we do. We can, and should, measure what we measure with as well.
So here’s your chance for feedback, on both our goals for WARP and how you think we should go about getting there. We look forward to hearing from you, and for the chance to try to do something new and exciting together.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Have you or others ever thought about modeling how team WARP ends up as more than a simple sum of the players' WARP in order to get a better tie into predicting team outcome?
http://www.math.smith.edu/~bbaumer/pub/jsm2013_openWAR_slides.pdf
since it seems like you guys have some of the same admirable goals (openness, conservation of runs and error estimates).
I'll use a TV analogy. I loved the season premier of Breaking Bad because it addressed the elephant: Walt knows. Hank knows that Walt knows. Walt knows that Hank knows that Walt knows. Now the writers can get down to what's really important in the show, and not let plot manipulations dictate the last 7 hours.
Similarly, the folks out there who choose to poo-poo the versions of WAR often include in their argument that it can't be exact to a decimal point. But you know that, Sean knows that, Tango knows that, all of your literate readers know that. An error estimate tied to WARP, to me, would address the elephant. WARP knows that it's not perfectly exact. It knows that the--I don't want to say illiterate--not-as-literate reader knows that it's not exact. Now the not-as-literate reader knows that WARP knows that they know it's not as exact. Now we can move on to more important stuff, like how WARP, rWAR and fWAR differ, why that is, and why those ideas are important.
I'd love love love love to see a +/- column next to WARP in your stats pages that use that value metric.
Also there base-running errors where pitchers are credited with an out (despite not doing anything to earn that out) but batters are not credited with recording an out in such situations. A simple example is getting thrown out at second trying to stretch a single into a double. The pitcher is credited with the out despite giving up the hit but the batter is credited with a hit despite getting out. As a result there are more pitcher outs than batter outs.
I am very curious as to whether you will attempt to address these unsymmetrical situations to create a true double-entry system or accept the imprecision.
At the most basic level, "WAR" has reached the public market for stats, but BP insists on calling their version of the model WARP. I'm very, very interested in this series, I just hope it is additive to the work being done elsewhere -- and consciously so.
I hear (on podcasts & increasingly so on TV/radio) references to fWAR & bWAR ... never can recall a reference to bpWARP.
If you build this hopefully better tree in a walled-off forest, there's not enough people to watch it stand (or fall).
You're not just trying to improve on the measurement of WAR ... but also build awareness and broader usage of whatever you end up with.
fWAR is *an* implementation of WAR
rWAR is *an* implementation of WAR
It's not identical, any more than Oracle's implemention of SQL92 is identical to DB2's version.
I don't know if fWAR/rWAR is a feature or not. I don't know if they both call it WAR, but have different methods of calculation, is a feature or not.
If this was a court case, we'd each take sides, and just explain one point of view. I think you can reasonably make a case either way.
In this same vein, will defensive replacement value continue to be at 0? Correct me if I'm wrong, but it seems that most systems use performance vs. replacement level for offense and then performance vs. the average (0 runs) for defense. I have no idea where you would put replacement level defense and maybe it is zero runs, but it would be good to establish this early.
Another tangent in this line would be how much value you want to assign to the three pillars of batting, fielding, and base running? If batting is 60% of the value of a position player then would you consider fielding to be 30% and base running the last 10%? Have I/we missed something here that we're not factoring? If those percentages don't make sense then what does and is it possible to create a regression equation using historical data to derive this information. Would this research also allow for a passage to being able to watch the watchmen, so to speak, which is a solid and underrated point you've made about being able to compare your output to reality.
These are just some initial thoughts off the top of my head, and I look forward to hearing those of others as this is sure to be an exciting project. Thank you for making us, the community, feel a part of the process. It's quite refreshing.
(It won't be the first place, we start -- it was going to be, but it turns out that to discuss replacement level well, you need at least a few other concepts in your head first, so that's what I've hit upon as the best place to start off.)
In terms of the relative value of hitting, fielding, and baserunning -- it comes down to two things, how many runs is it worth relative to X baseline, and how well can you measure it. We'll be getting into all of that.
Since you asked: you are wrong.
Fangraphs follows the framework I have, which is that every component is compared against the AVERAGE.
You can see that presentation at the bottom of any of the player pages. You can also see it at BaseballProjection.com, which is the precursor to References' WAR.
Anyway, there is a difference in how I present it (each component at the league average, then one sweeping replacement level number at the player level).
The practical difference is that by keeping things as I'm presenting it, then we don't have to have these conversations about "replacement level offense", "replacement level defense", etc, things that don't in fact exist.
This was the #1 problem with the original WARP, which Clay finally agreed to in the end. And this would have been more obvious if we simply stuck to the presentation I advocate.
It's tiring that we need to constantly correct readers who are not as knee-deep in this, because the damage was done, and continues to be done.
I think everyone can pretty much agree that a typical replacement level player is a below-average hitter for his position but an average fielder. If you want to call that "replacement level offense and defense" or not doesn't really change the results of a value metric any.
That's not the argument I'm making.
The argument I'm making is that you compare each component to the average, because that makes sense.
And replacement level is a concept at the PLAYER level, not at the component level. That's the argument. That's the way I've presented WAR and that's how I've sold WAR.
The seasonal error bars are not going to add up linearly for the career totals (should add up following RMSE). Have you figured out how to explain that for the masses?
You can have Andrelton Simmons in 2013 be worth +30 runs on fielding +/- 15 runs.
But, if he continues to pile up +30 run seasons, you'll be able to restate his 2013 season as say +30 +/-10, with the knowledge of future seasons.
Again, most people are going to not like this idea, thinking that all seasons should remain independent. But, the reality is, they are not.
And if you look at it historically, obviously the WWII years were lacking in huge talent.
So, yes, that should be on the table.
Mr. Tango's comment about the WWII years lacking in talent would be exactly the type of thing I'm curious about how "average" takes that into account vs. now for example.
http://www.baseballprojection.com/war/e/erstd001.htm
And that's based on the framework I've described on my blog: every component compared to average, with the replacement level treated as its own component.
http://www.insidethebook.com/ee/index.php/site/comments/everyone_has_their_own_war/
And since FRA is the central component to WARP for pitchers, that means WARP for pitchers is useless to me.
Things like the Trout vs. Cabrera race should be at least temporarily solved, because, even with the stats Cabrera puts up being of the mostly nonsabermetric sort, there's still no way that Trout's that much better than everyone else. Trout should be a player of focus in reviewing this batch of WARP, to see if there is some major misstep allowing him to do this well or if he really is this good.
If Trout is +8 +/-1.5 runs, then that makes his range +6.5 to +9.5. And if Miggy is +7.5 +/-1, then he's at +6.5 to +8.5.
If anything, this kind of thing will reinforce Trout's greatness.
(All numbers for illustration purposes only.)
Mike Trout is an above average hitter, above average fielder, and above average runner. And you can put "way" in front of any or all of those.
All I can say is that everyone should try to develop their own implementation of WAR. The WAR framework is there for everyone to use. The presentation at BaseballProjection.com gets it exactly right.
At this point, just work through it yourself, let's see where you end up, and we can take it from there.
- Please make it easier to find WARP in the Statistics page. Right now I can find WARP, but to find the individual components you need to really search. I would love it to be on the front page as a continuous update similar to B-Ref and Fan Graphs.
- ARM ratings need to be part of FRAA (if they are not already). I know you have said they are part of WARP, but they are still not available on your site.
- Please leverage the work Max Marchi is doing on Catcher fielding (ie: framing / game calling) in FRAA. This will truly separate WARP from fWAR and rWAR.
Thanks
The most obvious example is NL pitchers facing the pitcher every 9th PA compared to AL pitchers facing a DH. Its been common for years to had a half run or some amount to ERA when trying to make an eyeball comparison between pitchers in opposite leagues.
But even more interesting is controlling for the statistics of players against the expectation of who they faced. Using a starting pitcher as an example; it should be fair to say a guy that gets 24 starts against the top 5 offenses in the league has a more difficult job than a guy that gets 24 starts against the worst 5 offenses (and of course this concept should be reduced down to specific pitchers and batters faced). The former pitcher is arguably a much better player than the latter with the same stats, and who is on the schedule is entirely outside the control of the players and in an ideal world should be controlled for.
Apologies in advance if this is the proverbial dumb question.
1) Wouldn't a full stadium view of the playing field be extremely helpful in judging defense? You could time an outfielder's jump, mark a player's exact position, and with some perspective judge the distance/difficulty of throws.
2) Why can't replacement level be the bottom 5th percentile of major league performers or something? I know this is overly simplistic thinking, but wouldn't something like that catch the essence of what a replacement level player is?
3) Will opponent factor into this at all? For example, if Matt Harvey shuts out the Red Sox at Citi Field, will he get the same amount of credit as if he shut out the Marlins at Citi Field? I don't know if this is already taken into account in WARP or the other total value metrics, but there is a discussion between Joe Sheehan and Brian Kenny about Harvey/Kershaw and one of Sheehan's points is that Matt Harvey has had the easiest schedule of anyone this year. Shouldn't that value into WARP, if it isn't already?
4) I hope the little things like the utility study by Mr. Carleton and the catch framing studies done by Mr. Lindbergh will be accounted for in this new version of WARP.
5) Will outside studies be used to help formulate the new version of WARP?
6) After this is all said and done, could you possibly make it easier to access WARP by maybe putting a leader board on the main page ala. Fangraphs & B-R?
1) Both the play outcome --> run value and run value --> win value legs of any WAR methodology are things that ought to float from year to year (or at least era to era) based on overall environment. A run in 1962 is worth a lot more than a run in 1998. Similarly, the relative value of HR vs BB would be different given that in a lower offensive environment a walk would have a lower expectation of scoring. Is that something already accounted for in WARP? If not, is it something you plan to include? Would you make similar AL/NL adjustments? The presence/absence of the DH would have similar effects on runs->wins and outcomes->runs
2) The discussion above on offensive, defensive, and baserunning WAR above is a bit circular, but gets at an important point- Is "replacement level" going to be defined at the player level or skill-component level? That is to say, is a 0 WARP player someone who is not just a bad hitter, but slow and stonehands as well? Or is 0 WAR going to mean awful with the stick but not a disaster in the field or on the basepaths (or some equivalent skill mix that gets the the same place)?
2b) The component-level vs player-level definition of replacement level gets at the fundamental assumption of talent distribution behind WAR, which is that it's a pyramid (or some kind of highly skewed distribution like Pareto or the truncated edge of a normal curve, etc), and that such a distribution implies that there's a certain production level that is so ubiquitous as to have zero scarcity value. This is pretty demonstrably the case for hitting and pitching (probably pitching, I'm not 100% sold on it), but given that defensive stats are by no means mature, can we assert with any confidence that it is also true in the field? Even if it's true in some metaphysical sense, is it true in the population of players who hit well enough to potentially put on a 25-man roster?
3) I would think avoiding "average" in any mathematical formulation of WARP would be ideal if possible. The whole conception behind WAR is the skewed talent distribution, and a big part of the usefulness of WARP is that it measures that skew. By building in any reference to "average" into the definition you are inherently assuming a particular level of skew, which then creates a circular measuring-what-you-assume problem. A system that defines replacement level based on percentile-of-players is a much better fit with the concept or WAR. That said, going with a percentile-based (or ranking-based) definition presents its own set of difficulties, particularly when you start trying to measure position value.
3b) I put "average" in scare quotes because it's a very slippery word with several possible meanings in a player-value context. First of all, given that we're dealing with a skewed talent distribution, do we really mean "median" rather than average? Even if we are talking about an average rather than a median, what kind of weighting do you use? Averages based on league-wide aggregate stats are weighted by plate appearance, and for obvious reasons better players will have more PAs than worse players. The result is that "average" calculated from leaguewide aggregates will be higher than "average" calculated in a way where each player counts equally