Caught Looking examines articles from the academic literature relevant to baseball and statistical analysis. This review will cover openWAR: An open source system for evaluating overall player performance in major league baseball, by Benjamin Baumer, Shane Jensen and Gregory Matthews in the June 2015 Journal of Quantitative Analysis in Sports. The goal, as always, is to expose the academic frontier to a wide audience and seek ways to move the discussion forward.
Most baseball statistics are easy to define—a run scored is a run scored. Sometimes a bit of judgment goes into the definition—sacrifice flies appear in the denominator for on-base percentage but not batting average, for instance—but the definition is at least widely agreed on. Wins Above Replacement (WAR), however, is a statistic that involves much judgment and little agreement. Baseball Prospectus publishes a measure called WARP, and FanGraphs (fWAR) and Baseball-Reference (bWAR) have measures of their own. In a recent paper, Benjamin Baumer, Shane Jensen and Gregory Matthews have declared openWAR on the others.
Their paper, openWAR: An open source system for evaluating overall player performance in major league baseball, proposes a new manifestation of WAR that is different from the other measures in some important ways. They also emphasize reproducibility, and along with their paper, the authors make available an R software program that allows users to recreate their work. Reproducibility and transparency have become increasingly important topics in academic research in recent years, and meeting very exacting standards for reproducibility is one of the authors’ stated goals. This stands in contrast to existing methods that rely on proprietary methods and opaque calculations. Whether their method outperforms the other measures is, of course, an open question and a difficult one to answer.
It is useful, though, to review the approach of Baumer, Jensen and Matthews to see how it differs from WARP, fWAR and bWAR in philosophy and construction. First and foremost, this paper uses a conservation-of-runs framework based on changes in run expectancy at the plate appearance level. The gory details are in the paper, but essentially after every plate appearance, the change in run expectancy is apportioned to hitters, pitchers, fielders and baserunners. After some manipulation to deal with park and platoon effects, these changes are compared to what would have been expected from a replacement level player. Finally, run expectancy impacts are converted to a wins measure. In the openWAR approach, context matters.
This contrasts with other methods that begin with a player’s season-long stat line and assign run values to different outcomes like walks or stolen bases. Context-neutral models like WARP are generally interested in constructing an estimate of a player’s underlying skill level or true talent, while openWAR is attempting to describe what impact a player actually had in a given season. There are reasons to prefer one approach over the other, but the differences in philosophy are important to keep in mind when making comparisons.
The data used in the openWAR formulation comes from scraping MLB Advanced Media’s GameDay feeds. MLBAM data allows the program to identify each change in base-out state and run expectancy throughout each game, and also identifies the location of each batted ball. This data is subject to the same kinds of measurement error that trouble many kinds of defensive metrics, but the authors argue that it provides a higher level of resolution for hit location than other available sources. Importantly, this data source allows the authors to assign defensive responsibility for runs to fielders at the plate appearance level in a reproducible way that is not possible with something like Ultimate Zone Rating.
Another key element of the openWAR model is the careful definition of replacement level players. The authors are critical of the ad hoc approach used by bWAR and fWAR that involves calibrating the model in such a way that exactly 1,000 wins of WAR are distributed among all players, with the rest of the wins representing the replacement level contribution. The openWAR approach instead selects a number of players equal to the number of available roster slots—750—based on those with the most playing time, and assumes that everyone else represents the replacement pool. These fringe players form the baseline for comparison. This approach runs some risk of a star player who spends most of the season on the DL appearing in the replacement pool and may also lead to overrepresentation of fringe relievers relative to fringe starters, but it’s a reasonable approach.
Another important contribution of this paper is the attempt to put a variance around their estimate of how good a particular player was. In particular, the authors address player-outcome variations using a resampling strategy that simulates the player’s season by drawing (and replacing) runs-above-average values for each plate appearance from the player’s season. They use this method to create 3,500 simulated seasons for each player and generate a distribution of WAR that reflects what might have been expected to occur but for the random variation in outcomes in each plate appearance. This is a useful exercise for understanding how a player’s actual WAR might differ from what would be expected, given performance and context, but it should be made clear that this should not be interpreted as variance in an estimate of the player’s true talent or underlying skill.
The authors do not estimate variance caused by model uncertainty, or imprecision in parameter estimates, as this is unlikely to be a significant source of variation relative to player outcome variation. It is probably even less of a concern in the event-based approach to computing WAR relative to a linear weights-based approach that relies more heavily on parameter estimates for valuing a particular outcome. They also do not estimate situational uncertainty that comes about due to different players having different opportunities—players who come to bat more often with men on base will get more impact from their hits than those who don’t. The openWAR approach makes no attempt to strip this sort of context out of the model; rather, the authors view context differences as a key feature of their approach.
There are some noteworthy and curious choices in the openWAR model. First, like other systems, the authors make positional adjustments to reflect differences in defensive responsibility—a shortstop has a bigger fielding impact on the game than a left fielder and this needs to be reflected somehow. In the same vein, a pitcher in the National League probably will be much worse at the plate than the average replacement player. It is misleading to convert all of these negative batting runs into negative WAR when the average replacement player for that particular player will be a pitcher. Some adjustment is reasonable. However, the authors make the positional adjustment by scaling the hitting performance in each plate appearance based on the performance relative to other hitters who play the same position. In other words, the player’s position in the field determines in part how much credit they get for their batting accomplishments. This is counterintuitive, even if a positional adjustment is necessary. An alternative approach would be to calculate a position-based replacement level for each position and adjust accordingly. It’s not immediately clear how this would impact estimated WAR in practice, but it would be more internally consistent.
Also, the authors’ formula apportions credit or blame on the defensive side between fielders and pitchers by making use of a fascinating map of the field. Each park-specific pixel from the MLBAM feed corresponds with a potential hit location and a probability that a batted ball in that location will result in an out. As it stands, however, the weight of an outcome is assigned to pitchers or fielders based solely on the location of the batted ball. The less likely a ball is to be caught, the more responsibility given to the pitcher. This is problematic in the sense that fielders get very little credit for making difficult plays. An alternative solution would be to assign credit or blame based on whether an out was made. A deep drive in the gap would be heavily weighted toward the pitcher if it fell for a double, but heavily weighted toward the fielder if it were caught. A similar strategy of dividing credit for called balls and strikes between pitchers and catchers might be useful for measuring pitch-framing, something not currently captured in the model.
Finally, the steady improvement of fielding data is likely to provide interesting opportunities for improvement of the model. It’s not hard to envision a model that converts batted ball trajectories and exit velocities into a map of the probability of recording an out on the play. It’s equally easy to see an updated version of openWAR that incorporates this sort of data into improved measures of a player’s defensive contributions.
While openWAR is unlikely to move the sabermetric community toward an agreed-upon measure of WAR, the authors of the model have set an admirable standard for transparency and reproducibility. There will still be those who prefer a measure stripped of all context, and the openWAR approach is perhaps better suited for MVP voting than forecasting. But for those who wish to take issue with some elements of their approach, Baumer, Jensen and Matthews have provided a framework and source code that can be built upon.
Michael Wenz is Associate Professor of Economics at Northeastern Illinois University. Caught Looking reviews mainly recent articles from peer–reviewed academic journals. Please send along suggestions, especially for interesting dissertation or thesis chapters that aren’t always easy to find.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.Subscribe now