Have you heard about the new Statcast “X” metrics put out by MLB Advanced Media (MLBAM or BAM)? I’ll bet you have. In fact, I’ll bet you read about them all the time, in part because BAM strongly promotes them in its media operations.
Bloggers, columnists, and broadcasters have picked up on this encouragement, using these metrics to highlight pitchers who have been “unlucky” and others who seem poised for a “breakout” season. These takes assume that new metrics like “expected Batting Average” (xBA) and “expected weighted On-Base Average” (xwOBA) offer new and unique insights on true player performance. But is that notion supported by the data?
BAM contends that these metrics add value due to their incorporation of measurements like batted ball exit velocity and launch angle. Take, for example, the description provided by BAM’s glossary for Expected Batting Average (xBA). In the relevant part, it says:
Expected batting average (xBA) is formulated using the Statcast metric Hit Probability, which was introduced before the 2017 season. …
Expected batting average is more indicative of a player’s skill than regular batting average, as xBA removes defense from the equation. Hitters, and likewise pitchers, are able to influence exit velocity and launch angle but have no control over what happens to a batted ball once it is put into play.
The description for xwOBA contains similar language. But it says this also:
Knowing the expected outcomes of each individual batted ball from a particular player over the course of a season—with a player’s real-world data used for factors such as walks, strikeouts and times hit by a pitch—allows for the formation of said player’s xwOBA based on the quality of contact, instead of the actual outcomes.
We are unaware of any public validation of these claims, or any quantification of the actual net benefits offered by xBA and xwOBA. As we are now going on three years of public Statcast data, it is fair to put them to the test, not only to measure them, but to see whether they offer meaningful improvement over what we already had.
Unfortunately, the results are disappointing. After review, we find little evidence that metrics like xBA or xwOBA provide a uniquely better measurement of probable pitcher skill. In fact, these Statcast metrics are little better—and in some cases worse—at isolating skills than metrics like FIP and DRA which don’t incorporate exit velocity or launch angle at all.
We downloaded complete batted ball summaries for pitchers from Baseball Savant for the 2015, 2016, and 2017 seasons. These downloads include, among other things, expected batting average (xBA), actual batting average (BA), expected weighted on-base average (xwOBA), actual weighted on-base average (wOBA), and the number of “hits” and at-bats (“abs”) confronted by each pitcher.
We then downloaded, for the same seasons, pitcher DRA, FIP, and plate appearances (“PA,” aka batters faced) from our BP tables, grouped by pitcher and season to match the overall format of the Savant downloads.
We performed inner joins to combine the two data sets for each season. We created our first data set (same year performance comparisons) by binding together all rows from all three seasons. (Using all pitchers, rather than just pitchers who throw in back-to-back seasons, is more favorable to BAM’s expected statistics). After removing those with missing entries, we were left with 2,226 pitcher seasons.
To create our second data set, looking at next year’s performance, we performed further inner joins on pitchers who pitched back-to-back in the 2015 and 2016 seasons, did the same for pitchers who pitched in the 2016 and 2017 seasons, bound the combined rows from these two seasonal pairs into one final comparison data set, and once again eliminated the very few pitchers who had one or more missing entries after this process was complete. This left us with 1,060 back-to-back pitcher seasons for comparison.
We concluded by calculating the harmonic mean for the various opportunity measurements for each two-year combination to use as weights for our year-to-year comparisons. We ended up selecting Savant’s at-bats (“abs”) for all of our weights in both data sets, although the numbers were essentially the same when using plate appearances or what Savant calls “hits” instead.
We then set up a series of weighted Spearman correlations to look at the descriptive (correlation to past performance), the predictive (correlation to future performance), and the reliability (consistency of measurement) qualities for various combinations of these statistics. Spearman correlations are a slightly more robust form of the Pearson correlation, working off ranks rather than the raw measurements, although the results tend to be similar. Looking for positive correlations is useful because they are easier to understand (scale of 0 to 1, higher is better) and because, unlike raw error measurements, they allow us to compare rate metrics on different scales (such as batting average versus runs allowed).
The correlations were measured over 100,000 bootstrapped re-samples of both pitcher data sets to provide the mean correlations and a probable margin of error around them. We used our previously discussed Bayesian bootstrap concept to generate credible intervals that would be intuitive.
The analysis was completed entirely in the R programming environment. The code will be available in our repository shortly.
Discussion and Results
Sometimes it’s difficult to decide the fairest way to evaluate a metric’s performance. Not so, here: by describing these metrics as “expected” batting average and “expected” wOBA, these metrics seem to have prediction written all over them (even though BAM disagrees). We are not alone in this intuition. Various—if not virtually all—articles written by others about these statistics seem to assume their predictive value, typically claiming that pitchers with large divergences between their “expected” and “actual” value probably will move toward their “expected” value in the future.
We can test a metric’s predictive power by comparing consecutive seasons of pitcher performance. This is a challenging test, as some pitchers will face very different conditions, or experience a change in skill. However, those challenges make life more difficult for all such statistics, so it remains a fair basis for comparison.
To provide context, we’ll also compare xBA and xwOBA to Fielding Independent Pitching (FIP) and Deserved Run Average (DRA), neither of which rely upon exit velocity or launch angle, and thus arguably should be comfortably outperformed by BAM’s “expected” statistics, which do incorporate these inputs—if, in fact, these metrics indeed tell you what to “expect.”
We’ll start with batting average and see how xBA’s predictions compare to ordinary batting average, FIP, and DRA at predicting actual batting average allowed for all MLB pitchers in the following year (“Y+1”).
The following table provides both the average weighted correlation and the margin of error, aka standard deviation in each direction around the average correlation. Statistics with different means, but overlapping margins of error are less likely to be materially different. Those whose mean correlations exceed their respective margins of error, however, are quite likely to be materially different.
|Statistic||Correlation to Y+1 BA||Margin of Error (plus/minus)|
Well, there you have it: the difference between xBA and plain old batting average at predicting future batting average is … within the margin of error. There is a reasonable likelihood that xBA is slightly more predictive, but the difference is modest at best, and possibly not meaningful at all. xBA is probably also more predictive than FIP, although that too is within the margin of error. Notably, DRA, which does not use exit velocity or launch angle or even stringer coordinates, essentially ties with xBA in predicting next-year batting average, particularly when you consider the margin of error.
But, you say, batting average is outdated! Surely xwOBA is a superior choice that will demonstrate the true power of new metrics like these. We’ll run a similar analysis, looking at the correlation to the following year’s wOBA as predicted by these other metrics:
|Statistic||Correlation to Y+1 wOBA||Margin of Error (plus/minus)|
Hmm. The difference between xwOBA and just using last year’s wOBA is once again within the margin of error. It is entirely possible there is little real difference between the two.
It gets more discouraging.
First, xwOBA is no better than FIP—yes, plain old strikeouts, walks, and home runs FIP—at predicting future wOBA. None of us has seen a marketing slogan proclaiming “xwOBA: expect no more than FIP,” but at least in this respect, that appears to be the case. This finding is consistent with comparisons made by Craig Edwards of FanGraphs. Although his method is slightly different than ours, he too had trouble finding any real difference between xwOBA and FIP at predicting future performance.
Second, DRA does an equally good job as xwOBA at predicting future wOBA, even though DRA has no idea whether a given ball plugged the gap or bounced off first base. Put simply, launch angle and exit velocity—at least on average—are not providing any additional predictive benefit over what we already have, and have already had for some time.
Finally, you’ll notice xwOBA is no more predictive of future wOBA (.35 versus .32) than xBA is of future batting average (.36 to .32). Think about that for a second: batting average, one of the whipping boys of the sabermetric movement, is predicted slightly better than the more advanced wOBA by these new inputs.
Irony aside, the inability of xwOBA to improve on xBA suggests fundamental problems in how it’s put together. Recall that batting average is unstable in large part because it’s driven by a pitcher’s inherently volatile batting average on balls in play (BABIP). xwOBA, on the other hand, should have the advantage of explicitly and separately incorporating walks and strikeouts, two of the most stable skills in the game. To incorporate those two elements and still be less predictable than batting average takes some doing.
Another useful test to evaluate whether a metric is measuring skill is to see how consistently the same metric evaluates the same player, one year later. Pitchers change, of course, but good pitchers should generally remain good pitchers and below-average pitchers generally remain below-average pitchers. A good pitching metric should be able to consistently put the same guys in the right general category. So, in consecutive years (Y+1), how do our various metrics compare to one another in reliability?
|Statistic||Correlation to Itself Y+1||Margin of Error (plus/minus)|
This comparison reveals some good news and some bad news. The good news is that xBA and xwOBA are almost certainly more reliable than their non-x counterparts, although xwOBA’s probable inferiority to xBA remains bizarre. xBA and xwOBA also (finally) show signs of being superior to FIP, so they’ve got that going for them.
Except then you get to DRA, which is clearly more reliable than xwOBA (outside the margin of error), and probably also more reliable than xBA (some overlap). In other words, the BAM metrics incorporating launch angle and exit velocity appear to be detecting less useful pitcher skill than DRA. This is a nice vote of confidence for robust statistical methods. Unfortunately, it also suggests, at least in the context of xBA and xwOBA, that exit velocity and launch angle aren’t telling us things we couldn’t already detect just fine from other, more traditional inputs.
At this point, perhaps you are wondering whether xBA and xwOBA are uniquely good for anything at all. There is, as it turns out, one aspect in which both statistics perform “well,” and that is in so-called “descriptive power”—the correlation to measured past performance. Let’s look at our various statistics again, except this time we’ll focus only on comparisons to wOBA, using our same-year data set:
|Statistic||Correlation to same-year wOBA||Margin of Error (plus/minus)|
This is another good news/bad news situation. Since we are trying to match performances we’ve already seen, there’s much less uncertainty all around. And xwOBA finally comes out on top! Except … once again, it barely outpaces FIP, which, as you’re no doubt tired of being reminded, has no Trackman or Chyron Hego, but seems quite competitive.
The bigger problem is that there is no real trophy for “best descriptive performance.” In fact, descriptive performance for an aggregate statistic is generally useless, as it seeks only to match a statistic we already have. A more reasonable formulation of the table above would be as follows:
|Statistic||Correlation to same-year wOBA||Margin of Error|
wOBA already exists and therefore correlates perfectly to itself without error. Creating a new statistic that fits wOBA for the most part and doesn’t uniquely predict anything else is not useful. If this is hard to understand, try substituting “not quite wOBA” for “xwOBA” in any sentence where you might otherwise quote xwOBA to evaluate a pitcher. Doesn’t it sound silly? And yet, that is all that descriptive performance is measuring.
DRA is tuned for reliability/skill measurement, and in this comparison (which again selects the single-year data set most favorable to xwOBA), it admittedly shows.
Earlier this week, we reached out to BAM with our findings, asking if they had any comment.
MLBAM Senior Database Architect of Stats Tom Tango promptly responded, asking that we ensure we had the most recent version of the data, due to some recent changes being made. We refreshed our data sets, found some small changes, and retested. The results were the same.
Tango then stressed that the expected metrics were only ever intended to be descriptive, that they were not designed to be predictive, and that if they had been intended to be predictive, they could have been designed differently or other metrics could be used. Having thought about this, we have a few comments.
First, defining “expected” performance entirely in terms of past performance is a tough sell, regardless of what you intended. Grammatically speaking, it seems like a better name for these metrics would be What You Would Have Expected wOBA (wywhewOBA) rather than the more general “expected” modifier, a concept, together with its “x” prefix, that is indelibly—as well as logically—associated with anticipated future performance among fantasy baseball enthusiasts and others.
Furthermore, regardless of what was intended, reader expectations matter also. As indicated by even the rudimentary sampling of articles above, the idea that xwOBA should not be used to calibrate future expectations will come as a bit of a shock to many. These are admittedly complicated issues, and finding clear answers to them is hard. But if the metric was designed for as limited a purpose as Tango has described, BAM’s glossary entries really should have specified their limited intent more carefully.
The second, and equally challenging, issue surrounds whether descriptive performance, despite being useless in general, might have some utility here. Perhaps, the argument might go, it’s useful to know what exit velocities or launch angle combinations drove past performance, for teaching or other purposes. The problem is that a visual chart serves that purpose just fine without the confusion of the “expected” label. Moreover, the implicit suggestion of an “expected” metric is that it’s somehow revealing a fundamental truth of which players “deserve” a good performance and which do not.
With their comparative lack of reliability, xBA and xwOBA at best move us somewhat closer to that goal without necessarily isolating the consistent underlying skill. After all, we already knew that extra-base hits arise from the ball being struck very hard and hit far into the outfield grass, not the infield dirt. When metrics like xwOBA struggle to replicate the skills they claim to uniquely capture, one reasonable inference is that the true skills driving productive contact—indeed, the true skills driving exit velocity and launch angle—remain substantially unquantified and out of reach.
First, I want to stress again that this study looked only at xBA and xwOBA as applied to pitchers. (That said, our preliminary studies show that the descriptive performance of xwOBA for batters and pitchers is basically identical.)
Second, if one is going to criticize, ideally one should also offer some suggestions. Here are mine:
- Before labeling any statistic “x” anything, think long and hard about your true goal, and about how you should be describing it. If your goal is to predict future performance, you need ensure that the metric in fact does this, and preferably does so better than what’s already out there. Not doing so is unfair to columnists and readers who, as the hyperlinks above show, will otherwise assume a new metric has unique predictive value when in fact it does not. Fitting past performance does not in and of itself reveal pitcher skill.
- Be wary of over-fitting, which almost certainly occurs when you focus solely on finding explanations for past performance. Finding an explanation for the past doesn’t necessarily tell us what truly caused it. Proper regularization will improve out-of-sample performance and should improve predictive power also, revealing more of the actual “skill” at issue.
- Some of what xwOBA and other “x” statistics get credit for “spotting” is probably plain old reversion to the mean. You don’t need Trackman radar to credibly bet that an outlier performance will become more typical in the future if the player remains in MLB. Focus instead on average overall league measurement and the uncertainties around those averages to determine whether you’re probably detecting actual improvements in prediction.
- Finally, consider modeling inputs like exit velocity and launch angle as non-linear, even if that results in a “black box” that makes the effect of Statcast components more difficult to explain. The inability of xwOBA to beat DRA at reliability suggests either a) that launch angle and exit velocity are not useful, or b) more likely, that too much of the signal from these inputs is being lost. Machine learning and other semi- or non-parametric methods can potentially improve these results, perhaps providing improvement in all these metrics.
Lastly, the arguable failure of xBA and xwOBA to offer sufficient value does not mean that Statcast itself lacks value. From an entertainment standpoint, tracking and measuring the progress of balls in play is fun and adds definite value to the baseball viewing experience. In terms of adding statistical value, however, public applications of Statcast inputs to “expected” outputs may have a long way to go.
Ahmad Emad & Paul Bailey (2017). wCorr: Weighted Correlations. R package version 1.9.1. https://CRAN.R-project.org/package=wCorr
Baseball Savant. URL https://baseballsavant.mlb.com/
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.