We’re almost two months into the Deserved Runs Created (DRC) era, and we’ve received the first wave of feedback about BP’s new composite batting statistic.
Although the overall performance of DRC+ is very strong, we’ve heard two general concerns. The first is that some players, particularly from extreme ballparks like Coors Field, are not being properly adjusted. The second is that players with “extreme” stats regardless of ballpark (particularly those who hit an unusually high or low rate of singles) are being shrunk too much toward the mean.
We’ve substantiated both of these concerns and made appropriate updates. What follows is a summary of the updates and a discussion of what we’ve learned.
Let’s start with ballparks. There was never anything wrong with the park ratings themselves, just the way those effects were being isolated from batter predictions. The batter predictions are now 100 percent park-isolated.
To prove this issue has been addressed, here is the weighted Pearson correlation between the updated batter DRC+ numbers and our DRC park ratings for each season, for all batters with at least one plate appearance from 2010 through 2018, as compared to raw linear weights, which have no adjustment at all, and thus reflect the natural bias caused by different parks.
|Batting Statistic||Weighted Correlation to DRC Park Rating||Error|
The connection between batter DRC+ and parks is now effectively zero (or, as the statisticians would say, “orthogonal”). Your favorite Colorado hitters will now be adjusted a bit downward from where they were before, and your hitters from parks that suppress offense are now being treated more fairly.
That said, DRC+ continues to rate Colorado hitters more highly than other composite batting statistics. To us, that is a feature not a bug, as certain hitters, such as DJ LeMahieu, feel like they get penalized too much by traditional park factors. Here are some comparative numbers from 2018:
Extreme Batter Values
The more intricate issue concerns how DRC+ deals—or should deal—with hitters who have extremely high or low seasonal rates in different events.
This question goes to the heart of multilevel modeling, which underlies DRC+ as well as many of our other recent statistics. The strength of this approach is its skepticism that large groups of individuals are really all that different from one another, and its tendency to “borrow strength” from different members of each group to rate them relative to each other’s accomplishments. As a practical matter, this has two effects: (1) most players get shrunk toward the mean, at least a little bit, and (2) the models tend to be skeptical of outlier performances.
This strategy usually works extremely well, and for most position-player hitters, it definitely works well. The challenge is that there are a few batters every season who are deservedly outliers: typically, although not always, superstars. These batters have more extreme rates at hitting singles or drawing walks or hitting home runs because they are some of the best (or at least the most unique) athletes in the game. These players are not outliers randomly selected from an unknown population; rather, they are the confirmed occupiers of the extremes from a defined population. For these players, we want the models to “know” that they should not necessarily be shrunk as much, even if their performances are not accepted entirely at face value.
However, without additional information, the models do not “know” that some outlier performances can be representative of true player ability. So, while the original DRC models could tell that Mike Trout is extremely good, and in most ways better than everyone else, they still were inclined to shrink his performance a fair amount. Making this problem more difficult is the fact that these outliers generally fall only on one side of the spectrum: batters who are terrible typically do not get 500 plate appearances; great batters do. Fitting outliers is one problem; having to accommodate outliers that lie on only one side of the distribution makes it even harder, as distributions, especially multilevel modeling groups, tend to be designed around assumptions of symmetry (typically your bell-curve, normal-type distributions).
Complicating things still further is that the existing method, although skeptical of outliers, nonetheless is performing better than all competing batting statistics overall. To be sure, the outliers are important, and we want to accommodate them. But we don’t want to do that at the expense of getting the non-outliers incorrect, as most baseball players are not outliers. And if you’re just going to assume that all outlier performances are deserved (which they aren’t), you might as well just go back to raw batting statistics.
What to do? The solution lies in reminding ourselves what it is about the outliers that we want to know: their expected contribution. If at least some outliers largely deserve their performances, there should be a consistent set of outlier performances across seasons, even if they don’t always involve the same players. So, what we really want to know is which outlier performances are within the range of consistent over-performance.
Put more formally, we have prior knowledge about the expected range of contributions from full-time hitters over multiple seasons; what we want is to have our model predictions be aware of what those multi-season ranges might be.
One of our first approaches was to perform what is known as an empirical Bayesian update on batter predictions, informed by their raw results from surrounding seasons. This works reasonably well with the right combination of surrounding seasons, but it also has drawbacks: (1) the values are constantly subject to change, even after the season has concluded; (2) we are explicitly incorporating results beyond the current season, which we dislike on principle; and (3) the corrections end up being fairly blunt, masking the sudden improvements and declines typical of players at the beginnings and ends of their careers. We wanted something better.
So, we thought about what we had learned. We knew that, while single-season extremes can be somewhat undeserved, there is a subset of those extremes that reflect a consistent range of performance across baseball for each event. We also knew that the models would benefit from prior knowledge about what these extreme ranges could be. However, we don’t want to put our thumb on the scale directly by treating each player’s surrounding seasonal results as fact. What we want instead is to let the data for each season drive the results, with the models merely better informed about what ranges of performance are plausible.
The solution is to incorporate a semi-informative prior distribution into the batter components of our models. The flagship R package for multilevel modeling, lme4, does not allow you to specify prior distributions for modeled groups: it estimates them instead from the data, which is what we were doing with the initial release of DRC+. There is a surprisingly useful variant of lme4, however, which does allow you to set prior distributions, while retaining the speed and ease of use that has made lme4 a default tool for multilevel modeling. Courtesy of Vince Dorie, that package is blme, and it turns out to be exactly the tool we needed.
We already were using blme for DRC+, but we were only setting prior distributions on what are known as the “fixed effects”: temperature, strike-zone effects, and similar confounders that universally affect players involved in a batting event. The so-called “random effects,” which are the identities of individual batters, pitchers, and stadiums, did not have such prior distribution. That has now changed: for any group with a tendency to have a performance spread wider than an lme4 model would ordinarily estimate, we now have set an Inverse Gamma prior on the standard deviation of that group.
Inverse Gamma priors have a checkered history in Bayesian statistics. For reasons of computation convenience, Inverse Gamma priors were long used as priors on variance components on the pretense of being objective (a/k/a “non-informative”) when in fact they often turned out to be quite the opposite. But when you have strong prior information to apply, and that information suggests that group variance could be larger than otherwise expected, the Inverse Gamma turns out to be a terrific choice, thanks to its combination of concentration near zero and its enormous right tail.
For our purposes, an Inverse Gamma prior consistently allows batter performances to scale to their multi-year ranges, without compromising overall performance. It’s important to get the prior specification right: if you specify it too narrowly or too broadly, the results can be terrible, and it will tank the benchmarks we discuss below. But year-in and year-out, the default IG specification we have chosen seems to provide very good results for baseball outliers. The specification and code for the prior will be discussed in the coming weeks.
How does this affect our estimates? Let’s take an example from 1987, a season when hitters like Tony Gwynn were testing the upper limit of how many singles could result from balls put into play. Among qualified hitters that year, the percentage of balls in play that became singles, rather than outs, ranged from 17 percent (Mike Pagliarulo) to 33 percent (an achievement Gwynn shared that year with John Kruk and Vince Coleman). Is that a trustworthy range of true batter ability? To check, let’s look at qualified hitters for each year between 1985 and 1989. As it turns out, the multi-year range for singles-versus-outs is 19 percent (Pagliarulo) to 32 percent (Wade Boggs)—close, but a bit tighter on each side. That is the narrower range worth targeting.
Originally, deferring entirely to the single-season data, DRC+ saw the probable range of actual singles-hitting to be only 20-30 percent: a solid estimate, but with some definite shrinkage beyond the multi-year range of batter performance. With the addition of an Inverse Gamma prior, however, blme allows the range to go a bit higher: 19-32 percent which … just so happens to be the range of ability suggested by the multi-year raw data.
Extreme singles hitters thus fare better with these updates. But so do outliers in other categories, as we have extended the Inverse Gamma prior to batter groups for other event models as warranted, to allow those variances to similarly accommodate more extreme performances.
As noted above, while we’re looking to better accommodate outliers, we don’t want to do it at the expense of accurately measuring more typical players. Fortunately, we seem to be able to have it both ways.
Here are the updated benchmarks on our three Contribution Measures, in terms of fit to team run-scoring rates. The original versions of the tables that follow were also in the previous article, but they now include measurements for updated DRC+.
Batting Metric Performance by Contribution Measures (teams, 1980–2018); Robust Pearson Correlation to Team Runs/PA
As it turns out, these adjustments provide a substantial upgrade in our Reliability, trading off with Descriptiveness, while preserving our Predictiveness. As discussed previously, we consider Descriptiveness to be the least valuable of the three Contribution Measures, and in any event DRC+ remains more Descriptive of team run-scoring rates over the last several decades than any competing batter metric. The lead for DRC+ in Reliability is even greater than before, and the comparative advantage for DRC+ in Predictiveness remains far beyond the margin of error.
Team data is nice, but what about individual players who switch teams? This could be better “proof” of how well a player’s contributions are being measured, as switching teams can change both the hitter and ballpark quality around a player.
Here is how the updated DRC+ fares in Reliability and Predictiveness, from season-to-season, including non-park-adjusted stats. Recall that Reliability measures the consistency with which a statistic measures the same players similarly, a key indicator that the statistic is measuring more skill, and less noise:
Reliability of Team-Switchers, Year 1 to Year 2 (2010-2018); Normal Pearson Correlations
|Metric||Reliability||Error||Variance Accounted For|
As you can see, only DRC+ is able to account for over half of a team-switcher’s variance in the following season; that is over three times better than competing metrics.
Predictiveness, on the other hand, measures how well the metric predicts run production the following season. For this chart, we used wOBA to measure next season’s production:
Predictiveness of Team-Switchers, Year 1 to Year 2 wOBA (2010-2018); Normal Pearson Correlations
By accommodating outliers, we’ve traded one point of Reliability for one extra point of Predictiveness, a perfectly acceptable trade-off that amounts to a wash.
In sum, we’ve managed to better accommodate the outliers while maintaining what appears to be superior overall performance versus other batting metrics. That feels like a win-win.
Although these improvements are exciting news, there will always be room for further improvement.
The robustness of the outlier adjustment appears to be further challenged as we get further back into the past. For example, Roberto Clemente is still being shrunk a fair amount during some extreme seasons in the 1960s and Stan Musial tends to jump around a bit from year to year in the 1950s. This may reflect time periods where the differences in talent were more extreme, or other competitive (or a lack of competitive) aspects which ultimately made some performances less impressive.
Some eras may require even more focused priors to coax out the expected contributions of their outliers. As always, you should remember that, over the course of a career, a player’s raw stats—even for something like batting average—tend to be much more informative than they are for individual seasons. If a hitter consistently seems to exceed what DRC+ expects for them, at some point, you should feel free to prefer, or at least further account for, the different raw results.
In the meantime, the performance of DRC+ in measuring today’s game speaks for itself. As we walk through the code in the coming weeks, we’ll look forward to further feedback and insights that can continue to improve our new batting metric. To the extent readers see more opportunities for DRC+ to become even more sensitive to context, we hope you will let us know.
For more information about DRC+, our landing page can be found here.
 In re-running these benchmarks, we discovered a few team-seasons that had slipped out of the data set. Those teams have been restored and make little difference to the overall measurements. The correlations are still calculated through a multivariate regression in Stan.
 wOBA isn’t necessarily any better than OPS for objectively evaluating offensive contributions; DRC+ leads regardless of whether you look at Predictiveness through wOBA or OPS.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.Subscribe now