keyboard_arrow_uptop

This is the second in a series of articles explaining in depth the updated formulation of Deserved Run Average. The overview can be found here, and Part I of the in-depth discussion of the revised approach can be found here.

Call me Jonathan.

For most of this offseason, my (entirely metaphorical) White Whale was baseball’s run expectancy curve; the distribution, if you will, between the minimum and the maximum number of runs yielded by pitchers per nine innings of baseball. Why would something so seemingly arcane be so very important to me? Let’s start with some background on run expectancy.

In 2015, for pitchers with at least 40 innings pitched, their ERAs ranged from .94 (Wade Davis) to 7.97 (Chris Capuano). In more prosperous times, such as the 2000 season, pitcher ERAs at the same threshold ranged from 1.50 (Robb Nen) to 10.64 (Roy Halladay). For something more in the middle, we can turn to 1985, when a starter (!), Dwight Gooden, had the lowest ERA at 1.53, and Jeff Russell topped things off at 7.55.

Here’s what those seasons look like on a weighted density plot, side by side:

The density plot allows us to visualize different seasons side by side (and preferably in contrasting colors). From these curves, you can see two things. First, run environments of course can change a lot over time. The curve from the “Year of the Pitcher” in 1968 is from a different baseball world than the year 2000, when league expansion and (possibly) PEDs were driving historic amounts of run-scoring.

But there is one thing these curves all have in common: They converge and terminate at their minimums before 0. The absolute minimum amount of runs a pitcher can allow is . . . zero. No more and, more importantly for our purposes, no less.

So, back to the problem. My problem was negative run expectancy. How can something that is technically impossible be a problem? Because for pitcher run estimators, negative run expectancy is a problem. In fact, it has traditionally been one of their most irritating flaws: particularly in smaller samples, pitchers end up with predictions of “negative” runs allowed, even though that is not possible.

For example, at the moment I began writing this article, Hector Rondon had a negative Fielding Independent Pitching (FIP) score: -0.14 to be exact. Scott Alexander of the Royals had a SIERA of -1.00. And when it comes to DRA, Dellin Betances spent the first few months of last season in negative territory as he dominated the league.

This problem has two causes. The first is that pitcher run estimators subdivide baseball events into sets of positive and negative run values. To give one example, if a pitcher strikes a lot of guys out (multiply by -2), but issues far fewer walks (multiply by +3), and avoids giving up home runs (multiply by +13), his FIP will stay in negative territory. SIERA is more complex, but because it also has negative components, it appears to be similarly vulnerable. DRA had the same problem last year.

The second cause is that pitcher run estimators tend to assume the run-scoring is on what is known as a normal distribution: the familiar bell curve which swoops down symmetrically from its identical mean, median, and mode. The normal distribution is defined by its location (midpoint) and its scale (how wide it is). FIP, for example, makes the MLB-average ERA the location for its distribution and tries to approximate ERA’s scale by inflating the linear weight coefficients it applies to strikeouts, walks, hit-batsmen, and home runs. Last year, DRA did something similar by making its location the MLB mean of runs allowed per 9 innings (RA9) and its scale the standard deviation of MLB RA9.

What’s wrong with this approach? Let’s take the 2000 actual seasonal distribution for ERA, and compare it to what ERA would look like if it were actually a normal distribution, drawing from the same underlying mean and standard deviation:

As you can see the two distributions are somewhat different. They move closer together if we raise the minimum number of innings considered, but that doesn’t help us if we’re trying to characterize an entire season’s worth of performances, large and small.

That said, by itself this mismatch would be no big deal. There are very few perfect normal distributions, and on its face, this has “close enough” written all over it, at least for computational purposes. In fact, we’re going to take advantage of that “close enough” quality later on in this article.

But the problem comes when we look at the scale of the two distributions. Let’s stretch the x-axis out a bit in each direction, and thin the lines so you can distinguish better on the margins:

And here you can see the problem. Runs per 9 innings, as we discussed above, has an absolute minimum value of 0. Yet, the normally-distributed version of ERA—assumed by FIP, SIERA, and (last year’s) DRA—keeps going below 0. This is wrong, and it is not curable within the existing assumptions made by pitching run estimators. Run scoring is, quite simply, not normally distributed. It is more concentrated at its mean (higher peak), and it tapers off more quickly from that mean when it expands (shorter tails), meaning its values are less extreme in both directions than the normal distribution would expect.

If the normal distribution won’t work, then we need to find another one. And this was the ultimate whale I chased this offseason. There are hundreds of data distributions that we might use, and sadly, I studied a lot of them. But which one is the best distribution to fit seasonal ERA (and thus, RA9)? And once we find it, how do we work with it?

A rate of counts
We begin the inherent characteristics of runs scored. We know that they are always positive. We also know they can only be integers (1, 2, 3, etc.), so there are no decimals or fractions in between those integers. In the language of distributions, runs are considered to be “counts.” Most sports are scored in terms of counts (e.g., “goals,” “baskets,” “runs,” etc.), meaning that they tend to operate on a Poisson or, more commonly, a negative binomial distribution.

Many analyses tend to stop there. But we need to keep going, because what we care about for our purposes isn’t how many runs are scored, but how much time there is between runs being scored, with “time” measured by outs, as collected in innings, with those innings collected in groups of 9 to give us ERA or RA9.

So, what is the distribution that governs the frequency at which runs are being scored? I concluded that the answer is the gamma distribution. In addition to being the canonical distribution for the time between discrete events, the gamma distribution is also eminently suitable for our purposes: It fits both ERA and RA9 quite well, it can only be positive, and, because it represents a continuous rate rather than an integer count, it accommodates the decimals we need to distinguish the scoring rates of pitchers from one another as a practical matter. The gamma is not the only distribution we could use, but it is the simplest one that does the job properly.[i]

Gamma distributions can be parameterized in a few different ways; the method we will use employs a shape parameter and a rate parameter. Through non-linear optimization, using a package such as propagate in R, we can derive these parameters from the league ERA (or RA9) distribution in a given season. So, putting weights to one side for a moment, the gamma distribution for raw ERA in 1968 has a shape parameter of 9 and a rate of about 3. For 2014, it’s about 6.5 and 1.5. And for 2000, the shape is about 7 and 1.5. (As you raise the minimum number of innings for the pitchers involved, the distributions do compress a bit).

Working with the gamma
So, we’ve now got our target distribution, and I’ve become much more cheerful. But, we’re not out of the maelstrom yet, because now we have to calculate run-scoring on the gamma distribution. This, too, requires some creativity, because regressions directly performed on the gamma distribution, while technically possible, are computationally demanding and can be unstable. Even if you figure out a workable direct regression on the gamma, how do we get the 24 DRA models to run in an acceptable amount of time? Time is a particular concern for a sabermetrics website, because we update our data daily during the season. Due to the complexity of our models, cFIP and DRA may not be available each day for breakfast, but they really ought to be available by lunch.

Fortunately, there is an answer to this too. The solution goes back to our original diagnosis of the problem: the fact that ERA and RA9 look an awful lot like a normal distribution. Because ERA and RA9 are substantially distributed in a normal fashion, we can use the normal distribution (which is very speedy to employ) for our calculations, as long as we scale the results to the gamma in the end, something we can do from z scores off the normal distribution. In the R programming environment, I do this as follows:

1. Run the DRA component models assuming a normal distribution for each season.

2. Predict the aggregate number of “deserved” runs divided by the pitcher’s innings for a raw “deserved” runs per 9 innings, and convert the final values to weighted z scores;

3. Determine, through “leave-pitcher-out” bootstrapping, the approximate shape and rate characteristics of the RA9 gamma distribution for each season;

4. Convert the weighted z scores to the gamma distribution derived in step 3, thereby putting it on the proper scale, and creating what we publish as “DRA.”

Points 3 and 4 are where the rubber hits the statistical road. We’ll start with Point 3. For DRA, I chose to use pitchers above the median number of innings pitched for the season, which these days tends to be about 40 innings or so, and fit the distribution for the pitchers who remain. We also stick with last season’s curve until about June 1 or so of the following season, to make sure we have enough data to work with. Either way, we randomly resample, with replacement, the final seasonal statistics for all pitchers who fit these criteria to get a robust estimate of what the “true” distribution of RA9 is most likely to be. By resampling which pitchers get included, we get a sense of the “true” distribution without any particular combination of pitchers exerting undue influence over it. We do this 20 times and then average the shape and rate parameters from each resampling to get our final estimated parameters for a particular season’s RA9. (For 2015, we derived a shape parameter of 16 and a rate parameter of 4).

That brings us to Point 4. To run the scaling step in R, you take advantage of the fact that, through simple math, the gamma distribution can easily translate normally-distributed data to a defined gamma distribution. So, having derived our weighted z scores for each pitcher, and also derived our shape and rate parameters for the year in question, we give R the following command:

pitcher.final$DRA_final <- qgamma(pnorm(pitcher.final$z), shape=shape, rate=rate)

And that is how the metaphorical whale of RA9 was found: through the life buoy of the gamma distribution. As implemented in the updated DRA, it ensures that DRA will never again fall below zero, and will scale faithfully to the RA9 distribution. Similar applications may benefit other estimators as well.

Thanks, as always to the BP Stats Team for their review and feedback.

Bibliography

  Andrej-Nikolai Spiess (2014). propagate: Propagation of 
  Uncertainty. R package version 1.0-4.
  https://CRAN.R-project.org/package=propagate 
 
  Matthew Lundberg (2014). Transform variable to gamma distribution in R. 

R Core Team (2016). R: A language and environment for

statistical computing. Version 3.3.0. R Foundation for Statistical

Computing, Vienna, Austria. URL https://www.R-project.org/.



[i] The one thing the gamma distribution cannot do by itself is model intervals of 0, because there technically is no such thing. So, to include pitchers who only ever deserve to give up exactly 0 runs, the zeroes would have to be modeled separately as a binomial. These steps could be unified with a hurdle-gamma model, if desired.

You need to be logged in to comment. Login or Subscribe
crazy114
5/23
Fantastic work. Great mind
Tarakas
5/23
Interesting. I have no deep thoughts to add, but I wanted to note this was an interesting solution to this problem, and while technical, written up in an understandable way. Accessibility can be a real challenge in a piece like this, and you met the challenge as a writer.
bachlaw
5/23
Thank you; I tend to understand these concepts in a non-mathematical way and, while that has some limitations, I think it does make this easier to explain.
jdeich
5/23
Other metrics used by BP have similar issues with returning nonsensical results given valid inputs. For example, 44 players currently have a negative TAv. This list is "topped" by Hansel Robles (-.430) who had the misfortune of a GIDP in his only plate appearance. Since TAv's "job" is to report what a player's batting average would be if it captured all aspects of performance, it should never be below .000. (You don't need GIDPs for a negative TAv: Tim Adleman is -.010 by virtue of being 0-for-8 with 5 K. Otherwise you could argue that nothing but GIDPs is worse than striking out every time.) It may help to test BP's other metrics for limit behaviors. For example, would a team full of perfectly incompetent hitters (.000/.000/.000, 27 K/9) be projected to win negative games? Conversely, would a team full of Mike Trouts and Clayton Kershaws be projected to win 163+ games? Considering semi-ludicrous extreme cases should reduce your model's errors at the edges of plausible performance.
yibberat
5/23
Thank you for this. Been a long time since I've been in stats so it's good to see the thought process laid out and not just the calcs. Here's another story idea. My intellectual interest is the 'Rockies effect' on leaguewide (and Rockies) data. Their league-comparative data (park adjustment, etc) is always wrong because they are not a random outlier. Year-after-year in a 30 team league, their data WILL BE slightly 2+ standard deviations from mean - because that is the definition of 2 sigma (1/22 frequency * 30 teams = 1.36 with Rockies being the 1 plus some other team/outlier every three years or so). IOW - the rules of normal distribution stats define their compared-to-mean results more than their actual data does. I know the solution is - throw out the outlier and make your calcs on the remaining teams. Which makes the comparative stats more valid for the other teams - but the Rockies are 'thrown out' now. So how do you get the Rockies data back in - even if only for their own comparison to everything else?
mainsr
5/23
Thanks for a clarification/explanation, surpassed only by your 80-grade opening sentence.
mainsr
5/23
*the, not a