January 9, 2017
Passed Balls and Wild Pitches (Again)
About this same time last year, I was in the midst of a trial in West Virginia when I got to thinking about wild pitches, as one does. In doing so, I realized that modeling passed balls and wild pitches as simple binomials—as we had been doing—did not fit the data as well as it should. To address the problem (or so I thought), I tweaked the parameters, recognized that a Poisson distribution seemed to be a better fit, and remodeled them accordingly.
However, in reviewing those revised numbers after this season, Harry Pavlidis and I came to the same conclusion: our predicted numbers were still not quite right. Specifically, they are too low. In raw numbers, catchers tend to be worth anywhere from plus or minus five runs a season when it comes to blocking, but our models were giving them credit for only about one or two runs above or below average.
Why were our models still underestimating the value of pitch blocking? The answer is that wild pitches follow an even more complex distribution than I had thought. Specifically, what I had decided to be a simple Poisson distribution was in fact a mixture distribution. Mixture distributions, in turn, require a more sophisticated approach.
To understand mixture distributions, we need to start with non-mixture distributions and work our way up. The most famous probability distribution, typically described as the normal distribution, or “bell curve”, looks like this:
Many phenomena in nature approximate a normal distribution. The most distinguishing feature of the bell curve is that it has one peak and gradually slopes off to both sides. This is indicative of an activity being driven by a single process, whether it be samples of a group’s blood pressure, errors in measurement, or something else entirely. Most statistical models (including those on this website) use single-process distributions of some kind. This is because the things we care about tend to accumulate toward a peak, and settle back to rest.
But what if there is more than one process at work? Then, you have to make adjustments. Sometimes, multiple processes have evident from the data. Take, for example, the geyser Old Faithful at Yellowstone Park. Although this landmark is famous for its reliable eruptions, Old Faithful actually operates on two overlapping schedules. Plus or minus 10 minutes, Old Faithful will erupt either 65 minutes or 90 minutes after a previous eruption. This gives Old Faithful a mixture of two distributions:
The resulting plot has two humps rather than one, providing clear evidence of a “mixture” distribution. To be more specific, there are two normal distributions that combine to create a schedule is even more remarkable than we would expect.
Sometimes, though, the mixtures are more subtle. There may be only one bump on the plot, but there can still be other processes operating in the background, but driving the picture you actually see. Today, we’ll talk about what I call “threshold” mixtures: hidden processes that make the events of interest appear less often than expected. These threshold processes are rather sneaky: if you don’t catch them, your models will end up under-counting the events of interest—like, say, a bunch of wild pitches.
So, let’s talk about some threshold mixtures. As a special bonus, we’ll even finally start talking about baseball. We’ll begin as an easy example: Runs allowed per 9 Innings, or RA9. In 2016, the distribution of RA9 looked something like this:
At first glance, this distribution seems pretty straightforward. There is one bell curve-ish hump, starting from around zero (the minimum possible runs allowed) and peaking once before making a gradual descent.
So, how many processes would you guess underlie a pitcher’s RA9? Well, given the graph, I think most of you would say “one.” What would that singular process be? Most obviously, the quality of pitching. Certainly there are other factors that affect scoring—ballparks, weather, opposing batters—but those are confounders, not processes. Runs do not score themselves, even at Coors Field. Rather, the runs-allowed process involves pitchers giving up (or not giving up) run-scoring events, and all other things being equal, better pitchers give up fewer runs per 9 innings than bad pitchers.
Straightforward, right? And yet, by RA9, here were the best team stints by pitchers in baseball last year:
One possible reaction: “Wow! Look at these guys! The best RA9s possible and yet not a single Cy Young vote among them! The BBWAA strikes again!”
But the more sensible reaction would be to request one additional column:
As many of you probably expected, none of these folks pitched more than a few innings. None of them is viewed as a top pitcher in baseball, nor should they be. Rather, the likelihood is that, given a few more opportunities, these pitchers would start giving up runs at a rate more commensurate to their true ability, which itself is reflected by the number of innings these teams were willing to give them in the first place.
The point is that there is a second process underlying RA9 that generates more zeroes (here, scoreless innings) than we would expect given the quality of pitchers who are on the mound. In the language of statistics, there is a “hurdle” that pitchers and batters need to get over before RA9 bears a reasonable resemblance to how well a pitcher is actually throwing. Said hurdle is that scoring typically requires a sequence of events, and thus even a mediocre pitcher can sometimes get away with no damage for a few innings if the runners never make their way around third base. Eventually, though, the hurdle gets crossed for everybody and runs start crossing the plate. RA9, despite its singular bell curve, thus also can be modeled as a threshold mixture distribution.
Wild Pitches: A Zero-Inflation Problem
So, RA9 gives us an example of two distinct processes, one of which is generating excess zeroes. But what if there is just one process at work and you still get excess zeroes? Then we have so-called “zero-inflation.” It so happens that zero-inflation is what was messing up our wild pitch models.
How is the distribution of wild pitches zero-inflated? One piece of evidence I cited last year for treating errant pitches as counts was that their respective means and variances were close to each other—a canonical Poisson trait. What I failed to appreciate was that the variance of both wild pitches and passed balls were actually less than their respective means, which is somewhat unusual. Zero-inflation is not the only possible explanation for underdispersion (which is rare in any event), but with a maximum possibility of one errant pitch at a time, any underdispersion in this context by definition involves extra zeroes.
Statistically speaking, we can test for zero-inflation using what is known as a Vuong test. A Vuong test compares a zero-inflated Poisson model to a plain-old Poisson model with the same parameters. The Vuong test does not find substantial zero-inflation with passed balls, once we control for knuckleballs. In other words, using an ordinary Poisson model, passed balls generally occur with the infrequency we would expect (at least when Tim Wakefield is not pitching).
Wild pitches, though, are a different story. If you take all 2016 wild pitch opportunities, run both our existing wild pitch model and then a zero-inflated model, and then execute the Vuong test, the results are striking:
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishable)
Vuong z-statistic H_A p-value
Raw 10.506882 model1 > model2 < 2.22e-16
AIC-corrected 10.292709 model1 > model2 < 2.22e-16
BIC-corrected 9.193909 model1 > model2 < 2.22e-16
Model 1 is the zero-inflated option; model 2 is the existing Poisson option. These results show that zero-inflation is almost certainly present.
What is the source of this zero-inflation? Recall that zero-inflation means that the same process driving wild pitches in general is also driving the excess zeroes. The process generating wild pitches is, obviously, pitchers throwing to catchers at locations a catcher struggles to stop. Why would this process generate zero-inflation? My guess is that, in the major leagues at least, it's because these players are the best in the world at what they do. Pitchers who cannot locate pitches do not stick around, and catchers who cannot block pitches need to find another position. Pitchers and catchers are still human, which is why wild pitches still happen. However, I suspect that if you compared wild pitch rates at lower levels of baseball competition, and certainly with ordinary people off the street, that zero inflation would be less of a problem.
Modeling Zero-Inflation (skip this if you hate math, statistics, or both)
How do we model zero-inflation? In the R programming environment, the consensus choice is the pscl package, which efficiently tackles zero-inflation and hurdle models for data involving counts. The pscl package also provides the Vuong function to help diagnose when a data set presents a threshold mixture problem.
In terms of the fit itself, there are a few wrinkles. First, pscl does not support multilevel modeling, which as you know is our preference. But, the coefficients we use—and they are the same ones we laid out last year—essentially describe all of the variance that would otherwise be attributable to players, so there is little harm done. Second, pscl is limited to count distributions. If you require something more elaborate (such as a zero-inflated binomial), you’ll need to try something else.
Finally, because players do not have their own intercepts, as is our usual practice, the with-or-without-you (WOWY) calculation has to be adapted. Because our chosen covariates are all extrinsic to catchers, the catcher WOWY calculation is the difference between the probability of the actual event and the predicted event, which is then averaged, and multiplied by a linear weight and the number of errant pitch opportunities to determine the final run value. On the other hand, all of our chosen covariates (wild pitch likelihood and knuckleball) are intrinsic to pitchers. You can't just follow the same calculation as with catchers or you would be "controlling" for all the facts that actually distinguish pitchers from one another when it comes to errant pitches. Thus, the pitcher WOWY calculation becomes the difference between the predicted likelihood and the likelihood with a pitcher with a league-average performance on each of those covariates. The results are then averaged, multiplied, and valued as with the catcher tabulations.
The Revised Errant Pitch numbers
The nice thing about the zero-inflated wild pitch model is that it solves our underestimation problem; because the majority of pitch blocking value arises from wild pitches, solving that problem returns our pitch blocking predictions to the ranges that we actually see over the course of a baseball season.
How do we prove this? By using the same combination of metrics we have been using to validate Deserved Run Average (DRA): (1) correlation of predicted values to raw values, (2) year-to-year consistency of predicted values, and (3) accuracy of next year’s predicted raw values. These tests are objective and can be run by anybody who downloads our posted values for any metric. The tests are particularly useful because you don’t have to understand or recreate our models to see when they are producing good and useful results.
For this proof, we took all catcher seasons from 2010 through 2016, weighted by the chances each catcher had to experience an errant pitch of some kind. In categories (2) and (3), the new models performed basically the same as the old ones. But the zero-inflated wild-pitch model makes a clear difference in category (1), improving the Spearman correlation on our blocking runs from .72 to .81 and the Pearson correlation from .57 to .87. The former means we are now more accurately reflecting the natural order of catchers, and the latter means that we are better approximating the actual range of catcher value, which is ultimately what we care about for catcher (and pitcher) valuation. Because those improvements come at no cost to the other measurements, these changes are a net positive and we will use them going forward.
Here is the difference between the previous values and some of the updated values, for catchers on both ends of the spectrum this past season:
Suffice it to say that the new values make clear that pitch blocking, while not producing value in the range of pitch framing, is nonetheless an area where runs are being gained and lost. Of particular note is the high rating of Josh Thole. Thole is commonly asked to catch knuckleballers, putting him a particularly difficult position when it comes to preventing errant pitches. However, by controlling for this challenge, we find that Thole is, on balance, a very good blocking catcher, which explains why he is a strong choice to catch knuckleballers in the first place.
I strongly suspect that the significance of threshold mixtures does not stop with errant pitches. As Statcast and biometric measurements continue to rise in importance, it is likely we will discover further areas where threshold mixtures have the potential to mask the true value of certain baseball plays. The framework we have set up here should make those future challenges much easier to recognize and solve.
Chris Fraley, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca (2012). mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation Technical Report No. 597, Department of Statistics, University of Washington.
R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
 Indeed, one might say that RA9 must be a mixture distribution because neither of the common distributions that fit it best — the gamma or the log-normal — even allow for zeroes.
 This is bad news for those who have become overly reliant on R’s handy “predict” function to specify all of their model predictions. Correct WOWY calculations involving intrinsic player characteristics require a coefficient by coefficient compilation of the baseline probabilities.