Quick, which player had the greatest change in on-base percentage from 2011 to 2012? Did you say Houston Astros pitcher Aneury Rodriguez? In 2011, Rodriguez went 0-for-9 with two sac bunts. In 2012, Rodriguez appeared in only one major-league game, but he came to the plate once and got a hit. Rodriguez went from a seasonal OBP of .000 to 1.000. It doesn't get bigger than that.
But then you might have already recoiled at the thought of using sample sizes of nine PA and one PA in any sort of analysis. At those sample sizes, OBP isn't stable enough to mean anything at all. Especially, y'know… one PA. Can we say anything about whether his performance changed? How would we know?
There are several statistical problems that get in the way in these sorts of analyses. It's easy enough to take this year's OBP and subtract last year's and get the difference between the two, but what sort of context should we place around it? For one, there are some statistics which naturally vary more from year to year than do others, but for issues of statistical reliability. If a pitcher's BABIP goes from .250 to .350, that is certainly unfortunate, but there's an understanding that because the stat is so unreliable, huge swings are to be expected. If a hitter's OBP did similarly, this would be a cause for investigation.
Then there's the matter of sample size. We instinctively treat Aneury Rodriguez with caution because of the small sample size. But had the case been that he had jumped from .000 to 1.000 and had 500 PA in both seasons, people would probably skip the PED investigation and just investigate Rodriguez's ties to the dark arts.
Finally, there's a question of cut-offs. What "counts" as enough of a jump (or drop) in OBP to warrant further explanation? Is it a certain number of points? (20 points? 30?) Perhaps a certain percentage change from year to year? And of course, what happens to the guy who adds 19 points? He gets tossed into the "didn't change" bin? Where do these numbers even come from, other than just being pulled out of the air?
There needs to be some sort of sensible framework for identifying players who made changes from year to year. Let's see what I can do. As always, if you just want to see the list, skip the next section, although this week, the whole point is the gory stuff.
Warning! Gory Mathematical Details Ahead!
We're going to be using a technique known as the reliable change index. For the statistically initiated, the best way to think about it is marrying a paired-samples t-test with a one-sample z-test. The formula looks a lot like a paired-samples t-test, but we'll be using population estimates of some key parameters, rather than the sample mean/standard deviation/correlation coefficient.
I'll be using OBP as an example, but any statistic can be used. My goal is to show the method, and you can all do this at home with whatever stat you feel is important.
Here's the formula, with the eventual result being a z-score (and yes, it'll map onto the z-distribution).
z = (OBP2012 – OBP2011) / error term.
This is a pretty standard form for an equation, with difference over error term. The numerator is easy enough. It's just the raw difference in OBP (or whatever) from 2011 to 2012. The denominator is going to be our determinant of "enough."
About that error term. The amount of change that we might expect (one might say a standard error of change) is going to be related to two major factors: the general spread (standard deviation) that we might expect in the population for that measure and the reliability of the measure. Both of these are going to be impacted by the sample size that underlies them. So, we're going to need reliability numbers and standard deviations for OBP at sample sizes from one to 740 (Derek Jeter had 740 PA this past year, leading baseball). And when I say that, I mean we'll need estimates for one PA, two PA, three PA, four PA, five PA, etc.
I went up to 800, just to make it a round number, and got Kuder-Richardson formula 21 reliability coefficients for all 800 steps on that ladder. To do so, I used data from 1993-2011, and used all hitters who had logged at least 1600 plate appearances during that time (2 * 800). There were 389 such hitters. For standard deviation, I needed only those who had a minimum of 800 PA (n = 574) within those years. I calculated the standard deviation of OBP after all of those hitters had logged one PA, then two, then three, all the way to 800.
The formula for our standard error is given by this formula.
SD * sqrt(2) *sqrt (1-reliability)
In 2011, Brandon Belt came to the plate 209 times and logged an OBP of .306. After 209 PA, OBP has a reliability of .5516 and a standard deviation of .04078. The expected error in this situation is .038614.
Now, we need to pool these error terms, and the standard form (weighted pooling) for doing that is given by the formula:
If you plug in Belt's numbers, you get .03539. That pooled term is the error term that we'll stick in the denominator of our original equation.
For Belt, we get (.360 – .306) / .03539 = 1.53
You can interpret that like a standard z-score with the attendant p-value (in this case, 0.12). In the language of hypothesis testing, we can ask, "Did Brandon Belt's OBP performance change from 2011 to 2012?" If we're using the usual cut-off of p < .05, then we fail to find evidence for the hypothesis that his performance changed. There's still too great a chance that Belt's considerable (54 points!) jump in OBP was simply the result of random variation. Your alpha level—and willingness to use hypothesis testing—may vary.
To show how important the sample size issues can be in this sort of analysis, I present two cases: Jeff Keppinger and Steve Pearce. Keppinger's 2011 OBP was .300 in 400 PA. In 2012, his OBP rose to .367, while he logged 418 PA. His raw change in OBP was 67 points. Factoring in those sample sizes (400 PA and 418 PA), the z-score associated with this change was 1.98. Steve Pearce, on the other hand, had a similar jump in OBP (.260 to .328, for a total of 68 extra points), but did so in samples of 105 PA in 2011 and 188 PA in 2012. Despite having a slightly higher raw difference in OBPs, his smaller sample sizes increased the uncertainty, and he settled into a z-score of 1.48.
One rather interesting effect was that players who had logged only a few PA in one of the years and who had subsequently extreme scores in one of the years (e.g., Darin Mastroianni went 0-for-3 in 2011, but had an OBP of .328 in 186 PA in 2012), were likely to show up as having large z-scores. This brings up an important point in interpreting these z-scores. The question that's being answered is not whether I believe (to within a certain amount of error), that Darin Mastroianni really went from having a true OBP talent level of .000 to a true talent level of .328. I'm merely pointing out that his performance (n.b., not the same thing as true talent level) changed in some meaningful way. Whether or not we believe that his true talent level has changed will depend on how much we believe that his performance represents his underlying true talent.
But, to weed out some of these statistically noisy cases, I've limited myself to reporting on players who had a minimum of 100 PA in both 2011 and 2012.
Which player had the most statistically significant improvement from 2011 to 2012? Ladies and gentlemen, Tyler Colvin, who had a .204 OBP in 222 PA in 2011, but in 2012, hit (and walked) his way to a .327 OBP in 452 PA, for a z-score of 3.45. In second place was some guy named Mike Trout (z = 3.41). I finally gave in and admitted that Trout was second to someone, but I don't think I expected it to be to Tyler Colvin.
Now, let's talk about what these findings can and can't do. They can’t tell you why the player showed such improvement (or regression). But now that there's a good solid statistical basis for identifying which players’ performance changed, further research can look at whether there are common factors that might predict future improvements. Second, these findings do not tell you whether the player will maintain his improvement (or regression). There are undoubtedly some guys who just got really, really lucky (that's the whole point of having an alpha level) in this sample, and their uppance shall come. Distinguishing the folks who will hold onto their performance vs. those who will regress is another area for investigation.
But what these findings can do is to set up a statistically-based framework for identifying which players made big gains over the course of a year, and that's a good place to start.