It happens every May. Someone on your favorite team is having an uncharacteristically good (or bad) year. This year, David Wright got his groove back, while his former teammate Jose Reyes lost his way. Edwin Encarnacion and Carlos Ruiz started hitting home runs for no apparent reason. For a while, Albert Pujols (!) was stuck in a very public home run drought. Early in the season, analysts and fans have learned to (properly) dismiss these runs as small sample size flukes. They’re something to keep an eye on, but… he'll be back to normal soon.
And then, it happens. “Well, you know, we’ve reached the point in the season where sample sizes start to become meaningful. Smith has now amassed 150 PA, so we can really start to believe in his performance…”
What amazes (or perhaps just amuses) me is that the reaction to this magical point in the season is that fans start panicking, whether the player has been doing well or poorly. Fans of a poorly performing Smith go from wondering when he’ll revert to form to panicking over the fact that “he’ll… never… be… the same… again.” Fans of the surprisingly good Jones fall into despair because he's a one-year contract guy who will probably ride this to a three-year deal elsewhere. Why can’t baseball fans ever be happy?
That’s usually the point where I hear my name. “You see, there was this study done about when different stats start to stabilize, and we’ve gotten to the point in the season where we can say that this new strikeout rate that we’ve seen actually is what we can start to expect…" (And for the record, I am guilty of this too!) It’s just that there’s one tiny little problem.
That's not what the study was actually about.
There’s a giant assumption built into the statement “We’ve reached the point of stability so this is what you can expect out of Smith from here to eternity,” or at least for the rest of the year. It assumes that baseball players have fairly static talent levels over the course of a season. That's not an awful assumption. It may not even be a false assumption. But it’s an assumption, and assumptions must be challenged.
I think it would be rather instructive to recap how I think we got to this point, but first…
Warning! Gory mathematical details ahead!
(I'd recommend reading this one. But, if you want to skip this part, close your eyes and go to “What It All Means.”)
In November 2007, when I was writing the initial study, I was chasing the answer to a very different question. At that time, a favorite trick of mine was to make up a new stat and see what it correlated with. Did my homemade plate discipline metric correlate well with strikeout rate? I would set a minimum number of plate appearances for including a player in my study, mostly because I had been raised on the fact that to qualify for the batting title, you had to have a minimum number of at-bats. It’s very simple logic. We can’t give the batting title to the September call-up who goes 1-for-2 and thus does Ted Williams 94 points better. Two at-bats isn’t a big enough sample size to tell what a player’s true talent is. Baseball has apparently known this for several generations.
At what point should I set that cutoff? DIPS theory had previously shown us that some stats were more reliable than others, so I assumed that the cutoff would vary by stat. Up until that point, I had been relying mostly on the “yeah, that sounds about right” method. But was there a more empirical way to do this? Thankfully, I was a grad student with the words “measurement reliability” still ringing in my ears.
Reliability isn’t all that hard to fathom. The idea is this: if there are two roughly equivalent measures of the same thing, then if someone does well on one, he’ll probably do well on the other. All I needed were two roughly equivalent measures of the same thing. Thankfully, baseball supplies oodles of plate appearances every year, perfect for split-half reliability. If I lined up each player’s plate appearances and put them into separate buckets according to whether each was even or odd-numbered, I could generate two roughly equivalent samples of plate appearances for each player. Alternating back and forth meant that I would likely get a couple of PA from the same game in each basket. After a while, issues like whether a plate appearance was against a reliever (or the third time around against a starter) would balance themselves out with some of each type in each basket.
By doing this I could answer the question "If you took Player X and gave him 60 PA, and then gave him another 60 PA in roughly the same circumstances, how well would his performance in each match up?" The rest was a matter of number-crunching to find out where the correlation crossed .70, which is the point where the R-squared crosses 50%. At that point, the signal outweighs the noise. There will always be random variation driving results in baseball, but at r = .70, the sample size was big enough that we could assume that the ups and downs had evened out. I had my defensible number with which I could set my minimum inclusion criteria!
(Now would be a good time to point out that at a stability point of .70, we assume that performance over that time period is still 50% noise. That's a lot of noise.)
The careful reader noticed that I used italics above on the words “in roughly the same circumstances.” Those words make a very big difference. Let's take a closer look at the split-half method and what it does. Because some plate appearances from each day end up in both the even and odd baskets, we can assume that the batter’s general health is the same in both baskets. The same goes for any other daily situational variables that might be in there, like whether the batter got enough sleep the previous night. If he woke up one day and decided to do something different with his swing, some of the PA with the new approach would be in the even basket and some in the odd basket. It's a great way to control for confounds.
For example, last week I found that strikeout rate stabilized around 60 PA (using something a little different—and better—than split-half, but functionally for now, it's the same idea). That is, Player X's strikeout rate after 60 PA was (past tense) a reasonable reflection of his true talent during that timeframe. Whether or not the first 60 PA of a season predict anything about the next 500 is an entirely different question, with a somewhat different answer.
To find that answer, I isolated the first 300 PA for all players in 2011 who reached that level (n = 264) and split them into five consecutive samples of 60 PA each. (PA 1 through 60 were the first such sample, 61 through 120 were the second, etc.) If we indeed have a steady-state universe, the correlation between sequential blocks of 60 PA should be roughly .70.
K rate in the first 60 PA correlated with K rate in PA 61-120 at (drum roll, please) .483. If we keep going, block two and block three were at .467, three and four were at .533, and four and five were at .518. Call it a correlation of .50. I took an AR(1) intra-class correlation on all five time periods, and it came out to .50.
That may not seem like a big deal at first blush, but let me phrase it this way: the R-squared has been halved simply by removing the controls that the split-half/KR-21 methods naturally afforded us. In our sequential analysis, we control for the fact that the PA are taken by the same hitter, but we do not control for any sort of daily circumstantial variables. In this case, those circumstantial variables are half (!) the story.
What it all means
The generally accepted "stability numbers" chart is a good chart for researchers who are doing retrospective research. I think it's also a good one to look at in terms of understanding which stats stabilize more quickly relative to others, which I think can show us some interesting truths about the same. However, I would kindly point out that they are not nearly as powerful in predicting future performance as people seem to believe that they are.
I think that the bigger implication is for what I like to call the "steady state" model of baseball players. Baseball fans have a nasty habit of thinking of players as static objects. Consider the ways in which fans speak about players. (A general rule of life: the way in which someone speaks is far more instructive about their thoughts than their actual words.) Consider the phrase "He's a 25-homer guy." Usually, there's an unspoken "per year" at the end of that sentence, because that's how baseball fans reckon time. There are reasons for that, but it comes with a cost. It assumes that our "25-homer guy" is a 25-homer guy in April and still a 25-homer guy in August. Again, this may or may not be true. The problem is that by denominating time per year, we ignore the fact that a baseball player lives a day-to-day life. He has bad days and good days in areas that have nothing to do with baseball but can certainly affect his performance. More than that, he grows and develops over time. In fact, this is why teams have coaches.
Let's go back to our hitter after his first 60 PA. Consider that over those first 60 PA, our hitter has gotten a little older and hopefully wiser. Or maybe just more tired. Or maybe he picked up a bad habit. He may have tinkered with something in his swing and decided to keep it. He may have tinkered with something in his daily routine and decided to keep it. Over the next set of 60 PA, you could be dealing with a very different set of circumstances, even if it's the same man.
There's one other implication that I think is rather important. I've been searching for the point where performance is more reflective of talent than luck/random variation. At the stability point of .70, this means that there is roughly a 50-50 split between the two. Luck can't be controlled, so we will leave it aside. But it looks like "talent" is, in some substantial part, a function of circumstances that surround a player day-to-day. The cost of focusing so much on the year as a unit of time is that it overlooks all of these daily factors.
I have to imagine that some of those circumstantial variables will be out of a player or a team's control as well (the quality of the other team comes to mind). Some of the other variables that are in play may also be hard to measure (get creative, people!). But some of them might be things that can be both measured and changed. I think that sabermetrics has been operating on the assumption that they don't much matter. Circumstances do matter. A lot.
And yes, while we've gotten to "that point in the season," if your favorite player just hasn't been himself lately, don't worry too much. Things can change.