Happy Thanksgiving! Regularly Scheduled Articles Will Resume Monday, December 1
June 28, 2013
The Mystery of the Missing .500 Teams
Sometimes, baseball research happens because you go out looking for something and you find it. Other times, it happens because you go off looking for something else and you trip over something far more interesting. This is the latter. While looking through historic team records for another project I was working on, I came across an interesting puzzle—there were far fewer teams exactly at .500 than I would have expected. I thought maybe it was a wacky feature of the sample set I was using, but I expanded my search to nearly 50 years of Major League Baseball, and the same puzzle was still staring me in the face. So I was left with three questions: Was what I was seeing really there? Why was it happening? And what did it mean?
One of the best parts of working at Baseball Prospectus is the ability to pester the staff email list with really bizarre questions. Some people use this power to ask questions where they don’t know the answer. Those people are probably much more well-liked than I am by the other staffers. I, instead, ask questions to which I already know the answer and request that people make wild guesses without doing any research first. I do this because sometimes when I’m looking at data, it helps me to get an unbiased perspective of what someone might expect the data to look like. But to get that, you need to ask people who haven’t seen the data, because once you’ve been staring at the data for too long you expect the data to look like the data.
So here’s the question I posed to the staff list:
Exactly 162 games doesn’t really affect much of anything except that it limits the number of possible responses—you know that everything has to be in increments of roughly .006. Please, before you go any further, think about it yourself—go on gut, don’t look up anything. Okay? Got it? Good, let’s proceed.
Now, I suspect that if I could ask the question in such a way as to get an immediate response—that is, without giving you time to think that boy, isn’t it odd that he’s asking this question—the most common answer would be .500. It’s a pretty reasonable guess, since .500 is both the mean and the median for team win percentage. Of course, I can’t ask the question without asking the question, and the very fact that I’m asking the question lets on that I’m looking for a somewhat less obvious answer.
Interestingly, the answers I got from the staff at first were uniformly low—Harry Pavlidis guessed .488 right out of the gate, which is actually the fourth-most-common win percentage for teams who have played exactly 162 games. And there’s actually a compelling logic to the idea that it’s easier to be below .500 than above it. It’s even true, to an extent—the worst team record in the sample I tested, .250, is further from .500 than the highest record of .716.
It’s a good theory. But the most common team record for teams with 162 games played is actually above .500:
(I went out to 11 entries because there was a 10th-place tie.)
The three most common win percentages are all above .500, and six of the top 11 are above .500. An even record is only the seventh-most-common record, and it is tied with two other records for that spot.
It’s a bit bizarre, really, or at least it surprised me—in a normal distribution, the mean, median, and mode should all be the same. The distribution for team win percentages certainly doesn’t seem skewed—the mean and median line up—but something seems decidedly abnormal here. It might help to look at a histogram showing all teams from 1962 (the first year of the 162-game schedule) through 2012:
The normal distribution is a pretty good fit, but it isn’t perfect. As anticipated, the tail looks slightly different for bad teams as compared to good. But shockingly, the biggest point of disagreement isn’t the tails but right in the middle—there’s an odd little dip right at .500. What our histogram suggests is what’s known as a bimodal distribution—instead of a bell curve with a single peak, what we seem to have is a combination of two overlapping bell curves with two peaks.
Histograms, of course, are subject to small sample sizes and the largely arbitrary decision on the number of bins to use (I used 29 bins for that histogram, for no other reason than the program gretl suggested it). We have 1,338 teams in the sample, which isn’t small but isn’t so large that we can necessarily trust in the eyeball test. What we need is a statistical test of the number of modes in a distribution. We can’t necessarily tell how many modes a distribution has—it can be hard to tell at what point overfitting begins to occur. But we can test whether or not a distribution is unimodal (the standard bell curve) or not.
I turned to Hartigan’s dip test of unimodality, which as it turns out has nothing at all to do with how much oil is in your car. What the dip test measures is the largest difference between the observed data and the normal distribution that best fits that data (or at least, that has the smallest such distance possible). That distance can be compared to a Monte Carlo simulation that sees how frequently that distance would occur randomly, assuming the data was randomly sampled from a normal distribution, given the sample size. The diptest package in GNU R reports a distance of 0.0164, which according to its simulations would occur at random less than two percent of the time (five percent is the generally accepted standard for statistical significance). In other words, the statistical test seems to back up what we see in the histogram.
Can I explain it? Maybe. I can come up with some possible explanations, at least. One explanation is that teams are, rather than being purely random coins, self-aware. Most of the rewards—that is, playoff spots—are for teams above .500, so the incentive is either to finish above .500 and contend, or to play for the long term by selling off veteran parts for younger players.
Another possible explanation is that there are structural imbalances in baseball that lead to the results we see. Because of long-term deals and club control of young players, roster turnover is limited. Unlike basketball or (to a much lesser extent) football, a single high draft pick cannot remake the fortunes of an entire team overnight. And teams that have structural advantages (like the ability to carry a high payroll, or a well-run farm system) tend to carry those advantages over a period of several years, sometimes a decade or more. Similarly, teams that are poorly run or poorly funded tend to be bad over the long run.
Another explanation is that it has to do with the differences between leagues—we know that recently, the AL has been the stronger league, for instance. It’s possible that what shows up so clearly in interleague play has an effect on the overall distribution of win percentages.
Now, it’s possible that one explanation is right, that some combination of the above is right, or that some other explanation (or explanations) I haven’t even considered plays a significant role (either by itself or in concert with one, some, or all of the possible explanations I have offered here). It’ll probably take some more work to figure this out.
But the next question is, what does this mean in a practical sense? What it suggests is that our current view of how teams behave, and how talent is distributed in major league baseball, is flawed. Now, as I have pointed out in the past, flawed models can still be useful. Treating MLB teams as though they come from a normal distribution can still be useful. I want to emphasize this because I’m going to talk about a lot of people’s favorite whipping boy next, and I want to make it clear that one can improve upon something without invalidating it entirely.
Now, then, the whipping boy. Sabermetricians and fellow travelers love to talk about regression to the mean. It’s a somewhat more subtle and nuanced concept than I think most writers (even writers from a sabermetric background) manage to convey. You can overstate the impact of regression to the mean—it’s a probability, not an iron law, and it deals with populations more so than individuals. I like to say that groups will tend to regress to the mean over time, while individuals can do any damned thing they like (with some damned things, of course, being more likely than others). But ignoring overenthusiasm for the concept, you can pretty much divide people who analyze baseball into three camps:
None of what I say here should indicate that I favor the second and third camp over the first; I very much do not. I nearly sold off my Mark Prior jersey for a Chris Sale one when Rick Hahn talked about small sample size.
But the standard model for regression to the mean assumes a normal distribution. If baseball teams aren’t normally distributed, in most cases it will still probably do pretty well. But there are going to be edge cases where it does not work so well. It implies that most teams above .500 should perhaps not be expected to regress towards .500 quite as much as we would otherwise suspect. But by the same token, some above-.500 teams should be expected to regress to something below .500, so performing even worse than the normal model would suggest. And most paradoxically, a .500 team should be expected to regress away from their current record! (The question then becomes, regress towards what?)
It also tells us that there are things about how talent is distributed among MLB teams that we don’t yet understand. Instead of baseball being a bunch of nearly-average teams with some good teams and bad teams at the margins, it seems as though baseball may instead be a collection of good teams and bad teams, with something of a gulf between them. That would seem to have impacts on evaluating roster construction, trades and free-agent signings, the structure of the amateur draft (and the acquisition of foreign talent) and our expectations of a team’s future performance. The sabermetric study of how individual players relate to things like runs and wins has thus far outpaced the study of how talent is distributed among teams, and it seems as though that’s been an oversight on our part.