When Matt Cain signed his $127.5 million contract extension in April 2012, Colin Wyers memorably tweeted: “If your response to the Matt Cain extension involves xFIP I'll be by later to pour coffee on your keyboard.” As someone who was still relatively new to the field of sabermetrics, seeing that was a watershed moment in that it signaled a substantial retreat from our previous collective understanding that DIPS statistics were unambiguously superior to ERA as measurements of pitching ability.
In the two years since Wyers’ excoriation of those who still judged Cain by his peripherals despite his history of outperforming them, I’ve noticed that the best analysts in our midst have continued to trend toward nuance in their discussions of DIPS theory, looking for qualities and characteristics that explain the outliers both in the stats and from personal observations. This ever-increasing blend of sabermetrics and scouting is unambiguously good for bettering our understanding of baseball.
Yet I wonder if, in our collective quest to bridge the gaps between traditional baseball wisdom and sabermetric logic, we are failing to see the forest for the trees. Specifically, I fear we are becoming far too quick to identify outlier pitchers as exceptions to DIPS norms rather than understanding them to be manifestations of typical population variance. Some basic probability can show us that it is a lot harder to demonstrate results incompatible with the DIPS theory than we seem to think.
The Starting Assumptions
In his criminally under-cited article from November 2011, Noah Isaacs used basic probability theory to show that the proportions of MLB pitchers who outperformed their FIPs through different numbers of seasons were very similar to those we would expect to see from random variation alone. For his thought experiment, Isaacs assumed that every pitcher had a 50 percent chance of outperforming his DIPS numbers and a 50 percent chance of underperforming them.
In recognition of the intuitive and empirical logic in different types of pitchers having different relationships with their DIPS numbers, I divide the population of MLB pitchers into three hypothetical groups with the following characteristics and proportions:
· The Normals: Pitchers whose true-talent levels are more or less accurately described by their DIPS numbers. In any given season, each has a 50-percent chance of having an ERA below his DIPS numbers and a 50-percent chance of having an ERA above his DIPS numbers. I’d guess that 60 percent is a conservative estimate for the proportion of MLB pitchers who fall into this category.
· The Overachievers: Pitchers whose true abilities to prevent runs are underestimated by their DIPS numbers. In any given season, I assume that each has a 75-percent chance of having an ERA below his DIPS numbers and a 25-percent chance of having an ERA above his DIPS numbers. My intuitive guess is that about 15 percent of MLB pitchers are Overachievers.
· The Underachievers: Pitchers whose run-prevention skills are overestimated by their DIPS numbers. In any given season, I assume that each has a 25-percent chance of having an ERA below his DIPS numbers and a 75-percent chance of having an ERA above his DIPS numbers. I’d estimate that around 25 percent of MLB pitchers fit this description. (The asymmetry in my theoretical proportions of Over- and Underachievers reflects the fact that, in baseball, it is a lot easier to be significantly below average than it is to be significantly above average.)
Categorizing pitchers this way is an oversimplification, and these numbers are nothing more than educated guesses, but for the purpose of some exploratory calculations I think it’s a fair description of the population of MLB arms.
When Does Overachieving Signify DIPS-Beating Skill?
One of my favorite statistical tools is Bayes’ theorem. Simply stated, Bayes’ theorem provides an elegant mechanism for calculating the probability of event or characteristic A given an observed event or characteristic B as a function of the probability of A, the probability of B, and the probability of B given A. In this instance, we want to calculate the odds that a pitcher has the true ability to outperform his peripherals — and thanks to Bayes, we can plug in my proposed numbers and come up with solutions.
Imagine a young pitcher named Wendy from my beloved hypothetical E Street League. Wendy gets the call for the Opening Day roster and outperforms her DIPS numbers in her first season. What are the odds that she is an Overachiever? The probability of her beating her DIPS stats if he is an Overachiever is 75 percent. Multiply that by the 15-percent chance of her being an Overachiever and divide by the general population’s 47.5-percent chance of beating DIPS, and the odds that she’s an Overachiever are 23.7 percent. So after one year of observed overachieving, the odds of a pitcher being a true Overachiever are less than one in four.
Now say Wendy keeps it up for a second year—what could we infer about her true nature then? Overachievers have a 56.2-percent chance of beating their DIPS numbers two years in a row, compared to 25-percent and 6.3-percent odds for Normals and Underachievers, respectively. Yet because Overachievers are but a minority of the population of MLB pitchers, Wendy’s odds of being one of them are just 33.8 percent — barely over one in three.
So how long would it take before we could confidently describe Wendy as an Overachiever? Here’s a look at how the probability increases over time:
Seasons |
Probability |
1 |
23.7% |
2 |
33.8% |
3 |
44.5% |
4 |
55.2% |
5 |
65.2% |
6 |
73.9% |
7 |
81.0% |
8 |
86.5% |
9 |
90.6% |
10 |
93.5% |
It takes four years of consistent overachieving before we can say that Wendy is more likely to be an Overachiever than not, and even then a 55.2-percent chance is hardly a slam dunk. Even after a decade we could not reject the null hypothesis that she is a Normal at the five-percent level (though that comes in Year 11). That’s a lot longer than most of us seem to think.
When Does Underachieving Signify Lack of Skill?
We can do a similar analysis to estimate the probability that a pitcher is an Underachiever given observed underachieving. Take the case of Janey, another pitcher from the E Street League. In her rookie season, she underperforms her peripheral numbers. It is easier for her to be classified as an Underachiever in this model than it was for Wendy to be categorized as an Overachiever because of the assumption that there are more of the former than of the latter. Even so, the odds of Janey’s being a true Underachiever after one season of observation are just 35.7 percent.
If Janey failed to improve relative to her DIPS numbers the next season, her odds of being an Underachiever would rise to 46.9 percent—still less than the probability that she is a Normal. Only after Year 3 would Janey’s chances of being an Underachiever exceed 50 percent, and it would take 10 seasons to reject the null hypothesis that she is a Normal. Here’s a look at how the odds would change over time:
Seasons |
Probability |
1 |
35.7% |
2 |
46.9% |
3 |
57.7% |
4 |
67.5% |
5 |
75.8% |
6 |
82.5% |
7 |
87.7% |
8 |
91.4% |
9 |
94.1% |
10 |
96.0% |
Complications and Caveats
I’ll be the first to admit that this is an overly simplistic model. I think my estimates of the proportional size of each category of pitchers and the odds of each result for each group are fair, but they are merely educated guesses. Further, to express the comparison of a pitcher’s ERA to his DIPS numbers as a binary leaves something to be desired — there’s a difference between beating your FIP by a couple points and beating it by a full run.
More importantly, there is more to the art of categorizing pitchers than seeing where their ERAs and DIPS numbers end up. For example, if you watch an otherwise-great pitcher and notice that he throws more than his share of mistake pitches, you could expect to see him underperform his DIPS numbers. I suspect that these kinds of observations are sometimes the tail wagging the dog as analysts look to explain the ERA-DIPS disparities they are already observing, but they still matter.
But the imprecision of this model doesn’t undermine the basic point: DIPS theory needn’t be uniform in its empirical manifestations for us to conclude that it is true. Even over several seasons, random variation can still have a substantial impact, so a given pitcher’s apparent nonconformity doesn’t necessarily mean he’s an exception to the rule.
The Bigger Picture
At the SABR Analytics Conference in March, I talked to a friend who observed that the first day’s program—which featured a presentation about the unknown impacts of team chemistry, a panel on the limitations of current knowledge about injuries, and two discussions about advanced statistics from the perspectives of players—felt almost like “an apology for sabermetrics.” Perhaps I’m reading too much into it, but I fear that this is illustrative of a larger inferiority complex that has taken hold among the sabermetricians who are part of the otherwise-fantastic effort to synthesize statistical analysis with traditional scouting and baseball knowledge.
“There are things that are generally publicly held as sabermetric doctrine—in some cases, crucial underlying assumptions—that are demonstrably false,” Russell Carleton wrote upon returning to the public world after consulting stints for MLB teams. These words are humbling. There is a lesson in them to be open-minded about ideas both old and new and to constantly think critically not just about the old traditions we believe to be outdated but of the studies and conclusions we perform and draw ourselves. (From my heretofore more-limited experience working on the inside, I would wholeheartedly agree with his advice.)
But it doesn’t mean that we need be deferential by default, especially when the contrary evidence is anecdotal. Some (if not most) of the greatest advances in sabermetric thought have come from looking at the game through a wide-angle lens and not giving a damn about how the happenings of the game felt to the players and coaches and fans. No formula will ever be able to encompass everything that happens in a baseball game (let alone the years of preparation that go into the construction of every roster before a team takes the field), and no sabermetrician should ever think of a model as not needing to be improved or a conclusion as too sacrosanct to test again. But a smart sabermetrician knows not only when to acknowledge when the facts contradict him or her but when not to be dissuaded by insufficient evidence.
Now think back to Matt Cain: So convinced is the sabermetric community of his exceptional DIPS-beating skill (despite lacking a clear explanation for how he does it, to my knowledge) that one of the most prominent analysts in the field didn’t think his xFIP was even worth considering. Yet according to this model, the odds that Cain was an Overachiever after beating his DIPS numbers five years in a row (as his streak was at the time of Wyers’ tweet) were just 65.2 percent. So while a Cain skeptic might have had coffee on his or her keyboard, these numbers say he or she would have had better-than one-in-three odds of being right.
I don’t believe the odds that Cain’s DIPS-beating skill is an illusion are really as high as one in three, and a more complicated model for assessing the probability that a pitcher is an Overachiever would likely bear that out. But even if the odds of his being a Normal are even half of that, then the certitude with which we as a community have asserted that Cain is special is undeserved. And that speaks to the fact that we have far more confidence in the anecdotally empirical and the subjectively perceived than they should warrant in the face of mathematical logic.
This is the last time I’ll be able to discuss baseball publicly for the indefinite future, so after one final plug for my senior thesis (there’s even an abridged version now!) I want to share my favorite quote about the game, from Dick Cramer: “Baseball is a soap opera that lends itself to probabilistic thinking.” A good analyst will never dismiss the value of a scout or assume that a model that describes some aspect of the game cannot be improved upon. But if our quest to explain baseball takes us so far into the rabbit hole that we mistake ordinary random variation for causal trends, that’s not nuance—that’s overfitting to the data.