July 25, 2011
Lost in the SIERA Madre
Recently, there has been a lot of digital ink spilled about ERA estimators—statistics that take a variety of inputs and come up with a pitcher’s expected ERA given those inputs. Swing a cat around a room, and you’ll find yourself with a dozen of the things, as well as a very agitated cat. Among those is SIERA, which has lately migrated from here to Fangraphs.com in a new form, one more complex but not necessarily more accurate. We have offered SIERA for roughly 18 months, but have had a difficult time convincing anyone, be they our readers, other practitioners of sabermetrics, or our own authors, that SIERA was a significant improvement on other ERA estimators.
The logical question was whether or not we were failing to do the job of explaining why SIERA was more useful than other stats, or if we were simply being stubborn in continuing to offer it instead of simpler, more widely adopted stats. The answer depends on knowing what the purpose of an ERA estimator is. When evaluating a pitcher’s performance, there are three questions we can ask that can be addressed by statistics: How well he has pitched, how he accomplished what he’s done, and how he will do in the future. The first can be answered by Fair RA (FRA), the third by rest-of-season PECOTA. The second can be addressed by an ERA estimator like SIERA, but not necessarily SIERA itself, which boasts greater complexity than more established ERA estimators such as FIP but can only claim incremental gains in accuracy.
Some fights are worth fighting. The fight to replace batting average with better measures of offense was worth fighting. The fight to replace FIP with more complicated formulas that add little in the way of quality simply isn’t. FIP is easy to understand and it does the job it’s supposed to as well as anything else proposed. It isn’t perfect, but it does everything a measure like SIERA does without the extra baggage, so FIP is what you will see around here going forward.
Why ERA Estimators?
Why do we want to know a player’s expected ERA? We can measure actual ERA just fine, after all, and we have good reason to believe that ERA is not the best reflection of a pitcher’s value because of the unearned run. Component ERA estimators are frequently used to separate a pitcher’s efforts from those of his defense, but you end up throwing out a lot of other things (like how he pitched with men on base) along with the defensive contributions. Here at BP, we use Fair RA to apportion responsibility between pitching and defense while still giving a pitcher credit for his performance with men on base. Sean Smith, who developed the Wins Above Replacement measure used at Baseball Reference, has his own method of splitting credit that doesn’t rely on component ERA estimators. They simply aren’t necessary to handle the split credit issue.
What about ERA estimators’ utility in projecting future performance? Let’s take a look at how the most popular component ERA works in terms of prediction. Fielding Independent Pitching was developed independently by Tom Tango and Clay Dreslough, and is simply:
I computed FIP for the first 50 games of the season, from 1993 to 2010, and looked at how it predicted rest-of-season ERA, weighted by innings pitched. As expected, the Root Mean Square Deviation is very large, at 5.32. If you repeat the exercise but use previous year’s ERA instead, RMSE drops to 2.78. After 100 games, the difference still persists; ERA has an RMSE of 4.52, compared to 9.64 for FIP. (The reason RMSE goes up the further you are in season is that you are predicting ERA for substantially fewer games.) Large sample sizes are our friend when it comes to projection.
All component ERA estimators share the same substantial weakness in terms of predicting future ERA: they work a lot better on smaller sample sizes. I tried predicting 2010 pitching stats using the aggregate of pitching stats from 2003 through 2009, so I could test how several ERA estimators fared as sample sizes increased dramatically. In addition to FIP, I used xFIP, a FIP derivative developed by Dave Studeman which uses outfield flies to estimate home runs allowed, and Skill-Interactive ERA, a metric originally developed here at BP by Matt Swartz and Eric Seidman.
Here is the RMSE for 2003-2010 at increments of 100 innings:
Looking at all pitchers with at least 100 innings, SIERA outperforms the other estimators (but not by much, as I’ll explain momentarily) as well as ERA. By about 400 innings, though, the gap between the ERA estimators (as well as the gap between them all and ERA) has essentially disappeared. As you add more and more innings, plain old ERA outperforms the estimators, with basic FIP ranking second-best.
If you want a projection, the key thing you want is as large a sample as possible. If you need both a large sample as well as the most recent data, weighted according to their predictive value, you want something like rest-of-season PECOTA. So what is the function of these estimators, if not prediction? The answer, I think, is explanation. These ERA estimators allow us to see how a pitcher came about the results he got, if they came from the three true outcomes, defense, or performance with runners on.
SIERA was doomed to marginalization at the outset. It was difficult to mount a compelling case that its limited gains in predictive power were worth the added complexity. As a result, FIP, easier to compute and understand, has retained popularity. FIP is the teacher who can explain the subject matter and keep you engaged, while SIERA drones on in front of an overhead projector. When FIP provides you an explanation, you don’t need to sit around asking your classmates if they have any idea what he just said. If ERA estimators are about explanation, then the question we need to ask is, do these more complicated ERA estimators explain more of pitching than FIP? The simple answer is they don’t. All of them teach fundamentally the same syllabus with little practical difference.
Fangraphs recently rolled out a revised SIERA; a natural question is whether or not New Coke addresses the concerns I’ve listed above with old Coke. The simplest answer I can come up with is, they’re still both mostly sugar water with a bit of citrus and a few other flavorings, and in blind taste tests you wouldn’t be able to tell the difference.
I took the values for 2010 published at Fangraphs and here, and looked at the root mean square error and mean absolute error, weighted by innings pitched. I got values of .12 and .19 respectively. In other words, roughly 50 percent of the time a player’s new SIERA was within .12 runs per nine of the old SIERA, and 68 percent of the time it was under .20 runs per nine. There is a very modest disagreement between the two formulas.
In order to arrive at these modest differences, SIERA substantially bulks up, adding three new coefficients and an adjustment to baseline it to the league average for each season. I ran a set of weighted ordinary least squares regressions to, in essence, duplicate both versions of SIERA on the same data set (new SIERA uses proprietary BIS data, as opposed to the Gameday-sourced batted ball data we use here at Baseball Prospectus). Comparing the two regression fits, there is no significant difference in measures of goodness of fit such as adjusted r-squared. Using the new SIERA variables gives us .33, while omitting the three new variables gives us .32.
We can go one step further and eliminate the squared terms and the interactive effects altogether, and look at just three per-PA rates (strikeouts, walks and “net ground balls,” or grounders minus fly balls) and the percentage of a pitcher’s innings as a starter. The adjusted r-squared for this simpler version of SIERA? Still .32.
So what does the extra complexity of the six interactive and exponential variables add to SIERA? Mostly, multicollinearity. An ordinary least squares regression takes a set of one or more input variables, called the independent variables, and uses them to best predict another variable, the dependent variable. It does this by looking at how the independent variables are related to the dependent variable. When the independent variables are also related to each other, it’s possible to throw off the coefficients estimated by the regression.
When multicollinearity is present, it doesn’t affect how well the regression predicts the dependent variable, but it does mean that the individual coefficients can incorrectly reflect the true relationship between the variables. This is why you can have one term in old SIERA with a negative coefficient now have a positive coefficient in new SIERA. The fundamental relationship between that variable and a pitcher’s ERA hasn’t changed, players haven’t started trying to score outs without making runs or anything like that.
Adding additional complexity doesn’t help SIERA explain pitching performance in a statistical sense, and it makes it harder for SIERA to explain pitching performance to an actual human being. If you want to know how a pitcher’s strikeouts influence his performance in SIERA, you have no less than four variables which explicitly consider strikeout rate, and several others which are at least somewhat correlated with strikeout rate as well. Interpreting what SIERA says about any one pitcher is essentially like reading a deck of tarot cards—you end up with a bunch of vague symbols to which you can fit a bunch of rationalizations to after the fact.
The world doesn’t need even one SIERA; recent events have conspired to give us two. That’s wholly regrettable, but we’re doing our part to help by eliminating it from our menu of statistical offerings and replacing it with FIP. Fangraphs, which already offers FIP, is welcome to this demonstrably redundant measure.
FIP or xFIP?
The next question is why not use a measure like xFIP instead of FIP? The former is much simpler than SIERA and performs better on standard testing than FIP, while being indistinguishable from SIERA in terms of predictive power. While xFIP is simpler than SIERA, it still drags additional levels of complexity along with it, including stringer-provided batted ball data. The question is whether or not that data adds to our fundamental understanding of how a pitcher allows and prevents runs.
One of the fundamental concepts to understand is the notion of true score theory (otherwise known as classical test theory). In essence, what it says is this: in a sample, every measurement is a product of two things—the “true” score being measured, and measurement error. (We can break measurement error down into random and systemic error, and we can introduce other complexities; this is the simplest version of the idea, but still quite powerful on its own.)
You’re probably familiar with the principle of regression to the mean, the idea that extreme observations tend to become less extreme over time. If you select the top ten players from a given time period and look at their batting average, on-base percentage, earned run average, whatever, you see that the average rate after that time period for all those players will be closer to the mean (or average) than their rates during that time period. That’s simply true score theory in action.
Misunderstanding regression to the mean can lead to a lot of silly ideas in sports analysis. It’s why people still believe in a Madden curse. It’s why people believe in the Verducci Effect. It is why we ended up with a lot of batted-ball data in our supposed skill-based metrics. I’m regularly known as a batted ball data skeptic. So I’m sure it comes as no surprise that I regularly lose sleep over statistics like home runs per fly ball, a component of xFIP.
xFIP uses the league average HR/FB rate times a pitcher’s fly balls allowed in place of a pitcher’s actual home runs allowed. While it’s true that a pitcher’s fly ball rate is predictive of his future home run rate on contact, it’s actually less predictive than his home runs on contact (HR/CON). Looking at split halves from 1989 through 1999, here’s how well a pitcher’s stats in one half predict his home run rate on contact in the other half:
For those who care about the technical method, I used the average number of contact in both halves to do a weighted ordinary least squares regression. The coefficients on the regression including both terms are:
HR_CON_ODD = 0.019 + 0.158*HR_CON_EVEN + 0.032 * FB_RT_EVEN
This result seems to fly in the face of previously published research. Most analysts seem to feel that a pitcher’s fly ball rate is more predictive of future home runs allowed than their home run rates. How to reconcile these seemingly contradictory thoughts? The answer is simply this: estimating home run rates using a pitcher’s fly ball rates leads to a much smaller spread than using observed home run rates. Taking data from the 2010 season, and fitting a normal distribution to actual HR/CON, as well as expected HR/CON given FB/CON, gives us:
When you create an expected home run rate based on batted ball data, what you get is something that's well correlated with HR/CON but has a smaller standard deviation, so in tests where the standard deviation affects the results, like root mean square error, it produces a better-looking result, without adding any predictive value.
Aside from questions of batted ball bias, however, there is a reason that this sort of analysis can be unsatisfactory: it assumes that a fly ball that wasn’t a home run is equally predictive of future home runs as a fly ball that is a home run. This is absolutely incorrect. You can break a pitcher’s fly ball rate down into two components: home runs on contact, and fly balls on balls in play. (This ignores the very small number of home runs that are scored as line drives, which is typically rare enough to not impact this sort of analysis.) Fly balls on balls in play are a much poorer predictor of future home runs than home runs on contact, with an r-squared of only .014.
While regressing the spread of HR/CON is a good idea, using fly ball rates to come up with expected HR/CON does this essentially by accident. And in doing so, they throw out a lot of valuable information that is preserved in HR/CON but is washed out when non-HR fly balls are considered in equal proportion. By being in a rush to discard the bath water, someone’s throwing away the baby.
No Estimating Zone
Proponents of skill-based metrics such as SIERA and xFIP (as distinct from FIP) may claim that results trump theory, that their additional predictive value alone proves their utility. But just as the greater predictive value of fly ball rates is a mirage, so is the greater predictive value of skill-based estimators. What happens when we fit a normal distribution to our ERA estimators, along with ERA? Again looking at 2010 data:
The estimators have a smaller standard deviation than observed ERA. SIERA has a lower spread than FIP and xFIP has a lower spread than both (although SIERA and xFIP are much closer to each other than SIERA is to FIP). Let’s look at how well each of these predicts next-season ERA from 2003 through 2010 (using a weighted root mean square error), and well as the weighted SD of each among the pitchers who pitched in back to back seasons:
The ERA estimators are all closer to the next-season ERA than ERA is, but they have a much smaller spread as well. So what happens if we give all the ERA estimators (and ERA) the same standard deviation? We can do this simply, using z-scores. I gave each metric the same SD as xFIP, which was (just barely; you can’t even tell the difference when you round the values) the leading estimator in the testing above, and voila:
In essence, what we’ve done here is very crudely regress the other stats (except SIERA, which we actually expanded slightly) and re-center them on the league average ERA of that season. Now SIERA slightly beats xFIP, but in a practical sense both of them are in a dead heat with FIP, and ERA isn’t that far behind. As I said, this is an extremely crude way of regressing to the mean. If we wanted to do this for a real analysis, we’d want to look at how much playing time a player has had. In this example we regress a guy with 1 IP as much as a guy with 200 IP, which is sheer lunacy.
In a real sense, that’s what we do whenever we use a skill-based metric like xFIP or SIERA. We are using a proxy for regression to the mean that doesn’t explicitly account for the amount of playing time a pitcher has had. We are, in essence, trusting in the formula to do the right amount of regression for us. And like using fly balls to predict home runs, the regression to the mean we see is a side effect, not anything intentional.
Simply producing a lower standard deviation doesn't make a measure better at predicting future performance in any real sense; it simply makes it less able to measure the distance between good pitching and bad pitching. And having a lower RMSE based upon that lower standard deviation doesn't provide evidence that skill is being measured. In short, the gains claimed for SIERA are about as imaginary as they can get, and we feel quite comfortable in moving on.
The years in question are because those are the years for which we have Project Scoresheet batted ball data, and thus represents the largest period of publicly-available batted ball data on record.