BP Comment Quick Links
February 11, 2010 Introducing SIERAPart 4Over the last three days the ERA estimator SIERA has been introduced, complete with explanations of its origins and the derivation of its formula. Now comes one of the most important aspects of building a new metric: making sure it works and testing it against its peers. Any estimator should be both realistic in its modeling and accurate in its predictive ability. These are not mutually exclusive attributes, however, as you could have a situation where a regression on sock height and induced foul balls caught by middleaged men holding babies somehow predicts parkadjusted ERA better than anything else. Sure, it tested well, but that type of modeling is absurd and illogical; those two variables should not be assumed to have any impact whatsoever on run prevention. This regression means nothing in spite of its hypothetical test results, but situations may also arise in which the most fundamentally modeled statistic tests poorly. Realistic modeling is based on a combination of truths and assumptions, as we've discussed before; the former being that walk and strikeout rates are stable with the latter suggesting that HR/FB is comprised more of luck than skill. During the course of our post yesterday, it seems safe to say that our modeling is sound as the variables used make sense as far as perceived impact on what they seek to measure. The question then becomes one of how we can test the results to determine how it compares to other estimators currently on the market. For the purposes of this study, we used root mean square error testing, a simple but effective method that informs on the average error between an actual and a predicted array of data. In terms of calculations, RMSE is simple enough to do in Excel: take the difference between the actual and predicted term, square it, and take the square root of the average of those previously squared deltas. When compared to other predictors, the lower the RMSE the better. With the why and how out of the way, who are we testing? To gauge how SIERA fares as an estimator we must compare it to its colleagues. In this forum, that group would consist of FIP, xFIP, QERA, tRA, raw ERA, and parkadjusted ERA itself. To further ensure data integrity, each of the above statistics was calculated from the same Retrosheet dataset. For FIP, the standard +3.2 constant was dependent upon the league and year in question and not a shell figure, with the same obviously true for the xFIP mark. Additionally, as it pertains to xFIP, we spoke with Dave Studeman of The Hardball Times in order to determine that the expected number of home runs to be substituted into the FIP formula is to be calculated through home runs per outfield flies, not the sum of those and popups. Coding for tRA was the final hurdle, but a bunch of help from the invaluable Colin Wyers helped in that regard. One other note on tRA: since it calculates RA/9 instead of ERA, an adjustment needed to be made that essentially resulted in the creation of tERA, which was the normal tRA value discounted by the differenceleague and year dependentbetween RA and ERA. Before getting into the results it would be prudent to discuss some assumptions and goals here. To be blunt, our goal was to beat everyone at predicting parkadjusted ERA in the following season, regardless of HR/FB treatment, and beat everyone but FIP and tRA in terms of sameyear predictive value. Though it may sound counterintuitive to openly seek a thirdplace finish in something like this, the rationale is that both FIP and tRA treat HR/FB as skill rather than luck, meaning that no adjustment at all is made to the home run variable; if we assume that a fluky high or low HR/FB is the true skill level of the pitcher and not in line for adjustment, then of course those metrics will be better at sameyear predictivity (should totally be a word, damn that red squiggly line) than one that does apply an adjustment. As we discussed earlier in the week, the intraclass correlation for HR/FB is very low, both from an overall standpoint and one in which the individuals are isolated from their respective teams. Both tRA and FIP use extra information to retroactively guess ERA, but if we wanted to do that we could use BABIP to more precisely derive an ERA prediction. Looking at tRA, we know that it uses the same information that FIP does, but also takes GB, LD, FB, and PU totals and estimates the expected number of runs and outs that each of these batted balls leads to on average. The assumption is certainly one worth exploringthat pitchers control the rates of each of these batted balls, but that defense and luck determine whether they land in gloves or not. The problem is that this is mostly going to hinge on the assumption that linedrive rates are persistent for pitchers, because line drives are outs far less frequently. Therefore, a pitcher's line drive rate is going to affect his tRA significantly. However, when we look at the ICC of the pitcher's linedrive rate relative to the rest of his team, we only get .007. In this regard, tRA takes a luckbased stat used in FIP but adds another luckladen metric in the rate of line drives and uses that as a main determinant of expected ERA. The ideas are certainly sound, but assumptions must be tested, which is exactly what we did here with SIERA. If everything plays out the way we hoped, then tRA and FIP will best SIERA in postdicting same year ERA but will lose at subsequent year predictive value. But the goal isn't so much to lose to both of them in the sameyear RMSEs but as much as it is to beat the other competitors that treat HR/FB similarly, which would be xFIP and QERA. With that series of disclaimers out of the way, the tables below show the sameyear and subsequentyear RMSEs for the seven metrics in a variety of different categories and subsets. For starters, here is the table of overall results: Stat YRSame YRNext SIERA 0.957 1.162 tRA 0.755 1.222 FIP 0.773 1.224 xFIP 1.168 1.319 QERA 1.070 1.248 ERAPark  1.430 ERA 0.094 1.434 Our goals came to fruition, as SIERA beat xFIP and QERA in the sameyear RMSE test while besting everyone else in terms of predicting parkadjusted ERA in the following year. The latter is very important as a big purpose of these estimators is to base ERA around repeatable skills that would conceivably lead to better results the next time out. Next, we will break the RMSE test results down into a number of subsets to add a level of granularity to the discussion. These subsets were not chosen at random, either, with each being tested for a specific purpose. Most of these purposes involve specific interactions of skills, thus the name SkillInteractive Earned Run Average. For starters, here are the pitchers with above average strikeout rates: Stat YRSame YRNext SIERA 0.929 1.135 tRA 0.704 1.191 FIP 0.748 1.191 xFIP 1.191 1.275 QERA 1.032 1.191 ERAPark  1.401 ERA 0.084 1.404 When looking at the crop of pitchers with an above average SO/PA, the standing of SIERA relative to the overall group remains unchanged. Next up, the group with an SO/PA greater than or equal to one standard deviation from the mean, classified as really high strikeout guys: Stat YRSame YRNext SIERA 0.866 1.218 tRA 0.689 1.229 FIP 0.722 1.216 xFIP 1.214 1.289 QERA 0.972 1.222 ERAPark  1.430 ERA 0.071 1.432 Here, FIP pulls ever so slightly ahead, but remains very close to SIERA in predicting parkadjusted ERA the following year. SIERA uses a quadratic term on strikeouts, which makes it particularly good at estimating ERA for mediumhigh levels of strikeouts but does not add anything particularly helpful for veryhigh levels of strikeouts. Shifting the focus to walks, how do things shake out when looking at below average walk rates (i.e. pitchers with good control)? Stat YRSame YRNext SIERA 0.871 1.071 tRA 0.725 1.133 FIP 0.719 1.125 xFIP 1.105 1.168 QERA 0.915 1.073 ERAPark  1.329 ERA 0.085 1.336 Interesting results surface here, as SIERA and QERA are very similar as it pertains to walk rates below the league average. Looking at the pitchers with very low walk rates, xFIP and QERA actually predict nextyear ERA better than SIERA, while the latter continues to best both of them at sameyear predictions. Moving onto ground ball rates, both above average and above one standard deviation from the mean: Stat YRSame YRNext SIERA 1.079 1.153 tRA 0.761 1.205 FIP 0.773 1.202 xFIP 1.071 1.216 QERA 1.088 1.234 ERAPark  1.419 ERA 0.099 1.422 Stat YRSame YRNext SIERA 1.003 1.193 tRA 0.844 1.203 FIP 0.823 1.226 xFIP 1.063 1.217 QERA 1.173 1.272 ERAPark  1.456 ERA 0.091 1.453 Same story, different metrics. Next up is a table looking at interactions between skills. It looks for low strikeout but high grounder and high walk pitchers, the kinds of hurlers we would expect to allow plenty of baserunners and rely on fielded grounders to wipe the slate clean:
>= AVG GB_PA, BB_PA & <= AVG SO_PA Stat YRSame YRNext SIERA 1.123 1.299 tRA 0.883 1.318 FIP 0.875 1.305 xFIP 1.178 1.408 QERA 1.294 1.477 ERAPark  1.551 ERA 0.121 1.553>= AVG GB_PA, BB_PA Stat YRSame YRNext SIERA 0.876 1.064 tRA 0.698 1.141 FIP 0.708 1.127 xFIP 0.982 1.090 QERA 0.912 1.065 ERAPark  1.323 ERA 0.086 1.329 In both of these tables, we see the results we would expect. SIERA, being a SkillInteractive Earned Run Average, does exactly what it should here: beat other estimators at measuring the skill components of pitcher performance that interact with each other. Moving on to elite pitchers: <= 3.50 ERAParkStat YRSame YRNext SIERA 1.221 1.142 tRA 0.833 1.208 FIP 0.873 1.203 xFIP 1.601 1.235 QERA 1.439 1.180 ERAPark  1.536 ERA 0.063 1.535 With its ability to properly estimate the effects of very strong skill levels, SIERA again beats other estimators that treat HR/FB as luck neutral in predicting sameyear parkadjusted ERA and leads all other estimators in predicting nextyear parkadjusted ERA. Looking through all of these different tests, it is apparent not only that SIERA is the best ERA estimator currently available, but specifically that it is exceptionally strong at measuring the skill level of specialized kinds of pitchers. To make this less abstract, tomorrow's fifth and final article in our introductory series will discuss three specific examples of pitchers who are unique in their skill sets, and are particularly troublesome for other estimators. SIERA will perform excellently with all three, which should leave you with a solid understanding of what SIERA does and why it is so important.
Matt Swartz is an author of Baseball Prospectus. 36 comments have been left for this article.

It seems to true that if the YRNext column in every table was restricted to just on digit to the right of the decimal point there would be no difference at all between most of these metrics. In most cases the RMSE would round to identical 1.2 values.
That you can go out to two and three places past the decimal and get "better" numbers is all well and good. But I'm having a hard time putting the improvements of 0.0xx in RMSE in context. What is the meaningful impact of that small level of improvement?
Or perhaps better yet, are any of these improvements outside the margin of error for these various metrics?
If you don't like it being on the other side of the decimal place, multiply everything by 1000. The difference in between these is large and very significant. Compare the difference in other metrics and simply using ERA to the difference in SIERA and other metrics, and you'll clearly see it's a BIG step forward.
I should note that if you do these tests separately for EACH YEAR from 200309, SIERA is ahead of the same estimators EVERY time. This is a large difference even if we're dealing with ERAs which are necessarily going to require some decimals.