BP Comment Quick Links


January 27, 2011 Ahead in the CountTesting SIERAWhen Eric Seidman and I introduced SIERA last winter, we ran a number of tests to determine if our theoretical foundation of run prevention led to a superior estimation of pitchers’ skill levels. While SIERA had a solid advantage at predicting future ERA over some ERA estimators and a last decimalpoint small lead over xFIP, we ran the tests again after 2010 to ensure that it held a lead going forward. Although the regression formula did not incorporate future ERAs and should not have been biased, it's still important to test the following year to see how well SIERA held up. The short story is this: SIERA had a good year in 2010, as it extended its lead over other estimators. Below is the root mean square error (RMSE) of the difference between various estimators and parkadjusted ERA (using the Lahman Database’s threeyear pitcher park factors) for all pitchers with two consecutive years of at least 40 innings pitched. (To avoid the mixup of last year, I obtained xFIP, FIP, and tERA directly from FanGraphs’ website.) Table 1. Estimators' RMSE of difference between ParkAdjusted ERA (IP>=40 both years), Unweighted, 20092010
SIERA Tests for 200310 This was the typical order of ERA estimation for the previous six years as well: Table 2. Estimators' RMSE of difference between ParkAdjusted ERA (IP>=40 both years), Unweighted
SIERA has been ahead in six of seven years, with xFIP in second place in each of those six years, and the two switching places in 2009 ERA estimation (using 2008 estimators). FIP typically finished third, tERA typically finished fourth, and previous year’s parkadjusted ERA finished fifth. The exceptions were 200506 (when tERA outestimated FIP), and 200304 (when tERA finished behind parkadjusted ERA itself). Putting together the complete 200310 dataset of seven pairs of years, we get the following RMSEs: Table 3. Difference between ParkAdjusted ERA (IP>=40 both years) for 200310, Unweighted
SIERA finished modestly ahead of xFIP overall, by .021 points, and both were ahead of the other three estimators. FIP finished a solid third, though to be fair to it, it isn't a parkadjusted statistic and was never intended to estimate parkadjusted ERA. So, I checked to see how FIP did at predicting next year’s unadjusted ERA; it only did slightly better with a 1.216 RMSE. Of course, that included pitchers who changed teams and obviously had different park effects, but using only pitchers who pitched on the same team in two consecutive years, the RMSE actually fell to 1.160, the same RMSE that xFIP had for parkadjusted ERA estimation. Pitchers who did not switch teams threw more innings, and were therefore easier to predict overall, so I ran the same test of other estimators on parkadjusted ERA for only pitchers who did not switch teams, and found that the order remained the same: Table 4. Pitchers with IP>=40 both years with same team, Unweighted
One criticism of this test is that it treats pitchers who had 41 innings the same as pitchers who threw 241 innings, so I reran the test weighted by next year’s innings pitched. The order remained solidly the same, but unsurprisingly, estimation was better overall for pitchers with more innings: Table 5. Difference between Estimator and Next Year’s ParkAdjusted ERA (IP>=40 both years), Weighted by next year’s IP
While the consensus has emerged that RMSE is the best test to run, some still prefer to see correlations so I checked that as well, with SIERA similarly ahead: Table 6. Correlation with Next Year’s ParkAdjusted ERA for Pitchers with IP>=40 both years
Variance in ERA Estimators Something that I have always found interesting about these ERA estimators is that, since they are not projections, they do not regress to the mean at all. The larger the variance of a statistic, the less it regresses to the mean by definition, which means that its ability to estimate the next year’s ERA is going to be worse, even though it may be picking up on temporary skill levels. For the record, the standard deviations are as follows: Table 7. Standard Deviation of Estimators among Pitchers with IP>=40
Unsurprisingly, the statistics that include unregressed home runs per flyball rates (FIP and tERA) have higher standard deviations, but xFIP has a lower standard deviation than SIERA, meaning that it regresses more to the mean than SIERA. Thus, it is a good sign that SIERA is picking up on enough skill level as it still does better at predicting next year’s ERA despite less natural regression to the mean built in. How Often is SIERA Closest? Other than RMSE and correlations, another way to see how well a statistic does is to simply see how often each statistic is closer to next year’s parkadjusted ERA. I did this, and pitted ten different pairs of ERA estimators against each other to see how often each side won. The following are all very statistically significant (with p<.002), as there are over 1800 pitchers who pitched at least 40 innings in two consecutive years between 2003 and 2010: Table 8. Percentage of Pitchers (IP>=40) For Whom Estimator is Closed to Next Year’s ParkAdjusted ERA
SIERA finished ahead of xFIP, FIP, tERA, and parkadjusted ERA, each by a noticeable and very statistically significant if not visually damning amount. Note that xFIP beat the other three by significant margins as well, FIP beat tERA more often, and tERA beat parkadjusted ERA itself most often. Comparing Estimators to Years Other Than The Following When SIERA came out last year, Brian Cartwright came up with another way of testing estimators: he looked at the RMSE between ERA estimators and parkadjusted ERAs in different years other than just next year. After all, these are not meant to be projection systems. They are meant to estimate skill level, and next year’s parkadjusted ERA has been considered a pretty good estimate of this year’s skill level. However, so is last year’s parkadjusted ERA, so is two years from now, as is two years ago. So I checked all of these, sameyear ERA (which obviously gives home run inclusive estimators FIP and tERA an obvious leg up), three years ahead, and three years behind: Table 9. RMSEs of Estimators in Year T with ParkAdjusted ERA in Years T3 through T+3, Unweighted
Unsurprisingly, FIP and tERA do best at sameyear ERA, since they give credit or blame to pitchers for their home run rates. I was pleased to see that SIERA was closest in every other comparison. The order stayed the same for the other estimators as well. These are all unweighted observations, meaning that pitchers who throw more innings are treated the same as pitchers who throw fewer. The following table shows the RMSE with each pitcher weighted by their innings pitched in the subsequent year: Table 10. RMSEs of Estimators in Year T with ParkAdjusted ERA in Years T3 through T+3, Weighted by IP
Once again, the order of estimators remained the same in each year in question: Table 11. Correlation of Estimators with ParkAdjusted ERA in Years T3 through T+3
Correlations seemed to move all over the place, but SIERA did have the highest correlation for every year other than same year. I also tested "win rate," the percentage of how often each statistic was closest to parkadjusted ERA, looking at different years. The first table is sameyear ERA, in which FIP unsurprisingly wins the most. SIERA is closer slightly more often than xFIP. This is a statistically significant win percentage, albeit a small one (p=.02). All other comparisons are statistically significant except SIERA’s deficit in predicting sameyear ERA compared to tERA is not (p=.45), despite the fact that tERA credits a pitcher with the full run cost of the actual number of home runs they surrender. Table 12. Percentage of Pitchers for Whom Estimator Was Closer to ParkAdjusted ERA in Same Year
Since we have already looked at predicting next year’s parkadjusted ERA, we jump to "predicting" last year’s parkadjusted ERA. Table 13. Percentage of Pitchers for Whom Estimator Was Closer to ParkAdjusted ERA in Previous Year
Again, SIERA continues to do better, followed by xFIP, FIP, tERA, and parkadjusted ERA itself. The following tables look at two years from now, two years ago, three years from now, and three years ago: Table 14. Percentage of Pitchers for Whom Estimator Was Closer to ParkAdjusted ERA in Two Years
Table 15. Percentage of Pitchers for Whom Estimator Was Closer to ParkAdjusted ERA Two Years Before
Table 16. Percentage of Pitchers for Whom Estimator Was Closer to ParkAdjusted ERA in Three Years
Table 17. Percentage of Pitchers for Whom Estimator Was Closer to ParkAdjusted ERA Three Years Before
The results and order of best statistics are similar in each of these tables. Almost all of these are statistically significant, thanks to the large sample size of pitchers, with exceptions including tERA’s victory over parkadjusted ERA is not significant in predicting ERA from three years ago, predicting ERA three years ahead, nor predicting ERA from one year ago. Additionally, SIERA's victory over xFIP at predicting ERA three years prior is not significant, nor is xFIP’s deficit compared to FIP three years prior. Averaging Multiple Years of ERA Estimators In discussing this article with Sky Kalkman, he suggested that I look at whether multiple years of this estimator averaged out would show FIP to be better at picking up the elusive skill level in home runs per fly ball. Considering this very plausible, I checked this a few different ways. First, I looked at just averaging the previous three years of ERA estimators to predicting the fourth year’s parkadjusted ERA, without doing any weighting. I looked at both RMSE and the correlation. Second, I weighted the estimator by innings pitched in each of those first three years. Again, I looked at the RMSE and the correlation. Third, I used that estimate but checked the RMSE while weighting the pitchers by innings pitched in the fourth year. The results? Table 18. ThreeYear Average of Estimator versus ParkAdjusted ERA in Fourth Year
SIERA continued to be the best in each of these five tests, though FIP basically caught up with xFIP. Although tERA did decently well with correlations, it fell short in RMSE. Mixing and Matching I also ran some other tests (not included) in which I used weighted averages of xFIP and FIP to see if they did better as a mixture than separately. They did, and the best mixture seemed to be 80%/20% for xFIP/FIP in oneyear estimation and 40%/60% for threeyear estimation. However, these did not outdo SIERA in any of the cases, except the threeyear estimation did tie it for observations weighted by IP in the fourth year and estimator weighted by IP in the first three years. However, the rest of the tests had SIERA safely in front. Squeezing the data every which way, it remains true that 2010 continues to show SIERA to be the best ERA estimator. It is clear that xFIP is almost as good, though if left with one, I would prefer SIERA (perhaps obviously). Interestingly, running a regression of parkadjusted ERA on the previous year’s SIERA and xFIP shows that not only does SIERA to a better job, you should actually lower the expected ERA of a pitcher with a higher xFIP and the same SIERA. The formula given is: ERA (pkadj) = 1.60 + .914*SIERA  .277*xFIP Both coefficients are statistically significant (p=.000, p=.013 for SIERA and xFIP respectively). This means that xFIP is not giving extra information beyond what SIERA does. This peculiar result of a negative coefficient is probably a result of sampling bias, but it is still worth reporting. Why SIERA Succeeds The natural question that everyone asked last year when we came out with SIERA was not just if it was the best estimator, but why it is the best estimator. Why is it that a statistic that has fewer years to work withand therefore does not precisely estimate the runeffect and outeffect of strikeouts, walks, and home runsdoes better than statistics like xFIP and FIP, that do precisely estimate those things? My further research over the last year has helped me understand why. The following are the highlights of this research. The first one listed is the one that we knew already when we published SIERA last year, but it is not the primary reason at all.
My educated guess is that reasons 2) and 4) are the primary reasons for SIERA's superior estimation skill. The stark difference between QERA’s RMSE and SIERA’s RMSE in last year’s testing was primarily due to the negative coefficient on groundball rate squared in SIERA. When we ran our initial tests on SIERA, the inclusion of a variable for the square of groundball rate often did the most to improve estimation. Further, even though pitchers do have some control over BABIP, the amount that they do control is very similar to the amount that SIERA credits them with through BABIP’s correlation with strikeouts and groundball rates. With this, we're not done with SIERA. We knew when we introduced it that we only had so many years of battedball data, and more years will undoubtedly help us better estimate the many coefficients in SIERA (though we have not yet looked at how much 2010 can help). Also, as Colin Wyers has done some work on parkadjusted battedball rates for pitchers, this may or may not help SIERA improve its estimation skill. If it does, we will be able to make these changes as well. Furthermore, as run environments change, SIERA will need to adjust accordingly too. ERA estimation is clearly a complicated task with a lot of moving parts. SIERA is currently the best way to take a snapshot of a pitcher’s skill level, but with a lot of competition out there, we will continue to work on delivering even better ways of understanding pitcher’s skill level. In the end,
Matt Swartz is an author of Baseball Prospectus. 21 comments have been left for this article. (Click to hide comments) BP Comment Quick Links TangoTiger (57181) In this article, I didn't actually compute any of the estimators using formulas. In the article, I explained that to avoid last year's mixup, I just used what was available at BP and FG. I did use old SIERAs from 200304 before BP stats have access to the batted ball data. If there was a bbFIP posted somewhere that I could merge with the rest of the data, that would be easier. If I try to compute it from an original set of batted ball data, I can include bbFIP, but it's tricky without since my coding skills are still a work in progress. Jan 29, 2011 09:04 AM TangoTiger (57181) Also, can you test with unadjusted ERA, not the parkadjusted? After all, FIP itself is unadjusted, and you can't take Ubaldo's unadjusted FIP in 2009 and compare it to his parkadjusted ERA in 2010. This is clearly unfair to FIP. Alternatively, parkadjust FIP in 2009 to test to parkadjusted ERA in 2010. Jan 27, 2011 08:42 AM Amen to that, Tom, RA9 would be a nice tweak. The official scorer's opinion does not change the score, after all. Jan 27, 2011 12:14 PM Yeah, I agree that FIP isn't parkadjusted and that causes it to be shortchanged. In the article I explained that, which is why I tested FIP against unadjusted ERA for pitchers who did not switch teams, and checked SIERA, xFIP, etc. against parkadjusted ERA for the same pitchers who did not switched teams. (That's in table 4). Jan 29, 2011 09:08 AM TangoTiger (57181) Can you also post your dataset used to create all these charts, so the rest of us can do our own testing? Jan 27, 2011 08:43 AM JD Sussman (49200) Fantastic stuff Matt. I wish there was a LOVE button somewhere I could click. Jan 27, 2011 09:22 AM IvanGrushenko (45528) So it's an awesome stat! Can I find it on the player cards or do I have to look at the long of pitchers by year? Jan 27, 2011 09:55 AM RedsManRick (23592) I really like the fact that Sierra recognizes that the ability to miss bats and the ability to induce weak contact come from much of the same underlying skills and thus are strongly correlated. Jan 27, 2011 11:21 AM steverynear (30911) Fangraphs allows you to sort by ERAFIP. Can BP provide a ERASIERA sort? A SIERA  xFIP would be cool to look at as well. Jan 27, 2011 12:46 PM WaldoInSC (26415) BP + Fangraphs + Tangotiger + all other analyticallybent baseball sites = boon for readers. Jan 27, 2011 15:25 PM bleaklewis (38301) Great, interesting work! Any future plans to work with and integrate pitchF/X data into SIERA? Jan 27, 2011 18:24 PM I don't think we'll integrate the two in a formulaic way. Mike Fast and I email and tweet to exchanges ideas sometimes, and it has definitely helped my understanding of how the two can be integrated to learn about pitching. I suspect that the data won't be used in SIERA but both will be used to inform our understanding of pitchers. Jan 29, 2011 09:10 AM Michael Bodell (89) Great article! I hope we see lots of these type of articles, and not just when the BP stat in question comes out one top. How about one on multiyear PECOTA study too? Jan 28, 2011 13:44 PM That could be something we'll work on soon too. I agree that it's important to be forthright about when we do not come out on top. Colin has worked very hard on improving PECOTA in the last several months, and he's definitely playing with tests of various versions of PECOTA at least internally. I do think multiyear looks at projection systems are an important step to take as well. Thanks. Jan 29, 2011 09:12 AM Not a subscriber? Sign up today!

great stuff. can't improve without self evaluation.