Ahead in the Count: Testing SIERA

January 27, 2011

When Eric Seidman and I introduced SIERA last winter, we ran a number of tests to determine if our theoretical foundation of run prevention led to a superior estimation of pitchers’ skill levels. While SIERA had a solid advantage at predicting future ERA over some ERA estimators and a last decimal-point small lead over xFIP, we ran the tests again after 2010 to ensure that it held a lead going forward. Although the regression formula did not incorporate future ERAs and should not have been biased, it's still important to test the following year to see how well SIERA held up.

The short story is this: SIERA had a good year in 2010, as it extended its lead over other estimators. Below is the root mean square error (RMSE) of the difference between various estimators and park-adjusted ERA (using the Lahman Database’s three-year pitcher park factors) for all pitchers with two consecutive years of at least 40 innings pitched. (To avoid the mix-up of last year, I obtained xFIP, FIP, and tERA directly from FanGraphs’ website.)

Table 1. Estimators' RMSE of difference between Park-Adjusted ERA (IP>=40 both years), Un-weighted, 2009-2010

Estimator	RMSE
SIERA	1.083
xFIP	1.139
FIP	1.180
tERA	1.220
ERA (pk-adj)	1.322

SIERA Tests for 2003-10

This was the typical order of ERA estimation for the previous six years as well:

Table 2. Estimators' RMSE of difference between Park-Adjusted ERA (IP>=40 both years), Un-weighted

Estimator	2009-10	2008-09	2007-08	2006-07	2005-06	2004-05	2003-04
SIERA	1.083	1.213	1.097	1.191	1.179	1.131	1.067
xFIP	1.139	1.200	1.136	1.209	1.204	1.138	1.086
FIP	1.180	1.264	1.192	1.255	1.294	1.239	1.124
tERA	1.220	1.312	1.235	1.325	1.265	1.321	1.425
ERA (pk-adj)	1.322	1.478	1.388	1.440	1.492	1.385	1.403

SIERA has been ahead in six of seven years, with xFIP in second place in each of those six years, and the two switching places in 2009 ERA estimation (using 2008 estimators). FIP typically finished third, tERA typically finished fourth, and previous year’s park-adjusted ERA finished fifth. The exceptions were 2005-06 (when tERA out-estimated FIP), and 2003-04 (when tERA finished behind park-adjusted ERA itself).

Putting together the complete 2003-10 dataset of seven pairs of years, we get the following RMSEs:

Table 3. Difference between Park-Adjusted ERA (IP>=40 both years) for 2003-10, Un-weighted

Estimator	RMSE
SIERA	1.139
xFIP	1.160
FIP	1.223
tERA	1.301
ERA (pk-adj)	1.416

SIERA finished modestly ahead of xFIP overall, by .021 points, and both were ahead of the other three estimators. FIP finished a solid third, though to be fair to it, it isn't a park-adjusted statistic and was never intended to estimate park-adjusted ERA. So, I checked to see how FIP did at predicting next year’s unadjusted ERA; it only did slightly better with a 1.216 RMSE. Of course, that included pitchers who changed teams and obviously had different park effects, but using only pitchers who pitched on the same team in two consecutive years, the RMSE actually fell to 1.160, the same RMSE that xFIP had for park-adjusted ERA estimation.

Pitchers who did not switch teams threw more innings, and were therefore easier to predict overall, so I ran the same test of other estimators on park-adjusted ERA for only pitchers who did not switch teams, and found that the order remained the same:

Table 4. Pitchers with IP>=40 both years with same team, Un-weighted

Estimator and Estimated	RMSE
SIERA with Park-Adjusted ERA	1.093
xFIP with Park-Adjusted ERA	1.127
FIP with ERA	1.160
tERA with Park-Adjusted ERA	1.235

One criticism of this test is that it treats pitchers who had 41 innings the same as pitchers who threw 241 innings, so I re-ran the test weighted by next year’s innings pitched. The order remained solidly the same, but unsurprisingly, estimation was better overall for pitchers with more innings:

Table 5. Difference between Estimator and Next Year’s Park-Adjusted ERA (IP>=40 both years), Weighted by next year’s IP

Estimator	RMSE
SIERA	1.013
xFIP	1.033
FIP	1.094
tERA	1.195
ERA (pk-adj)	1.272

While the consensus has emerged that RMSE is the best test to run, some still prefer to see correlations so I checked that as well, with SIERA similarly ahead:

Table 6. Correlation with Next Year’s Park-Adjusted ERA for Pitchers with IP>=40 both years

Estimator	Correl
SIERA	.398
xFIP	.352
FIP	.341
tERA	.347
ERA (pk-adj)	.295

Variance in ERA Estimators

Something that I have always found interesting about these ERA estimators is that, since they are not projections, they do not regress to the mean at all. The larger the variance of a statistic, the less it regresses to the mean by definition, which means that its ability to estimate the next year’s ERA is going to be worse, even though it may be picking up on temporary skill levels. For the record, the standard deviations are as follows:

Table 7. Standard Deviation of Estimators among Pitchers with IP>=40

Estimator	Standard Deviation
SIERA	.740
xFIP	.682
FIP	.901
tERA	1.029
ERA (pk-adj)	1.247
ERA	1.157

Unsurprisingly, the statistics that include un-regressed home runs per fly-ball rates (FIP and tERA) have higher standard deviations, but xFIP has a lower standard deviation than SIERA, meaning that it regresses more to the mean than SIERA. Thus, it is a good sign that SIERA is picking up on enough skill level as it still does better at predicting next year’s ERA despite less natural regression to the mean built in.

How Often is SIERA Closest?

Other than RMSE and correlations, another way to see how well a statistic does is to simply see how often each statistic is closer to next year’s park-adjusted ERA. I did this, and pitted ten different pairs of ERA estimators against each other to see how often each side won. The following are all very statistically significant (with p<.002), as there are over 1800 pitchers who pitched at least 40 innings in two consecutive years between 2003 and 2010:

Table 8. Percentage of Pitchers (IP>=40) For Whom Estimator is Closed to Next Year’s Park-Adjusted ERA

Win-Rate	vs. xFIP	vs. FIP	vs. tERA	vs. ERA (pk-adj)
SIERA	54.6%	54.1%	56.8%	58.9%
xFIP		53.9%	56.0%	58.0%
FIP			55.8%	59.2%
tERA				53.5%

SIERA finished ahead of xFIP, FIP, tERA, and park-adjusted ERA, each by a noticeable and very statistically significant if not visually damning amount. Note that xFIP beat the other three by significant margins as well, FIP beat tERA more often, and tERA beat park-adjusted ERA itself most often.

Comparing Estimators to Years Other Than The Following

When SIERA came out last year, Brian Cartwright came up with another way of testing estimators: he looked at the RMSE between ERA estimators and park-adjusted ERAs in different years other than just next year. After all, these are not meant to be projection systems. They are meant to estimate skill level, and next year’s park-adjusted ERA has been considered a pretty good estimate of this year’s skill level. However, so is last year’s park-adjusted ERA, so is two years from now, as is two years ago. So I checked all of these, same-year ERA (which obviously gives home run inclusive estimators FIP and tERA an obvious leg up), three years ahead, and three years behind:

Table 9. RMSEs of Estimators in Year T with Park-Adjusted ERA in Years T-3 through T+3, Un-weighted

Comparison Years	T	T+1	T-1	T+2	T-2	T+3	T-3
Estimator	RMSE	RMSE	RMSE	RMSE	RMSE	RMSE	RMSE
SIERA	0.997	1.139	1.138	1.209	1.200	1.234	1.271
xFIP	1.002	1.160	1.165	1.218	1.226	1.252	1.295
FIP	0.843	1.223	1.237	1.301	1.290	1.308	1.342
tERA	0.946	1.301	1.323	1.399	1.364	1.400	1.420
ERA (pk-adj)		1.416	1.416	1.509	1.509	1.516	1.516

Unsurprisingly, FIP and tERA do best at same-year ERA, since they give credit or blame to pitchers for their home run rates. I was pleased to see that SIERA was closest in every other comparison. The order stayed the same for the other estimators as well.

These are all un-weighted observations, meaning that pitchers who throw more innings are treated the same as pitchers who throw fewer. The following table shows the RMSE with each pitcher weighted by their innings pitched in the subsequent year:

Table 10. RMSEs of Estimators in Year T with Park-Adjusted ERA in Years T-3 through T+3, Weighted by IP

Comparison Years	T	T+1	T-1	T+2	T-2	T+3	T-3
Estimator	RMSE	RMSE	RMSE	RMSE	RMSE	RMSE	RMSE
SIERA	0.875	1.013	1.019	1.070	1.074	1.088	1.142
xFIP	0.876	1.033	1.042	1.083	1.102	1.115	1.166
FIP	0.736	1.094	1.129	1.158	1.169	1.173	1.235
tERA	0.862	1.195	1.236	1.287	1.263	1.282	1.331
ERA (pk-adj)		1.272	1.314	1.354	1.403	1.364	1.415

Once again, the order of estimators remained the same in each year in question:

Table 11. Correlation of Estimators with Park-Adjusted ERA in Years T-3 through T+3

Comparison Years	T	T+1	T-1	T+2	T-2	T+3	T-3
Estimator	Correl	Correl	Correl	Correl	Correl	Correl	Correl
SIERA	.600	.398	.357	.343	.309	.305	.244
xFIP	.606	.352	.325	.305	.273	.244	.214
FIP	.738	.341	.303	.286	.269	.270	.235
tERA	.710	.347	.303	.286	.280	.276	.233
ERA (pk-adj)		.295	.295	.232	.232	.237	.237

Correlations seemed to move all over the place, but SIERA did have the highest correlation for every year other than same year.

I also tested "win rate," the percentage of how often each statistic was closest to park-adjusted ERA, looking at different years.

The first table is same-year ERA, in which FIP unsurprisingly wins the most. SIERA is closer slightly more often than xFIP. This is a statistically significant win percentage, albeit a small one (p=.02). All other comparisons are statistically significant except SIERA’s deficit in predicting same-year ERA compared to tERA is not (p=.45), despite the fact that tERA credits a pitcher with the full run cost of the actual number of home runs they surrender.

Table 12. Percentage of Pitchers for Whom Estimator Was Closer to Park-Adjusted ERA in Same Year

Win-Rate (year T)	vs. xFIP	vs. FIP	vs. tERA
SIERA	52.1%	41.2%	49.3%
xFIP		39.4%	48.1%
FIP			58.0%

Since we have already looked at predicting next year’s park-adjusted ERA, we jump to "predicting" last year’s park-adjusted ERA.

Table 13. Percentage of Pitchers for Whom Estimator Was Closer to Park-Adjusted ERA in Previous Year

Win-Rate (year T-1)	vs. xFIP	vs. FIP	vs. tERA	vs. ERA (pk-adj)
SIERA	54.3%	55.8%	59.3%	57.0%
xFIP		53.4%	58.4%	58.3%
FIP			55.9%	56.9%
tERA				51.2%

Again, SIERA continues to do better, followed by xFIP, FIP, tERA, and park-adjusted ERA itself.

The following tables look at two years from now, two years ago, three years from now, and three years ago:

Table 14. Percentage of Pitchers for Whom Estimator Was Closer to Park-Adjusted ERA in Two Years

Win-Rate (year T+2)	vs. xFIP	vs. FIP	vs. tERA	vs. ERA (pk-adj)
SIERA	52.8%	53.8%	57.3%	58.6%
xFIP		54.7%	57.5%	59.6%
FIP			55.4%	58.4%
tERA				53.3%

Table 15. Percentage of Pitchers for Whom Estimator Was Closer to Park-Adjusted ERA Two Years Before

Win-Rate (year T-2)	vs. xFIP	vs. FIP	vs. tERA	vs. ERA (pk-adj)
SIERA	53.9%	55.8%	57.5%	59.8%
xFIP		53.7%	58.1%	59.3%
FIP			56.2%	58.1%
tERA				53.7%

Table 16. Percentage of Pitchers for Whom Estimator Was Closer to Park-Adjusted ERA in Three Years

Win-Rate (year T+3)	vs. xFIP	vs. FIP	vs. tERA	vs. ERA (pk-adj)
SIERA	53.3%	55.6%	55.8%	59.1%
xFIP		53.8%	55.8%	57.2%
FIP			54.0%	58.8%
tERA				53.2%

Table 17. Percentage of Pitchers for Whom Estimator Was Closer to Park-Adjusted ERA Three Years Before

Win-Rate (year T-3)	vs. xFIP	vs. FIP	vs. tERA	vs. ERA (pk-adj)
SIERA	52.5%	52.8%	56.8%	54.9%
xFIP		49.9%	54.8%	54.7%
FIP			56.5%	54.3%
tERA				51.6%

The results and order of best statistics are similar in each of these tables. Almost all of these are statistically significant, thanks to the large sample size of pitchers, with exceptions including tERA’s victory over park-adjusted ERA is not significant in predicting ERA from three years ago, predicting ERA three years ahead, nor predicting ERA from one year ago. Additionally, SIERA's victory over xFIP at predicting ERA three years prior is not significant, nor is xFIP’s deficit compared to FIP three years prior.

Averaging Multiple Years of ERA Estimators

In discussing this article with Sky Kalkman, he suggested that I look at whether multiple years of this estimator averaged out would show FIP to be better at picking up the elusive skill level in home runs per fly ball. Considering this very plausible, I checked this a few different ways.

First, I looked at just averaging the previous three years of ERA estimators to predicting the fourth year’s park-adjusted ERA, without doing any weighting. I looked at both RMSE and the correlation. Second, I weighted the estimator by innings pitched in each of those first three years. Again, I looked at the RMSE and the correlation. Third, I used that estimate but checked the RMSE while weighting the pitchers by innings pitched in the fourth year. The results?

Table 18. Three-Year Average of Estimator versus Park-Adjusted ERA in Fourth Year

	Un-weighted	Un-weighted	ERA weighted by IP, average estimator un-weighted	ERA weighted by IP, average estimator un-weighted	ERA weighted by IP, average estimator weighted by next-year IP
Estimator over Three Years	RMSE	Correl	RMSE	Correl	RMSE
SIERA	1.104	.433	1.107	.427	.968
xFIP	1.120	.390	1.123	.386	.984
FIP	1.122	.411	1.122	.410	.975
tERA	1.141	.427	1.135	.425	1.016
ERA (pk-adj)	1.177	.397	1.178	.396	1.021

SIERA continued to be the best in each of these five tests, though FIP basically caught up with xFIP. Although tERA did decently well with correlations, it fell short in RMSE.

Mixing and Matching

I also ran some other tests (not included) in which I used weighted averages of xFIP and FIP to see if they did better as a mixture than separately. They did, and the best mixture seemed to be 80%/20% for xFIP/FIP in one-year estimation and 40%/60% for three-year estimation. However, these did not outdo SIERA in any of the cases, except the three-year estimation did tie it for observations weighted by IP in the fourth year and estimator weighted by IP in the first three years. However, the rest of the tests had SIERA safely in front.

Squeezing the data every which way, it remains true that 2010 continues to show SIERA to be the best ERA estimator. It is clear that xFIP is almost as good, though if left with one, I would prefer SIERA (perhaps obviously). Interestingly, running a regression of park-adjusted ERA on the previous year’s SIERA and xFIP shows that not only does SIERA to a better job, you should actually lower the expected ERA of a pitcher with a higher xFIP and the same SIERA. The formula given is:

ERA (pk-adj) = 1.60 + .914*SIERA – .277*xFIP

Both coefficients are statistically significant (p=.000, p=.013 for SIERA and xFIP respectively). This means that xFIP is not giving extra information beyond what SIERA does. This peculiar result of a negative coefficient is probably a result of sampling bias, but it is still worth reporting.

Why SIERA Succeeds

The natural question that everyone asked last year when we came out with SIERA was not just if it was the best estimator, but why it is the best estimator. Why is it that a statistic that has fewer years to work with–and therefore does not precisely estimate the run-effect and out-effect of strikeouts, walks, and home runs–does better than statistics like xFIP and FIP, that do precisely estimate those things?

My further research over the last year has helped me understand why. The following are the highlights of this research. The first one listed is the one that we knew already when we published SIERA last year, but it is not the primary reason at all.

Ground balls matter more for pitchers who get more walks and fewer strikeouts because they allow more runners to reach first base.
Ground-ball pitchers allow fewer hits and fewer extra-base hits on ground balls than non-ground-ball pitchers, and SIERA acknowledges this effect due to its negative coefficient on ground-ball rate squared.
Pitchers with higher ground-ball rates (but not too high) allow the highest BABIPs and SIERA picks up on this reversing effect of ground balls on BABIP due to their correlation.
Pitchers with higher strikeout rates allow lower BABIPs and lower HR/FB rates, and SIERA picks up on this correlation. This is why the coefficient on strikeout rate in SIERA is so negative–because pitchers with high strikeout rates not only prevent runs by getting outs, but because they also allow fewer hits on balls in play and fewer home runs on fly balls.
Pitchers with higher strikeout rates get more ground balls in double-play situations.
Pitchers with lower walk rates issue more of their walks strategically, and thus the average damage of a walk from a high walk pitcher is higher, another effect which SIERA picks up.

My educated guess is that reasons 2) and 4) are the primary reasons for SIERA's superior estimation skill. The stark difference between QERA’s RMSE and SIERA’s RMSE in last year’s testing was primarily due to the negative coefficient on ground-ball rate squared in SIERA. When we ran our initial tests on SIERA, the inclusion of a variable for the square of ground-ball rate often did the most to improve estimation. Further, even though pitchers do have some control over BABIP, the amount that they do control is very similar to the amount that SIERA credits them with through BABIP’s correlation with strikeouts and ground-ball rates.

With this, we're not done with SIERA. We knew when we introduced it that we only had so many years of batted-ball data, and more years will undoubtedly help us better estimate the many coefficients in SIERA (though we have not yet looked at how much 2010 can help). Also, as Colin Wyers has done some work on park-adjusted batted-ball rates for pitchers, this may or may not help SIERA improve its estimation skill. If it does, we will be able to make these changes as well. Furthermore, as run environments change, SIERA will need to adjust accordingly too.

In the end, ERA estimation is clearly a complicated task with a lot of moving parts. SIERA is currently the best way to take a snapshot of a pitcher’s skill level, but with a lot of competition out there, we will continue to work on delivering even better ways of understanding pitcher’s skill level.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Matt Swartz

Latest Articles

You need to be logged in to comment. Login or Subscribe

oneofthem

1/27

great stuff. can't improve without self evaluation.

Reply to oneofthem

TangoTiger1

1/27

Good stuff Matt. Can you include Batted Ball FIP in your testing?

Reply to TangoTiger1

swartzm

1/29

In this article, I didn't actually compute any of the estimators using formulas. In the article, I explained that to avoid last year's mix-up, I just used what was available at BP and FG. I did use old SIERAs from 2003-04 before BP stats have access to the batted ball data. If there was a bbFIP posted somewhere that I could merge with the rest of the data, that would be easier. If I try to compute it from an original set of batted ball data, I can include bbFIP, but it's tricky without since my coding skills are still a work in progress.

Reply to swartzm

TangoTiger1

1/27

Also, can you test with unadjusted ERA, not the park-adjusted? After all, FIP itself is unadjusted, and you can't take Ubaldo's unadjusted FIP in 2009 and compare it to his park-adjusted ERA in 2010. This is clearly unfair to FIP. Alternatively, park-adjust FIP in 2009 to test to park-adjusted ERA in 2010.

SIERA positions itself as being park-neutral, so comparing park-neutral SIERA in 2009 will obviously give it an advantage to park-neutral 2010 ERA.

Finally: we really don't care about ERA by RA. The UER is some biased construct, as FB pitchers like Santana will give up far fewer UER than GB pitchers like Brandon Webb.

Reply to TangoTiger1

ckahrl

1/27

Amen to that, Tom, RA9 would be a nice tweak. The official scorer's opinion does not change the score, after all.

Reply to ckahrl

TangoTiger1

1/27

Ah, RA9. Perfect name. I keep using ERA and RA as pairs (each denoting a rate stat), and it's never sat well with me, since the term RA is also used as a counting stat. RA9. I like it. Now we just need all the saber-stat sites to use it.

Reply to TangoTiger1

swartzm

1/29

Yeah, I agree that FIP isn't park-adjusted and that causes it to be short-changed. In the article I explained that, which is why I tested FIP against unadjusted ERA for pitchers who did not switch teams, and checked SIERA, xFIP, etc. against park-adjusted ERA for the same pitchers who did not switched teams. (That's in table 4).

I agree RA would be ideal instead of ERA, for non-fantasy purposes. At some point, we might derive a SIRA regression, or something like that, so as to sort out what that should look like. I guess that the original goal was just to move closer to estimating something familiar and to justify its use for fantasy purposes if people are interested. It's definitely something on my radar. Thanks.

Reply to swartzm

TangoTiger1

1/31

Good stuff, I didn't note the distinction in that table.

Reply to TangoTiger1

TangoTiger1

1/27

Can you also post your dataset used to create all these charts, so the rest of us can do our own testing?

Reply to TangoTiger1

metty5

1/27

Fantastic stuff Matt. I wish there was a LOVE button somewhere I could click.

Reply to metty5

IvanGrushenko

1/27

So it's an awesome stat! Can I find it on the player cards or do I have to look at the long of pitchers by year?

Reply to IvanGrushenko

swartzm

1/29

My understanding is that is definitely the plan for the player cards this year. Doing it on old cards was apparently very tricky.

Reply to swartzm

TheRedsMan

1/27

I really like the fact that Sierra recognizes that the ability to miss bats and the ability to induce weak contact come from much of the same underlying skills and thus are strongly correlated.

Reply to TheRedsMan

steverynear

1/27

Fangraphs allows you to sort by ERA-FIP. Can BP provide a ERA-SIERA sort? A SIERA - xFIP would be cool to look at as well.

It would be interesting to see an article highlighting where SIERA and xFIP differ. Is there a certain type of pitcher?

Reply to steverynear

swartzm

1/29

Really cool idea. I'm going to have to look into this. Thanks!

Reply to swartzm

WaldoInSC

1/27

BP + Fangraphs + Tangotiger + all other analytically-bent baseball sites = boon for readers.

FIP FIP Fooray!

Reply to WaldoInSC

Moneyball16

1/28

Great article Matt!

Reply to Moneyball16

jplcarl

1/28

Great, interesting work! Any future plans to work with and integrate pitchF/X data into SIERA?

Reply to jplcarl

swartzm

1/29

I don't think we'll integrate the two in a formulaic way. Mike Fast and I email and tweet to exchanges ideas sometimes, and it has definitely helped my understanding of how the two can be integrated to learn about pitching. I suspect that the data won't be used in SIERA but both will be used to inform our understanding of pitchers.

Reply to swartzm

mbodell

1/28

Great article! I hope we see lots of these type of articles, and not just when the BP stat in question comes out one top. How about one on multi-year PECOTA study too?

Reply to mbodell

swartzm

1/29

That could be something we'll work on soon too. I agree that it's important to be forthright about when we do not come out on top. Colin has worked very hard on improving PECOTA in the last several months, and he's definitely playing with tests of various versions of PECOTA at least internally. I do think multi-year looks at projection systems are an important step to take as well. Thanks.

Reply to swartzm

Ahead in the Count: Testing SIERA

Thank you for reading

Latest Articles

speX ’24: Week Four $

Will I Be Drawing These Stupid Rabbits Forever? $

Deep League Landscape ’24: Week Four $

MLU: Bratt Frustrates Opposing Hitters $

Box Score Banter: Knuckling (Way, Way) Up B

Matt Swartz

Latest Articles

speX ’24: Week Four $

Will I Be Drawing These Stupid Rabbits Forever? $

Deep League Landscape ’24: Week Four $