keyboard_arrow_uptop

Over the last three days the ERA estimator SIERA has been introduced, complete with explanations of its origins and the derivation of its formula. Now comes one of the most important aspects of building a new metric: making sure it works and testing it against its peers. Any estimator should be both realistic in its modeling and accurate in its predictive ability. These are not mutually exclusive attributes, however, as you could have a situation where a regression on sock height and induced foul balls caught by middle-aged men holding babies somehow predicts park-adjusted ERA better than anything else. Sure, it tested well, but that type of modeling is absurd and illogical; those two variables should not be assumed to have any impact whatsoever on run prevention. This regression means nothing in spite of its hypothetical test results, but situations may also arise in which the most fundamentally modeled statistic tests poorly.

Realistic modeling is based on a combination of truths and assumptions, as we’ve discussed before; the former being that walk and strikeout rates are stable with the latter suggesting that HR/FB is comprised more of luck than skill. During the course of our post yesterday, it seems safe to say that our modeling is sound as the variables used make sense as far as perceived impact on what they seek to measure. The question then becomes one of how we can test the results to determine how it compares to other estimators currently on the market. For the purposes of this study, we used root mean square error testing, a simple but effective method that informs on the average error between an actual and a predicted array of data. In terms of calculations, RMSE is simple enough to do in Excel: take the difference between the actual and predicted term, square it, and take the square root of the average of those previously squared deltas. When compared to other predictors, the lower the RMSE the better.

With the why and how out of the way, who are we testing? To gauge how SIERA fares as an estimator we must compare it to its colleagues. In this forum, that group would consist of FIP, xFIP, QERA, tRA, raw ERA, and park-adjusted ERA itself. To further ensure data integrity, each of the above statistics was calculated from the same Retrosheet dataset. For FIP, the standard +3.2 constant was dependent upon the league and year in question and not a shell figure, with the same obviously true for the xFIP mark. Additionally, as it pertains to xFIP, we spoke with Dave Studeman of The Hardball Times in order to determine that the expected number of home runs to be substituted into the FIP formula is to be calculated through home runs per outfield flies, not the sum of those and popups. Coding for tRA was the final hurdle, but a bunch of help from the invaluable Colin Wyers helped in that regard. One other note on tRA: since it calculates RA/9 instead of ERA, an adjustment needed to be made that essentially resulted in the creation of tERA, which was the normal tRA value discounted by the difference-league and year dependent-between RA and ERA.

Before getting into the results it would be prudent to discuss some assumptions and goals here. To be blunt, our goal was to beat everyone at predicting park-adjusted ERA in the following season, regardless of HR/FB treatment, and beat everyone but FIP and tRA in terms of same-year predictive value. Though it may sound counterintuitive to openly seek a third-place finish in something like this, the rationale is that both FIP and tRA treat HR/FB as skill rather than luck, meaning that no adjustment at all is made to the home run variable; if we assume that a fluky high or low HR/FB is the true skill level of the pitcher and not in line for adjustment, then of course those metrics will be better at same-year predictivity (should totally be a word, damn that red squiggly line) than one that does apply an adjustment. As we discussed earlier in the week, the intra-class correlation for HR/FB is very low, both from an overall standpoint and one in which the individuals are isolated from their respective teams.

Both tRA and FIP use extra information to retroactively guess ERA, but if we wanted to do that we could use BABIP to more precisely derive an ERA prediction. Looking at tRA, we know that it uses the same information that FIP does, but also takes GB, LD, FB, and PU totals and estimates the expected number of runs and outs that each of these batted balls leads to on average.  The assumption is certainly one worth exploring-that pitchers control the rates of each of these batted balls, but that defense and luck determine whether they land in gloves or not.  The problem is that this is mostly going to hinge on the assumption that line-drive rates are persistent for pitchers, because line drives are outs far less frequently. Therefore, a pitcher’s line drive rate is going to affect his tRA significantly.  However, when we look at the ICC of the pitcher’s line-drive rate relative to the rest of his team, we only get .007. In this regard, tRA takes a luck-based stat used in FIP but adds another luck-laden metric in the rate of line drives and uses that as a main determinant of expected ERA.

The ideas are certainly sound, but assumptions must be tested, which is exactly what we did here with SIERA. If everything plays out the way we hoped, then tRA and FIP will best SIERA in post-dicting same year ERA but will lose at subsequent year predictive value. But the goal isn’t so much to lose to both of them in the same-year RMSEs but as much as it is to beat the other competitors that treat HR/FB similarly, which would be xFIP and QERA. With that series of disclaimers out of the way, the tables below show the same-year and subsequent-year RMSEs for the seven metrics in a variety of different categories and subsets. For starters, here is the table of overall results:

```
Stat    YR-Same YR-Next
SIERA    0.957   1.162
tRA      0.755   1.222
FIP      0.773   1.224
xFIP     1.168   1.319
QERA     1.070   1.248
ERA-Park  ----   1.430
ERA      0.094   1.434
```

Our goals came to fruition, as SIERA beat xFIP and QERA in the same-year RMSE test while besting everyone else in terms of predicting park-adjusted ERA in the following year. The latter is very important as a big purpose of these estimators is to base ERA around repeatable skills that would conceivably lead to better results the next time out. Next, we will break the RMSE test results down into a number of subsets to add a level of granularity to the discussion. These subsets were not chosen at random, either, with each being tested for a specific purpose. Most of these purposes involve specific interactions of skills, thus the name Skill-Interactive Earned Run Average. For starters, here are the pitchers with above average strikeout rates:

>= AVG SO/PA

```
Stat    YR-Same YR-Next
SIERA    0.929   1.135
tRA      0.704   1.191
FIP      0.748   1.191
xFIP     1.191   1.275
QERA     1.032   1.191
ERA-Park  ----   1.401
ERA      0.084   1.404
```

When looking at the crop of pitchers with an above average SO/PA, the standing of SIERA relative to the overall group remains unchanged. Next up, the group with an SO/PA greater than or equal to one standard deviation from the mean, classified as really high strikeout guys:

>= AVG SO_PA + 1 SD

```
Stat    YR-Same YR-Next
SIERA    0.866   1.218
tRA      0.689   1.229
FIP      0.722   1.216
xFIP     1.214   1.289
QERA     0.972   1.222
ERA-Park  ----   1.430
ERA      0.071   1.432
```

Here, FIP pulls ever so slightly ahead, but remains very close to SIERA in predicting park-adjusted ERA the following year.  SIERA uses a quadratic term on strikeouts, which makes it particularly good at estimating ERA for medium-high levels of strikeouts but does not add anything particularly helpful for very-high levels of strikeouts. Shifting the focus to walks, how do things shake out when looking at below average walk rates (i.e. pitchers with good control)?

<= AVG BB_PA

```
Stat    YR-Same YR-Next
SIERA    0.871   1.071
tRA      0.725   1.133
FIP      0.719   1.125
xFIP     1.105   1.168
QERA     0.915   1.073
ERA-Park  ----   1.329
ERA      0.085   1.336
```

Interesting results surface here, as SIERA and QERA are very similar as it pertains to walk rates below the league average. Looking at the pitchers with very low walk rates, xFIP and QERA actually predict next-year ERA better than SIERA, while the latter continues to best both of them at same-year predictions. Moving onto ground ball rates, both above average and above one standard deviation from the mean:

> = AVG GB_PA

```
Stat    YR-Same YR-Next
SIERA    1.079   1.153
tRA      0.761   1.205
FIP      0.773   1.202
xFIP     1.071   1.216
QERA     1.088   1.234
ERA-Park  ----   1.419
ERA      0.099   1.422
```

>= AVG GB_PA + 1 SD

```
Stat    YR-Same YR-Next
SIERA    1.003   1.193
tRA      0.844   1.203
FIP      0.823   1.226
xFIP     1.063   1.217
QERA     1.173   1.272
ERA-Park  ----   1.456
ERA      0.091   1.453
```

Same story, different metrics. Next up is a table looking at interactions between skills. It looks for low strikeout but high grounder and high walk pitchers, the kinds of hurlers we would expect to allow plenty of baserunners and rely on fielded grounders to wipe the slate clean:

>= AVG GB_PA, BB_PA & <= AVG SO_PA

```
Stat    YR-Same YR-Next
SIERA    1.123   1.299
tRA      0.883   1.318
FIP      0.875   1.305
xFIP     1.178   1.408
QERA     1.294   1.477
ERA-Park  ----   1.551
ERA      0.121   1.553
```

>= AVG GB_PA, BB_PA

```
Stat    YR-Same YR-Next
SIERA    0.876   1.064
tRA      0.698   1.141
FIP      0.708   1.127
xFIP     0.982   1.090
QERA     0.912   1.065
ERA-Park  ----   1.323
ERA      0.086   1.329
```

In both of these tables, we see the results we would expect. SIERA, being a Skill-Interactive Earned Run Average, does exactly what it should here: beat other estimators at measuring the skill components of pitcher performance that interact with each other. Moving on to elite pitchers:

<= 3.50 ERA-Park

```
Stat    YR-Same YR-Next
SIERA    1.221   1.142
tRA      0.833   1.208
FIP      0.873   1.203
xFIP     1.601   1.235
QERA     1.439   1.180
ERA-Park  ----   1.536
ERA      0.063   1.535
```

With its ability to properly estimate the effects of very strong skill levels, SIERA again beats other estimators that treat HR/FB as luck neutral in predicting same-year park-adjusted ERA and leads all other estimators in predicting next-year park-adjusted ERA.

Looking through all of these different tests, it is apparent not only that SIERA is the best ERA estimator currently available, but specifically that it is exceptionally strong at measuring the skill level of specialized kinds of pitchers. To make this less abstract, tomorrow’s fifth and final article in our introductory series will discuss three specific examples of pitchers who are unique in their skill sets, and are particularly troublesome for other estimators. SIERA will perform excellently with all three, which should leave you with a solid understanding of what SIERA does and why it is so important.

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

### Latest Articles

5/31
0
5/31
0
• ##### Box Score Banter: An Epic Unfolds at the Coliseum \$
5/31
0
You need to be logged in to comment. Login or Subscribe
philly604
2/11
It seems to true that if the YR-Next column in every table was restricted to just on digit to the right of the decimal point there would be no difference at all between most of these metrics. In most cases the RMSE would round to identical 1.2 values.

That you can go out to two and three places past the decimal and get "better" numbers is all well and good. But I'm having a hard time putting the improvements of 0.0xx in RMSE in context. What is the meaningful impact of that small level of improvement?

Or perhaps better yet, are any of these improvements outside the margin of error for these various metrics?
swartzm
2/11
If you don't like it being on the other side of the decimal place, multiply everything by 1000. The difference in between these is large and very significant. Compare the difference in other metrics and simply using ERA to the difference in SIERA and other metrics, and you'll clearly see it's a BIG step forward.

I should note that if you do these tests separately for EACH YEAR from 2003-09, SIERA is ahead of the same estimators EVERY time. This is a large difference even if we're dealing with ERAs which are necessarily going to require some decimals.
nickojohnson
2/11
Are we missing a table here for pitchers with "very low walk rates?" where FIP and QERA outperform SIERA in terms of next-year ERA?
swartzm
2/11
We have below average walk rates. If you mean more than one standard deviation below average, we did that test too, and were better in that as well-- but since we didn't have a squared term on BB, we thought it was too much clutter to include it. We certainly were not biased in the tests we reported.
nosybrian
2/11
This is a pretty good vindication of FIP as a same-year ERA estimator, and of SIERA as a next-year ERA estimator. (There must be a typo here: SIERA 0.079.)

I would like to see a correlation matrix of each against all.
swartzm
2/11
Well, FIP can do well with same year because it treats HR/FB as skill rather than luck. Since that's not really the case, it seems like it won't be helpful. If you want to predict same-year ERA using luck-based numbers as skills, I'd go ahead and look at the actual ERA, you know?
nosybrian
2/11
On that SIERA number I asked about, is it really ZERO point zero seven nine? Maybe it's ONE point zero seven nine?
swartzm
2/11
Yes, that's a typo, sorry!
jjaffe
2/12
Matt or Eric, please ask edit to fix this, as it's fairly crucial information.
nosybrian
2/11
To echo philly's last question, I'd like to see how consistently the different metrics work. Not just summed over 2003-8 or 2003-9 -- but for each same-year (single year) 2003-2009, and for each next-year 2003-2004 through 2008-2009.

IOW, if I really want to rely in a particular indicator for 2010, is there any reason, for example, to prefer SIERA to FIP -- or do they each beat the other as often as not?
swartzm
2/11
Sure. If that helps, I'll put it here in the comments--

Next-year ERA for
03-04, 04-05, 05-06, 06-07, 07-08, 08-09

SIERA 1.107 1.141 1.179 1.186 1.107 1.248
QERA 1.237 1.237 1.219 1.277 1.206 1.316
xFIP 1.284 1.403 1.211 1.404 1.287 1.311
FIP 1.120 1.230 1.298 1.236 1.170 1.283
tRA 1.162 1.202 1.273 1.216 1.171 1.307
ERA_pk 1.391 1.388 1.488 1.429 1.390 1.493

As you can see, it's ahead every time and offers a solid improvement if you compare the difference between the other estimators and regular ERA_pk to the difference between the other estimators and SIERA.
nosybrian
2/11
Agreed. This test of consistency is important. Thank you.
metty5
2/11
Matt and Eric, this series has been great. Breaking down a new statistic like this should be a must in the future.
NLBB15
2/11
Hey thanks for SIERA, I'm loving the new BP. I expect a few tweaks and an approval from Tango shortly. But I have a few questions

If I want to look at future ERA I would use TRA* instead of TRA as they advise on StatCorner. How does SIERA do against TRA* for predicting next year ERA?

I also understand if you don't want to include any metric that regress the components. But what about a SIERA*? Would this be better than SIERA at predicting next year ERA? And what about when SIERA* means an individual regression rather than regressing toward league average. So how does SIERA do at predicting 2009 ERA when you input a Marcel produced from 2006-2008 data instead of just 2008 data?
swartzm
2/11
We didn't have the code for tRA* so we couldn't test against it. We also don't have a SIERA* to test against it. We might as well compare it to PECOTA or another projection system if we're going to start regressing components. The goal is really to answer the question of how a collection of skills leads to keeping ERA down.

The main thing though is that tRA as a metric has a major problem in that it is affected so much by line-drive rate. Since we found that LD/Batted Ball had an ICC of 0.007-- which is pretty much the closest thing to zero I can remember seeing in sabermetrics-- it doesn't make sense to use it. I heard its inventor say once that LD/Ball in Air had a high correlation but that's because Fly Ball/Batted Ball has a high correlation and it's picking that up. If you picked (Pitcher's Birthdate)/(Pitcher's Birthdate + Fly Ball%) and checked its correlation, it would be high as well. It's just that you can't then call birthdate significant any more than you can call line drive rate significant.
hiredgoon1
2/11
According to my database, Kevin Correia led the majors in IFBCBMAMHB, with 11.
makewayhomer
2/11
Sorry if you have discussed this, but is SIERRA being used to compute PECOTA ERA's? or is this something completely different?
swartzm
2/11
It helped check them and may be more involved next year.
nosybrian
2/11

In my opinion, it would be a little difficult to implement everything from Siera in PECOTA because PECOTA's system relies on matching a given player's baseline performance (last three years) with the set of all pitchers of the same age who played in the majors since 1946. But there are probably some insights to be gained in refiguring the baseline performance ERA (using SIERA instead) for the player of interest. Just my opinion.
makewayhomer
2/11
also, is this something that will be added into the PFM for fantasy value?
dpowell
2/11
Matt, I have a suggestion to revise SIERA while maintaining the same basic idea. Your point seems to be that non-linearities and interactions matter, but you handle them in a very brunt way when you simply multiply different variables together or square them. It also makes the coefficients hard to interpret. I would suggest creating "bins" for different ranges of each variable or interactions of variables. In other words, create a dummy variable for everyone with a K% between X1 and X2. Or, better, one dummy variable for K% between X1 and X2 and BB% between Y1 and Y2. The "trick" will be defining these bins as precisely as possible while still including a useful # of observations in each one. This is a very flexible way of dealing with interactions. Even better, it would make the results very easy to interpret. Anyone could take a pitcher, find his bin or bins, figure out his SIERA, and easily calculate his theoretical SIERA if he had, say, just slightly increased his strikeout rate.
sunpar
2/11
It seems like taking an effective continuous regression and making it non-continuous for no real reason.
cjones06
2/11
Someone likes them some non-parametric estimation!

Anyway, the main problem with this is an increased reliance on sample size; judging by how Matt/Eric were already lamenting the sample size in re: GB/BB interaction, this isn't going to be practical quite yet, even with only 3 dimensions. Although i do agree that it would be handy to take a look at the residuals along each dimension to see how the quadratic assumption is working out. Given that this metric seems to be relatively weakest when extreme skills are demonstrated, it is possible that there could be some room for improvement here.
dpowell
2/11
You're right that sample sizes become an issue. I wasn't suggesting, however, to create 3-dimensional bins. That would be too much. But maybe a series of 1- and 2-dimensional bins.
joenemanick
2/12
I agree that this is an interesting idea to test. You split out the results for comparison of high K, low ERA etc., why not just bin these groups up first and derive new constants for your parameters. To deal with sample size issues, you could combine several years. You'd have fewer train-test options, but you might be able to use the relative stability of Krate, BBrate, and GBrate year to year to pull out different interactions between the skills at the extremes. In my experience with these complex systems, the lions share of the error comes from the extremes. It would be less elegant, but once the methodology is laid out, your stats program does all the extra work.
lynchjm
2/12
Matt and Eric, I'll take your word for it.
JinAZReds
2/12
I might have missed it, but were the years in which you tested the equation the same years that you used to generate the regression coefficients? I'm sure you're aware of this, but doing so will make your equation look exceptionally good...but that won't necessarily carry over into a new dataset.
-j
swartzm
2/12
We did discuss it already, but that's okay. I don't mind reiterating once rather than making everyone reading 200 comments...

We developed the coefficients by regressing on SAME-year ERA. So the tests on NEXT-year ERA are legitimately different. Also, to check, we ran the regression on the 2003-08 ERA for same-year data and then used those coefficients to compute 2009 SIERA stats and then checked same-year ERA and it finished similarly.

Thanks for asking again, actually, because this question should be highlighted rather than obscured and have other people miss it.
JinAZReds
2/13
Thanks, Matt, and sorry I missed it. I really did look. :P

Anyway, that makes me feel a lot better about the data you're reporting. I'm sure you'd still have to tune your regression differently as you move to a different era, but we're mostly interested in modern baseball anyway. :)
-j
JinAZReds
2/13
Also: hurry up and get this in the reports so I can pull it for the power rankings this year. :D
-j
BobbyRoberto
2/12
I'm surprised that xFIP did so poorly in both same-season and next-season. I had it in my mind that xFIP would be better than FIP.
JinAZReds
2/13
xFIP did outperform FIP in Colin's test at THT that ran last year, so I was surprised to see this as well.
-j
Michael
2/12
I had the same thought as BobbyRoberto above. Also, is there a publicly available formula for tRA? I don't remember reading that before.
JinAZReds
2/13
The methods for tRA are posted here: