Notice: Trying to get property 'display_name' of non-object in /var/www/html/wp-content/plugins/wordpress-seo/src/generators/schema/article.php on line 52
keyboard_arrow_uptop

Last week Matt Swartz published an updated analysis of ERA estimators. He was kind enough to share his data so I could take a look at the accuracy of ERA estimators as a function of innings pitched. In other words, is there a difference in accuracy between the estimators given 100 historical innings pitched versus 500?  (Hint: yes.)

We can measure a lot of things that happen while a pitcher is on the mound, but it takes a while for the real information to show itself. As we collect more data, the random noise is more likely it is to cancel itself out. For example, during any one season you'll see a lot of .270 BABIPs, but once we look at careers over five-season stretches, .270 BABIPs are few and far between.
 
Some peripherals include a lot of noise, such as BABIP, HR/FB% and LOB%. Other peripherals start with a low noise-to-information ratio, such as SO/PA, BB/PA, and GB/BIP. As such, we might guess that metrics like SIERA and xFIP, which only use the latter peripherals as inputs, will more accurately reflect true talent in the short run, while things like ERA will be more accurate in the long run, because they can pick up on the former peripherals. Short term we need to reduce noise, but long term we can maximize information.
 
Methodology: Using pitchers who threw at least 40 innings from 2004 through 2010, I binned them according to total innings pitched the previous three years, every 100 IP. (Note that the data set only goes back until 2003 and didn't include any seasons with fewer than 40 innings.) The first bin has 40-100 IP in years n-3 through n-1, and the last bin has 600+ IP. For each bin, I calculated the weighted metrics (ERA_adj, FIP, xFIP, SIERA, and tERA) over the preceding three seasons, and then found the RMS error compared to the following season's park-adjusted ERA, weighted by the following season's IP total. A lower number means less disagreement, which is good.
 
Results (full data table at end of post):
 
 
Takeaways:
  • The more historical information, the better future ERA can be predicted by all the metrics. Stunner, I know.
  • Given less than 200 innings pitched — a full season for most starting pitchers, or three years for a relief pitcher — SIERA holds a small advantage over xFIP, which in turn holds a small advantage over FIP. ERA_adj and tERA lag behind.
  • With more than 200 innings pitched, SIERA, xFIP and FIP merge in effectiveness.
  • By 500 IP, all five estimators are on equal ground.
  • ERA never surpasses the peripheral-based estimators, but maybe we just haven't included enough history to detect ERA's advantage, yet.
(Not that the only job of an ERA estimator is to predict future ERA.  Another important use is to evaluate in-season performance.  Comparison to future ERA remains a handy benchmark, however, because determining in-season true-talent ERA is an important part of projection.)
 
IP Bin Count SIERA xFIP FIP tERA ERA_adj
40<100 490 1.14 1.18 1.27 1.44 1.61
100<200 609 1.05 1.09 1.15 1.31 1.26
200<300 326 1.09 1.08 1.09 1.20 1.24
300<400 179 .92 .95 .97 1.04 1.01
400<500 141 .97 .96 .95 1.03 1.02
500<600 149 .99 1.01 .98 .98 .97
600+ 111 .74 .75 .72 .80 .73

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
ckahrl
2/02
First off, Sky, welcome aboard, it's great to have you here. Second, to follow up on one of Tango's comments last week, I'd be interested to this sort of evaluation using RA9 instead of ERA. Focus on scoreboard outcomes, and not a scorer's opinion, as it were.
SkyKing162
2/02
Thanks, and I totally agree. Would also be great to include a bunch of other ERA estimators, park- and league-adjust all of them, see if the conclusions hold across time, etc. I hope and think we'll get to doing a lot of that, but for now, this data was readily available.
TangoTiger1
2/03
So given 400 IP or more, FIP wins? Even though you are comparing against park-adjusted ERA instead of actual ERA?

This seems like a big deal, a huge deal, no? This is saying that the batted ball data is worse than just knowing the number of HR allowed.

Am I misinterpreting?
SkyKing162
2/03
I guess if it's worth nothing .04 differences, it's worth noting .02 differences. In which case ERA is a co-winner at 500+ IP. The muddling of lines was all I really noticed at that end of the graph. Although I'm very curious if FIP or ERA would pull ahead with more historical data.

We're aware of irregularities in batted ball data, but I also wonder if FIP (and ERA) are picking up on park effects (and defense) over the longer time period, information that might be removed with adjustments.

Lots of fun questions.

And of course, if we require accuracy, we should find ourselves the nearest decent projection.
TangoTiger1
2/03
Right, all legitimate questions.

The test is the following: given all known information for each pitcher (his career past performance, his recent past performance, his batted ball distribution, his performance with men on base, the fielding talent of his past fielders, his parks, his past teams, his 2011 team, his 2011 fielders, etc), what will be his RA9 in 2011?

Now, FIP is saying: "I don't care about anything, other than his BB, K, HR, HBP numbers. I'll make my estimate based solely on that."

PECOTA, SIERA, et al would say: "My god, I definitely need all that past information. It's critical that I know all that. I'll make my estimate based solely on that."

And when 2011 comes to a close, what's going to happen? I think you can make a decent case that all that extra effort may bring you very little, and perhaps will even be a negative (i.e., over adjusted).

So, that's the real test. Until then, we're dancing around the entire issue with these various other tests, because they are all going to be biased to some extent toward one metric or another based on however you setup the various other tests.
SkyKing162
2/03
Not to be the new guy who shills for the company's stats, but SIERA doesn't "need all that past information". It's HR, BB, and GB/FB (and PAs instead of IP -- heck maybe FIP and xFIP would be improved simply by upgrading denominators; IP are influenced by things they're trying to ignore.) SIERA inputs are basic.

*My* real test was to see which metrics did better short term, not longer term, although all questions are interesting. Which one do I want to use when Livan Hernandez posts a 2.50 ERA the first two months of the season? I think it's pretty clear you want SIERA or xFIP. If he does it for 2+ years, eh, it doesn't much matter (although I'd really like to see if a projection system can beat the estimators.)

It's like a point I made in Dave's Matt Cain post at Fangraphs yesterday. When the Cain hullaboo started, it was quite fair to challenge his ERA as somewhat flukey, given his xFIP. But now that we have three more years of him posting 7.0% HR/FB rates, we need to focus on FIP instead of xFIP. Our answer changes because the amount of data changes. Now, if there was evidence that his HR/FB was "real" three years ago, that would be awesome to find.
TangoTiger1
2/03
It's not clear at all "short term".

If Livan has say a 2.50 FIP over the first two months, but a 5.50 SIERA over those same two months, and the question you are asking: "How will he do over the next 4 months", well, my answer is "Use his entire career."

You are suggesting that if you intentionally limit yourself to only using two months of short-term data, and discarding the rest of his past data, then SIERA will do better. Well, giving that the batted ball distributions stabilize faster than HR rates, then you are correct.

But, there is no reason to limit yourself to only looking at his first two months of data.

What we have with Livan is a history, and you use that history. And this is exactly what you have shown, that if you look at all pitchers with a minimum of 400 IP, then FIP does a bit better than SIERA. That is, knowing his HR allowed (that's what's in FIP but not in SIERA) is better than knowing his batted ball distribution (that's what's in SIERA but not in FIP).

So, if you want to argue that for guys with less than 200 career IP you prefer SIERA, then fine.

***

Two more points:
1. I'll keep repeating this, but as long as you compare SIERA to park-adjusted future ERA, and you compare FIP to park-adjusted future ERA, you are biasing the results against FIP. You should no longer perform that test ever. If Ubaldo has a high FIP one year because of HR, he'll have a high FIP the next year because of HR, and you can't compare it to park-adjusted ERA (which presumes a flatter HR rate).

2. FIP is not meant to be predictive! FIP merely represents current performance. In no way should one even think that you would regress K rates the same as HR rates. If I wanted a "predictive FIP", I would probably do something like (5*HR + 2*BB - 2*SO)/PA + constant or something.

I think anyone here can find a stat that predicts future RA9 better than FIP and better than SIERA by focusing only on HR, NIBB+HBP, SO.

There's my next challenge to the community.
SkyKing162
2/03
Yeah, "short-term" was a bad way to put it. No reason to toss out historical information if you have it. I should have said that if you have little information (from the past three years) it appears SIERA and xFIP are the metrics to look at. That's mostly younger starters and relief pitchers. Or anyone in any given season if you're into measuring value that way, I suppose.
TangoTiger1
2/03
Agreed.
markpadden
2/03
Why don't you create and test a "predictive FIP" then?


"2. FIP is not meant to be predictive! FIP merely represents current performance. In no way should one even think that you would regress K rates the same as HR rates. If I wanted a "predictive FIP", I would probably do something like (5*HR + 2*BB - 2*SO)/PA + constant or something."
TangoTiger1
2/04
Way ahead of you. Check my blog for FutureFIP.
markpadden
2/03
Nice job.

I think the next step is to look at which estimator performs best for in-season projections. That is, compare the first-half ERA estimator with the second-half actual ERA (filtering out team changers and guys with not enough IP in both halves), and assess the average errors.




JeffZimmerman
2/03
Congrats on the new gig. With all the time you have Sky, could you also look into pitchers that changed teams. How do pitchers perform with a new stadium and defense behind them?
SkyKing162
2/03
Thanks, Jeff. Pretty sure I have about as much free time as you do ;)

The data set I have doesn't include team, so I'll add your questions to the queue when someone does a bigger study.
studes
2/03
Congrats, Sky, and welcome to BPro.

I'm also someone who feels that, once you reach a certain threshold, ERA is just as good a predictor as anything. Happy to see you peg the threshold. I'm pretty comfortable saying that, after a couple of years as a starting pitcher, actual performance is as good an indicator as these other metrics.

Technical question: how good are any of them anyway? At 100 innings, it appears that the average RMSE is 1.2 points of ERA. Is RMSE similar to standard deviation? Can you say that 66% of all estimates fall within one RMSE, or something similar to that? So the best any of these estimates might get is 66% within plus or minus 2.4 ERA?

If so, it makes the difference between the RMSE of these measures kind of trivial.
cwyers
2/03
You figure SD and RMSE the same way, mathematically.
studes
2/03
Thanks, Colin. So, picking on the 100-200 bin, when you use SIERA to predict future ERA, 66% of the results will be within 2.1 runs of ERA and 95% will be within 4.2 (that's taking the RMSE on both sides).

When you use ERA to predict ERA, 66% of the results will be within 2.5 runs of ERA and 95% will be within 5.0 runs.

Is that right? If so, it's obviously an improvement, but I think somewhere we should acknowledge that ERA is just plain tough to predict, regardless of what measure you use.

Looked at that way, I don't think there's much to choose between SIERA, xFIP and FIP. Just use whatever you're comfortable using.
TangoTiger1
2/03
I wouldn't use the word within, as I think it would imply +/-. In your case, you are saying that the *range* is 2.1 runs (i.e., +/- 1.05).

But, yeah, ERA is notoriously difficult to estimate because of BABIP and sequencing.
cwyers
2/03
Well, assuming the error is symmetrical Studes is right. If the error isn't symmetrical (and I would suspect it isn't), then it's more of an approximation, but I think it's still a useful way of putting it.
TangoTiger1
2/03
The error definitely can't be symmetrical. But, that's a vagary of using runs per out. If you instead used the square root, you'd get something closer to symmetrical.

That is, if we treat runs as a multiplication of OBP and SLG (for illustration purposes), and if each of those has a symmetrical error, then multiplying the two won't give you a symmetrical error.
BarryR
2/03
Interesting stuff. What I would like to know is whether there are pitchers who consistently under/over perform the various metrics. If so, is there a common thread among them?
Whether these metrics are intended as predictive or not, they will be used that way, whether to analyze trades, signings, or for fantasy purposes. In order to effectively analyze a pitcher's future value, it is necessary to know if there is reason to believe or doubt the numbers in his specific case.
TangoTiger1
2/03
If there was a common thread, it would be identified and included as a parameter in the estimator.
BarryR
2/04
And I feel foolish for optimistically suggesting it's existence.
I'd still like to know if there are consistent under/over pitchers though.
TangoTiger1
2/05
SIERA excludes HR. So anyone with a HR skill will under/over. Brett Myers for one.

FIP excludes batted balls. So anyone with a batted ball distribution skill will over/under. Felix maybe is one.

Basically, whatever parameter is being ignored is a candidate for being over/under.

These metrics purposefully ignore parameters because they want to, not because they necessarily think there's no skill there.