Checking the Numbers: Defining Declining Expectations

July 26, 2009

Back in May, I implored fans and analysts to bypass treating discrepancies between a pitcher’s ERA and FIP at various points throughout a season as the gospel, given that a more granular line of research-one that would not necessarily require an abundance of time-could provide more telling results. Even if the heavier, research-laden conclusion pointed to the same performance trends suggested by the differential, the investigative process itself would be much more accurate. The subject of that particular discussion was Matt Cain, who continues to be pegged as a candidate for a severe second-half regression based on an FIP over a run higher than his earned run rate. It is certainly plausible and frankly, it is downright likely, that Cain or pitchers of that ilk-Jair Jurrjens, for example-will see their ERAs worsen down the stretch based on many of their current marks falling right in line with those of years past.

However, color me uncomfortable with stopping at Step #2 in a five-step evaluative process, turning a legitimately essential statistical term like regression into the shorthand for “he’s going to suck from here on out,” and especially uncomfortable when the tones in which these claims are dished out seem to suggest that the players are going to perform doubly worse moving forward as a means of evening out the first-half overachievement.

While incorrect, the new shorthand for regression is unlikely to dissipate anytime soon, and I got curious as to what the general range of second-half and year-end performance is for those pitchers considered to be substantially overperforming their “true talent levels” in the first half of a given season. For instance, will pitchers like Cain and Jurrjens really decline to the point that, by season’s end, their current sub-3.00 ERAs will end up matching the 3.60-3.80 FIPs? Or could it be that they will match their FIPs in the second half, keeping a vast differential intact when the postseason rolls around, even though it was minimized a bit? These are but a few of the several interesting questions worth looking into when discussing the second-half performance of pegged overachievers.

To start, I called upon my table with running raw totals and rates through each game, for each player in a given season. In order to keep things relatively recent, only pitchers from 1996-2008 were considered. For these, the data accrued through the last pitching date prior to July 1 in each of those seasons became of interest, enabling comparisons by half. Those without 100 overall frames logged in a particular year did not get to move on to consideration. The final filter sought after pitchers who, through the end of June, had an FIP a run or more above their ERA, producing a sample of 194 pitcher-seasons. While a sample of this size may not be large enough off of which to base irrefutable claims, it certainly serves as a great starting point. The tables below bring the aforementioned query to life, displaying the first half, second half, and end of season stats of interest, as well as the deltas for each of the three metrics in the desired time frames.


                            First Half         Second Half        Season End
Pitcher            Year   ERA   FIP   LOB    ERA  FIP  LOB     ERA   FIP   LOB
Shawn Estes        1997  2.72  3.83  81.2%  3.66 3.44 69.0%   3.18  3.64  75.0%
Armando Galarraga  2008  3.40  4.42  69.4%  4.01 5.29 81.0%   3.73  4.89  75.6%
Rich Hill          2007  3.49  4.51  80.8%  4.35 3.83 70.0%   3.92  4.17  75.3%
Hideki Irabu       1998  2.47  4.61  90.0%  5.55 5.69 69.0%   4.06  5.17  79.4%
Joe Mays           2001  3.03  4.79  83.3%  3.29 3.90 73.0%   3.16  4.34  78.2%


                          First Half Second Half  Season End
Pitcher            Year     FIP-ERA    FIP-ERA     FIP-ERA
Shawn Estes        1997      1.11      -0.22         0.46
Armando Galarraga  2008      1.02       1.28         1.16
Rich Hill          2007      1.02      -0.52         0.25
Hideki Irabu       1998      2.14       0.14         1.11
Joe Mays           2001      1.76       0.61         1.18

Estes saw his ERA rise while the FIP simultaneously fell in the second half, turning a discrepancy of 1.11 runs through the month of June into a -0.22 differential from that point on. The second-half performance did not erase his metric discrepancy over the first three months, but it certainly lessened the gap between the two by the end of the year, down from 1.11 to its final 0.46 run resting place. Joe Mays, however, saw the differential shrink by other means; his ERA stayed virtually the same in each half of that 2001 season, with the FIP drastically improving from 4.79 in the first half to 3.90 in the second half. In his situation it seems that the ERA actually served as a better predictor of future results than FIP, a tad odd given that one of the major merits of the latter involves its increased accuracy with estimating future earned run marks.

Forgive the sidetrack here, but how does FIP perform in this current setting with regards to predicting future earned run average? Does the FIP in the first half of the season correlate strongly to the ERA in the second half for the same pitcher? Interestingly, the correlation between the first two variables, first-half FIP and second-half ERA, surfaced at 0.35, meaningful to some extent, but of moderate strength at best. Here is where the wrench gets thrown in: first-half ERA correlated to second-half ERA at an r of 0.33, essentially no different than the relationship between second-half ERA and the metric of controllable skills used to more accurately estimate future marks. Keeping in mind that these correlation coefficients are derived from the sample prequalified as having a discrepancy of at least one run at the halfway point, a baseline is needed from which comparisons can be drawn, mainly the correlations amongst the larger sample of all pitchers over the same time span with 100+ innings.

When these correlations were run, an r of 0.23 resulted from first-half to second-half ERA, with an r of 0.34 for first-half FIP to second-half ERA. Simply put, based on these results, if a pitcher has an FIP much higher than his ERA at the halfway point of a season, it seems that his ERA is equally as likely to indicate performance moving forward as his run prevention rate via controllable skills, a phenomenon not observed amongst the larger sample of pitchers within the predetermined time frame. However, these correlations are of moderate strength at best, indicating that while they might not be concrete in predicting future performance, if a pitcher has an ERA that much lower than his FIP at the halfway point, the mark of controllable skills holds no predictive advantage over his earned run average.

What do the averages look like in these spans? While the correlations are certainly interesting and the equal predictive relationships to second-half ERA are more than noteworthy from an individual standpoint, the fact remains that the majority of these pitchers do experience some sort of decline in the second half that should be evident in an aggregate setting. Observe the graph below, which houses the first half discrepancy between FIP and ERA on the vertical axis, and the delta between first half ERA and end of season ERA on the horizontal:

Aside from the extreme outlier-in 2000, Doc Gooden had a 6.86 ERA and 8.60 FIP in the first half, and finished with an ERA of 4.71, a delta of 2.15-most of the pitchers with a discrepancy of one or more runs saw their first-half ERA decline by the end of the season, though the extents to which they decline can vary anywhere from a mere 0.10-0.15 runs up to two or so runs. This next chart displays the first-half FIP–ERA discrepancy on the vertical axis, with the same gap in the second half on the horizontal axis. The extreme outlier to the left is Mike Maroth‘s 2007 campaign, which wasn’t very good:

What all of the data thus far tends to indicate is that, despite equal correlations between first-half ERA and FIP relative to predicting second-half ERA, most of these pitchers are going to see the discrepancy between their two stats shrink over the final three months of the season, considerably minimizing the FIP–ERA gap. In fact, only 13 of the 194 pitcher-seasons in this sample experienced an FIP–ERA differential larger in the second half than in the first three months. When combined with the “overachieving” first half, a better-than-expected season surfaces. Quantifying this hypothesis, here are the average ERAs, FIPs, and LOBs across the aggregates:


                 First Half          Second Half        Season End
               ERA   FIP  LOB%     ERA   FIP  LOB%     ERA   FIP  LOB%
Discrepancy   3.34  4.64  79.4    4.60  4.65  70.0    3.98  4.65  74.7
Rest          4.40  4.34  71.3    4.35  4.30  70.8    4.38  4.32  71.0

Taken at face value, these averages tend to disagree with the aforementioned correlations, as the group sporting the large discrepancy gravitates towards their FIP in the second half, making the first-half FIP seem much more predictive. However, the standard deviation for second-half ERA is quite large, at 1.42, compared to the 0.83 in the first half; the standard deviations for FIP are almost identical in each half, right around the 0.83 for ERA in the first half. This all suggests that, as a whole, the pitchers with vast gaps between both metrics in the first half will pitch to the talent level suggested by their controllable skill set after the All-Star break, but the individuals themselves diverge rather wildly from the mean. This helps makes sense of why the first-half ERA has equivalent predictive value to first-half FIP with regards to the ERA produced over the final three months of the season.

Overall, correlation does not necessarily equal causation, instead referring to how the data relates to itself. Despite the fact that the discrepancy group boasts equal correlations to second-half ERA between the ERA and FIP in the first half, the correlations are weak, meaning that neither is particularly adept at making such future projections for this group and that neither holds any advantage over the other. The aggregate averages portend that FIP is a much more stable metric than ERA, which most of us already knew to begin with, but also that pitchers with the large gaps in the first half are very unlikely to pitch terribly in the second half as any sort of an evening-out process.

Instead, at worst, they are more likely to pitch to their FIPs, so someone like Jurrjens can be expected to fall in the vicinity between the 2.67 ERA and 3.63 FIP he currently sports, finishing the season right around a 3.12 mark; a 3.12 ERA still constitutes a fantastic season, especially with the innings total the young Braves righty is expected to log, as it indicates that he suppressed his own run environment in conjunction to getting a bit lucky. On the other hand, as the correlations suggest, he may very well hover around the 2.67 ERA, finishing up a remarkable season as opposed to a merely very good one. Component ERAs and run estimators based on controllable skills provide utility in the sense that they are more accurate at predicting future ERA, but as this data shows the same rules that apply for the greater whole of pitchers are not totally applicable for pitchers vastly outperforming their assigned talent level for half of a season. The safer bet would be to expect a regression to the FIP or QERA in the second half of the season, but these pitchers should not be automatically written off as flukes guaranteed to substantially worsen down the stretch.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Eric Seidman

Latest Articles

You need to be logged in to comment. Login or Subscribe

OnTilt

7/26

I think two things are clear for any large sample of pitchers:

- For the same group, the standard deviation of ERA will be substantially higher than the standard deviation of FIP when their repsective averages are 4.60 and 4.65
- The standard deviation of a group of pitchers with a 3.34 collective ERA will be lower than that of a group with a 4.60 collective ERA

Am I missing anything? I don't think we can conclude anything by comparing the standard deviations here.

Reply to OnTilt

EJSeidman

7/26

The SDs are only a minor part of this. The major takeaway is that, from at least 1996-2008, and I'm in the process of running this back across a whole lot of seasons, when a pitcher has an FIP a lot higher than his ERA at the halfway point, ERA is equally predictive as FIP in terms of second half performance.

Reply to EJSeidman

EJSeidman

7/26

Plus, per your second point, it's the SAME group that had 3.34 in the first half and 4.60 in the second half. So in the first half they had 3.34 ERA with an SD of 0.83... and in the second half, the same group had a 4.60 ERA with an SD of 1.42, which helps explain why the aggregate average is right in line with FIP, even though the ERA has equal predictive value.

Reply to EJSeidman

dpowell

7/27

Your main point seems to be that Corr(FIP for 1st half, ERA for 2nd half) is higher than Corr(ERA for 1st half, ERA for 2nd half) for the overall sample. But the same when you select on a group that has a much higher FIP than ERA in the first half. That doesn't really mean FIP is a bad predictor for that group though - just within that group. You've eliminated a _major_ source of information for that group - the fact that FIP is 1+ runs greater than the ERA.

The inputs to FIP are linear so when there are non-linearities, FIP is going to be wrong, most likely in the outliers. You've basically selected on the outliers, meaning the information provided by FIP is relatively noisy within that group. I'm not saying anything 100% contradicting what you said. I just wouldn't say that a variable isn't a useful predictor when you've selected the sample in a way that should make it a poor predictor.

Reply to dpowell

EJSeidman

7/27

Yeah, we're on the same page, except I never said FIP is a bad predictor for that group... in fact I said sort of the opposite, that the finding is that ERA is actually just as good for those pitchers.

FIP is at 0.34-0.35 for the overachievers and the greater sample, but the ERA correlation is much higher for the overachievers, equivalent to the FIP correlation.

So in no way is FIP a bad predictor for these guys, but we shouldn't automatically write off their ERAs either.

Reply to EJSeidman

dpowell

7/27

Fair enough. It seems like what we want to see is a regression like (where _# refers to the half):

ERA_2 = a + b1 * ERA_1 + b2 * (FIP_1 - ERA_1) + e

If FIP is "better" than ERA, then b2 > 0.

And you're saying that, really, the effect of (FIP_1 - ERA_1) isn't linear and should potentially have a different effect for FIP_1 - ERA_1 > 1. So, you could just alter the regression to let b2 change at certain cutoffs.

Might be interesting...

Reply to dpowell

EJSeidman

7/27

Yeah, time permitting I'll run them through and see what surfaces.

But in case anyone gets to the comments, let me reiterate that the major takeaway here, in case it is confusing in the article, is NOT that FIP is a bad predictor in any way for these guys... but rather that when we see a pitcher with an ERA 1+ run below his FIP through the first half of a season, the data from 1996-2008 indicates that the ERA is equally as effective as FIP at predicting ERA in the second half. This phenomenon is not observed in the larger sample of pitchers excluding these guys.

Reply to EJSeidman

dpowell

7/27

Right, apologies for causing the confusion. I really meant that (you're implying that) FIP is a bad predictor _conditional_ on ERA for your selected sample, but not for the entire sample. My points are:

1) Intuitively, that makes sense. If FIP is really far away from ERA, then it's probably partially because of "noise" in FIP (non-linearities are important which FIP ignores).

2) This isn't necessarily true given the results you've shown (though I'm guessing it is true anyway). Say you care about predicting Y and you have predictors A and B. And you have Corr(Y, A) = Corr(Y, B). Yes, A and B are equally "effective" in predicting Y. But that doesn't mean you don't want to use both. To understand the importance of a predictor - like FIP - you can't just compare its importance to another predictor. You have to see if each predictor is meaningful _conditional_ on the other. The regression I suggested above would get at that better.

Reply to dpowell

EJSeidman

7/27

Right, I'm with you. But I do want to stress that a major point here is simply not to overlook ERA for these pitchers. Regardless of the conditional importance, which is certainly important, the data here shows that ERA should not be disregarded for pitchers outperforming FIP for "this long."

Reply to EJSeidman

fireorlime

7/27

So in the group of pitchers with less than 1.0 difference between their ERA and FIP in the first half, is FIP a better predictor of second half performance than ERA?

Also, I wonder how this plays out for players with more than a one run difference between their ERA and FIP at the end of the season in terms of predicting their next season. At what point do we say that's no fluke at all, FIP consistently underestimates X pitcher because it is not accounting for some skill X pitcher has? I wonder how many cases there are of a pitcher with 3+ seasons of greatly outperforming his FIP, and perhaps if there's any commonality amongst those outliers.

Reply to fireorlime

EJSeidman

7/27

Yep, as I stated in the article, the correlations for the group without the big discrepancy was 0.23 ERA1-ERA2 and 0.34 FIP1-ERA2. So FIP is a better predictor for "normal" pitchers, but ERA shouldn't be written off for the "abnormal" guys.

As far as the other point, look at Carlos Zambrano's career.

Reply to EJSeidman

fireorlime

7/27

Thanks for your reply and the article, I learned something new, as usual!

Steve Traschel looks like another dude who regularly outperformed his FIP.

Reply to fireorlime

BobbyRoberto

7/27

This is interesting stuff, but I do wonder if there would be a difference if you used xFIP. After reading articles at The Hardball Times and FanGraphs, I've been convinced that xFIP is a better predictor than FIP.

Reply to BobbyRoberto

EJSeidman

7/27

Bobby, while I do agree xFIP has merits over FIP and that QERA, which uses the better K% and BB% then K9 and BB9 has merits over both potentially, the reason for using FIP is based on how people these days tend to look at a player's ERA and FIP and base a conclusion on his season strictly on the heels of his FIP-ERA discrepancy.

Reply to EJSeidman

sunpar

7/27

RE: Jurrjens and Cain

It's interesting to note that Jurrjens' top PECOTA comp is Greg Maddux and Cain's top comp is Jim Palmer: both guys in their prime (and actually, Palmer throughout his career) were able to consistently outperform their FIP on a year-by-year basis.

I haven't seen Cain as much, but I've watching almost every single one of Jurrjens' MLB starts, and he's got some of Maddux in him (a young late 80's Maddux, not prime mid '90s Maddux). I wonder if the two of them will continue to consistently outperform their FIPs, and if the PECOTA comps will predict as much going forward.

Reply to sunpar

EJSeidman

7/27

Very interesting observation. I'm the opposite, having watched all of Cain's starts but not much Jurrjens, and I can say with certainty that he has gotten out of jams more than he has prevented jams from occurring, which isn't the best way to be successful, but he is doing things differently out of the stretch that appear to be more real than fluky.

Reply to EJSeidman

DrDave

7/27

#define pedant ON

"While a sample of this size may not be large enough off of which to base irrefutable claims[...]"

Off of which to base? Ouch. I gather Miss Pricknel used to whack your knuckles with a ruler for ending a sentence with a preposition? Shame on her.

If the terminal preposition (which should be "on", by the way, not "off of") doesn't work for you, how about "...may not be large enough to support irrefutable claims"?

#define pedant OFF

Reply to DrDave

ScottBehson

7/28

Perhaps some sort of discriminant analysis could distinguish between "flukes" and "those who have found a new performance level". I think young guys like Cain and Jurjjens can be expected to have actually improved, while older guys outperforming periferals may just be fluky spikes.

Reply to ScottBehson

Checking the Numbers: Defining Declining Expectations

Thank you for reading

Latest Articles

The Stash List ’24: Week Four $

Box Score Banter: No Exit B

MLU: Triantos Tries on Some Power $

Speed, Spin, and Snap $

Pat Murphy, Wade Miley, and the Ship of Theseus $

Eric Seidman

Latest Articles

The Stash List ’24: Week Four $

Box Score Banter: No Exit B

MLU: Triantos Tries on Some Power $