keyboard_arrow_uptop

Imagine if the entire baseball blogosphere started using the original Runs Created formula-the one Bill James developed circa Off The Wall-as our primary way of valuing a player’s offensive contribution. Forget run environments, linear weights, league adjustments, and all of the other things we’ve learned over the past thirty years; instead, for the sake of efficiency, we went back to (H + BB) * (TB) / (PA). Maybe it’s not perfect, but hell, it’s easy, and it’s not like Willie Bloomquist is going to come out better than Adam Dunn.

Sounds ridiculous, right? But that’s more or less what’s happening right now with defense independent pitching stats. Quick, who’s been better this year (numbers as of Tuesday afternoon):


Pitcher          K/9   BB/9   HR/9   GB%    ERA    FIP   SNWP
Jeff Niemann     5.7    3.1    0.8  40.7%  3.62   4.24   .567
Matt Garza       8.0    3.5    1.1  41.7%  3.69   4.23   .569

It looks pretty close. They play in the same ballpark, obviously, so we’ll assume a similar run environment. Niemann and Garza have virtually identical ERAs, FIPs, and Support-Neutral Winning Percentages. Garza has a better K/BB ratio, but Niemann has allowed fewer home runs.

If you follow these things, you know where I’m going with this: HR/9-or HR/PA, depending on how you want to look at it-was one of Voros McCracken’s original Three True Outcomes way back in 1999, and it’s been treated as such ever since. But that didn’t make sense then, and it doesn’t make sense now-a pitcher’s home run-rate isn’t nearly as stable from year to year as his strikeout and walk rates, a fact that Voros himself noted in his early articles. Logically enough, when it’s used as a major component of a defense-independent pitching stat, it makes that metric less stable as well.

This isn’t new ground. There’s been plenty of research done on the subject, and the explanation is pretty clear: a pitcher’s HR/FB rate correlates about as well as his BABIP from year to year, especially after you adjust for park effects. Over the course of several years, there will be statistically significant differences between pitchers. But that simply doesn’t manifest itself every single year, and if you’re trying to evaluate a pitcher on a season-by-season basis, or in the middle of a season, it’s probably better to just leave it out.

Of course, we already have a couple of stats that do just that. If we use Nate Silver‘s QuikERA, which uses K%, BB%, and GB%, Garza comes out well ahead, with a 4.22 QERA to Niemann’s 4.95 QERA. Another metric, xFIP, is very similar, in that it normalizes HR/FB, and has Garza at 4.39 compared to Niemann’s 5.03. These numbers tend to be very predictive; the year-to-year r-squared coefficients for QuikERA and xFIP center around 0.45, whereas FIP (which uses HR, K, BB, and HBP) comes in at 0.25, and ERA and RA are around 0.10. (These numbers are based on single season data 2004-2008. A different dataset might give you slightly different results, but should always lead to the same conclusion.)

Yet, despite the obvious logic of using GB% or a normalized HR/FB number as the third true outcome for single-season stats, the uptake in the blogosphere has been oddly slow. FIP has largely become the de facto rate stat for pitchers, and while it’s useful over the long haul, it leans far too heavily on HR/9 to be used for shorter time spans. Graham MacAree’s tRA has also gained some steam, but it’s based on batted-ball data which, for my tastes, is still far too subjective (although that won’t be the case forever-more on this below).

Going back to Niemann and Garza, just under eight percent of Niemann’s fly balls have left the ballpark this year, while Garza is just over eleven percent. So while their ground-ball percentages are virtually equal, Niemann’s HR/9 is significantly lower. FIP sees those home runs as reflections of each pitcher’s true talent, and rates the two pitchers equally. In contrast, xFIP and QuikERA see HR/9 as a function of the pitchers’ ground-ball rates, prone to tremendous amounts of random variation, and give Garza a huge edge. If those are my only two choices, I’m taking the latter.

That might seem a bit unfulfilling, and I don’t totally disagree with that sentiment, given that pitchers do have some control over their BABIP and HR/FB over long periods of time. For example, from 2005 to 2008, about 9.4 percent of the fly balls hit off of Chien-Ming Wang went for home runs, while Felix Hernandez was closer to 17 percent. The pitchers’ respective park factors actually increase that gap, as the old Yankee Stadium turned was about league average, while Safeco Field depressed home runs a bit. While that doesn’t absolutely mean that Wang’s fly balls are less likely to turn into home runs than Felix’s, it is certainly a very, very strong hint.

So how do we reconcile this for single-season (or in-season, for that matter) data? The best way would probably be to use the regression method outlined in the appendix of The Book, both for park-adjusted HR/FB and BABIP. The FIP formula could be adjusted to use these new estimates, or perhaps a new regression-driven linear weights approach could be built from scratch (a la wOBA, but with “true talent” estimates instead of actual measured components). Either way, this “new” rate could also be used on a game-to-game basis to figure out each pitcher’s SNWP. I’ll leave this to people who didn’t get a two on their AP calc exam (I cheated off of someone who thanked me after the test for giving him all of his answers), but I’m pretty certain this is the right approach to take.

Looking ahead, this will all hopefully become a moot point in the not-too-distant future, with HITf/x, and perhaps even GAMEf/x, becoming full-blown realities. In that case, we could have linear weights for every hit ball depending on its speed, angle, and end location. This would probably require some heavy regression as well, especially for game-to-game comparisons, but we would undoubtedly end up with a much clearer picture of each player’s underlying performance.

For the time being though, let’s stop relying on FIP for single-season stats. If we can build a regression-driven hybrid, great. But otherwise, let’s stick to QuikERA and xFIP.


Statistics courtesy of Baseball Prospectus’s Bil Burke and The Hardball Times
.

You need to be logged in to comment. Login or Subscribe
Ozdoltorps
8/05
Outstanding article. I didn't know all of that about FIP and SNWP, and I normally shy away from stat heavy articles, but this was easy to read and understand.
Arrian
8/05
Good stuff, thanks. I like the CMW reference, especially now that he's probably never going to be the same.
leez34
8/05
Stuff I didn't know. I've never seen you write like this before, Shawn! Glad to know you're so multitalented.
incaficious
8/05
What would be really helpful for readers less statistically-inclined is if you included terms like "r-squared coefficient" in the glossary. I have no idea which is better, 0.1 or 0.45. Thanks.
marioreturns66
8/05
The r-squared coefficient tells you how much of the variation between two data sets can be attributed to the data we're testing. In English, we can say that 45% of the variation in QuikERA from year to year can be predicted using his previous year's QuikERA. The rest is because of aging, change in ability, typhoid fever, whatever. Hope this helps.
dethwurm
8/05
Also, what constitutes a "good" r-squared varies depending on context. In something as complicated as baseball, a correlation of 0.45 is very good. In a high school physics experiment, it's very bad.
dethwurm
8/05
I should say "strong" or "meaningful," not "good."
padresprof
8/06
Pat, Feel free to stick with your original comments. R squared of 0.45 would be laughable, unpublishable, sophomoric, etc. in physics. In several social sciences, it would be "significant". ;)
beeker99
8/05
Great work, Shawn! I read somewhere recently that FIP was flawed, but no explanation was given. Now I know why.
brownsugar
8/05
I would argue that FIP isn't "flawed" necessarily, just that it is more descriptive than prescriptive. Maybe Pitcher X has given up home runs on 3% of his fly balls in 2009, and in 2010 and beyond that won't last, but for the games already played in 2009 those lack of home runs allowed are real and do matter. FIP is a good statistic as long as everyone keeps in mind that it is a record of past pitcher performance rather than a prediction of future pitcher performance.
marioreturns66
8/05
You could say the same about ERA and RA. In the end, the goal is to allow less runs, so those stats are really the end-all-be-all of "descriptive" stats, but we would never use it to evaluate a pitcher's inherent value. If I can get around to it, I'll see which is more predictive over a 3-4 year period, FIP or QuikERA. I'm honestly not sure which it would be.
DrDave
8/06
"You could say the same about ERA and RA. In the end, the goal is to allow less runs, so those stats are really the end-all-be-all of "descriptive" stats, but we would never use it to evaluate a pitcher's inherent value." That's an awfully ironic thing to say as a BP writer, given that BP does not publish any pitcher evaluation stats that are not outcome-dependent in the way ERA and RA are. VORP, SNWP, etc. are all about how many runs actually scored. Where is the stat that is to QERA as batter's VORP is to OPS?
thenamestsam
8/05
But the point is that b/c of the "luck" involved in the percentage of flyballs that turn into homeruns it tells us something that just doesn't get to the heart of pitching quality. Maybe flawed isn't the right term exactly, but the way FIP is treated is as a measure of the true quality of what a pitcher has done that tells us something deeper than Wins or ERA, and below that the problem is that its not really doing that, which I would call a flaw.
jepson
8/05
No, it's a record of past pitcher results (or more precisely, past team results with the particular pitcher active in the game). Whether or not this can be accurately equated with past performance depends to a large extent on the degree to which the game result in question varies independently of the actions of the player. As Shawn points out in this article, HR/FB is highly prone to random variation - therefore, taking HR/FB results at face value leads to a skewed measurement of performance, either past or future.
dbthewise1313
8/05
I get that pitchers that generate a lot of grounders can succeed with lower strikeout rates, but below a certain GB/FB threshold, is GB% really a good predictor of success? Could it be the success of extreme groundball types skew studies and make us believe that more grounders are always good? Can we say that someone with a GB% of 45% will pitch better than someone with a GB% of 40%, all else being equal. Ok, so it could be that the pitcher who induces more grounders might be less susceptible to extra-base-hits, but is it plausible that they are also more hittable, enough to negate the XBH advantage? It just seems that the use of GB% should be more nuanced. For example, is there any more harmless ball put in play than the infield pop-up (though maybe infield fly ball rates aren't stable over a single season)? Perhaps someone who's done research or can cite research on this can illuminate me.
sunpar
8/06
http://www.baseball-reference.com/leagues/split.cgi?t=p&lg=AL&year=2009#traj Ground balls: .234/.234/.254 (488 OPS) Fly Balls: .220/.215/.587 (801 OPS) I would say that it's always better to get a ground ball than a fly ball, unless it's a situation like 2 outs, man on 2nd or 3rd. If you look at successful fly ball pitchers like Johan Santana, they are able to limit SLG% on fly balls more than most, but even they appear to benefit by allowing ground balls rather than fly balls. http://www.baseball-reference.com/players/split.cgi?n1=santajo02&year=Career&t=p#traj Santana's splits, career: Ground balls: .214/.214/.234 (448 OPS) Fly Balls: .191/.188/.477 (665 OPS)
dbthewise1313
8/07
Thanks for responding. The first link is not quite what i'm looking for as I want to exclude groundball freaks from my sample (this methodology is debatable of course). I examined some trajectory data of a few top pitchers and see that their OPS against is higher for fly balls. This brings up something I hadn't been thinking about which is that groundballs are almost always singles, unless they are hit down the line. So, a groundball, almost by definition would be better than a flyball. In addition, we are omitting line drives and if a pitcher's GB% is correlated with LD%, then it may not be a good thing to induce more grounders. I still hesitate to reach a firm conclusion until it can be shown that GB rate is not correlated with a LD/FB ratio.
tfierst
8/05
AMEN on this article!
Hawkeye
8/06
Where can I find QERA season stats?