July 8, 2008
Once More With the Pitching Scales
In the past few editions of Toolbox, we've been discussing pitching statistics, and trying to answer one of the big questions that I receive frequently in emails: with all of these fancy statistics we've got here at Baseball Prospectus, how do you know what's good, bad, or in between? The answer so far has been to go stat-by-stat, comparing and averaging out performances over the last three years using a methodology borrowed from Kevin Goldstein and William Burke. You can check out the first two articles in the series here and here.
Now that you're caught up, let's get to some reader mail. Last time, when we went over the scales for some of the reliever statistics, I expressed some disappointment that the "average" rating in ARP-a stat that is measured in terms of runs above average-had a positive value. Several readers wrote in about this, all thinking I should keep my chin up. Reader C.B.'s email was representative of the group:
You say that it's disappointing to see the values for average RPs to be above average. I'm actually surprised they aren't more above average. After all, you're looking at the top relievers in terms of innings pitched. If the most used relievers in the game aren't above average, then don't we need to question why managers keep using below average relievers so often?
Well, first off, I'd like to thank everyone for being so forgiving of my selection bias. The cutoff for relievers took players out of the pool who were, on average, pretty darn awful: altogether, the folks who didn't make the innings threshold contributed over 6,000 innings between 2005 and 2007, posting a Reliever's Fair RA (FRA) of 6.06, compared to a league average of 4.62. By and large, this speaks well of baseball as a meritocracy-the relievers who couldn't meet my low playing time thresholds included a handful of guys who were effective but simply weren't called up early enough (think Joba Chamberlain last year) or lost time to injury (Mike Gonzalez), but mostly they were just guys who were given a chance, then were quickly removed from the roster when they couldn't hack it. To that extent, I'm rather impressed with the job that managers did in weeding out the chaff.
As for why managers use below-average relievers so often, it might be belaboring the obvious to point out that there are considerations other than performance when managers dole out playing time. While it's easy to say that so-and-so shouldn't be playing because his stats are bad, managers are often stuck with players who have a significant gap between their perceived talent and their performance level. Money's a factor (an eight-figure salary will get you a lot of second chances to prove you should be on the field), as are roster restrictions-some guys have the right to refuse minor league assignments, so the team is compelled to let them work out their problems at the major league level. Injuries can also turn a normally productive major leaguer into a replacement-level performer.
Reader E.T. had some more basic issues with my methodology:
As I was reading the article, I was wondering why you would separate the groups evenly/arbitrarily and not use something like a mean and standard deviations instead. It seems to me that choosing a number to regard as elite is based more on intuition. For instance, why should we expect 'x' number of elite players at any position? Shouldn't elite be defined by the actual results of individual performance compared to the group as a whole? Regardless, I think a comparison of those methods would be interesting.
For those of you who haven't been following Toolbox from the beginning, here's something I hope won't be a shocking revelation: While I write about statistics in this space, I'm not much of a sabermetrician. In fact, that's probably why I'm good at the job-since I'm not a mathematical type, I have to work pretty hard to understand what the much smarter authors on this site are writing about. By the time I'm finished explaining their work to myself, the explanations are pretty much ready for anyone above a fifth-grade reading level.
Generally, that means that when I do my own sabermetric work, I try to keep it as simple as possible. With the exception of the FRA data I presented last time (which was supplied to me by the incomparable Mr. Burke), all of the data you've seen in the Scales project could be compiled by any BP Premium subscriber with decent spreadsheet skills and some time to kill. This is by design-I purposefully avoid some statistical tools, such as scatter charts, that I feel are likely to confuse or alienate a new reader.
With those caveats out of the way, I'll explain what E.T.'s talking about, in language that will likely not do it justice. Standard deviation is a method of measuring variance in a body of data. If the data is normally distributed-that is, in a bell curve configuration-approximately 68 percent of the set's data points will fall within one standard deviation from the mean average (what most of us simply call "the average"), and about 95 percent will fall within two standard deviations. If I were forced to do the calculations by hand, the results would be so pitiful that my ninth-grade math teacher, Mr. Sikso, would probably weep a little before hurling several choice expletives at me. Nonetheless, with Excel's help I ran the calculations on the FRA data from last week. The mean FRA for the players who qualified was 4.22, and the standard deviation of the sample was 1.58, so if my calculations were right, it would take an FRA of 2.64 on the good side and 5.80 on the bad side to be one standard deviation from the mean, 1.06 and 7.38 to be two standard deviations.
Let's put that next to the scale that we set forth for FRA last week:
Fair RA 2nd SD 1.06 Elite 2.38 1st SD 2.64 Good 2.82 Average 4.15 Mean 4.22 1st SD 5.80 Bad 5.92 Superbad 6.61 2nd SD 7.38
So by E.T.'s standards, we've had about nine elite relievers over the last five years, and a baker's dozen of truly awful ones. I still prefer the Goldstein/Burke method we started out with-what can I say, I'm lazy-but if there's enough clamor for the standard deviations approach, I'll climb out on a mathematical limb for you.