“Just when I thought I was out…they pull me back in!”
–Michael Corleone, Godfather Part III

This quote represents the only resonant moment in the apocryphal third Godfather movie-a film ruined by, in no particular order, nepotism, over-acting, a filmmaker in financial difficulty, and a script that went through approximately 8,000 drafts and was compromised by cast defections. Regardless of the above flaws, whenever I think I’ve wrapped something up nicely, and then find that I can’t move on, it’s Al Pacino’s voice I hear in my head, in full Scent of a Woman scenery-chewing mode, lamenting his inability to move on to a post-Cosa Nostra life. Two installments of the Scales Project ago, I announced that we’d conclude last week, but the previous edition seems to have raised more questions than it answered. Since I’m a question-answering kind of guy, we’ll stick with relievers for another week-I trust you’ll let me know if this looks as if it’s becoming a quagmire.

The one thing that everyone seems to agree on is wanting to base the scales of performance on standard deviations, rather than the previous method I’d been using to calibrate performance, which was based on the average performance of those players who had the most playing time. Many of you also had ideas on how to sharpen things up a bit, as well. Reader C.B., who we heard from in the last installment, checks in again:

While I don’t completely agree with your hand-waving response to my point about your average relief pitchers being above-average, it’s not really a point worth arguing over. Instead, I’m coming to the defense of E.T. You wrote, ‘So by E.T.’s standards, we’ve had about nine elite relievers over the last five years, and a baker’s dozen of truly awful ones.’ E.T. never said how many standard deviations from the mean should be used, and it looks like you assumed that you had to use full standard deviations. From the numbers you gave, it looks like using 1/2 standard deviations would be more useful where -1.5 standard deviations and below would be elite, -1.5 to -0.5 standard deviations would be good, -0.5 to 0.5 standard deviations would be average, 0.5 to 1.5 standard deviations would be bad, and 1.5 standard deviations and above would be superbad. Those ranges are just suggestions, and further tweaking may be necessary.

I’ll come back to the first part of C.B.’s mail, about the above-average relievers, in just a moment. First, let me be clear that the way I ended the last column wasn’t meant to denigrate E.T., or his proposal; it was to give an admittedly-incomplete accounting of the results of the application of standard deviations to this purpose. Taking C.B.’s suggestion, let’s look at what our scales might look like, adding a few more gradations:

         FRA     % of Sample   2008 Player
2.0 SD   1.06        1.5       Mariano Rivera (1.11)
1.5 SD   1.85        6.2       Max Scherzer (1.82)
1.0 SD   2.64       16.2       Dan Wheeler (2.62)
0.5 SD   3.43       32.7       Francisco Cordero (3.43)
Mean     4.22       37.2       Brian Tallet (4.21)
0.5 SD   5.01       29.3       Aaron Heilman (5.01)
1.0 SD   5.80       17.0       C.J. Wilson (5.81)
1.5 SD   6.59        7.2       Eric Gagne (6.62)
2.0 SD   7.38        2.7       Mike Timlin (7.41)

I’ve appended as examples a few players from the current season operating near each of the standard deviation guideposts to give you a slightly more tactile idea of what a player with a similar FRA “looks like,” and I’m also showing you what percentage of our original 600-player sample fell in each category. It looks like C.B.’s estimates were dead on. If you look at the three numbers in the middle, using half a standard deviation chops our sample roughly into upper, middle and lower thirds-just as the Goldstein/Burke method did previously. The next gradation, at 1.5 standard deviations, raises the bar a bit on what we termed the “elite” and “superbad”-not that this is a bad thing, but looking at that distribution raises some issues, which reader J.M. brings into relief:

Your description of the standard deviation method is accurate. But I’m not sure you thought through the implications. Assuming a normal distribution, only 2.5 percent of pitchers would achieve Elite status using the +2 standard deviations test. That is a much more stringent requirement than yours, which allows in about nine percent. In fact, using your approach I estimate about half the pitchers that meet the Good standard would also be Elite. That strikes me as too easy. Using two standard deviations it would be only 15 percent. That may be too tough. I suggest splitting the difference, somewhere around 1/3 of the Good being Elite. Same for Bad and Superbad.

The real problem, though, is this distribution is clearly not normal. If it were, the difference between Average and Good (4.15-2.82=1.33) would be similar to the difference between Bad and Average (5.92-4.15=1.77). This distribution is highly skewed. That’s why there are more pitchers two standard deviations worse than the mean than there are two standard deviations better. The standard deviation method doesn’t work well if the distribution is highly skewed.

J.M.’s correct about everything here, even if our numbers are slightly different. To give everybody an idea of what he’s talking about, it’s my honor to present Toolbox’s first-ever chart:

Player/Season Distribution

The shape is more or less correct distribution-wise, but if you’re familiar with bell curves, you see that the distribution is a bit weighted toward the front (performances better than the mean), and with a slope that’s much too gentle on the back end. I looked at the possibility that my cherry-picking the top 200 of each year for playing time might be the problem, but looking at FRA without the playing-time filter made the distribution even more amorphous. I’ll leave it to wiser minds than mine to tell me if this is close enough to the ideal to be functional. For now, I think it’s as good as we’ll get.

Getting back to the beginning of C.B.’s letter, just as I didn’t mean to undercut the standard deviation approach, I also didn’t mean to pish-posh C.B.’s concerns about the lack of a stronger selection bias in the original breakdown of ARP. C.B.’s point-I think-was that if managers were using their relievers efficiently, the average player to receive a substantial amount of playing time should be well above the league average, as the league average is weighted down with scrubs who’ll pitch a handful of innings, stink up the joint, and then be sent back to the minors (or worse). The problem I pointed out last time, expressed in a slightly less vague way, is that our knowledge is imperfect, both from the standpoint of the decision-makers who make the roster and dole out playing time, and in terms of how much of a premium we should gain if everyone made the right decisions.

The decision-makers, even if they’re looking at the data and acting logically, have a tough job. Think about a player like Aaron Heilman, who’s noted above: he has a track record of success, but halfway through the season he’s performed poorly, posting a “bad” FRA (by our new standard) and a below-average ARP. Still, his peripherals are strong, and he probably wouldn’t accept a minor league assignment anyway-even if you thought you had a better player to replace him, which isn’t always the case. In a situation like this, the decision-makers are likely to see if he can pitch his way out of the poor performance, rather than trade him at the bottom of his value or waive him.

Beyond that type of situation, there are times that a below-average pitcher is perceived to have more value than his ARP would suggest. I’ve sometimes called this the Todd Jones effect-the idea that while a guy may suck overall, when the chips are down he has the gumption/experience/raw stuff to work his way out of tight spots, so he’s valuable. If you look at Jones, he has a negative ARP (-4.8) representing below-average run prevention, but a positive WXL of 0.663 (that’s Lineup-Adjusted Win Expectation, a cousin of the more famous WXRL, which I use here so that we’re comparing both stats to the league average, rather than the replacement level). The win expectation figure suggests that Jones is doing something different in his high-leverage pitching situations, something that doesn’t show up in his overall numbers. Another closer, Texas’ C.J. Wilson, has an even larger disconnect between his run prevention (-6.4 ARP) and win expectation (1.369 WXL).

If you think Wilson and Jones have special abilities to pitch better in the clutch than they do in low-pressure situations, or if you think their run prevention numbers are the result of bad luck, and win expectation reveals their true talent level, then the fact that the average player in our sample only had a 4.7 ARP might not be much actual cause for despair. If, however, you don’t believe either of those things, then one way to measure the efficiency with which relievers are being used would be to look at the correlation between ARP and reliever Leverage. So I think I’ll put that out to you as an item for discussion until next time.