After a short delay, we’re back with the next phase of the “Scales Project,” this time bringing the methodology we borrowed from William Burke and Kevin Goldstein to bear on relief pitchers. Recapping from last time, the method I used for the starters was to take the top 150 pitchers each year (basically, the equivalent of five starters per team), ranked by the number of innings they threw as starters, and then within that sample I ranked the players according to various statistics (ERA and SNLVAR). Those pitchers were then averaged out with like-ranked players from the years 2005-2007. So the top pitcher profile, per SNLVAR-a composite of Roger Clemens‘s 2005, Johan Santana‘s 2006, and Jake Peavy‘s 2007-posted an average SNLVAR of 9.0, had an average record of 17 wins and 6.7 losses, an average VORP of 79.9, and so forth. We then used these rankings to divide the sample into thirds-Good, Average, and Bad, as ranked by that statistic-and further took the top half and bottom half, respectively, of the Good and Bad groups to have a profile of Elite and Superbad performance in each statistic. For the starters, this method outlined a scale for each statistic, which looked like this:

              ERA      SNLVAR
Elite        3.14        6.43
Good         3.47        5.61
Average      4.40        3.23
Bad          5.50        1.18
Superbad     6.28        0.58

Naturally, there were several questions and suggestions about both the technique and the results. We’ll start with C.W.:

Since ERA is a true rate stat, might it not make sense to use SNLVAR/GS or something similar to capture true success independent of playing time? You might have to establish a cut-off point for minimum number of starts, but that seems like something you would have to do for ERA as well, no?

We did establish a cut-off point for playing time, it was just based on innings as a starter rather than starts. Rather than having a fixed cutoff (like the ERA title-qualifying sum of approximately 162 IP) we simply made it anyone who fell in the top 150. For 2007, the cutoff was 76 1/3 innings as a starter, a rather paltry sum-I’ll leave it to your judgment if it’s too low to be useful. Reed Baessler had some related thoughts:

Just read your article about translating scales. Your suspicion that the SNLVAR range wasn’t so great because of playing time seems right to me, because ERA is a rate statistic and SNLVAR is a counting statistic. So, playing time is a huge factor and counting statistics don’t scale easily on the low end. For example, Jay Bruce has three home runs, on the low end of a counting scale, but the high end on a rate scale like SLG. Right?

Right, although a player like Bruce probably wouldn’t show up in this sample, since the first cutoff was playing time. Establishing scales for rate stats (like SNLVAR/GS, as suggested by C.W.) is a lot easier than doing so for counting stats (like plain old SNLVAR), and using a rate stat proxy may, as Reed suggests, make for a cleaner scale. The only problem I have with the rate stat approach is that we don’t just use rate stats; counting stats like SNLVAR, WARP, VORP, wins, and saves are referenced on this site and in baseball conversations in general. I’m hoping to produce scales that serve those who use counting stats, as well as rate stats.

Ira Blum had some issues with how the scales relate to the replacement level:

I found it very interesting, your article. But I have a question. In the table at the bottom, your Bad and Superbad had data that seemed contradictory.Under the SNLVAR column, Bad is 1.18 and Superbad is 0.58. But in the VORP column, Bad is -0.1 and Superbad is -6.7.

How can the same pitchers be above replacement by one measure and significantly below replacement by another? Is it that VORP and SNLVAR are measuring something that different?Or is it that the replacement levels have different meanings.

The two stats measure different things, therefore it’s logical that replacement level in one is not necessarily replacement level in the other. SNLVAR uses a win expectancy framework to indicate a starter’s support-neutral contribution, while VORP is based on a pitcher’s runs allowed average (RA) without the win expectancy context. Also, while both stats are measured against the replacement level (so a true replacement level performance is zero in each stat) what’s being measured by each stat is different: SNLVAR is measured in wins, while VORP is measured in runs. While the conventional wisdom is that ten runs equals approximately one win, you can’t extrapolate that to say that an SNLVAR of 1.0 is equivalent to a VORP of 10, or vice versa.

There are a number of other factors that could skew the results. SNLVAR only measures a player’s performance as a starter, while VORP would include the relief appearances of a pitcher who also spent time in the bullpen. There might also be a selection bias, since there was a playing time threshold for the sample.

Reader G.G. suggested a slightly different system:

I performed a player analysis for a separate project on hitters who started their career after 1994 and had at least 1,000 plate appearances (sorted by career OPS). I think the total amount of eligible players players came out around 600.

I also wanted to define players by the term of ‘good,’ ‘average,’ and ‘bad.’ I had to manually look up each player’s statistics, so I had time constraints and I didn’t want to look up all 600 players and divide them into thirds like you did. Instead, I looked up the stats for the top 60, the middle 60, and the bottom 60 and used these terms to define good, average and bad. The good group was obviously very good, and the bad group was not completely horrible because they all had at least 1,000 big league PA.

So you may get more accurate definitions if you divide the players into groups. Your definition of good and average is separated by small percentage points right at the 1/3 mark, while mine has a bigger gap between the two groups, so I think there may be less variation in my method… although I wonder if they would come out the same. I’d be curious if you divided your players at the 10-30th percentiles, the 40-60th percentiles, and the 70-90th percentiles, and you came out the same averages as you did in the groups separated by thirds. Anyways, this is mostly my own curiosity but I think this division of players may be a better method of defining the terms ‘good’ ‘average’ and ‘bad.’

For those interested in seeing the difference between the system we had set up (which averages out the sample by thirds) and the system G.G. suggests, I’ve marked the measurements using G.G.’s system “Middle Good,” “Middle Average,” and “Middle Bad”:

              ERA      SNLVAR
Elite        3.14        6.43
Good         3.47        5.61
Middle Good  3.65        5.16
Average      4.40        3.23
Middle Avg.  4.42        3.22
Middle Bad   5.30        1.48
Bad          5.50        1.18
Superbad     6.28        0.58

As G.G. predicted, the scales are a little less spread out using his system. The average remains the same, but the definitions of “Good” and “Bad” get a little closer together. To get a feel how that affects the statistic as language, let’s look at lists of pitchers from last year. Under the definition of “Good” I used last time, a good pitcher by SNLVAR would be Mark Buehrle, Tom Glavine, Justin Verlander, or Carlos Zambrano. Under G.G.’s definition of “Good” the pitchers who fit most closely would be Tom Gorzelanny and Daisuke Matsuzaka. All six of those pitchers were good last season, but which group would exemplify that concept best? Under the existing system, guys like Byun-Hyung Kim and Andy Sonnanstine would have been typical “Bad” pitchers last year. G.G.’s proposal would make the typical bad pitcher Edwin Jackson, Lenny DiNardo, or Claudio Vargas. Which group do you think fits the definition of “bad” best?

For the moment, I’ll stick with my modified Burke/Goldstein method as I turn to looking at relievers, this time using a sample of the top 200 relievers over each of the past three years, by relief innings pitched. Relief statistics can be a bit of a hodgepodge: some, like the save, are of dubious value and have a deleterious effect on how the game is managed. Others, like ERA, lose integrity in a situation where a reliever is expected to keep his predecessors’ baserunners from scoring, as well as his own. So rather than tell you what’s a good number of saves for a reliever, I’ll focus on our reliever evaluation tools, which we discussed in this space last year. We’ll start with Adjusted Runs Prevented (ARP):

              ARP    SV     IP     BB      K    VORP
Elite        21.5   13.7   71.6   22.3   66.3   25.2
Good         17.0   11.1   69.3   23.4   62.0   21.7
Average       4.7    3.7   59.9   23.7   46.5   10.1
Bad          -5.9    2.6   50.5   24.5   41.3    1.7
Superbad    -11.1    3.2   53.1   24.1   41.3   -3.2

Adjusted Runs Prevented are measured above and below the league average, so it’s disappointing to see a substantial positive number in the middle of our scale (for those of you wondering, Reader G.G.’s method would give us an average ARP of 4.5 in the 40-60th percentiles). Let’s see how the scale looks for Win Expectation Above Replacement, Lineup-Adjusted (WXRL):

             WXRL     SV     IP     BB      K    VORP
Elite        4.09    21.2   73.0   23.6   70.5   24.6
Good         3.04    13.4   69.8   24.1   63.2   20.7
Average      0.82     2.4   60.9   23.8   47.5    9.7
Bad         -0.25     1.7   53.7   23.7   39.2    3.1
Superbad    -0.84     2.9   54.4   24.9   40.6    0.1

Again, as Ira Blum noted above, there’s some disagreement with regard to the replacement level between the win expectation-based statistic and VORP. Since ERA isn’t of much use for relievers, we’ll use Fair Runs Allowed (FRA). FRA for relievers shows us how many runs the reliever would have allowed, if his bequeathed runners scored at a league-average rate-in other words, it neutralizes the effect of poor bullpen support behind the reliever:

           Fair RA   WXRL    IP     BB      K    VORP
Elite       2.38     2.70   72.7   24.3   63.3   21.7
Good        2.82     2.45   70.4   24.1   60.3   20.0
Average     4.15     1.08   62.3   24.1   50.0   11.5
Bad         5.92     0.07   51.7   23.4   39.5    1.9
Superbad    6.61    -0.20   48.5   22.7   37.1   -1.8

Next time, we’ll try other alternative methods, and wrap up our look at pitchers’ stats.