December 2, 2011
Prospectus Hit and Run
Resetting the Standard
This time of year is a busy stretch if you're a Hall of Fame buff, or at least this particular Hall of Fame buff. The 2012 BBWAA ballot was released on Wednesday, adding 13 new candidates to the 14 holdovers from last year's ballot. I'll start digging into the details of those candidacies starting at some point late next week. Meanwhile, the vote on the Golden Era candidates will take place at the Winter Meetings in Dallas this coming Monday, December 5; alas, I think I’m actually going to be in the air when the results are announced, but I’ll weigh in upon arrival. Earlier this week, I had the opportunity to discuss some of the Golden Era candidates on television as part of my debut appearance on MLB Network's new show, “Clubhouse Confidential.” It wasn't my first time on TV, but I believe it was my first time discussing JAWS in that medium. Explaining the system concisely AND discussing the merits of a handful of candidates in a four-minute span was certainly a challenge, but host Brian Kenny and his producers seemed quite pleased with the segment, and there’s reason to believe that it won't be the last time I appear on the show.
Here is a clip of the appearance, if you didn’t get to see it:
Between trying to boil down the system to its essential talking points and continuing a discussion of the top candidates on the Golden Era ballot, I've been thinking about a few issues pertaining to JAWS and wondering if it isn't time for a tweak or two to the system. First off, it's important to recognize that the system just underwent a seismic shift, in that this year's data set marks the first one using Colin Wyers' formulation of WARP instead of Clay Davenport's. Higher replacement levels and different methods of measuring offensive, defensive, and pitching value have shaken up the standings of some candidates relative to the standards, which have shifted as well—after all, they're averages of individual player values. In general, the WARP values for most players are lower, and in some cases very different from what previous iterations or various competing systems have told us.
The changes themselves were a topic for considerable discussion in the comment thread of my Golden Era piece. One loyal reader voiced concern over the underlying changes to WARP and the way that they rendered previous work—not only on JAWS but with respect to a whole lot of other BP studies—outdated, asking us to rename the new methodology to avoid confusion and to maintain the old version as well. This is impractical. Consider how dismissive certain members of the mainstream sports media are of WAR because golly, there are two popular formulations of it out there. Evan Grant, the Dallas Morning News writer who gave Michael Young his sole first place vote in the 2011 AL MVP race, explained his vote by noting, "When somebody can quickly explain the complexities of the concept and standardize the WAR formula, I’ll spend more time with it. In the meantime, I’ll go with what my eyes told me."
After you're done rolling your eyes at that one, consider that here at BP, we've spent the past year retiring or retooling some overlapping metrics so that, say, VORP and RARP aren't telling us two different things when they not only should be saying the same thing, they should be the same thing. No attempt to expand the audience for sabermetrics is going to convince many people if we have to resort to explaining, "This is the original Port Huron WARP, man, not the watered-down second draft." We might as well retire to our bathrobes and drink White Russians, a huge problem because the mere thought of drinking milk makes me gag.
Most of us do not love change, because it decreases our comfort level, threatens our understanding of the world, and forces us to act in new ways and digest new information. I don't love everything about the new WARP relative to the old, don't love change in general, but I do think that the new system's improvements are very worthwhile. Colin has incorporated baserunning into WARP, as well as play-by-play defense going back to 1950. He has attempted to tackle some age-old problems that smart people had with WARP, things like the counterintuitive calculations involved in Equivalent Average/True Average, or the assumption that replacement-level players were both replacement-level hitters and replacement-level fielders, when further study has suggested that such hitters are generally average defensively. I'm paraphrasing a more detailed exchange with Colin on the topic here, but formerly, we were measuring each pitcher against a replacement-level pitcher backed by replacement-level fielders, when instead we should have been measuring them relative to a replacement-level pitcher backed by average fielders. The new WARP also measures starters and relievers against different baselines based upon the fact that that the historical record strongly suggests that replacement relievers are better pitchers than replacement starters.
Furthermore, it's important to remember that the "old WARP" of yesteryear was hardly static. It underwent countless revisions without Clay calling a lot of attention to them; if I wanted to do a JAWS study in July, I'd have to get an updated data set because January's was out of date. Some of the revisions were minor, some were jarring; a given set might show that 19th century first basemen like Roger Connor and Dan Brouthers had climbed the first-base rankings significantly, but they'd fall back down to a more typical ranking with a subsequent batch.
Sabermetrics is the pursuit of objective truth about baseball. If our understanding of the truth changes—"Hey, we've been underestimating where the replacement-level line should be set" or, "Pitchers deserve a bit less credit than we have been giving them for controlling balls in play"—we owe it to ourselves and our audience to review our previously-held assumptions and revise our thinking, without worrying too hard about what information we've superseded. At any given moment, our WARP numbers represent our systematic best estimates of value, but they're still just estimates, not permanent figures carved in stone. We make a grave mistake when we think we've found the answer once and for all. As the PITCHf/x and HITf/x stuff that Mike Fast is doing revises our thinking about the nature of DIPS, you can bet that we'll find a way to incorporate that, and likewise with his brilliant catcher framing study. Sooner or later, someone may come along and solve another defensive quandary—maybe it's the adjacent fielder ballhog effect—that will make our current assumptions look dated and naive.
Enough inside baseball, at least on that front. One thing that several people pointed out with regards to Santo's case is that the standards that ran in last week's piece showed third base to have the second-highest JAWS of any position, behind only center field:
The last column is the gap between the position in question and the overall average; leaving catchers out of the equation for the moment, the spread is about 10 points from center field to shortstop. This is not a new phenomenon, though the identity of the particular outliers is. Here's what the previous set of standards looked like:
Again excluding catchers, the spread here is even wider, almost 14 points between second base and first base, though third base wasn't the outlier at that point. Santo was at 62.4 in the old set, 2.9 points above the third-base standard, so that particular point wasn't germane to the discussion the last time I reviewed his candidacy. Now it is, given a JAWS of 58.2, a mark that's 1.4 points below the third-base standard, but 3.5 points higher than the average hitter.
An aside: I've hidden the career and peak columns for the purposes of this demonstration, but as I argued the other day, the fact that Santo has a peak higher than the standard while being short on career means he still has a strong case. Had he hung around two years and squeezed out 3.0 WARP—enough to put him over the JAWS standard—it wouldn't have mattered much either to his teams or to our notion of his greatness. The same is true for certain other candidates, Minnie Minoso being another one. Careers that ended too early due to injury or otherwise come up short due to time missed via military service or the color line—those are reasons why peak is an important facet of a Hall of Fame argument in the first place, and it’s why I generally use the career/peak/JAWS triumvirate for a three-dimensional picture of a given candidacy.
Another point to be made about Santo is that there are only 11 third basemen in the Hall, and that's if you count Paul Molitor, as I do. Molitor played 791 of his 2,683 games there, and another 644 elsewhere in the infield, compared to 1,171 at DH; his defense (22 FRAA in our current build) had real value, and since there are no "pure" designated hitters in the Hall with which to group him, I've argued that he belongs there. Meanwhile, there are an average of 20 players at the other defensive positions besides catcher, ranging from 18 at first base and center field to 23 in right. What we're measuring Santo against is a small sample size, even relative to the other small sample sizes.
That's one existing problem with the system. Another is the way that the various position rankings are generally top-heavy; of the 143 Hall of Fame hitters for whom we can do JAWS, just 58 (40.6 percent) clear the standard. Conceptually, this doesn't mean that I advocate the ouster of more than half of the Hall of Fame hitters as unworthy; this isn't Lake Wobegon, where all of the children are above average. JAWS is meant to spotlight above-average candidates as a means of improving the Hall of Fame.
Anyway, the lowest percentage of Hall of Famers above the JAWS standard at their position is in right field, where just eight out of 23 are above the bar:
Not surprisingly, all eight of those players above the standard are BBWAA choices, while only four of the 15 players below the standard—Waner, Gwynn, Heilman and Keeler—were elected via that route (and yes, I am surprised that Gwynn has fallen). The rest came via one iteration of the Veterans Committee, or as it was known from 1939-1949, the Old-Timers Committee. In the current incarnation of JAWS, the lowest score, that of McCarthy, is dropped and then the rest are averaged.
Would removing that dropped player at each position even things out with regards to the spread of standards between positions? Yes, but only a little; the distance between the center field and shortstop standards drops from 10.1 to 9.3 points. Incidentally, this only bumps three more players above the standard at their given positions, one at third base, one at shortstop, and one at catcher.
Over the years, it has been suggested by multiple readers that I use the median score at each position instead of the modified mean. I wrote about this back in 2007, incidentally, in the context of the case of the player who happens to be the 2012 BBWAA ballot's top newcomer, Bernie Williams. A reader argued that because the scores at each position were not normally distributed, and the populations of each position were small, it was inappropriate to use the mean. In his view, the median presented a better alternative; by definition, half of the players at a given position would be above average.
I considered that suggestion, but ultimately rejected it, because one consequence of it is that it lowers the standard scores too drastically—10 points, in the case of center fielders in the current data set—to the point that a given BBWAA ballot of 30-someodd candidates might have well over 10 flagged as above the median at their position. Voters can only list 10 players on their ballots, and aside from one crackpot with a ballot that I came across years ago, no voter or credible observer of the process has suggested that at a given time there are too many Hallworthy players to vote for. When the next two classes (Barry Bonds, Roger Clemens, Mike Piazza, Sammy Sosa, Craig Biggio, and Curt Schilling in 2013, and Frank Thomas, Tom Glavine, Mike Mussina, Jeff Kent, and Jim Edmonds in 2014) arrives, that may be the case anyway, but that’s a problem for the coming years, and relaxing the standards so drastically won’t help. Switching to the median doesn't do much to decrease the spread between the highest standard and the lowest non-catcher standard, either. It would still be 9.3 points.
Ruminating on this for the past week, I've come up with an idea that doesn't entirely solve the problem but reduces it: regress the standard to the average Hall of Fame hitter. Now, this isn't rigorously scientific, in that I don't have any empirical data to tell me how heavily to weight the average to get to a valid sample size. Plonking around for a couple of hours, I settled on a sample size of 23, the existing maximum at any position, for each non-catcher position, meaning that if there are 20 left fielders in the Hall, the standard would be calculated as 20 left fielders plus 3 average Hall hitters divided by 23, or 11 third basemen plus 12 average Hall hitters divided by 23. If I do that, first restoring that lowest-ranked player at each position to the data set to simplify the process, this is where the standards end up:
GE is the set I unveiled last week for the Golden Era ballot. WS is the new weighted set, with (23 - n) average Hall of Fame hitters thrown in. I did that for the catchers as well, taking 85 percent of the average at the other positions to preserve the existing ratio between catcher JAWS and the overall position player JAWS. Note that I’m not doing anything to pitcher JAWS at this point.
The spread between the center fielders and the shortstops is just 7.6 points, down from 10.1; the standard deviation of the various positions falls from 3.3 to 2.3. Again, this only bumps three existing Hall of Famers back above the standards at their position; for some reason, there seems to be a donut hole near the center of the rankings at most positions. Consider the center fielders:
I don't think this perfectly solves the multiple issues I've cited, but it does even out the terrain, setting a more reasonable bar for third base and center field, two of the least represented but highest-JAWS positions. Note that I'm actually doing this process for the career and peak scores underneath the hood; this is where the new standards would land:
Relative to the just-reviewed slate of Golden Era candidates, this lands both Santo (66.1/50.3/58.2) and Minoso (61.4/46.1/53.8), two candidates whose peak scores are significantly above average but whose JAWS scores fall a few hairs short, above the bar. That's where I had already concluded they belonged anyway, after considering the extenuating circumstances surrounding their careers and the odd lumps under the standards rug. The above set doesn't alter my conclusions about any of the other Golden Era candidates, none of whom had JAWS scores above 49.9. It may carry ramifications with regards to holdover candidates on the BBWAA ballot, though not to the extent as those wrought by new WARP changes in the first place. Word of warning: Not all of our pet candidates fare so well in the move.
This is the first major change I’ve made to JAWS since the 2006 ballot, when I redefined peak as best seven seasons instead of best five consecutive seasons. Before I go ahead and lock this methodology in for the new cycle, I'd like to mull the change over a bit longer, and consider the feedback I get from colleagues and readers. This article, and my attempt to smooth out some of the rougher edges in JAWS, is a response to previous feedback. If you have a strong opinion on the matter, I'd welcome your comments.