Volatility and consistency get used to categorize certain players. The terms are also perceived to be of much greater value than the actual relationship with results would suggest. Over the last two weeks, my focus has been on discussing these performances in conjunction to the free-agent market, contrasting pitchers personifying the aforementioned archetypes in order to help determine which teams they would best fit. Initially, I concluded that volatile pitchers should attract volatile teams, with the analog true of consistency. This was not to suggest that interflaky relationships would be suboptimal, but rather that, for example, Joe Blanton is of much greater utility to the Phillies than Joel Pineiro would be, with the inverse holding true for those pesky Mets.
An issue arose in that, while the above generalization is valid, the conflation of past and future flakiness proved to be potentially untrue: for example, Jon Garland‘s level of consistency for the Padres in 2010 should not be automatically assumed, given his consistency from 2007-09. A further study, delivered in a back-of-the-envelope fashion last week, put this assumption under the microscope. The methodology used entailed the culling of all four-year spans between 1974 and 2006, and binning these spans based on variance in park-adjusted ERA over the first three seasons. Using the median, the highest 50-percent standard deviations were classified as volatile, with the bottom 50 percent falling into the consistency bucket.
Breaking everything up by the minimum number of starts in each season, I found an inverse relationship between the gap in projectability and the playing time requirements. As more pitchers qualified and the talent pool widened, the gap between both sides grew, and the overall performance fell down a few notches. Unfortunately, there was an inherent selection bias at work, in that the requirement of four years precluded quick flameouts and those pitchers whose volatility was derived from career-abbreviating injuries. As volatility as I’ve been using it is more of a broad, all-encompassing term, I should restate that, for these purposes, we’re more concerned here with the differences between those capable of sticking around for a predetermined span and their level of variance, as opposed to anyone that comes to mind as wildly inconsistent in their statistics or playing time.
The selection bias did not refute the results of last week’s study, but we’re probably all in agreement that the median method needed retooling. The problem with using a 50/50 bucket approach is that most of the sample, regardless of playing time parameters, is bound to hover close to the median: if the median in a group was 1.05, then pitchers with deviations of 1.04 and 1.06 would be compared against one another, while our interest lies in the pitchers we can feel more comfortable claiming to be different from one another. Therefore, breaking the samples into quartiles and quintiles should provide much more telling information. With quartiles, the bins move from two (volatile and consistent) to four: very volatile, volatile, consistent, and very consistent.
To test the hypothesis that the middle two quartiles were skewing our data last week, take a look at the quartiles closest to the median in terms of RMSE, standard deviation of fourth-year park-adjusted ERA, and the actual and projected mark in that fourth season:
Min. Proj. Group GS ERA ERA RMSE SD Volatile 20 3.86 3.86 0.779 0.871 Consistent 20 3.74 3.78 0.796 0.875 Min. Proj. Group GS ERA ERA RMSE SD Volatile 10 3.96 3.98 0.951 1.034 Consistent 10 3.83 3.93 0.942 1.005 Min. Proj. Group GS ERA ERA RMSE SD Volatile 5 4.01 4.03 1.132 1.194 Consistent 5 3.87 3.98 1.111 1.198
As you can see, no matter how I sliced it in terms of games started requirements, these groups are similar to one another, with neither having that innate of an advantage over the other in terms of RMSE testing or the magnitude of its standard deviation in actual results during that fourth season.
What we want to do is compare the “very” groups to one another as a means of figuring out whether or not pitchers deemed very volatile in the past should be expected to be more volatile than their consistent counterparts; moving forward, are they tougher to project? If little difference surfaces, then basing decisions centering on volatility and consistency on past results is not an effective evaluative method. Here are the more extreme quartiles:
Min. Proj. Group GS ERA ERA RMSE SD Very Volatile 20 3.98 3.87 0.791 0.878 Very Consistent 20 3.71 3.71 0.762 0.862 Min. Proj. Group GS ERA ERA RMSE SD Very Volatile 10 4.17 4.08 0.976 1.053 Very Consistent 10 3.78 3.82 0.902 0.981 Min. Proj. Group GS ERA ERA RMSE SD Very Volatile 5 4.32 4.19 1.240 1.318 Very Consistent 5 3.80 3.88 1.093 1.172
Looking at this table, three points stand out:
- Very Volatile pitchers outperformed their projections in each instance, while the consistent pitchers fell in line with or performed worse than their projections.
- As expected, the standard deviations of each group, and the deltas between the groups, tightened up as the playing time requirements grew more strict.
- The Very Consistent pitchers displayed much more “dominance” in the form of noticeably lower RMSE results; there was less of an error in projecting their performance, and it was clinically significant.
Before moving on to a conclusion, what happens if we break the samples down even further, this time compiling results for quintiles: Very Volatile, Volatile, Volatile/Consistent, Consistent, and Very Consistent? This should inch us even closer to the extremes of the group, and it’s worth a look:
Min. Proj. Group GS ERA ERA RMSE SD Very Volatile 20 3.98 3.87 0.793 0.873 Very Consistent 20 3.70 3.73 0.756 0.875 Min. Proj. Group GS ERA ERA RMSE SD Very Volatile 10 4.21 4.09 0.996 1.069 Very Consistent 10 3.77 3.82 0.898 0.987 Min. Proj. Group GS ERA ERA RMSE SD Very Volatile 5 4.39 4.25 1.196 1.282 Very Consistent 5 3.79 3.88 1.119 1.203
Inspecting this table, we can see similar results to the quartile section. The more volatile pitchers outperform projections and are more prone to errors in comparisons between projected and actual park-adjusted ERAs than the consistent hurlers, while the standard deviation of their actual performance is wider in each playing time group. The consistent pitchers once again tend to perform equal to or worse than their projected marks, but are easier to project and boast less rangy distributions. As far as why the volatile pitchers perform better than their projections with the opposite true of the consistent hurlers, commenter TheRealNeal offered up a perfectly rational reason, in that a cause-and-effect explanation is inherent: If pitchers prove volatile in the wrong direction, they no longer start and will therefore miss the cutoff, whereas consistent pitchers that happen to underperform will be given much more slack due to their prior track record. This might not be the entire cause of the phenomenon, but it seems valid.
What do the results above tell us? When talking about consistency and volatility, it does seem that past results can be a clear indicator of what to expect, but only when dealing with the more extreme circumstances. Someone who could conceivably be interchangeable between both bins, and who does not appear to sway toward one particular direction, should not be expected to fall into a specific category moving forward. However, the very consistent will likely remain that way, and the same goes for the very volatile.
From the get-go, there was little debate as to the merits of a volatile team like the Mets seeking the services of a volatile pitcher like Joel Pineiro. Now, though, asserting that certain pitchers can be expected to remain volatile or consistent once in their new digs is more comfortable. Again, these studies provide information for pitchers capable of sticking around for a four-year span, so they may not tell the entire story, but when trying to determine whether the likes of a Garland or Pineiro will stick to their past moving forward, aren’t we more interested in those included in this sample as the basis of a working comparison more than the flameouts anyway?