Premium and Super Premium Subscribers Get a 20% Discount at MLB.tv!
September 24, 2010
Ahead in the Count
Predicting Strikeouts with Whiff and Swing Rates
When I wrote about pitchers with major divides between their ERAs and SIERAs two weeks ago, a reader inquired why Clay Buchholz had such a pedestrian strikeout rate while having an above average swinging-strike rate. Buchholz has mustered just 6.2 K/9, nearly a full strikeout below the 7.1 league average, but has induced batters to swing and miss on 9.5 percent of his pitches according to FanGraphs, a full percentage point above the 8.5 percent league average. The question was apparent: Do pitchers who get a lot of whiffs increase their strikeout rates over time?
The question is a logical one to ask. Inducing a swing and a miss could be more indicative of skill than getting an umpire to call a strike on a pitch that a hitter opted to take. The most notorious strikeout kings of all time have always been those that can get hitters to swing and miss. The top 10 swinging strike rates in 2010 according to FanGraphs are pitchers with high strikeout rates:
All are above average, and the top five strike out more than a hitter per inning. Clearly, swinging strikes are highly correlated with strikeouts—in fact, they have a correlation of .84 among starting pitchers.
Swinging strikes are also heavily correlated year-to-year. Among all pitchers with at least 80 innings as starting pitchers from 2002-09, there was a .79 correlation for swinging-strike rate, slightly above the .77 correlation for strikeout rate (SO/PA) itself.
I ran a regression of strikeout rate on the previous year’s swinging-strike rate, controlling for age and year, and found that a one percentage point increase in swinging-strike rate correlated with a 1.55 percentage point increase in SO/PA the following year, which was extremely significant. Thus, in the absence of information about the previous year's strikeout rate, knowing that a pitcher had more swinging strikes implies they likely had more strikeouts the following year.
Interestingly, the swinging SO/PA and called SO/PA do a fairly good job of predicting each other. The correlation between called SO/PA and swinging SO/PA the following year is .26, and the correlation between swinging SO/PA and called SO/PA the following year is .25.
The question remains whether swinging strikes provide additional information than strikeouts already do, so I ran a regression (again controlling for age and year) of strikeout rates on the previous year’s strikeout rate and the previous year’s swinging-strike rate, and found an interesting result.
This implies that the extra information that swinging-strike rate provide, once the previous year’s strikeout rate is already determined, is not very useful at all. For every one percentage point above average in the previous year’s strikeout rate, the following year’s strikeout rate is likely to be about 0.73 percentage points above average. However, for pitchers with the same strikeout rate the previous year, a pitcher with one percentage point higher swinging-strike rate only will have a 0.12 percentage point higher strikeout rate, which is not statistically significant. The value added from this information is virtually useless.
(Note that 2005-07 is not included as a coefficient. Those familiar with regression analysis will recall the coefficients for 2002-04 and for 2008 are both measured relative to the 2005-07 effect.)
The R2 statistic, which measures how much of the variation in the dependent variable (following year’s strikeout rate) can be explained by the variables used tells the same story. The R2 for the regression above is .6118, just a tiny fraction of the .6110 R2 statistic for running the same regression without swinging-strike rate.
In other words, the value added by knowing the swinging-strike rate when the strikeout rate is already known is less than a tenth of a percent of the differences in players’ strikeout rates the following year.
Running the same regression on pitchers who were less than 28 years old in the first year actually reduced the coefficient to a statistically insignificant negative number (-.169), suggesting that swinging-strike rate for younger pitchers provides no additional information that the strikeout rate does not already provide.
I decided to check whether getting more of one’s strikeouts as swinging strikes was helpful in predicting which direction a pitcher’s strikeout rate was headed, and found that this was not useful either. I ran a regression of strikeout rate on the previous year’s strikeout rate, dummies to control for year and age, and swinging-strikeout rate and got the following results:
In other words, knowing how much of a pitcher’s SO/PA came on swings versus called strikes was not useful at all.
In fact, running an equivalent regression with called SO/PA and swinging SO/PA as separate variables to see this more clearly shows that the coefficients on called and swinging strikeouts are almost exactly the same:
The information provided by knowing the form of those strikeouts is not all that useful.
Does this mean that none of the pitch information that we find in the “Plate Discipline” section on FanGraphs is useful when we have the regular box scores? Answering this requires using the same approach to look at each of the other variables. The following tables show the regression coefficients in a series of regressions on previous year’s strikeout rate, age and year controls, and an alternating statistic in each column. The P-Stats are in parenthesis underneath the coefficients in each cell. Statistically significant coefficients are bolded, while weakly statistically significant coefficients are bolded and italicized.
Swinging Strike% = Percent of pitches thrown that were swung at and missed
Nearly every bit of information that the pitch data gave us was useless. For each of the regressions above, we had the relevant information we needed by knowing the pitcher’s strikeout rate, age, and what year it was. The coefficients on eight of the nine variables are not even weakly statistically significant, but there is one variable that has a weakly significant (positive) effect on predicting the next year’s strikeout rate: O-Swing%, the rate at which pitchers get hitters to chase pitches (supposedly) out of the zone.
Part of the reason that this statistic’s weak significance was so surprising to me is that I did not expect this information to be useful due to measurement error. The information, which FanGraphs obtains from Baseball Info Solutions, is determined by watching each pitch from the center-field camera. Given the issue of parallax, the center-field camera gives a distorted view and the observer can be fooled. This is an important point—the data appears rather questionable when looking at the yearly averages. League average O-Swing% across the years has moved around and mostly increased:
Perhaps pitchers were gradually getting better at inducing batters to swing at the right pitches? It seems unlikely given the change in league average “Zone%.” describing the percentage of pitches in the strike zone:
It seems far more likely that pitches were being recorded as out of the strike zone more over the years, since this occurrence seems to become less likely, just as percent of swings at pitches that were supposedly out of the strike zone seems to become more likely. The overall swing rate has only ranged from 45.3 to 46.5 percent over the years, so it seems more likely that those swings were being treated as pitches out of the strike zone more so later on in the decade than that hitters were swinging at more pitches out of the zone as pitchers throwing a roughly equal and opposite amount of pitches out the zone.
I thought that normalizing the data might lead to stronger results, by measuring the O-Swing% relative to league average. This did not work:
The O-Swing% relative to the league average is now useless. Chances are this is because of the reason Baseball Prospectus' Colin Wyers expressed concerns with the data—the year to year fluctuations in league average O-Swing% are probably a result of moving center-field cameras. The average for each park is probably vastly different and using a league-average effect is probably not very useful.
None of the other statistics as measured relative to the league average yielded remotely significant coefficients when included in regressions, either.
The most promising pitch data is the rate at which pitchers can get hitters to chase pitches out of the strike zone. The pitchers that tend to do so are more likely to see their strikeout rates increase the following year. However, the measurement error in these statistics is currently so large that it is difficult to glean any major insight from them. Chances are that this information could be more useful if measured more scientifically, and this could be one of the areas where pitch data could move our understanding of baseball forward.
However, the most important information to take away from this article is that even more objective statistics like swinging-strike rate, swing rate, and contact rate, as well as called versus swinging strikeout rates are all of very little added value beyond knowing what the pitchers strikeout rate will tell you.
Of course, strikeout rate for pitchers is one of the quickest to stabilize among all baseball statistics, and so the added value of information beyond knowing historical strikeout rate is least likely to be significant for strikeout rate as compared with any other statistic. Thus, next week I will look at walk rates and attempt to determine whether this type of information can inform our knowledge about walk rates any more than it could have informed us about strikeout rates.