When I wrote about pitchers with major divides between their ERAs and SIERAs two weeks ago, a reader inquired why Clay Buchholz had such a pedestrian strikeout rate while having an above average swingingstrike rate. Buchholz has mustered just 6.2 K/9, nearly a full strikeout below the 7.1 league average, but has induced batters to swing and miss on 9.5 percent of his pitches according to FanGraphs, a full percentage point above the 8.5 percent league average. The question was apparent: Do pitchers who get a lot of whiffs increase their strikeout rates over time?
The question is a logical one to ask. Inducing a swing and a miss could be more indicative of skill than getting an umpire to call a strike on a pitch that a hitter opted to take. The most notorious strikeout kings of all time have always been those that can get hitters to swing and miss. The top 10 swinging strike rates in 2010 according to FanGraphs are pitchers with high strikeout rates:
Swinging Strike % 
K/9 

12.5 
9.54 

11.9 
9.29 

11.8 
9.11 

11.1 
9.62 

11.0 
9.62 

10.9 
7.72 

10.8 
9.38 

10.8 
7.37 

10.7 
8.52 

10.5 
8.39 
All are above average, and the top five strike out more than a hitter per inning. Clearly, swinging strikes are highly correlated with strikeouts—in fact, they have a correlation of .84 among starting pitchers.
Swinging strikes are also heavily correlated yeartoyear. Among all pitchers with at least 80 innings as starting pitchers from 200209, there was a .79 correlation for swingingstrike rate, slightly above the .77 correlation for strikeout rate (SO/PA) itself.
Further evidence that this information could be useful is that swinging strikeouts per PA have a .77 correlation yeartoyear, while called strikeouts per PA have a .59 correlation.
I ran a regression of strikeout rate on the previous year’s swingingstrike rate, controlling for age and year, and found that a one percentage point increase in swingingstrike rate correlated with a 1.55 percentage point increase in SO/PA the following year, which was extremely significant. Thus, in the absence of information about the previous year's strikeout rate, knowing that a pitcher had more swinging strikes implies they likely had more strikeouts the following year.
Interestingly, the swinging SO/PA and called SO/PA do a fairly good job of predicting each other. The correlation between called SO/PA and swinging SO/PA the following year is .26, and the correlation between swinging SO/PA and called SO/PA the following year is .25.
The question remains whether swinging strikes provide additional information than strikeouts already do, so I ran a regression (again controlling for age and year) of strikeout rates on the previous year’s strikeout rate and the previous year’s swingingstrike rate, and found an interesting result.
Variable 
Coefficient 
PStat 
Constant 
.0606 
.000 
SO/PA 
.7294 
.000 
Year 200204 
.0056 
.026 
Year 2008 
.0060 
.074 
.0010 
.000 

Swinging Strike% 
.1236 
.251 
This implies that the extra information that swingingstrike rate provide, once the previous year’s strikeout rate is already determined, is not very useful at all. For every one percentage point above average in the previous year’s strikeout rate, the following year’s strikeout rate is likely to be about 0.73 percentage points above average. However, for pitchers with the same strikeout rate the previous year, a pitcher with one percentage point higher swingingstrike rate only will have a 0.12 percentage point higher strikeout rate, which is not statistically significant. The value added from this information is virtually useless.
(Note that 200507 is not included as a coefficient. Those familiar with regression analysis will recall the coefficients for 200204 and for 2008 are both measured relative to the 200507 effect.)
The R^{2} statistic, which measures how much of the variation in the dependent variable (following year’s strikeout rate) can be explained by the variables used tells the same story. The R^{2} for the regression above is .6118, just a tiny fraction of the .6110 R^{2} statistic for running the same regression without swingingstrike rate.
In other words, the value added by knowing the swingingstrike rate when the strikeout rate is already known is less than a tenth of a percent of the differences in players’ strikeout rates the following year.
Running the same regression on pitchers who were less than 28 years old in the first year actually reduced the coefficient to a statistically insignificant negative number (.169), suggesting that swingingstrike rate for younger pitchers provides no additional information that the strikeout rate does not already provide.
I decided to check whether getting more of one’s strikeouts as swinging strikes was helpful in predicting which direction a pitcher’s strikeout rate was headed, and found that this was not useful either. I ran a regression of strikeout rate on the previous year’s strikeout rate, dummies to control for year and age, and swingingstrikeout rate and got the following results:
Variable 
Coefficient 
PStat 
Constant 
.0633 
.000 
SO/PA 
.7551 
.000 
Year 200204 
.0045 
.053 
Year 2008 
.0058 
.087 
.0010 
.000 

Swinging SO/PA 
.0266 
.752 
In other words, knowing how much of a pitcher’s SO/PA came on swings versus called strikes was not useful at all.
In fact, running an equivalent regression with called SO/PA and swinging SO/PA as separate variables to see this more clearly shows that the coefficients on called and swinging strikeouts are almost exactly the same:
Variable 
Coefficient 
PStat 
Constant 
.0633 
.000 
Year 200204 
.0045 
.053 
Year 2008 
.0058 
.087 
.0010 
.000 

Called SO/PA 
.7551 
.000 
Swinging SO/PA 
.7817 
.000 
The information provided by knowing the form of those strikeouts is not all that useful.
Does this mean that none of the pitch information that we find in the “Plate Discipline” section on FanGraphs is useful when we have the regular box scores? Answering this requires using the same approach to look at each of the other variables. The following tables show the regression coefficients in a series of regressions on previous year’s strikeout rate, age and year controls, and an alternating statistic in each column. The PStats are in parenthesis underneath the coefficients in each cell. Statistically significant coefficients are bolded, while weakly statistically significant coefficients are bolded and italicized.
Variable 
Coef (PStat) 
Coef (PStat) 
Coef (PStat) 
Coef (PStat) 
Coef (PStat) 
Coef (PStat) 
Coef (PStat) 
Coef (PStat) 
Coef (PStat) 
Constant 
.0606 (.000) 
.0526 (.004) 
.0808 (.000) 
.1007 (.057) 
.0748 (.000) 
.0037 (.946) 
.0536 (.000) 
.0470 (.122) 
.0449 (.034) 
SO/PA 
.7294 (.000) 
.7736 (.000) 
.7754 (.000) 
.7458 (.000) 
.7586 (.000) 
.8132 (.000) 
.7563 (.000) 
.7764 (.000) 
.7704 (.000) 
Year ‘02‘04 
.0056 (.026) 
.0045 (.056) 
.0040 (.091) 
.0052 (.038) 
.0054 (.030) 
.0029 (.280) 
.0022 (.399) 
.0051 (.044) 
.0046 (.048) 
Year ‘08 
.0060 (.074) 
.0058 (.084) 
.0054 (.110) 
.0060 (.078) 
.0067 (.055) 
.0058 (.084) 
.0041 (.246) 
.0062 (.073) 
.0058 (.087) 
.0010 (.000) 
.0010 (.000) 
.0010 (.000) 
.0010 (.000) 
.0010 (.000) 
.0010 (.000) 
.0010 (.000) 
.0010 (.000) 
.0010 (.000) 

Swinging Strike% 
.1236 (.251) 








FStrike% 

.0201 (.496) 







Zone% 


.0341 (.327) 






Contact% 



.0399 (.474) 





OContact% 




.0158 (.328) 




ZContact% 





.0686 (.218) 



Swing% 






.0620 (.060) 


OSwing% 







.0233 (.575) 

ZSwing% 








.0422 (.339) 
Definitions:
Swinging Strike% = Percent of pitches thrown that were swung at and missed
FStrike% = Percent of hitters faced for which the first pitch of PA was a strike
Zone% = Percent of pitches thrown in the strike zone
Contact% = Percent of hitters’ swings that were fouls or hit into play
OContact% = Contact% on pitches out of the strike zone
ZContact% = Contact% on pitches in the strike zone
Swing% = Percent of pitches at which hitters swung
OSwing% = Swing% on pitches out of the strike zone
ZSwing% = Swing% on pitches in the strike zone
Nearly every bit of information that the pitch data gave us was useless. For each of the regressions above, we had the relevant information we needed by knowing the pitcher’s strikeout rate, age, and what year it was. The coefficients on eight of the nine variables are not even weakly statistically significant, but there is one variable that has a weakly significant (positive) effect on predicting the next year’s strikeout rate: OSwing%, the rate at which pitchers get hitters to chase pitches (supposedly) out of the zone.
Part of the reason that this statistic’s weak significance was so surprising to me is that I did not expect this information to be useful due to measurement error. The information, which FanGraphs obtains from Baseball Info Solutions, is determined by watching each pitch from the centerfield camera. Given the issue of parallax, the centerfield camera gives a distorted view and the observer can be fooled. This is an important point—the data appears rather questionable when looking at the yearly averages. League average OSwing% across the years has moved around and mostly increased:
2002: 18.1%
2003: 22.2%
2004: 16.6%
2005: 20.3%
2006: 23.5%
2007: 25.0%
2008: 25.4%
2009: 25.1%
2010: 29.3%
Perhaps pitchers were gradually getting better at inducing batters to swing at the right pitches? It seems unlikely given the change in league average “Zone%.” describing the percentage of pitches in the strike zone:
2002: 54.6%
2003: 51.4%
2004: 55.1%
2005: 53.8%
2006: 52.6%
2007: 50.3%
2008: 51.1%
2009: 49.3%
2010: 46.6%
It seems far more likely that pitches were being recorded as out of the strike zone more over the years, since this occurrence seems to become less likely, just as percent of swings at pitches that were supposedly out of the strike zone seems to become more likely. The overall swing rate has only ranged from 45.3 to 46.5 percent over the years, so it seems more likely that those swings were being treated as pitches out of the strike zone more so later on in the decade than that hitters were swinging at more pitches out of the zone as pitchers throwing a roughly equal and opposite amount of pitches out the zone.
I thought that normalizing the data might lead to stronger results, by measuring the OSwing% relative to league average. This did not work:
Variable 
Coefficient 
PStat 
Constant 
.0659 
.000 
SO/PA 
.7654 
.000 
Year 200204 
.0046 
.049 
Year 2008 
.0057 
.090 
.0010 
.000 

Net OSwing% 
.0340 
.386 
The OSwing% relative to the league average is now useless. Chances are this is because of the reason Baseball Prospectus' Colin Wyers expressed concerns with the data—the year to year fluctuations in league average OSwing% are probably a result of moving centerfield cameras. The average for each park is probably vastly different and using a leagueaverage effect is probably not very useful.
None of the other statistics as measured relative to the league average yielded remotely significant coefficients when included in regressions, either.
The most promising pitch data is the rate at which pitchers can get hitters to chase pitches out of the strike zone. The pitchers that tend to do so are more likely to see their strikeout rates increase the following year. However, the measurement error in these statistics is currently so large that it is difficult to glean any major insight from them. Chances are that this information could be more useful if measured more scientifically, and this could be one of the areas where pitch data could move our understanding of baseball forward.
However, the most important information to take away from this article is that even more objective statistics like swingingstrike rate, swing rate, and contact rate, as well as called versus swinging strikeout rates are all of very little added value beyond knowing what the pitchers strikeout rate will tell you.
Of course, strikeout rate for pitchers is one of the quickest to stabilize among all baseball statistics, and so the added value of information beyond knowing historical strikeout rate is least likely to be significant for strikeout rate as compared with any other statistic. Thus, next week I will look at walk rates and attempt to determine whether this type of information can inform our knowledge about walk rates any more than it could have informed us about strikeout rates.