BP Comment Quick Links
October 1, 2010 Ahead in the CountPitch Data and Walks
Last week, I looked at Predicting Strikeouts with Swing and Whiff Rates, breaking down pitchbypitch data to see if things like swingingstrike rates could provide more enlightenment when combined with the previous year’s strikeout rate to predict future strikeout rate. The answer was mostly negative. This was primarily due to two reasons. One was that much of the data on pitch locations is poor, and ensuing discussions highlighted just how poor it is. The other reason, however, is that strikeout rate is the quickest statistic to stabilize over small samples, so one year of strikeout data does a very good job of predicting subsequent strikeout data already. However, this week I will look at walk rate, and attempt to determine whether this data is more useful in predicting future walk rates. There is certainly evidence of value added in this case, far more so than with predicting strikeouts. Firstly, it helps to consider the baseline case. Supposing that we know what year it is, the age of the pitcher, and the pitcher’s unintentional walk rate from the previous year, what would we expect their unintentional walk rate to be the following year? The below regression summarizes this:
We can see that having a walk rate that is 1 percent above average the previous year would imply an expected walk rate that is 0.707 percent above average the following year, controlling for age and time. However, suppose that we also knew some of the Baseball Info Solutions data provided at FanGraphs on swingingstrike rate, contact rate, zone rate, first strike rate, and swing rate. Recall the following definitions from last week’s article:
Definitions: Now, consider the following table whereby I ran regressions of walk rate on the variables above and on each of the nine variables defined above (but one at a time for now). I also started with the basic regression above. Statistically significant coefficients are bolded, while weakly significant coefficients are bolded and italicized.
Note that at the bottom, I have included both an R^{2} and an Adjusted R^{2} to the table. The R^{2}statistic tells us how much of the variance in actual walk rates across the league can be explained by the statistic model. The problem with R^{2} is that adding more variables to the regression always increases the R^{2} because it gives the regression just one more variable to fit around the data. So, the Adjusted R^{2}statistic accounts for the number of terms in the model, and makes it so that adding more variables only increases the Adjusted R^{2} if it adds more explanatory power than would be expected due to random variations. In the above table, the Adjusted R^{2} of the original regression with none of the nine Baseball Info Solutions variables provided at FanGraphs is .4903. Adding swingingstrike rates actually lowers this to .4902, despite raising the R^{2} from .4932 to .4938, meaning that there is no reason to think that the .0006 R^{2} increase is a result of swingingstrike rates actually being useful in predicting walk rates. The term that seems to be most useful in predicting walk rates is the rate at which pitchers throw firstpitch strikes. What the table above shows us is that pitchers of equal ages and walk rates are more likely to improve their walk rates if they have thrown more firstpitch strikes while generating those walk rates. This makes sense, because pitchers who throw firstpitch strikes but subsequently fall behind hitters can change their approach, while pitchers who cannot even get a fastball over for strike one are bound to struggle longer. Pitchers who throw more pitches in the zone in general are also more likely to improve their walk rates than pitchers who throw balls out of the zone more often, even if they start with the same walk rate. This makes some sense. What is impressive is not that this is true, but that it’s true despite some very troublesome issues with the data. Baseball Prospectus' Colin Wyers has long expressed concern with this data, stressing the issue of parallax in determining pitch location using only the center field camera. This week, in the discussion on The Book Blog (linked in the first paragraph) Colin ran a correlation of a team’s hitters’ “Zone%” with the team’s pitchers’ “Zone%.” Theoretically, this may not be exactly zero if measured perfectly, but it should be pretty close to zero. The answer he found was 0.88! In other words, the rate at which Baseball Info Solutions thinks a team is throwing the ball in the strike zone is almost the exact same rate that Baseball Info Solutions think that teams’ hitters are seeing pitches thrown in the zone. Undoubtedly, Colin is justified in being concerned with this bias. However, even with this gigantic flaw in the data, there is a clear improvement in predicting walk rates when looking even at the rough measures that Baseball Info Solutions provides. I am relatively confident that improved measurement, perhaps using a tool such as PITCHf/x might help provide even more improvement in predicting walk rates. Overall contact and swing rates both seem to be correlated with lowering walk rates as well. If you can induce hitters to swing more, and they do not allow the count to go as deep when they do swing (because they hit the ball more), then you are likely to improve your walk rate as well. Data on swing and contact rates on pitches separately in and out of the zone provided no real useful information, perhaps due to measurement errors complicating the regression. In the table above, only one BIS variable was considered at a time. However, since multiple BIS variables may be useful and yet may also be correlated with each other, it is productive to run regressions on several BIS variables at once. I removed the inzone and outofzone contact and swing rates, as well as the swingingstrike rate (which is only the rate of swings times one minus the contact rate anyway) and then I ran some regressions using the remaining four BIS variables. The below tables looks at only variables with three or four of the four BIS included.
The firstpitch strike rate was extremely statistically significant in each of the regressions above, and the contact rate was at least weakly significant in each of the regressions above as well. Swing rate was only significant when firstpitch strike rate was excluded. These variables have a 0.63 correlation, indicating that swing rate is probably only relevant in that it is picking up the effect of firststrike rate when it is absent. Moving to regressions where exactly two of the four BIS variables are included, we get the following results:
Once again, the firstpitch strike rate is extremely statistically significant in all regressions. Zone rate is only significant when firstpitch strike rate is absent, as it was in the previous table. These two variables have a .50 correlation, so it appears that zone rate also is picking up firststrike rate’s effect when it is absent. Contact rate is useful even when firstpitch strike rate is included in the fourth regression in the table above, indicating that both of these are probably useful. Looking at all of the results above, the regression with the highest Adjusted R^{2} by a very small margin is the regression including FStrike%, Contact%, and Zone%. However, it is only .0003 higher than leaving out Zone rate, and .0037 higher than leaving out both Zone and Contact rate. This would suggest that one of the four best regressions would be either included no BIS variables, using only FStrike%, using FStrike% and Contact%, or using FStrike%, Contact%, and Zone%. The problem is that running a regression on a data set that we are also testing on can be biased. So I came up with a bizarre but useful solution. I checked and found that approximately half of data set (365 of 712 pitchers) had an “i” somewhere in their first or last name. Since I wanted a way to randomly split the data set, I reran the four regressions that I wanted to check on each halfdata set and then checked the root mean square error of the predicted walk rate and the actual walk rate on the other half of guys. So, for example, I ran a regression of walk rate on previous years’ walk rate, age, and the year for all people without an “i” in their name, and came up with a set of coefficients. Then I used those coefficients to produce predicted walk rates for players with at least one “i” in their name. (It is probably safe to assume that there is no secret correlation between having an “i” in your name and improving your walk rate more than your statistics and age would suggest on their own.) Then I could compare whether having this extra information was useful. Did the three regressions with BIS info beat the predicted walk rates of the regression without BIS info?
Each of the three regressions with BIS data included showed some improvement beyond using raw walk rate, age, and year. However, the best two includes contact rate, firstpitch strike rate, and using the one that includes zone rate as well is a judgment call. Since firstpitch strike rate and contact rate are not measured with error, and zone rate is, I would be inclined to choose the regression without zone rate, but the above clearly shows that is a judgment call. For comparison, I compared the model of predicted strikeout rate from last week’s article with and without OSwing% and basically saw no real difference. There was a slightly better prediction when running the regression on players without an “i” and testing on players with an “i” when you included OSwing%, but a slightly worse prediction when you predicted OSwing% of pitchers without an “i” using coefficients generated by a regression only for pitchers with an “i” in their names. (Regression on data set without an “i” root mean square error: basic 2.926 percent, with OSwing% 2.931 percent; regression data set with an “i” root mean square error: basic 2.888 percent, with OSwing% 2.881 percent) This analysis confirms that knowing pitch data is helpful in predicting walk rates, much more than in predicting strikeouts. There was plenty of evidence that pitchers who threw firstpitch strikes more often could improve their walk rates more than pitchers who threw them less, and there was also plenty of evidence that pitchers who allowed more contact improved their walk rates as well. I would also suspect that improving the measurement of whether pitches are in the strike zone would be very helpful on this front as well. This article should serve as two things. One is a demonstration of the utility of this type of granular information, and the other is a demonstration of the need for better measurements of such granular information.
Matt Swartz is an author of Baseball Prospectus. 4 comments have been left for this article.

Great stuff Matt. One more check if you can: ok, they improved their walk rate, but did their overall performance get better as well? If the reason they had a lower walk rate because they give up more first pitch strikes, and those first pitch strikes are easier to hit, then that might be the reason: it was a trade of fewer walks for more hits.
Can you check into this?
Hmm...I will look into this tonight when I have access to my data again. The firstpitch strikes was about the year before though the high walk year so the pitchers who had thrown more first pitch strikes but had higher walk rates anyway improved their walk rates whether or not they had more first pitch strikes the second (lower BB) year.