October 1, 2010
Ahead in the Count
Pitch Data and Walks
Last week, I looked at Predicting Strikeouts with Swing and Whiff Rates, breaking down pitch-by-pitch data to see if things like swinging-strike rates could provide more enlightenment when combined with the previous year’s strikeout rate to predict future strikeout rate. The answer was mostly negative. This was primarily due to two reasons. One was that much of the data on pitch locations is poor, and ensuing discussions highlighted just how poor it is. The other reason, however, is that strikeout rate is the quickest statistic to stabilize over small samples, so one year of strikeout data does a very good job of predicting subsequent strikeout data already. However, this week I will look at walk rate, and attempt to determine whether this data is more useful in predicting future walk rates. There is certainly evidence of value added in this case, far more so than with predicting strikeouts.
Firstly, it helps to consider the baseline case. Supposing that we know what year it is, the age of the pitcher, and the pitcher’s unintentional walk rate from the previous year, what would we expect their unintentional walk rate to be the following year? The below regression summarizes this:
We can see that having a walk rate that is 1 percent above average the previous year would imply an expected walk rate that is 0.707 percent above average the following year, controlling for age and time.
However, suppose that we also knew some of the Baseball Info Solutions data provided at FanGraphs on swinging-strike rate, contact rate, zone rate, first strike rate, and swing rate. Recall the following definitions from last week’s article:
Now, consider the following table whereby I ran regressions of walk rate on the variables above and on each of the nine variables defined above (but one at a time for now). I also started with the basic regression above. Statistically significant coefficients are bolded, while weakly significant coefficients are bolded and italicized.
Note that at the bottom, I have included both an R2 and an Adjusted R2 to the table. The R2statistic tells us how much of the variance in actual walk rates across the league can be explained by the statistic model. The problem with R2 is that adding more variables to the regression always increases the R2 because it gives the regression just one more variable to fit around the data. So, the Adjusted R2statistic accounts for the number of terms in the model, and makes it so that adding more variables only increases the Adjusted R2 if it adds more explanatory power than would be expected due to random variations.
In the above table, the Adjusted R2 of the original regression with none of the nine Baseball Info Solutions variables provided at FanGraphs is .4903. Adding swinging-strike rates actually lowers this to .4902, despite raising the R2 from .4932 to .4938, meaning that there is no reason to think that the .0006 R2 increase is a result of swinging-strike rates actually being useful in predicting walk rates.
The term that seems to be most useful in predicting walk rates is the rate at which pitchers throw first-pitch strikes. What the table above shows us is that pitchers of equal ages and walk rates are more likely to improve their walk rates if they have thrown more first-pitch strikes while generating those walk rates. This makes sense, because pitchers who throw first-pitch strikes but subsequently fall behind hitters can change their approach, while pitchers who cannot even get a fastball over for strike one are bound to struggle longer.
Pitchers who throw more pitches in the zone in general are also more likely to improve their walk rates than pitchers who throw balls out of the zone more often, even if they start with the same walk rate. This makes some sense. What is impressive is not that this is true, but that it’s true despite some very troublesome issues with the data.
Baseball Prospectus' Colin Wyers has long expressed concern with this data, stressing the issue of parallax in determining pitch location using only the center field camera. This week, in the discussion on The Book Blog (linked in the first paragraph) Colin ran a correlation of a team’s hitters’ “Zone%” with the team’s pitchers’ “Zone%.” Theoretically, this may not be exactly zero if measured perfectly, but it should be pretty close to zero. The answer he found was 0.88! In other words, the rate at which Baseball Info Solutions thinks a team is throwing the ball in the strike zone is almost the exact same rate that Baseball Info Solutions think that teams’ hitters are seeing pitches thrown in the zone. Undoubtedly, Colin is justified in being concerned with this bias.
However, even with this gigantic flaw in the data, there is a clear improvement in predicting walk rates when looking even at the rough measures that Baseball Info Solutions provides. I am relatively confident that improved measurement, perhaps using a tool such as PITCHf/x might help provide even more improvement in predicting walk rates.
Overall contact and swing rates both seem to be correlated with lowering walk rates as well. If you can induce hitters to swing more, and they do not allow the count to go as deep when they do swing (because they hit the ball more), then you are likely to improve your walk rate as well.
Data on swing and contact rates on pitches separately in and out of the zone provided no real useful information, perhaps due to measurement errors complicating the regression.
In the table above, only one BIS variable was considered at a time. However, since multiple BIS variables may be useful and yet may also be correlated with each other, it is productive to run regressions on several BIS variables at once. I removed the in-zone and out-of-zone contact and swing rates, as well as the swinging-strike rate (which is only the rate of swings times one minus the contact rate anyway) and then I ran some regressions using the remaining four BIS variables.
The below tables looks at only variables with three or four of the four BIS included.
The first-pitch strike rate was extremely statistically significant in each of the regressions above, and the contact rate was at least weakly significant in each of the regressions above as well. Swing rate was only significant when first-pitch strike rate was excluded. These variables have a 0.63 correlation, indicating that swing rate is probably only relevant in that it is picking up the effect of first-strike rate when it is absent.
Moving to regressions where exactly two of the four BIS variables are included, we get the following results:
Once again, the first-pitch strike rate is extremely statistically significant in all regressions. Zone rate is only significant when first-pitch strike rate is absent, as it was in the previous table. These two variables have a .50 correlation, so it appears that zone rate also is picking up first-strike rate’s effect when it is absent. Contact rate is useful even when first-pitch strike rate is included in the fourth regression in the table above, indicating that both of these are probably useful.
Looking at all of the results above, the regression with the highest Adjusted R2 by a very small margin is the regression including F-Strike%, Contact%, and Zone%. However, it is only .0003 higher than leaving out Zone rate, and .0037 higher than leaving out both Zone and Contact rate. This would suggest that one of the four best regressions would be either included no BIS variables, using only F-Strike%, using F-Strike% and Contact%, or using F-Strike%, Contact%, and Zone%.
The problem is that running a regression on a data set that we are also testing on can be biased. So I came up with a bizarre but useful solution.
I checked and found that approximately half of data set (365 of 712 pitchers) had an “i” somewhere in their first or last name. Since I wanted a way to randomly split the data set, I reran the four regressions that I wanted to check on each half-data set and then checked the root mean square error of the predicted walk rate and the actual walk rate on the other half of guys.
So, for example, I ran a regression of walk rate on previous years’ walk rate, age, and the year for all people without an “i” in their name, and came up with a set of coefficients. Then I used those coefficients to produce predicted walk rates for players with at least one “i” in their name.
(It is probably safe to assume that there is no secret correlation between having an “i” in your name and improving your walk rate more than your statistics and age would suggest on their own.)
Then I could compare whether having this extra information was useful. Did the three regressions with BIS info beat the predicted walk rates of the regression without BIS info?
Each of the three regressions with BIS data included showed some improvement beyond using raw walk rate, age, and year. However, the best two includes contact rate, first-pitch strike rate, and using the one that includes zone rate as well is a judgment call. Since first-pitch strike rate and contact rate are not measured with error, and zone rate is, I would be inclined to choose the regression without zone rate, but the above clearly shows that is a judgment call.
For comparison, I compared the model of predicted strikeout rate from last week’s article with and without O-Swing% and basically saw no real difference. There was a slightly better prediction when running the regression on players without an “i” and testing on players with an “i” when you included O-Swing%, but a slightly worse prediction when you predicted O-Swing% of pitchers without an “i” using coefficients generated by a regression only for pitchers with an “i” in their names. (Regression on data set without an “i” root mean square error: basic 2.926 percent, with O-Swing% 2.931 percent; regression data set with an “i” root mean square error: basic 2.888 percent, with O-Swing% 2.881 percent)
This analysis confirms that knowing pitch data is helpful in predicting walk rates, much more than in predicting strikeouts. There was plenty of evidence that pitchers who threw first-pitch strikes more often could improve their walk rates more than pitchers who threw them less, and there was also plenty of evidence that pitchers who allowed more contact improved their walk rates as well. I would also suspect that improving the measurement of whether pitches are in the strike zone would be very helpful on this front as well. This article should serve as two things. One is a demonstration of the utility of this type of granular information, and the other is a demonstration of the need for better measurements of such granular information.