Ahead in the Count: Pitch Data and Walks

Last week, I looked at Predicting Strikeouts with Swing and Whiff Rates, breaking down pitch-by-pitch data to see if things like swinging-strike rates could provide more enlightenment when combined with the previous year’s strikeout rate to predict future strikeout rate. The answer was mostly negative. This was primarily due to two reasons. One was that much of the data on pitch locations is poor, and ensuing discussions highlighted just how poor it is. The other reason, however, is that strikeout rate is the quickest statistic to stabilize over small samples, so one year of strikeout data does a very good job of predicting subsequent strikeout data already. However, this week I will look at walk rate, and attempt to determine whether this data is more useful in predicting future walk rates. There is certainly evidence of value added in this case, far more so than with predicting strikeouts.

Firstly, it helps to consider the baseline case. Supposing that we know what year it is, the age of the pitcher, and the pitcher’s unintentional walk rate from the previous year, what would we expect their unintentional walk rate to be the following year? The below regression summarizes this:

Variable	Coef.	P-Stat
Constant	.0245	.000
UBB/PA	.7070	.000
Year 2002-04	-.0017	.193
Year 2008	.0017	.354
Age	-.0001	.465

We can see that having a walk rate that is 1 percent above average the previous year would imply an expected walk rate that is 0.707 percent above average the following year, controlling for age and time.

However, suppose that we also knew some of the Baseball Info Solutions data provided at FanGraphs on swinging-strike rate, contact rate, zone rate, first strike rate, and swing rate. Recall the following definitions from last week’s article:

Definitions:
Swinging Strike% = Percent of pitches thrown that were swung at and missed
F-Strike% = Percent of hitters faced for which the first pitch of PA was a strike
Zone% = Percent of pitches thrown in the strike zone
Contact% = Percent of hitters’ swings that were fouls or hit into play
O-Contact% = Contact% on pitches out of the strike zone
Z-Contact% = Contact% on pitches in the strike zone
Swing% = Percent of pitches at which hitters swung
O-Swing% = Swing% on pitches out of the strike zone
Z-Swing% = Swing% on pitches in the strike zone

Now, consider the following table whereby I ran regressions of walk rate on the variables above and on each of the nine variables defined above (but one at a time for now). I also started with the basic regression above. Statistically significant coefficients are bolded, while weakly significant coefficients are bolded and italicized.

Variable	Coef (P-Stat)	Coef (P-Stat)	Coef (P-Stat)	Coef (P-Stat)	Coef (P-Stat)	Coef (P-Stat)	Coef (P-Stat)	Coef (P-Stat)	Coef (P-Stat)	Coef (P-Stat)
Constant	.0245 (.000)	.0218 (.000)	.0863 (.000)	.0590 (.000)	.0423 (.002)	.0300 (.000)	.0263 (.146)	.0496 (.003)	.0197 (.004)	.0390 (.026)
UBB/PA	.7070 (.000)	.7059 (.000)	.5985 (.000)	.6612 (.000)	.6988 (.000)	.7016 (.000)	.7065 (.000)	.6732 (.000)	.7161 (.000)	.7023 (.000)
Year ’02-‘04	-.0017 (.193)	-.0019 (.146)	-.0018 (.149)	-.0007 (.593)	-.0020 (.125)	-.0022 (.104)	-.0017 (.207)	-.0015 (.238)	-.0010 (.472)	-.0012 (.390)
Year ‘08	.0017 (.354)	.0017 (.357)	.0020 (.274)	.0011 (.538)	.0017 (.354)	.0022 (.241)	.0017 (.355)	.0019 (.308)	.0012 (.546)	.0014 (.441)
Age	-.0001 (.465)	-.0001 (.537)	-.0001 (.412)	-.0001 (.417)	-.0001 (.546)	-.0001 (.575)	-.0001 (.470)	-.0001 (.299)	-.0001 (.493)	-.0001 (.367)
Swinging Strike%		.0285 (.348)
F-Strike%			-.0900 (.000)
Zone%				-.0586 (.009)
Contact%					-.0217 (.165)
O-Contact%						-.0098 (.173)
Z-Contact%							-.0020 (.919)
Swing%								-.0463 (.120)
O-Swing%									.0174 (.319)
Z-Swing%										-.0200 (.390)
R^2	.4932	.4938	.5051	.4981	.4946	.4945	.4932	.4949	.4939	.4937
Adj. R^2	.4903	.4902	.5016	.4945	.4910	.4909	.4896	.4913	.4903	.4901

Note that at the bottom, I have included both an R² and an Adjusted R² to the table. The R² statistic tells us how much of the variance in actual walk rates across the league can be explained by the statistic model. The problem with R² is that adding more variables to the regression always increases the R² because it gives the regression just one more variable to fit around the data. So, the Adjusted R² statistic accounts for the number of terms in the model, and makes it so that adding more variables only increases the Adjusted R² if it adds more explanatory power than would be expected due to random variations.

In the above table, the Adjusted R² of the original regression with none of the nine Baseball Info Solutions variables provided at FanGraphs is .4903. Adding swinging-strike rates actually lowers this to .4902, despite raising the R² from .4932 to .4938, meaning that there is no reason to think that the .0006 R² increase is a result of swinging-strike rates actually being useful in predicting walk rates.

The term that seems to be most useful in predicting walk rates is the rate at which pitchers throw first-pitch strikes. What the table above shows us is that pitchers of equal ages and walk rates are more likely to improve their walk rates if they have thrown more first-pitch strikes while generating those walk rates. This makes sense, because pitchers who throw first-pitch strikes but subsequently fall behind hitters can change their approach, while pitchers who cannot even get a fastball over for strike one are bound to struggle longer.

Pitchers who throw more pitches in the zone in general are also more likely to improve their walk rates than pitchers who throw balls out of the zone more often, even if they start with the same walk rate. This makes some sense. What is impressive is not that this is true, but that it’s true despite some very troublesome issues with the data.

Baseball Prospectus' Colin Wyers has long expressed concern with this data, stressing the issue of parallax in determining pitch location using only the center field camera. This week, in the discussion on The Book Blog (linked in the first paragraph) Colin ran a correlation of a team’s hitters’ “Zone%” with the team’s pitchers’ “Zone%.” Theoretically, this may not be exactly zero if measured perfectly, but it should be pretty close to zero. The answer he found was 0.88! In other words, the rate at which Baseball Info Solutions thinks a team is throwing the ball in the strike zone is almost the exact same rate that Baseball Info Solutions think that teams’ hitters are seeing pitches thrown in the zone. Undoubtedly, Colin is justified in being concerned with this bias.

However, even with this gigantic flaw in the data, there is a clear improvement in predicting walk rates when looking even at the rough measures that Baseball Info Solutions provides. I am relatively confident that improved measurement, perhaps using a tool such as PITCHf/x might help provide even more improvement in predicting walk rates.

Overall contact and swing rates both seem to be correlated with lowering walk rates as well. If you can induce hitters to swing more, and they do not allow the count to go as deep when they do swing (because they hit the ball more), then you are likely to improve your walk rate as well.

Data on swing and contact rates on pitches separately in and out of the zone provided no real useful information, perhaps due to measurement errors complicating the regression.

In the table above, only one BIS variable was considered at a time. However, since multiple BIS variables may be useful and yet may also be correlated with each other, it is productive to run regressions on several BIS variables at once. I removed the in-zone and out-of-zone contact and swing rates, as well as the swinging-strike rate (which is only the rate of swings times one minus the contact rate anyway) and then I ran some regressions using the remaining four BIS variables.

The below tables looks at only variables with three or four of the four BIS included.

Variable

Coef

(P-Stat)

Coef

(P-Stat)

Coef

(P-Stat)

Coef

(P-Stat)

Coef

(P-Stat)

Constant

.1474

(.000)

.1347

(.000)

.0981

(.000)

.1093

(.000)

.1448

(.000)

UBB/PA

.5461

(.000)

.5583

(.000)

.5851

(.000)

.6170

(.000)

.5507

(.000)

Year ’02-‘04

-.0020

(.147)

-.0019

(.158)

-.0012

(.350)

-.0013

(.357)

-.0024

(.062)

Year ‘08

.0019

(.316)

.0017

(.346)

.0016

(.392)

.0015

(.421)

.0021

(.242)

Age

-.0001

(.434)

-.0001

(.514)

-.0001

(.434)

-.0001

(.315)

-.0001

(.429)

Zone%

-.0242

(.315)

-.0282

(.227)

-.0369

(.115)

-.0442

(.061)

F-Strike%

-.0893

(.000)

-.0938

(.000)

-.0823

(.001)

-.0949

(.000)

Contact%

-.0408

(.021)

-.0355

(.027)

-.0327

(.064)

-.0449

(.009)

Swing%

-.0256

(.470)

.0086

(.791)

-.0590

(.088)

-.0339

(.325)

R^2

.5106

.5102

.5068

.5011

.5099

Adj. R^2

.5050

.5053

.5019

.4962

.5050

The first-pitch strike rate was extremely statistically significant in each of the regressions above, and the contact rate was at least weakly significant in each of the regressions above as well. Swing rate was only significant when first-pitch strike rate was excluded. These variables have a 0.63 correlation, indicating that swing rate is probably only relevant in that it is picking up the effect of first-strike rate when it is absent.

Moving to regressions where exactly two of the four BIS variables are included, we get the following results:

Variable

Coef

(P-Stat)

Coef

(P-Stat)

Coef

(P-Stat)

Coef

(P-Stat)

Coef

(P-Stat)

Coef

(P-Stat)

Constant

.1007

(.000)

.0732

(.000)

.0716

(.000)

.1264

(.000)

.0861

(.000)

.0998

(.000)

UBB/PA

.5823

(.000)

.6557

(.000)

.6442

(.000)

.5688

(.000)

.5988

(.000)

.6342

(.000)

Year ’02-‘04

-.0012

(.356)

-.0010

(.448)

-.0007

(.597)

-.0024

(.059)

-.0018

(.151)

-.0020

(.121)

Year ‘08

.0016

(.381)

.0012

(.532)

.0013

(.485)

.0020

(.264)

-.0020

(.276)

.0020

(.278)

Age

-.0001

(.390)

-.0001

(.485)

-.0001

(.322)

-.0001

(.544)

-.0001

(.426)

-.0001

(.295)

Zone%

-.0360

(.120)

-.0566

(.012)

-.0533

(.021)

F-Strike%

-.0801

(.000)

-.1025

(.000)

-.0901

(.000)

Contact%

-.01870

(.231)

-.0384

(.015)

-.0396

(.022)

Swing%

-.0290

(.344)

.0006

(.984)

-.0790

(.017)

R^2

.5068

.4991

.4987

.5092

.5051

.4987

Adj. R^2

.5026

.4948

.4944

.5050

.5009

.4944

Once again, the first-pitch strike rate is extremely statistically significant in all regressions. Zone rate is only significant when first-pitch strike rate is absent, as it was in the previous table. These two variables have a .50 correlation, so it appears that zone rate also is picking up first-strike rate’s effect when it is absent. Contact rate is useful even when first-pitch strike rate is included in the fourth regression in the table above, indicating that both of these are probably useful.

Looking at all of the results above, the regression with the highest Adjusted R² by a very small margin is the regression including F-Strike%, Contact%, and Zone%. However, it is only .0003 higher than leaving out Zone rate, and .0037 higher than leaving out both Zone and Contact rate. This would suggest that one of the four best regressions would be either included no BIS variables, using only F-Strike%, using F-Strike% and Contact%, or using F-Strike%, Contact%, and Zone%.

The problem is that running a regression on a data set that we are also testing on can be biased. So I came up with a bizarre but useful solution.

I checked and found that approximately half of data set (365 of 712 pitchers) had an “i” somewhere in their first or last name. Since I wanted a way to randomly split the data set, I reran the four regressions that I wanted to check on each half-data set and then checked the root mean square error of the predicted walk rate and the actual walk rate on the other half of guys.

So, for example, I ran a regression of walk rate on previous years’ walk rate, age, and the year for all people without an “i” in their name, and came up with a set of coefficients. Then I used those coefficients to produce predicted walk rates for players with at least one “i” in their name.

(It is probably safe to assume that there is no secret correlation between having an “i” in your name and improving your walk rate more than your statistics and age would suggest on their own.)

Then I could compare whether having this extra information was useful. Did the three regressions with BIS info beat the predicted walk rates of the regression without BIS info?

Variable

Coef

(P-Stat)

Name without “i”

Coef

(P-Stat)

Name with “i”

Coef

(P-Stat)

Name without “i”

Coef

(P-Stat)

Name with “i”

Coef

(P-Stat)

Name without “i”

Coef

(P-Stat)

Name with “i”

Coef

(P-Stat)

Name without “i”

Coef

(P-Stat)

Name with “i”

Constant

.0299

(.000)

0240

(.000)

.1128

(.001)

.1432

(.000)

.0959

(.003)

.1409

(.000)

.0760

(.001)

.0849

(.000)

UBB/PA

.6731

(.000)

.7152

(.000)

.5508

(.000)

.5657

(.000)

.5774

(.000)

.5681

(.000)

.5922

(.000)

.6097

(.000)

Year ’02-‘04

-.0046

(.012)

.0012

(.506)

-.0037

(.058)

-.0002

(.908)

-.0046

(.012)

-.0004

(.828)

-.0044

(.015)

.0007

(.674)

Year ‘08

.0040

(.125)

-.0003

(.903)

.0034

(.196)

.0002

(.942)

.0040

(.127)

.0003

(.914)

.0039

(.133)

.0003

(.895)

Age

-.0002

(.259)

-.0001

(.639)

-.0002

(.333)

-.0001

(.644)

-.0002

(.396)

-.0001

(.645)

-.0002

(.369)

-.0001

(.446)

Zone%

-.0506

(.112)

-.0094

(.786)

F-Strike%

-.0622

(.070)

-.1020

(.002)

-.0758

(.023)

-.1053

(.001)

-.0696

(.033)

-.0872

(.004)

Contact%

-.01388

(.521)

-.0526

(.027)

-.0189

(.379)

-.0536

(.023)

R^2

.4451

.4578

.4538

.4525

Adj. R^2

.4386

.4466

.4441

.4445

RMSE

Out-of-Sample

1.652%

1.597%

1.616%

1.567%

1.610%

1.567%

1.619%

1.574%

Each of the three regressions with BIS data included showed some improvement beyond using raw walk rate, age, and year. However, the best two includes contact rate, first-pitch strike rate, and using the one that includes zone rate as well is a judgment call. Since first-pitch strike rate and contact rate are not measured with error, and zone rate is, I would be inclined to choose the regression without zone rate, but the above clearly shows that is a judgment call.

For comparison, I compared the model of predicted strikeout rate from last week’s article with and without O-Swing% and basically saw no real difference. There was a slightly better prediction when running the regression on players without an “i” and testing on players with an “i” when you included O-Swing%, but a slightly worse prediction when you predicted O-Swing% of pitchers without an “i” using coefficients generated by a regression only for pitchers with an “i” in their names. (Regression on data set without an “i” root mean square error: basic 2.926 percent, with O-Swing% 2.931 percent; regression data set with an “i” root mean square error: basic 2.888 percent, with O-Swing% 2.881 percent)

This analysis confirms that knowing pitch data is helpful in predicting walk rates, much more than in predicting strikeouts. There was plenty of evidence that pitchers who threw first-pitch strikes more often could improve their walk rates more than pitchers who threw them less, and there was also plenty of evidence that pitchers who allowed more contact improved their walk rates as well. I would also suspect that improving the measurement of whether pitches are in the strike zone would be very helpful on this front as well. This article should serve as two things. One is a demonstration of the utility of this type of granular information, and the other is a demonstration of the need for better measurements of such granular information.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

You need to be logged in to comment. Login or Subscribe

TangoTiger1

10/01

Great stuff Matt. One more check if you can: ok, they improved their walk rate, but did their overall performance get better as well? If the reason they had a lower walk rate because they give up more first pitch strikes, and those first pitch strikes are easier to hit, then that might be the reason: it was a trade of fewer walks for more hits.

Can you check into this?

Reply to TangoTiger1

swartzm

Hmm...I will look into this tonight when I have access to my data again. The first-pitch strikes was about the year before though-- the high walk year-- so the pitchers who had thrown more first pitch strikes but had higher walk rates anyway improved their walk rates whether or not they had more first pitch strikes the second (lower BB) year.

Reply to swartzm

baseballben

We park-adjust everything else... what about park-adjusting pitch location info? I guess you'd want Home/Road splits to do that, right?

Reply to baseballben

I'm guessing that the park adjustments would very a lot from pitcher to pitcher, though, because different pitchers throw in different regions with different angles more than others. A guy with a sweeping slider might be particularly tough to evaluate.

On top of that, my home-field advantage studies found that the strike zone was where large home/road differentials existed, with BBs and Ks being big areas of home-field advantage. I would think adjusting pitcher to pitcher could be tricky.

All in all, I think doing it without a scientific calculation like pitchF/X is going to be problematic. I think it's a great approach to baseball and the results above show it does have some value even with all its problems, but the granularity of noisy information makes it tough to really approach scientifically and learn something about.

Ahead in the Count: Pitch Data and Walks

Thank you for reading

Latest Articles

Next Man Up ’24: Week Six $

Fantasy Starting Pitching Planner ’24: Week Six $

Box Score Banter: Pride, Intrigue, & the Pitchout B

Paul Skenes, and the Paul Skenes Cinematic Universe $

The Crooked Inning: Muncyball B

Matt Swartz

Latest Articles

Next Man Up ’24: Week Six $

Fantasy Starting Pitching Planner ’24: Week Six $

Box Score Banter: Pride, Intrigue, & the Pitchout B