Pitching is a significant part of the success of a baseball team. Additionally, finding good pitchers and keeping them healthy is a goal of any successful organization–even the Brewers and Devil Rays. Research on the impact of pitchers’ workload on health may rely on game-by-game pitch counts to measure the pitcher’s workload. However, pitch counts by start are not readily available for the minor leagues or for major league seasons before 1994. Using Sean Forman’s database of over 30,000 major league starts from the 1994-2000 seasons, we have constructed a model to estimate the number of pitches made in each start.

Using data commonly available in newspaper box scores, e.g., innings pitched, hits, runs allowed, earned runs allowed, walks, and strikeouts, we can derive estimated pitch counts. In addition, we will look at how the designated hitter impacts pitch counts.

The Raw Data

The distribution of actual pitch counts is skewed slightly to the left, as shown by these selected percentiles of the actual distribution of pitch counts for the 1994-2000 seasons:

Actual Distribution of Pitch Counts

Minimum	    1		Maximum	   168
01-%ile	   36		99-%ile	   135
05-%ile	   58		95-%ile	   125
10-%ile	   70		90-%ile	   120
25-%ile	   85		75-%ile	   110
Median	   98		

Most of the time, the pitcher is pitching deeper into games, and throwing more pitches, because he is having a very good outing. Since the very good outings are far outweighed by the poor ones, we would expect a greater number of starts with fewer pitches thrown, and that’s what we are seeing from the data. The goal of the model, then, is to both explain the variations in game-by-game pitch counts (i.e., achieve a high R2), and to accurately model this distribution.

A Linear Model

The first model will be a simple linear regression. This will determine how much our dependent variable, number of pitches per start, is impacted by each of the following independent variables:

  • Innings Pitched (IP);

  • Hits (H);

  • Earned Runs Allowed (ER);

  • Unearned Runs Allowed (UER);

  • Walks (BB);

  • Strikeouts (K);

  • A yes/no variable indicating whether the designated hitter was used in the game (DH).

Number of pitches thrown should increase directly as a function of number of batters faced; therefore, three of the independent variables reflect batters faced. A batter can be put out and recorded as one-third of an IP, get a hit, or be walked. The batter can also reach base via a hit by pitch, catcher’s interference, or an error. Hit by pitch information is not always available, and catcher’s interference is rare, so we’ll equate those with error in our regression equation. Data pertaining to reaching base on errors is not always available, but we may be able to gain insight into those effects in another way, through unearned runs.

The Runs Allowed and Earned Runs Allowed variables are both present in common newspaper box scores. However, these variables are so highly correlated (0.94 in the data set) that we can’t assume they are truly separate independent variables, so we can only use one in our equation. (For you statheads out there, it’s an issue of multi-collinearity.)

But the relationship between the Runs Allowed and Earned Runs Allowed variables does convey useful information. If the pitcher allows many runs, but few earned runs, it may be because the defense made errors in “support” of the pitcher. This could result in the pitcher having to face more batters, without those batters showing up in the pitcher’s line score through the IP, H, or BB variables. Therefore the presence of UER may signal that errors have been made, and could be expected to have a positive effect on pitch counts. Utilizing both the ER and UER variables allows more information to be considered, and the negligible correlation between these variables (-0.07) eliminates multi-collinearity problems.

The results of the linear model are shown below:

Linear Regression Results

Variable	Coefficient	Std. Error     T-Statistic
Constant	12.35	        0.31		 40.47
IP		 8.37	        0.04		198.72
H		 2.15	        0.03		 69.83
ER		 0.14	        0.04		  3.61
UER		 1.04	        0.08		 13.79
BB		 4.72	        0.04		129.10
SO		 1.89	        0.02		 76.74
DH		 2.50	        0.11		 23.56

This model explains 79% of the variation in pitch counts from game to game (R2 = .79; all variables are significant at the 99% level and are in the expected directions). An advantage of the linear regression model is that the coefficients are intuitive and translate directly into number of pitches. One inning pitched takes, all else being equal, about 8.4 pitches. One hit allowed adds about two pitches. One earned run adds virtually no pitches, but one unearned run adds about one pitch. One walk adds almost five pitches, and one strikeout adds almost two. Finally, the use of the DH in a given game contributes about 2.5 pitches per start.

One might wonder how anyone but Walter Johnson can record a strikeout with two pitches. (One story about Johnson tells of the time a hitter took two strikes from Johnson and then began to walk away from the plate. The umpire told him, “Hey, you’ve got one strike left!” to which the batter replied, “You keep it, I don’t want it.”) But remember that a strikeout also results in an out, so the other pitches are already accounted for as one-third of the IP. The groundout, then, is not only more democratic than the strikeout, but uses about two fewer pitches. Crash Davis was ahead of his time.

To evaluate the accuracy of the model, we can examine certain percentiles of the error terms, or residuals:

Distribution of Linear Model Residuals

Minimum		-34.91		Maximum		37.47
01-%ile		-20.56		99-%ile	        22.64
05-%ile		-14.82		95-%ile	        15.62
10-%ile		-11.70		90-%ile	        12.06
25-%ile		 -6.38		75-%ile	         6.19
Median		 -0.21		

So, for the 90% of starts that fall between the 5th and 95th percentiles, our model will typically predict the count within 15 pitches, plus or minus. For all starts except the 2% at either tail of the distribution, the model is correct within 21 to 23 pitches. Below we will examine the model distribution of pitches:

Distribution of Linear Model Results

Minimum	   15	 	Maximum	   155
01-%ile	   43		99-%ile	   131
05-%ile	   62		95-%ile	   123
10-%ile	   73		90-%ile	   117
25-%ile	   86		75-%ile	   108
Median	   98		

The model doesn’t seem to be picking up the tails of the distribution as much as we might like. Compared to the first table, we can see that the 1st percentile differs from the actual distribution by seven pitches, the 5th percentile by four, and the 10th by three. Similarly, the 99th percentile is off by four pitches, the 90th is off by three, and the 95th and 75th percentiles are off by two. This failure to pick up the tails might suggest a non-linearity of the relationship between the dependent and independent variables. However, we can use the information we’ve learned to establish parameters for a non-linear model.

A Non-Linear Model

Our non-linear model uses the natural log of the number of pitches as the dependent variable, and applies both a linear (beta) and non-linear coefficient (alpha) to the independent variables. The non-linear coefficient is not applied to the constant or the DH yes/no variable. The results of this regression are below:

Non-Linear Regression Results

                     Beta                                Alpha
	Coefficient Std. Error  T-Stat      Coefficient Std. Error  T-Stat
Constant    2.08     0.022	95.52			
IP	    1.00     0.019	51.57		0.32     0.0044	    72.49
H	    0.26     0.012	21.48		0.28     0.0099	    28.76
ER	    0.00     0.001	 3.12		1.55     0.1484	    10.47
UER	    0.02     0.002	11.05		0.75     0.0860	     8.77
BB	    0.07     0.002	41.10		0.84     0.0130	    64.29
SO	    0.05     0.003	19.39		0.64     0.0185	    34.80
DH	    0.03     0.001	21.64			

This new non-linear model explains 3% more of the variation in pitch counts than the original linear model (R2 = .82; all variables are significant at the 99% level and have the expected signs). Due to the non-linearity of the functional form, the parameters are not as intuitive, but the magnitudes are still consistent with the linear model. The variables with higher coefficients in the linear model maintain higher coefficients in the non-linear model.

Let’s take a look at the percentiles of the non-linear model’s residuals:

Distribution of Non-Linear Models 
Minimum	    -36.88	   Maximum	42.92
01-%ile	    -21.20	   99-%ile	23.50
05-%ile	    -14.98	   95-%ile	16.21
10-%ile	    -11.70	   90-%ile	12.72
25-%ile	     -6.12	   75-%ile	 6.75
Median	      0.25		

These are comparable to the Linear Model, in that for the middle 98% of the distribution, the Non-Linear Model is within about 21 to 23 pitches and for the middle 90% it is within about 15 pitches. Let’s see how it models the underlying distribution:

Distribution of Non-Linear Model Results
Minimum	    9		Maximum    170
01-%ile	   40		99-%ile	   135
05-%ile	   60		95-%ile	   124
10-%ile	   72		90-%ile	   118
25-%ile	   85		75-%ile	   108
Median	   97		

This is an improvement on the linear model, as the 1st percentile is now within four pitches of the actual distribution, and the 5th and 10th percentiles are within two. The 99th percentile is now equal to the original distribution, and the 95th and 90th percentiles are only off by one and two pitches, respectively. This model is doing a much better job of picking up the tails of the actual distribution. Therefore, we can conclude that this non-linear model of estimating pitch counts improves on our linear model.

Finally, we also want to make sure that these relationships are not changing over time. To test the stability of the model, we ran a final regression with dummy variables for each season from 1994 to 2000. While some of the yearly dummy variables were significant at the 95% level, the magnitude of the coefficients was no more than one pitch per start. This tells us that this non-linear relationship is changing by no more than one pitch per year over the course of the data set. This variation is mere statistical noise.

So summing up, we can use his non-linear model to estimate pitch counts for historical major league seasons, as well as minor league seasons, when actual pitch count data may be unavailable. This information could make it easier to study the effects of pitcher workload on pitcher health; any advance that could potentially lower injury rates among pitchers is a step in the right direction.

Ted Kury is an economist with an energy marketing firm in Jacksonville, Florida. You can reach him at

You need to be logged in to comment. Login or Subscribe