June 30, 2003
Estimating Pitch Counts
Filling the Gaps in Historical and Minor League ResultsPitching is a significant part of the success of a baseball team. Additionally, finding good pitchers and keeping them healthy is a goal of any successful organization--even the Brewers and Devil Rays. Research on the impact of pitchers' workload on health may rely on game-by-game pitch counts to measure the pitcher's workload. However, pitch counts by start are not readily available for the minor leagues or for major league seasons before 1994. Using Sean Forman's database of over 30,000 major league starts from the 1994-2000 seasons, we have constructed a model to estimate the number of pitches made in each start.
Using data commonly available in newspaper box scores, e.g., innings pitched, hits, runs allowed, earned runs allowed, walks, and strikeouts, we can derive estimated pitch counts. In addition, we will look at how the designated hitter impacts pitch counts.
The Raw Data
The distribution of actual pitch counts is skewed slightly to the left, as shown by these selected percentiles of the actual distribution of pitch counts for the 1994-2000 seasons:
Actual Distribution of Pitch Counts Minimum 1 Maximum 168 01-%ile 36 99-%ile 135 05-%ile 58 95-%ile 125 10-%ile 70 90-%ile 120 25-%ile 85 75-%ile 110 Median 98
Most of the time, the pitcher is pitching deeper into games, and throwing more pitches, because he is having a very good outing. Since the very good outings are far outweighed by the poor ones, we would expect a greater number of starts with fewer pitches thrown, and that's what we are seeing from the data. The goal of the model, then, is to both explain the variations in game-by-game pitch counts (i.e., achieve a high R2), and to accurately model this distribution.
A Linear Model
The first model will be a simple linear regression. This will determine how much our dependent variable, number of pitches per start, is impacted by each of the following independent variables:
Number of pitches thrown should increase directly as a function of number of batters faced; therefore, three of the independent variables reflect batters faced. A batter can be put out and recorded as one-third of an IP, get a hit, or be walked. The batter can also reach base via a hit by pitch, catcher's interference, or an error. Hit by pitch information is not always available, and catcher's interference is rare, so we'll equate those with error in our regression equation. Data pertaining to reaching base on errors is not always available, but we may be able to gain insight into those effects in another way, through unearned runs.
The Runs Allowed and Earned Runs Allowed variables are both present in common newspaper box scores. However, these variables are so highly correlated (0.94 in the data set) that we can't assume they are truly separate independent variables, so we can only use one in our equation. (For you statheads out there, it's an issue of multi-collinearity.)
But the relationship between the Runs Allowed and Earned Runs Allowed variables does convey useful information. If the pitcher allows many runs, but few earned runs, it may be because the defense made errors in "support" of the pitcher. This could result in the pitcher having to face more batters, without those batters showing up in the pitcher's line score through the IP, H, or BB variables. Therefore the presence of UER may signal that errors have been made, and could be expected to have a positive effect on pitch counts. Utilizing both the ER and UER variables allows more information to be considered, and the negligible correlation between these variables (-0.07) eliminates multi-collinearity problems.
The results of the linear model are shown below:
Linear Regression Results Variable Coefficient Std. Error T-Statistic Constant 12.35 0.31 40.47 IP 8.37 0.04 198.72 H 2.15 0.03 69.83 ER 0.14 0.04 3.61 UER 1.04 0.08 13.79 BB 4.72 0.04 129.10 SO 1.89 0.02 76.74 DH 2.50 0.11 23.56
This model explains 79% of the variation in pitch counts from game to game (R2 = .79; all variables are significant at the 99% level and are in the expected directions). An advantage of the linear regression model is that the coefficients are intuitive and translate directly into number of pitches. One inning pitched takes, all else being equal, about 8.4 pitches. One hit allowed adds about two pitches. One earned run adds virtually no pitches, but one unearned run adds about one pitch. One walk adds almost five pitches, and one strikeout adds almost two. Finally, the use of the DH in a given game contributes about 2.5 pitches per start.
One might wonder how anyone but Walter Johnson can record a strikeout with two pitches. (One story about Johnson tells of the time a hitter took two strikes from Johnson and then began to walk away from the plate. The umpire told him, "Hey, you've got one strike left!" to which the batter replied, "You keep it, I don't want it.") But remember that a strikeout also results in an out, so the other pitches are already accounted for as one-third of the IP. The groundout, then, is not only more democratic than the strikeout, but uses about two fewer pitches. Crash Davis was ahead of his time.
To evaluate the accuracy of the model, we can examine certain percentiles of the error terms, or residuals:
Distribution of Linear Model Residuals Minimum -34.91 Maximum 37.47 01-%ile -20.56 99-%ile 22.64 05-%ile -14.82 95-%ile 15.62 10-%ile -11.70 90-%ile 12.06 25-%ile -6.38 75-%ile 6.19 Median -0.21
So, for the 90% of starts that fall between the 5th and 95th percentiles, our model will typically predict the count within 15 pitches, plus or minus. For all starts except the 2% at either tail of the distribution, the model is correct within 21 to 23 pitches. Below we will examine the model distribution of pitches:
Distribution of Linear Model Results Minimum 15 Maximum 155 01-%ile 43 99-%ile 131 05-%ile 62 95-%ile 123 10-%ile 73 90-%ile 117 25-%ile 86 75-%ile 108 Median 98
The model doesn't seem to be picking up the tails of the distribution as much as we might like. Compared to the first table, we can see that the 1st percentile differs from the actual distribution by seven pitches, the 5th percentile by four, and the 10th by three. Similarly, the 99th percentile is off by four pitches, the 90th is off by three, and the 95th and 75th percentiles are off by two. This failure to pick up the tails might suggest a non-linearity of the relationship between the dependent and independent variables. However, we can use the information we've learned to establish parameters for a non-linear model.
A Non-Linear Model
Our non-linear model uses the natural log of the number of pitches as the dependent variable, and applies both a linear (beta) and non-linear coefficient (alpha) to the independent variables. The non-linear coefficient is not applied to the constant or the DH yes/no variable. The results of this regression are below:
Non-Linear Regression Results Beta Alpha Coefficient Std. Error T-Stat Coefficient Std. Error T-Stat Constant 2.08 0.022 95.52 IP 1.00 0.019 51.57 0.32 0.0044 72.49 H 0.26 0.012 21.48 0.28 0.0099 28.76 ER 0.00 0.001 3.12 1.55 0.1484 10.47 UER 0.02 0.002 11.05 0.75 0.0860 8.77 BB 0.07 0.002 41.10 0.84 0.0130 64.29 SO 0.05 0.003 19.39 0.64 0.0185 34.80 DH 0.03 0.001 21.64
This new non-linear model explains 3% more of the variation in pitch counts than the original linear model (R2 = .82; all variables are significant at the 99% level and have the expected signs). Due to the non-linearity of the functional form, the parameters are not as intuitive, but the magnitudes are still consistent with the linear model. The variables with higher coefficients in the linear model maintain higher coefficients in the non-linear model.
Let's take a look at the percentiles of the non-linear model's residuals:
Distribution of Non-Linear Models Residuals Minimum -36.88 Maximum 42.92 01-%ile -21.20 99-%ile 23.50 05-%ile -14.98 95-%ile 16.21 10-%ile -11.70 90-%ile 12.72 25-%ile -6.12 75-%ile 6.75 Median 0.25
These are comparable to the Linear Model, in that for the middle 98% of the distribution, the Non-Linear Model is within about 21 to 23 pitches and for the middle 90% it is within about 15 pitches. Let's see how it models the underlying distribution:
Distribution of Non-Linear Model Results Minimum 9 Maximum 170 01-%ile 40 99-%ile 135 05-%ile 60 95-%ile 124 10-%ile 72 90-%ile 118 25-%ile 85 75-%ile 108 Median 97
This is an improvement on the linear model, as the 1st percentile is now within four pitches of the actual distribution, and the 5th and 10th percentiles are within two. The 99th percentile is now equal to the original distribution, and the 95th and 90th percentiles are only off by one and two pitches, respectively. This model is doing a much better job of picking up the tails of the actual distribution. Therefore, we can conclude that this non-linear model of estimating pitch counts improves on our linear model.
Finally, we also want to make sure that these relationships are not changing over time. To test the stability of the model, we ran a final regression with dummy variables for each season from 1994 to 2000. While some of the yearly dummy variables were significant at the 95% level, the magnitude of the coefficients was no more than one pitch per start. This tells us that this non-linear relationship is changing by no more than one pitch per year over the course of the data set. This variation is mere statistical noise.
So summing up, we can use his non-linear model to estimate pitch counts for historical major league seasons, as well as minor league seasons, when actual pitch count data may be unavailable. This information could make it easier to study the effects of pitcher workload on pitcher health; any advance that could potentially lower injury rates among pitchers is a step in the right direction.
Ted Kury is an economist with an energy marketing firm in Jacksonville, Florida. You can reach him at firstname.lastname@example.org.