BP Comment Quick Links
June 5, 2002 Aim For The HeadPAP^3 FAQ
PAP^3 is the name for the new system for measuring pitcher abuse via pitch counts introduced in Baseball Prospectus 2001. Though it shares a similar name and goal with a system previously introduced by Rany Jazayerli, it was developed independently, and replaces the older system. The two articles in BP 2001 explained how PAP^3 (or simply "PAP," since the old system is being retired) helps quantify the relationship between high pitch counts and both shortterm performance and longterm injury risk. Since the publication of the book, I've received many thoughtful questions and comments about the research and the PAP^3 system. Probably the single most complete critique is Sean Forman's writeup at Baseball Primer. The questions in this FAQ aren't necessarily taken verbatim from reader comments, as I've tried to amalgamate various questions raised. Q: How are the new PAP reports on the Web site sorted? I didn't see anything that clued me in; not name, not PAP, not innings pitched. A: This was a goof on my part. The report is sorted by average PAP per start, but I neglected to actually print that column. We'll have it fixed for this year's reports. Q: Hey, there's a problem with the graphs published in the book! A: Yes, the graphs do have a mistake, though not, as it turns out, one that unduly affects the conclusions of the article. The problem is essentially that the data points are taken from the leftmost point of the data range, rather than the midpoint. E.g., for 9099 pitches, the data point is plotted at 90 instead of 95 (or 94.5 to be more accurate). I've corrected the graphs from the ones that appear in the book such that multiple of ten appears in the midpoint of the range, and they are the ones found in the web version of the original articles, and that I've reproduced below. (Thanks to Sean Forman, who was the first to bring this to my attention.)
(Click for fullsize image) Q: The star pitcher on my favorite team just threw 130 pitches! Shouldn't the manager be fired? A: Not necessarily. Though PAP is measured on a perstart basis, a single long outing is probably not enough to cause significant injury risk. You might expect your pitcher to be a bit less than his usual dominating self for the next few starts, however. For example, if the pitcher in question started 30 games, one in which he threw 130 pitches, and the rest with exactly 100, his Stress level for the season would be 8.9, which is not worth worrying about. It is a pattern of highstress workloads over the course of a career that is of most concern. What's worth more scrutiny is the circumstances under which the pitcher was allowed to throw so many pitches. In a close game against a tough opponent, or versus a division rival during a pennant chase, or even if the pitcher is nearing a personal accomplishment such a nohitter, shutout, or career high in strikeouts, there may be justification for leaving him in. Letting him burn himself out in a 152 blowout is more suspect (unless, perhaps, the game is at Coors Field or Enron Memorial Stadium). Q: What about age effects? The old PAP system had an adjustment for young pitchers. A: The new PAP research did not study the role of age in injury susceptibility beyond grouping pitchers of similar age and career workload in studying injury risk. Thus, the new PAP system itself is silent on the question of agenothing in the study confirms or refutes the notion that young pitchers are at more risk, though some other research (notably Craig Wright's) suggests that it may be the case. Similar, PAP may not be valid for high school, college, and lowminors pitchers who are significantly younger and in earlier stages of physical development than most of the pitchers in the majorleague study. It's probably better than nothing, but the threshold (100 pitches in PAP), in particular, could be quite a bit lower for less developed arms. Q: Does PAP address whether having additional days of rest can offset the damage of a highpitch start? For example, Bud Smith was skipped once in the Cardinal rotation following his 134pitch nohitter in 2001. A: No, PAP in its current form captures "typical" usage patterns, which essentially means the fiveman rotation that is standard today. No attempt was made to look at how pitching on short or long rest affects performance. One difficulty with studying the effect of the Bud Smith example is that it's very hard to tell from historical records whether a pitcher skipped a turn in the rotation as a preventative measure, or because he was already hurting. Q: Does PAP address pitching on three days of rest versus four days? A: No. PAP, because it's based on averages across pitchers from the late '80s and 1990s, is very much a product of that particular baseball environment. The fiveman rotation has been a fixture throughout that period, and thus PAP can generally be said to be based on the four days of rest common in modern pitching rotations. There was no attempt to distinguish starts based on the amount of rest the pitcher got. Q: Why does the PAP model start with 100 pitches instead of 130 where the curve really takes off? A: As mentioned in the article, I based the system on the lowest pitch count in which we see a consistent decline in one of the four components of the performance index. This index that declined first happened to be strikeout rate. Furthermore, referring to the distributionofpersonalthresholds model described elsewhere in this FAQ, if there are pitchers with low thresholds, there would be a real chance of pushing one of them past their personal threshold with a relatively low number of pitches. A good system would capture this small, but real, risk even at lower pitchcount levels. Q: There are examples of pitchers throwing over 200 pitches a game. With the amount of abuse PAP would assign to that start, their arms would have fallen off! Does this mean that PAP is fundamentally flawed? A: PAP, like any tool, should only be used within the ranges where it's been validated. There were very few starts above 140 pitches in the study, and therefore we can't say with any certainty how much worse the results would be as pitch counts rise above that level. Using PAP in its current form for starts much above 140 is not advised. To be conservative, capping PAP at the 140 or 150pitch level makes some sense. Otherwise, the PAP from a 200 pitch start would probably suggest that the pitcher's arm would spontaneously combust. As a purely hypothetical example, suppose that once a pitcher starts tiring, his decline continues until he's "fully fatigued," or to use baseball slang, "out of gas." If a typical pitcher's threshold for full fatigue is around 160 pitches, then it's possible that the loss of effectiveness as measured by PAP would flatten out with the next 1020 pitches above the studied range. In such a case, 200 pitches would not have more impact on a pitcher's shortterm performance than 180 pitches, and PAP would need to be modified to reflect that relationship once it's discovered. Q: Curveballs are supposedly more damaging to a pitcher's arm than fastballs. Does that have an effect on PAP? A: The type of pitches thrown is not currently a factor in PAP. In fact, the historical records do not include breakdowns for the number of fastballs, curves, changeups, sliders, split fingers, and the like thrown in a given start, so that kind of detailed analysis will probably have to wait for better recordkeeping. It may be possible to do some study involving pitchers who were "known" to be curveball pitchers (Bert Blyleven) or fastball pitchers (Nolan Ryan), but barring watching videos for every start in their careers, the estimation of pitches by type is extremely dicey. Q: Injuries come from pitchers being tired, not an arbitrary number of pitches. A: Pitch counts, and systems like PAP, are ways to quantify or derive a measurable index that is related to injury or ineffectiveness, which are in turn caused by fatigue. However, fatigue cannot be reliably and objectively assessed based on information available to the manager at the time that a decision needs to be made. Selfreporting of fatigue by competitive athletes is notoriously inaccurate. Expert assessment, in the form of coaches' observations of a pitcher's mechanics, may help determine when a given pitcher gets tired, although the quality of those assessments is difficult to evaluate. PAP can be another tool in a manager's arsenal, and a useful one, but it is not a onestop shop for making these kinds of decisions. Q: Does PAP predict the type of arm injury that is most likely to occur? Do older pitchers have a higher change of reinjury? Are certain types of injuries more likely to recur? A: No, unknown, and unknown. The PAP research to date has not included age, previous injury, or breakdowns of different types of arm injuries, and thus can't be used as a basis for concluding anything about the role of age or type of injury. Q: Is there any connection between PAP and pitchers trying to come back from injuries? A: It's possible that there's an effect on recovery. A reasonable person might suspect that high stress has greater impact on an already damaged arm, but this hasn't been explicitly studied. It's a great area for further research, perhaps even more important that focusing on age. Q: How can throwing one more pitch change a "Moderate" risk to a "Significant" risk just because it happens to be the 122nd pitch of the night? It seems like an arbitrary distinction. A: Any classification system that organizes a continuum of possible effects into a finite number of discrete groups is going to have break points or discontinuities where the local transition is overstated. In other words, whenever you simplify a complex phenomenon into categories, the borders of those categories will be more similar than you'd expect from looking at the average member of each category. Quality Starts is another baseball example of this. You can argue about whether "six innings, three earned runs" ought to be a Quality Start or not, but wherever you draw the line, the average QS will be significantly better than the average nonQS. This isn't limited to baseball. A person who is 18 years and 0 days old is not necessarily more mature or capable than one who is 17 years, 364 days old. Yet the legal system has a clear boundary in which the former has rights and responsibilities that the latter doesn't based only on that oneday difference. Q: Can you use the equation presented in the article [ Prob(Injury) = .06 * LN(Stress) ], plug in a pitcher's Stress for 2000, and forecast his likelihood for injury in 2001? For example: Pedro Martinez had a Stress level of 60.13 in 2000. Could this mean that he has a 24.58 % chance of injury in 2001? A: Answering this question now is rather ironic, given what we know happened to Martinez in 2001. The short answer is "no, that's not an appropriate way to use the StresstoInjury formula in the book, based on the research to date." What was presented was, at best, an estimate of the lifetime risk of at least one major arm injury (missing 30+ days) given a sustained career Stress level. Even this estimate is dubious, as the study included several pitchers whose careers are not yet complete, and thus we can't say with certainty whether a pitcher who has been healthy to date will suffer a qualifying injury, and thus affect the results of the research. It's a work in progress, and it's far too early to say with any confidence that a Stress load of X means an injury probability of Y even for a career, let alone one season to the next. I'll emphasize that againPAP^3 is a rough tool for assessing career injury risk, and any specific probability that comes out of it will have very wide error bars pending much more detailed research. Q: I am a graduate student in statistics and one comment in the article struck me as a little off. The chance of injury for one group was near 30% and the chance for the other group was 10, so you state that the risk was about three times as great. In categorical data analysis, this is called the relative risk. The problem is that this is not an appropriate measure of association in this case. Relative risk is good when you have independent binomial sampling as in clinical trials where you set the members of the group beforehand. Your study is more of a case control where the injuries are fixed and then you notice which pitchers are in after the fact. It is more of an observational study. Future studies could be more clearly defined, but this is observational in my opinion. When you have case control studies, the relative risk is a function of the total sample size, not a desirable property since sample sizes vary. This means that as more pitchers are added, the rates could differ. A better measure of association would be the odds ratio. A probability of .3 equates to odds of 0.4286 (p/1p) and .1 is odds of .1111. That means the odds ratio is 3.858 which means the odds of injury are 3.858 times greater for the high PAP group. I believe this to be more telling statistic and it is more correct. I know this is a minor detail but you guys are detailoriented and I have recently lost points on a takehome exam for the same mistake. A: I defer to your professor, and appreciate the correction. It's important to get the methodology correct, and I welcome input from those more statistically sophisticated than I to keep me honest. Good luck on your next test. Q: The effects measured in PAP seem to be rather small, on the order of a few percent. It is at all significant in practical terms? A: A single start by itself produces only a small effect. A 5% decline in performance for three weeks amounts to an expected increase in scoring of about sixtenths of a run. But it is a pattern of overuse, say one overlong start every 34 weeks, that can potentially keep a starter from his optimal performance level for the entire season. If a manager rides his top 2 starters hard, that can amount to about 10 runs over the course of a season, or about one win. Over the course of a season, the cost of not keeping starting pitchers fresh could be significant. It's also worth noting that in comparing the prestart and poststart performance levels, no attempt was made to see whether the prestart time period included a highpitchcount start or was affected by an earlier high pitch count. If the effects of high pitch counts are accumulating on top of each other, the magnitude of the effect itself could be understated. Another secondary impact is that starts subsequent to a highpitchcount start tend to be shorter (in innings pitched) than those immediately preceding. Throwing fewer innings may have a ripple effect on the bullpen, although this is counterbalanced by the fact that most highpitchcount outings have a pitcher throwing deep into the game, reducing the pressure on the bullpen. Q: Pitchers are all different! PAP treats them all the same without considering that Randy Johnson is more durable than Steve Woodard. A: That's a good point, and we can use it to demonstrate how PAP can be interpreted in different ways. Take a start of 118 pitches. The PAP^3 study finds that a start of that length has a moderate risk of a shortterm decline in performance (118 pitches is a high Category III start) when measured across all starters who've thrown that many pitches. In other words, assuming you know nothing else about a pitcher, you'd expect a mild decline. However, there are other ways to interpret what PAP tells us. It is not necessarily true that all pitchers are expected to decline modestly after such an outing, even if the group as a whole does. One of the apparent assumptions in PAP is that all pitchers have the same physical reaction to a given number of pitches. Thus, any pitcher with a 118pitch start is assigned the same level of risk, or expected decline. The reality is far more complex, as pitchers differ significantly in their ability to handle a workload, and how their bodies react to being pushed past a threshold. Let's introduce the idea of a personal threshold for a pitcherthe level at which a specific pitcher's performance can be expected to decline slightly as a result of marginal overuse. This is comparable to the 100pitch level in PAP, where abuse points start accumulating. To illustrate with a hypothetical example: suppose there are 100 pitchers, each of whom have a personal threshold of 101200 pitches (one pitcher per pitch level uniformly spread out across the interval). Beyond that point, each of them suffers an immediate 25% decline in performance (and no more, regardless of how far past their personal threshold). In this simplified model, each pitcher is either definitely overworked or definitely not, with no shades of gray. If we let each pitcher throw 120 pitches in a game, 80% of them will suffer no ill effects, while the other 20% will decline by 25%. Measured across the entire population of pitchers, the effect of a 120pitch outing will appear to be just a 5% decline. But you could also say that there is a 1 in 5 chance of performance dropping by 25%. There's a difference between surely losing a small amount and a small chance of losing a lot (indeed, that's the basis of the insurance business). This is a roundabout way of saying that at 118 pitches, rather than PAP suggesting a mild decline in pitcher performance, PAP could also be interpreted as that there is a fraction of pitchers who would decline a lot (as in the example above). This is not necessarily the case, it is just one possibility. That's part of the reason why I described the categories in terms of the amount of risk, rather than the amount of decline. In reality, pitchers probably do have different thresholds of work they can handle, but the decline for each pitcher isn't constant above that thresholdthe further above the threshold you push him, the bigger decline you can expect, up to a point. Finding a pitcher's personal threshold is a much bigger challenge, and one where the professional discretion of the manager and coaching staff can be given some leeway. I don't think anyone has tackled the problem sabermetrically yet, which is why we have generic tools like PAP that describe the expectation across populations instead of individuals. The shape of the PAP curve that would result from combining the different profiles of each pitcher depends on how personal thresholds are distributed. In the example above, thresholds were uniformly distributed between 100 and 200 pitches. It's more likely that pitchers fall on some sort of bell curve (though perhaps skewed towards higherendurance pitchers becoming starters). I've done some informal mathematical modeling of whether certain probability distributions could yield a PAP^3 kind of curve, and there's indication that a lognormally distributed threshold (with a mean around 125 pitches) could produce the results observed. However, keep in mind that PAP was developed by focusing on the higherendurance pitchers to begin with. Given a distribution of thresholds, the perceived decline in the overall population of starting pitchers would be smaller than the observed decline in a pitcher actually pushed beyond his limit. In a hypothetical world where this was the only effect of high pitch counts, PAP^3 would be a measure of the likelihood that a random pitcher suffers a large decline from throwing that many pitches, rather than the amount of decline a typical pitcher would suffer. This does not mean, though, that other sources of knowledge can't inform our assessment of the wisdom in letting someone throw that many pitches. Craig Wright's research suggests that young pitchers are at higher risk of injury from overuse, and biomechanics probably tells us that muscles get torn more easily at cold temperatures. If you take a young pitcher, like Kerry Wood, who has a history of arm problems, there is good reason to think he might be a pitcher whose risks from high pitch counts are higher than the "moderate" risk level assigned by PAP^3 to a generic pitcher. Perhaps even more so if he's starting on a cold night in April. The PAP framework is a baseline, and should be modified based on knowledge of the individual pitcher and game circumstances. Q: Why did you use total career pitch counts instead of pitches per start in the injury study? A: Average pitch counts alone are not sufficient, because they do not capture the total amount of work thrown at that pitch load. To illustrate with an extreme case, a rookie pitcher with five starts of 120 pitches each is considered to be as much "at risk" as a staff ace who has thrown 120 pitches/start for 30 starts in each of the past three years. Since the arm injuries under consideration are repetitive stress injuries, the quantity of work performed has to be considered in conjunction with how that workload was structured. In terms of the total amount of work extracted from an arm, total pitch counts is one way to measure for the total energy expended in the act of pitching, or the number of repetitions of the pitching motion required over the course of a career. Thus, grouping pitchers by career pitch counts seemed to be a reasonable basis for the study. Q: Why do both the career pitch total, and the length of individual starts matter? Let's look at an example. Consider two pitchers, A has 10 starts of 110 pitches each, B had 11 starts of 100 pitches each. They have each thrown the same total number of pitches. We can chart them as follows:
(Click for fullsize image) The last chart is the most significant. The area under the curve (or in this case, the area of the rectangles for each pitcher), represent the total amount of work done (pitches thrown). Placing one workload over the other, we see that A and B had a large subset of work in commonnamely, the first 100 pitches of 10 starts. Each pitcher has 100 more pitches not in common with the other. Pitcher A threw 10 more pitches in each of his 10 starts, while Pitcher B had one more start of 100 pitches. If we determine that Pitcher A is at more risk for injury than Pitcher B, then what this means is that the extra 100 pitches thrown late in the game collectively cause more damage (or create more risk) than the 100 pitches throw while starting fresh in an extra game. Thus, throwing a larger fraction of pitches late in the game would leave a pitcher more susceptible to injury. If, on the other hand, you believe that every pitch contributes an equal amount of risk (rather than an increasing amount on the margin), then total pitch counts would be proportional to risk, and the results of the longterm injury risk study showed that they are not. Looking at average pitch counts actually is actually conceptually similar to PAP's Stress metric, assuming that you've already grouped pitchers by career workload. To use the examples of Shawn Estes and Steve Woodard (suggested by Sean Forman), we have two pitchers with similar career workloads and vastly different average pitch counts. If you believe that they are not comparable in injury risk, then one possible reason would be that the 16pitchpergame difference results in a higher injury risk for Estes. In other words, concentrating the same number of pitches in highpitchcount outings is worse than having more lowpitchcount outings for injury risk. That's an idea very similar to Stress. Now, this average pitch count differs from PAP^3 in two significant waysit's not a cubic function, and it doesn't have a lower threshold cutoff. But fundamentally, it's employing a similar concept of increasing injury risk with increasing pitch count. Q: Can you chart the injury risk versus average pitch count the same way you did with Stress in the article? Somewhat ironically, the best and simplest polynomial curve to fit the average pitch count results is a cubic function similar to the one found in PAP^3, though this is coincidental. A better way to compare average pitch count to Stress is to look at the how well different models taken from each fit the observed injury data. Here, we'll look at R^2, the percentage of variance in injury attributable to changes in the underlying metric (Stress or average pitch count, transformed by the indicated model). Accuracy of Curve Fits using Average Pitch Count
Type of Model R^2 Accuracy of Curve Fits using Stress
Type of Model R^2 The formula presented in the bookProb{Injury} = .6*LN(Stress)has an R^2 of 0.621 (in the range where it returns a nonnegative number), and while it's not as good a fit as the Stresssquared model, it has two features that make it more attractive:
Q: What about the regimen of pregame pitches to get loose, the warmup pitches to start each inning, throwing between turns in the rotation, spring training, postseason starts, the AllStar Game, and the like? You're not counting every pitch. A: That's true, and ideally we would count them all. It may be particularly critical that we aren't counting postseason starts and, to a lesser degree, spring training games. Especially with the longer postseason, a starting pitcher could end up making 20% more starts beyond his season total. However, as long as the other factors either (a) are more or less constant among pitchers, or (b) are roughly proportional to the number of pitches thrown, then it doesn't have a big impact on PAP's evaluations. Many of those examples are pitches not thrown at full force, and presumably are not as stressful on the arm. The pitched thrown in pregame warmups (especially since the arm is fully rested at that point) and between turns are probably comparable for every majorleague starter. Pitchers throwing lots of pitches tend to throw lots of innings, and adding a fixed number of pitches per inning started would add similar amounts to every pitcher with similar PAP totals. It's even possible that those warmup pitches are part of the reason we see the decline in performance that we do. Assuming eight warmup pitches per inning, a pitcher who totalled 80 pitches in five innings has "thrown" 40 extra (120 pitches), while a pitcher whose made 120 official pitches in eight innings has "thrown" 64 extra (184 pitches). This hidden workload, some of which occurs while the pitcher is getting fatigued and which is larger the deeper into the game you go, could explain part of the tendency for high official pitch counts to yield nonlinear declines in performance. This is purely speculative, though. Q: People who count pitches are akin to someone who notes death and funerals almost always go together and then concludes that funerals cause death. A: That isn't really an accurate analogy in a couple of ways. At the very least, the temporal ordering of death and funerals are reversed from pitching resulting in injuries. A better analogy would be to say that doing 120 mph on a highway is associated with higher death rates, and objecting because the thing that actually causes death is blood loss from the flying piece of metal that used to be the other guy's bumper piercing your jugular vein. The speeding doesn't cause the death, but it increases the risk of circumstances that cause death. In other words, B may cause C, but A leads to a higher risk of B happening, so A similarly leads to a higher risk of C. Keith Woolner is an author of Baseball Prospectus. You can contact him by clicking here.
Keith Woolner is an author of Baseball Prospectus. 0 comments have been left for this article.
