I knew Jose Fernandez shouldn't have flown so close to the sun.
Even though wounded pitchers get all the attention (and rightfully so, in the case of Jose Fernandez’ Tommy John tragedy), position players get injured too. Just this week, the Rockies lost Nolan Arenado because of a finger fracture, the Yankees were deprived of Mark Teixeira on account of his persistent wrist problems, and the Mariners missed a few games of Robinson Cano. The absence of Jose Abreu, brief as it was, was surely no less painful on a per-game basis than the injuries of Patrick Corbin or Jarrod Parker.
On the other hand, there’s a good reason that pitcher injuries are regarded with so much more dread: they tend to rob us of our favorite players for whole seasons at a time. The Baseball Prospectus injury database informs me that the average pitcher injury causes about twice as many lost days as the average position player injury. Moreover, about five percent of all pitcher injuries take a player out for more than 150 days, or the better part of an entire season. Contrast that with position players, wherein only about one percent of injuries take the same toll.
Yet, while individual pitcher injuries are often more cataclysmic, position player injuries—in aggregate—account for about 40 percent of the total lost days. Some of these position player injuries seem random, like the flu that felled Cano. But others appear in the context of a larger pattern: at the ripe old age of 34, Teixeira’s ongoing struggles with wrist problems don’t necessarily seem like an unforeseeable accident. Similarly, Ryan Zimmerman’s well-documented medical problems appear to have lingered for years.
There’s no doubt that some aspects of injuries are predictable. Focusing on pitchers, Russell Carleton found that the main predictor of future injuries was past injuries. Most recently, Noah Woodward has done yeoman’s work on the signs immediately preceding a pitcher’s demise. Even though injuries are a mysterious menace, there appear to be clues here and there about what factors lead to their occurrence, at least for pitchers.
I want to turn the same critical eye on position player injuries. The easiest place to start such an investigation is with age. As I noted above, older players tend to get injured more often—but how much more often?
I use here the formidable injury database maintained by BP’s own Corey Dawkins (without whose work this research would be impossible).
I’ve plotted here the average number of days lost to injury for players of each age (from the 2013 season alone). On average, each year of age is correlated with a single additional day missed due to injury. That doesn’t seem like much on a year-to-year basis, but the spread in age in the league can be 15 or 20 years, making the effect of age enough to be noticeable.
Even so, previous work suggests that the dominant predictor of future injuries is just past injuries. That idea fits with intuition, if you consider that injuries themselves, along with the recovery process, put a lot of stress on the body. In addition, the occurrence of injuries can suggest or provide information about the player’s underlying constitution. If the player constantly suffers injuries—even if those injuries appear unrelated to one another—that pattern may be an indication of poor conditioning, or weak ligaments, or some other possible underlying ailment.
To examine the effects of injury history, I fit a linear regression model*, with injury days missed as the response variable and the number of injury days in the previous three years, as well as age, as predictor variables. In this way, I tried to jointly capture the effects of age and injury history, as well as parse how the previous few years’ injuries predicted a given year’s afflictions. I used data from 2013, along with the previous three years, for this analysis.
Each bar represents one of the predictor variables, and the height of the bar** indicates that variable’s expected relationship with days lost to injury. The previous year’s injury time is by far the most important variable for predicting the next year’s injury time: a given DL stay in the last year translates to more missed time in the next year at a rate of about 5-to-1***. In other words, if a given player missed 50 days due to injury in the previous year, they would be expected to miss (on average) 10 more days due to injury this year.
Injuries in previous years were also predictive of more injury time, but with lower coefficients. Two years prior drops to a coefficient of about .1, or a 10-to-1 exchange. The amount of missed time as a result of injury in the third prior year registers as barely significant; the coefficient comes in at a miniscule .02. To wit, we see a declining relationship between injury time in the previous years and the predicted injury time this year. A day lost two years ago is half as bad as a day lost last year. By three years prior, past days lost cease to mean much at all for the predictions.
The relationship between age and predicted injury days lost is a curiously weak one in this model. We observed before that every additional year correlated with a full day missed to injury, and yet here—in the full model—age has barely any weight at all. The most likely explanation for this reversal is that the effects of age have mostly been accounted for by examining the previous three years’ injury histories. In other words, if the player was injured with some frequency in the previous years, then that history suggests that the player is on the older side, and any additional knowledge of their calendar age is of little use. Conversely, if the player was healthy, it doesn’t matter too much whether they are young or old; the model presupposes that they are hardy enough to resist injury in the coming year. In any case, these results very much suggest the primacy of injury history in predicting future injuries, above and beyond age.
If you take this simple model and try to test it on some other data, you quickly find out that it isn’t very accurate in an absolute sense. On average, it is wrong by about 30 days of missed time (this is the RMSE), which would, on first glance, seem to negate the whole purpose of its existence.
But the utility of this model is asymmetric. It can’t tell you whether a certain player is going to fall on his ankle the wrong way while leaping for a fly ball and miss the whole year, but it can tell you that a player who has suffered nagging issues in the last few years is likely to continue suffering. The model will miss the unexpected, but it ought to guess right more often than not with older players and protracted issues.
Perhaps this weakness is unsurprising. It is questionable whether there will ever be a statistical approach that can predict the broken ribs of a bad slide or the bruised elbow of a hit-by-pitch (although certain players may be more skilled sliders, and others are more likely to be beaned). Maybe the best we can hope for is an accurate accounting of whether the strains and sprains of everyday MLB life are likely to persist or fade away. Even if the best model can only identify the injury-prone players, that is still a step forward. Injuries are complicated, and there’s no doubt that randomness plays a large role in their severity.
One bad thing about injuries is that we don’t know very much about them. By the same token, one good thing is that there is much to learn. There are many angles to approach here, and the above model is far from final. Notably, this model treats all injuries as equal, an assumption we know to be wrong—and an assumption that calls for fixing in the next iteration. With so many variables in play, it makes sense to approach this problem slowly and stepwise, building up to a fuller understanding of when and how players get injured.
*More specifically, I fit a LASSO model, with the lambda parameter determined via 10-fold cross-validation. LASSO is a slightly tweaked version of linear regression with some desirable properties for this problem.
**Confidence intervals are 90 percent estimates, based on re-running the cross-validation 100 different times.
***The full model is:
Days missed this year = .18*(days missed last year) + .1*(days missed two years prior) + .02*(days missed three years prior) + .004*Player’s Age