Injuries, I think we can all agree, are a deplorable scourge on baseball. They remove our favorite players indiscriminately from the field, or ruin their effectiveness. They can take teams that are great on paper and reduce them to smoldering piles of ash (see the Texas Rangers, 2014). Even though injuries appear random, the result of bad luck and stochastic variation, they are (to a limited extent) predictable. Based on prior research, it turns out that commonsense factors can predispose position players to injury.

In particular, my last foray into the subject found that age as well as the number of days missed in the past three seasons could provide a reliable prediction into how many days each position player would miss in the coming year. However, the accuracy of these predictions was modest, and required extensive information from prior years. It would be desirable to make further improvements upon these injury predictions, but in the absence of other possible sources of information, prospects seemed slim.

This is where PITCHf/x data may be of some use. There is precedent for employing this kind of data, albeit on the mound: Josh Kalk and Noah Woodward developed injury prediction tools for pitchers utilizing information like velocity and movement. When pitchers began to exhibit abnormal patterns, Kalk and Woodward could sometimes diagnose injuries. Of course, hitters present a different set of challenges, because PITCHf/x doesn’t directly measure their performance. Instead, as I’ve written before, we have to read out hitter ability in terms of the way pitchers approach them. When it changes, we can use that information to diagnose a lurking problem.

Take Prince Fielder, circa 2013. In the midst of an otherwise excellent season in which Fielder sported a .297 TAv, his strike probability was rising.

This pattern is indicative of pitchers gradually being more and more willing to challenge Fielder with strikes. In 2014, we found a possible explanation for why Fielder might have been seeing more strikes: a chronic neck injury which required surgery. It’s possible that Fielder’s neck troubles may have extended into the latter half of 2013, beginning to impair his power enough for opposing pitchers to notice.

Or, more recently, consider Jedd Gyorko last season. He was unexpectedly terrible for the first three months of the season, earning a putrid sub-replacement WARP on a dreadful batting line. The question at the time was whether Jedd Gyorko was suffering from merely a run of bad luck or a genuine change in ability. It turns out it was the latter; Gyorko was beset by plantar fasciitis, sapping his power. Was there any clue in the PITCHf/x data?

The red line bisecting the curve is when Gyorko went on the disabled list. Prior to that, just as with Fielder, he was seeing a rapid increase in strikes thrown in his direction. After that, Gyorko seems to have come back healthy. With a modicum of his power restored, he worked his way back up to replacement level on the season.

Anecdotally, then, there is precedent for PITCHf/x metrics being useful to predict injuries. The rough logic goes as follows: when a hitter sees more strikes, it may be indicative of a lurking injury. When they see fewer, it may be indicative of a return to health.

Even so, these are just anecdotes, not a full or rigorous accounting. Towards that end, I mined injury information collected by Corey Dawkins from each of the past three seasons. For each season, I tried to predict how many days would be missed in the following year with two models*:

1) A simpler model, using age, the number of days missed in the current season, and the number missed in the last season.

2) A PITCHf/x based model, using all of that information plus the trend in a hitter’s strike probability (the slope of the lines in the above graphs).

To guard against overfitting the models, I trained them on two seasons (2011-2012) and then tested them on the third season (2013).

For 2013 (that is predicting injuries for 2014 based on 2013 data), the PITCHf/x model substantially outperforms the days missed model alone. Here are some of their biggest differences:

The biggest differences come where PITCHf/x is able to diagnose a hitter with an extensive prior injury history as healthy, based on the fear they instill in the opposition. For example, Miguel Montero had suffered a slew of problems in 2013 and the year before, but when he finally came back from his most recent injury (a back strain), he began again to exhibit power, to a degree that was noticeable in his strike probability. Model 2 was willing to declare him almost cured; Model 1, blind to his performance, could only go on the injury history, and suggested he was due for more trouble still.

On the other side, projecting players more likely to be injured based on PITCHf/x evidence, model 2 performs more poorly. It somehow foresaw Jason Kipnis’ injury, but whiffed seriously on sluggers like Anthony Rizzo and Jhonny Peralta, both of whom enjoyed excellent years. That’s partially a consequence of the nature of injuries: often, they are all or nothing, a month or more lost or no problems. Model 2 hedges its bets, guessing an intermediate number for even healthy players if they are showing a poor strike probability prognosis. More often than not, model 2 is right, but it has its share of miscues, especially in predicting injuries to come (as opposed to recoveries).

In more statistical terms, the root mean squared error (RMSE) of model 1 comes in at 42 days missed, while the RMSE of model 2 is much better at 34 days missed. Another way of looking at the problem is to ask which of the models more correctly ordered the players from most to least injury days (i.e. the rank-order correlation). Again, model 2 works better, explaining about 7.5 percent of the variation in days missed, compared to a paltry 2 percent for model 1. It’s clear that the PITCHf/x data is able to make a significant contribution to injury prediction.

Here it is, unleashed upon the 2014 data, making predictions for 2015.

Model 2 calls these guys healthy, despite past afflictions. Based on the prior year’s performance, about seven of these 10 should clear next year without missing significant time. Which, precisely, of the seven, we don’t know yet.

Again, the list of players for whom the PITCHf/x model would predict a greater likelihood of injuries is slightly more suspect. Probably, most of these players will be fine, but a few will suffer terrible, season-sinking harm. In particular, the system isn’t sold on Jedd Gyorko, suggesting that he’s still a gamble. But bear in mind, the simple implementation of the trend detection doesn’t see Gyorko’s late-season surge.

There are still significant limitations of model 2. Primarily, it requires a fairly large sample of data from the previous season (I used a cutoff of 1,200 pitches for these models). Not all players achieve that much data, and of course players who are injured in the previous year are less likely to see the requisite number of plate appearances. So there are plenty of players for whom predictions can only be made for model 1.

In the long term, there are many routes to improve injury prediction for position players. This implementation only predicts between years, but as the Jedd Gyorko example shows above, it would be possible to extend the approach to prediction within a single year, more akin to Kalk’s original injury zone idea. Other aspects of the trend detection could be improved dramatically. For now, it’s clear that, as in predicting breakouts, PITCHf/x information can be leveraged to improve injury forecasts.

**Specifically, I used a Random Forest.*