keyboard_arrow_uptop

Injuries, I think we can all agree, are a deplorable scourge on baseball. They remove our favorite players indiscriminately from the field, or ruin their effectiveness. They can take teams that are great on paper and reduce them to smoldering piles of ash (see the Texas Rangers, 2014). Even though injuries appear random, the result of bad luck and stochastic variation, they are (to a limited extent) predictable. Based on prior research, it turns out that commonsense factors can predispose position players to injury.

In particular, my last foray into the subject found that age as well as the number of days missed in the past three seasons could provide a reliable prediction into how many days each position player would miss in the coming year. However, the accuracy of these predictions was modest, and required extensive information from prior years. It would be desirable to make further improvements upon these injury predictions, but in the absence of other possible sources of information, prospects seemed slim.

This is where PITCHf/x data may be of some use. There is precedent for employing this kind of data, albeit on the mound: Josh Kalk and Noah Woodward developed injury prediction tools for pitchers utilizing information like velocity and movement. When pitchers began to exhibit abnormal patterns, Kalk and Woodward could sometimes diagnose injuries. Of course, hitters present a different set of challenges, because PITCHf/x doesn’t directly measure their performance. Instead, as I’ve written before, we have to read out hitter ability in terms of the way pitchers approach them. When it changes, we can use that information to diagnose a lurking problem.

Take Prince Fielder, circa 2013. In the midst of an otherwise excellent season in which Fielder sported a .297 TAv, his strike probability was rising.

This pattern is indicative of pitchers gradually being more and more willing to challenge Fielder with strikes. In 2014, we found a possible explanation for why Fielder might have been seeing more strikes: a chronic neck injury which required surgery. It’s possible that Fielder’s neck troubles may have extended into the latter half of 2013, beginning to impair his power enough for opposing pitchers to notice.

Or, more recently, consider Jedd Gyorko last season. He was unexpectedly terrible for the first three months of the season, earning a putrid sub-replacement WARP on a dreadful batting line. The question at the time was whether Jedd Gyorko was suffering from merely a run of bad luck or a genuine change in ability. It turns out it was the latter; Gyorko was beset by plantar fasciitis, sapping his power. Was there any clue in the PITCHf/x data?

The red line bisecting the curve is when Gyorko went on the disabled list. Prior to that, just as with Fielder, he was seeing a rapid increase in strikes thrown in his direction. After that, Gyorko seems to have come back healthy. With a modicum of his power restored, he worked his way back up to replacement level on the season.

Anecdotally, then, there is precedent for PITCHf/x metrics being useful to predict injuries. The rough logic goes as follows: when a hitter sees more strikes, it may be indicative of a lurking injury. When they see fewer, it may be indicative of a return to health.

Even so, these are just anecdotes, not a full or rigorous accounting. Towards that end, I mined injury information collected by Corey Dawkins from each of the past three seasons. For each season, I tried to predict how many days would be missed in the following year with two models*:

1) A simpler model, using age, the number of days missed in the current season, and the number missed in the last season.

2) A PITCHf/x based model, using all of that information plus the trend in a hitter’s strike probability (the slope of the lines in the above graphs).

To guard against overfitting the models, I trained them on two seasons (2011-2012) and then tested them on the third season (2013).

For 2013 (that is predicting injuries for 2014 based on 2013 data), the PITCHf/x model substantially outperforms the days missed model alone. Here are some of their biggest differences:

The biggest differences come where PITCHf/x is able to diagnose a hitter with an extensive prior injury history as healthy, based on the fear they instill in the opposition. For example, Miguel Montero had suffered a slew of problems in 2013 and the year before, but when he finally came back from his most recent injury (a back strain), he began again to exhibit power, to a degree that was noticeable in his strike probability. Model 2 was willing to declare him almost cured; Model 1, blind to his performance, could only go on the injury history, and suggested he was due for more trouble still.

On the other side, projecting players more likely to be injured based on PITCHf/x evidence, model 2 performs more poorly. It somehow foresaw Jason Kipnis’ injury, but whiffed seriously on sluggers like Anthony Rizzo and Jhonny Peralta, both of whom enjoyed excellent years. That’s partially a consequence of the nature of injuries: often, they are all or nothing, a month or more lost or no problems. Model 2 hedges its bets, guessing an intermediate number for even healthy players if they are showing a poor strike probability prognosis. More often than not, model 2 is right, but it has its share of miscues, especially in predicting injuries to come (as opposed to recoveries).

In more statistical terms, the root mean squared error (RMSE) of model 1 comes in at 42 days missed, while the RMSE of model 2 is much better at 34 days missed. Another way of looking at the problem is to ask which of the models more correctly ordered the players from most to least injury days (i.e. the rank-order correlation). Again, model 2 works better, explaining about 7.5 percent of the variation in days missed, compared to a paltry 2 percent for model 1. It’s clear that the PITCHf/x data is able to make a significant contribution to injury prediction.

Here it is, unleashed upon the 2014 data, making predictions for 2015.

Model 2 calls these guys healthy, despite past afflictions. Based on the prior year’s performance, about seven of these 10 should clear next year without missing significant time. Which, precisely, of the seven, we don’t know yet.

Again, the list of players for whom the PITCHf/x model would predict a greater likelihood of injuries is slightly more suspect. Probably, most of these players will be fine, but a few will suffer terrible, season-sinking harm. In particular, the system isn’t sold on Jedd Gyorko, suggesting that he’s still a gamble. But bear in mind, the simple implementation of the trend detection doesn’t see Gyorko’s late-season surge.

There are still significant limitations of model 2. Primarily, it requires a fairly large sample of data from the previous season (I used a cutoff of 1,200 pitches for these models). Not all players achieve that much data, and of course players who are injured in the previous year are less likely to see the requisite number of plate appearances. So there are plenty of players for whom predictions can only be made for model 1.

In the long term, there are many routes to improve injury prediction for position players. This implementation only predicts between years, but as the Jedd Gyorko example shows above, it would be possible to extend the approach to prediction within a single year, more akin to Kalk’s original injury zone idea. Other aspects of the trend detection could be improved dramatically. For now, it’s clear that, as in predicting breakouts, PITCHf/x information can be leveraged to improve injury forecasts.

*Specifically, I used a Random Forest.

You need to be logged in to comment. Login or Subscribe
jfranco77
2/06
Really interesting stuff. How do you account for age? Meaning "this guy is seeing more strikes because he's old" versus "this guy is seeing more strikes because he's injured" ?
nada012
2/06
it's true that players tend to see increasing strike probability as they age (working on the aging curve now), so that's a valid concern. The way that it's dealt with is basically by letting the model sort it out, and what it ends up doing is taking an age-related decline as the baseline, and looking for changes more dramatic than the average year-to-year decline as symptoms of injury.
therealn0d
2/06
Any ideas of introducing a comparable players approach into this?
nada012
2/06
My concern is that the data is still relatively sparse, since there's only ~5 years of PitchF/X, and several parameters (age, strike probability, injury days in two seasons). But I will give it a try. It ought to become more feasible as we gather more and more years of data.
therealn0d
2/08
Whoever you are POS, I am going to figure out who you are and I don't care why you follow me around and down vote all my comments. I don't really care about the comment ratings but you obviously want my attention, and you've got it. Congratulations. You may not like how this turns out.
WaiverWireKing
2/08
It makes for a very interesting read but to me it sounds like it's all based on a perfect model that something is building up and the injury is waiting to happen and sure maybe it could spot something that a player maybe overlooking and that. But most injuries happen because of freak happenings. The Foot comes down in a bad way, The player trips over the base, A player runs in another player for they are not paying attention to where they are but where the ball is or even one foot moving faster then the other foot that does happen even to the average joe. So the question is can you really find the Freak in the data or does the freak happen in predictable ways and I don't think it does. I have followed baseball for nearly 35 years and over them years I see many players get injuried again and again every year and many time I see them find heath and play 3 or 4 years without an injury. Logic says that shouldn't happen, Once injury prone you should be injury prone and you shouldn't stop being injury prone for they are getting older that shouldn't happen but it does. So from a Fantasy baseball point of view you might get lucky calling out a player or two but overall I don't it being more then that. Not to speak it is based on strikes and a .220 hitter is more likely to see more then a .300 hitter for the pitcher may think he can't hit it.
floydwicker
2/08
How is this data helpful? Is it statistically significant?
nada012
2/08
The PITCHf/x data is helpful because it can diagnose recoveries (and, to a lesser extent, injuries). So, it can tell when a player who has suffered from a lingering injury is healed, or, sometimes when a player is suffering from a lingering injury that hasn't been announced yet (like with Jedd Gyorko). The difference between models is statistically significant. The model which incorporates PITCHf/x drops the prediction error pretty substantially. If you want to get into the nitty-gritty for why it's statistically significant: I did permutations where I randomly resampled the PITCHf/x predictor variables and reran the model many times with these random numbers. This gave me a distribution of prediction improvements which I would expect to see if the PITCHf/x numbers were pure noise. When I compared that distribution to the actual improvement of the model with the real (not permuted) Pitchf/x numbers, the real improvement exceeded the distribution of permutation improvements to a large & significant degree (p<.01). For a more academic background on what I did (which is called a permutation test), you can check out a paper like this: http://perso.uclouvain.be/michel.verleysen/papers/esann06df.pdf