keyboard_arrow_uptop
Baseball Prospectus is looking for a Public Data Services Director. Read the description here.

At the team level, injuries are as mysterious as they can be crippling. The Texas Rangers are suffering a whirlwind of pitcher injuries that threatens to break records and has certainly been one of the primary causes of their disappointing season. Meanwhile, teams full of aged veterans like the Yankees and Phillies have somehow managed to evade their fair share, albeit without benefiting very much.

Differences like these suggest asking whether some teams are better at limiting injuries than others. Ben Lindbergh (with the help of Russell Carleton) tried to tackle this issue a few days ago in the context of the Pirates’ remarkable run of injury prevention. They found little detectable signal of any team having an ability to reduce injuries.

However, their study was not without caveats. For one thing, they did not control for the players on each team, and we know that certain players are more likely to suffer injuries than others. Because players turn over from year to year, it stands to reason that failing to control for their unique injury probabilities could confound an analysis of the team’s injury-prevention abilities. (This caveat was noted in the article.)

I decided to undertake a deeper analysis of injury at the team level by controlling for the players involved. I showed in a previous article that the major determinants of days missed are 1) the number of days the player missed in each of the three years prior, and 2) age. The former encompasses the effect of a player’s injury history, the latter the natural buildup of risk due to senescence. With this said, the majority of the predictive power of the model stems from the days missed in the two prior years. Because all four variables (age and injury history in the last three years) are correlated with each other, most of the information about a player’s injury proneness can be captured in the days he has missed in only the last two years.

I built my models only taking into account these two variables for each player’s days missed in a year, as well as considering a team-specific effect.1 I used the last six full years of data (2007-13) from Corey Dawkins’ extensive database.

In the combined dataset, covering all years and both position players and pitchers, I found no significant associations between teams and the total days missed of each player. As expected, days missed in the previous year had the most significant impact,2 but the most significant p-value I observed for a team effect was that exercised by the Chicago White Sox. That’s to be expected: The White Sox had a remarkable run of well-being, seemingly producing healthy squads every year. But for all that, and an impressive estimated coefficient (-11 days missed per player), the p-value was a poor .086. It’s not impossible that the White Sox had an impact, but one would expect to see a p-value like that by chance alone (after having tested 30 different possible team effects), so no robust conclusions could be drawn.

Perhaps, I thought, teams have more of an effect within a single season than across multiple years. I broke the data down into individual years, and looked for team effects at that scale. No significant results arose. I split it by hitters vs. pitchers, repeating the analysis both across all years and within years. Still nothing.

Fine, you might say, but what if the effects are quite small in a given year, but consistent between years? This pattern would give us some idea that teams have a way to influence injuries. I tested this by building a model in Year 1 that estimated an effect for each team and then contrasted it with the estimated effect in Year 2. I did this iteratively, for each year. This graph is what I found:

Each line here is a team, with the position on the vertical axis indicating the model’s scaled estimate of the team’s injury prevention prowess in a given year. The more these lines interdigitate, the less consistent the pattern is, and, as you can see, they form a thick and confusing webbing. As the graph shows, so too do the statistics agree: There’s no correlation of any significance between years.

I reasoned that perhaps more time was necessary for team injury effects to stabilize and tried a different approach: I split the data set into two halves, one running 2008-10 and one running 2011-13.

The longer term is just as inconclusive. The correlation between effects observed in the first three years and the latter three years shows up as a measly .182, for an equally uninspiring p-value of .35. Which is to say there’s no apparent consistency in terms of how teams impact their players’ injury probabilities, either between individual pairs of years or between groups of years.

Negative results can sometimes be illuminating. For example, Russell wrote an illuminating piece on the improbability of some playoff clichés earlier this week. In our semi-official capacity as Baseball Snopes, it can be satisfying to bust a good baseball myth now and again. But in other cases, one expects to find something in the statistics and doesn’t.

This injury research belongs in the latter class. We have every reason to believe that teams do exercise an important impact on injury risk. We know that they have policies in place to control how their players rehab and what risks they are exposed to during play (e.g. pitch counts in consecutive starts). Now, maybe they are being superstitious or over-cautious or overconfident in their own abilities to make sabermetric discoveries, but it seems unlikely, given the cost (both in dollars and lost wins) of the policies and technologies they put in place.

What’s more, as David Epstein noted on a recent episode of Effectively Wild, there is abundant evidence from the scientific literature to suggest that training and treatment regimes should impact athletic performance. One of the next frontiers in sports science Epstein identified was in the development of personalized training schedules and methods for each player in order to suit their unique requirements. I can’t imagine this isn’t already underway in many a front office, but still, the effect is invisible.

There are certainly caveats with this approach that could render what should be an otherwise obvious effect unseen. Even though I’ve tried to control for the intrinsic probability each player has of being injured, in reality, I can’t do a very good job of that because I can’t estimate that intrinsic probability very well. Remember that injuries are still very stochastic, and so the best model I’ve come up with can explain only about 5 percent of the variance between players in days missed for a given year, even after accounting for the various contributors to injury proneness.

One thing our models presuppose is that the team effect is consistent across players, and in reality, it might not be. To put it another way, each team gets a single effect assigned to it which applies across all players on the roster, whether young or old, whatever the site (shoulder or leg or elsewhere) of the injury. That assumption could be violated if, for example, the Yankees might be great at rehabbing older players, but poor at pre-emptively shutting down younger guys. Or perhaps one team has a technology that can pre-diagnose ankle strains, which can be used to provide preventive care and stop them altogether. If so, we’d have to consider modeling bunches of interaction terms (Yankees * Age, Yankees * Age * Injury Type). Given the overall sparseness of injury data, to do so would invite overfitting. We might find something tantalizing, but with each new category of interaction term, we’d also be increasing the possibility that the tantalizing find was nothing more than the result of chance.

Instead, we may have to rely for a while on anecdotes and rumors. While the lack of concrete answers may be dissatisfying, there may not also be a better solution (yet). If we are lucky, perhaps a team executive here or there might drop a quote or tidbit that could illuminate some of the injury processes and findings teams have made. And of course, the burgeoning host of new technologies on the horizon (Statcast, BIOf/x, etc.) could always provide more dimension, if not necessarily more clarity, to the problem. In the mean time, to whatever extent teams influence injury probability and recovery, we (or at least I) probably cannot see or measure it.


  1. Formally, I fit the model as: Days missed of player X in year N ~ (Days Missed in Year N-1) + (Days Missed in Year N-2) + (Team Effect)

  2. p < 2.2 x 10-16

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
pizzacutter
9/19
Just me totally nerding out, but mixed-linear model?
nada012
9/19
I don't have much experience with that, but I'll give it a try.
adrock
9/19
I'm going to hazard a guess: Not the Blue Jays.
Kongos
9/21
With the Blue Jays, you have to factor in the effects of the artificial turf.
tsfunsworth7
9/27
Has that been shown to bear out statistically? It makes sense, but I can't remember seeing any analysis like Robert's that controls for the individual players.
harold
9/19
You touched on the difficulties of modeling every possible combination, but what about just splitting the injuries into two categories: on-field fluky events (concussions, HBP injuries, etc) and general "maintenance" injuries (pulled hamstrings, oblique strains, etc)? This would provide a step toward more granularity while maintaining larger samples. The biggest difficulty may be classifying the data.