How do you know whether your team’s hitting coach or pitching coach is doing a good job? Generally, the answer is “Well, how are his hitters/pitchers doing? Are they getting better?” That seems to be the justification given when he gets fired, after all.

Is that fair? Can a pitching coach really be blamed if his pitchers aren’t performing well? Like a manager, he’s not the one out there throwing the pitches, and the guys who are out there throwing the pitches may not be good. You can’t make a chicken salad out of a sow’s ear. It’s entirely possible that our hitting (or pitching) coach is actively tinkering with the swings of each hitter on his team, that he’s a mad genius, and that he could turn anyone into Babe Ruth. Or he might turn Babe Ruth into Mario Mendoza. Or maybe he’s more of a hands-off guy and just happens to be around when Mario Mendoza turns into Babe Ruth. If it does happen, can we safely proclaim him a coaching genius?

Warning! Gory Mathematical Details Ahead!
This is a tricky question to answer (and the math is going to get very gory this week). First, we ask the question of what a hitting (or pitching) coach’s job is. In theory, it’s to make the hitters (or pitchers) on his team better than they had previously been. Even if he’s not working with the most talented bunch, they should be showing some improvement.

Last year, I tried tackling this question, both with hitters and pitchers, using a very different method (mixed linear modeling, for the initiated). I tried to control, as best as I could, for the talent level that a coach had on hand, and then charged any variation from that talent level to the coach. I figured that some of it was random variance, but the randomness would largely cancel itself out across multiple hitters (or pitchers). At the end of the articles, I put together a “best and worst” section, but found myself asking whether I could actually trust the sample sizes that I had. For example, in the hitting coach article, I sang the praises of Kevin Seitzer, because his Royals hitters had been doing better than my model expected, and because he had what I guessed was a long enough track record. Was it really long enough?

I borrowed some methodology that I used in a previous article on figuring out whether we can trust changes in performance from year to year. The method, known as the reliable change indicator, looks at the difference between this year and last year in some stat (say, Smith picked up 30 points of OBP by going from .290 to .320) but also controls for what sample sizes produced those two numbers and the fact that those numbers are more reliable when they are produced at bigger sample sizes (for the initiated, I created a pooled standard error of the difference, based on the reliability of the stat at X PA and a good estimate of the population variance that we might expect at X PA). The result is something akin to a z-score. This gives us an indicator of whether a player has actually improved or declined.

From 1993-2013, I looked at the (raw, unadjusted) strikeout rate posted by all hitters, minimum 100 PA. In the following year (again, minimum 100 PA), I looked to see how much their strikeout rate had changed, by the reliable change index (RCI) method I just referenced. RCI can tell us whether the change in some measure has been positive or negative and how much faith we should put in those changes, in much the same way that a t-test can tell us whether the difference between two means is significant enough.

I began by using the standard cutoff point of 1.96 (on the z-distribution, this marks off a standard alpha level of .05). What surprised me was that there were very few cases in which anyone reached that level of certainty. To get any sort of sample size that had changed, I dropped the requirement down to a z-score of 0.5. Those who were above 0.5 had (sorta, kinda, we think—we’re getting loosey goosey with our inclusion criteria) increased their strikeouts (bad), and those below -0.5 had reduced their strikeouts (good). Anyone between was considered to be just holding steady.

The important thing to note with the RCI method is that it doesn’t rule on whether Smith really improved his true talent level by 30 points, only that we have some amount of certainty that he had improved by some non-zero amount that we cannot directly observe. Maybe his true talent really did improve by 30 points. Maybe it was just 20 and he got 10 points of luck thrown in for good measure.

I then matched each hitter with his hitting coach—as long as the hitter played on the same team all year and that the team had one hitting coach all year. (Yes, I know that many teams are going to dual hitting coaches.) I looked at how many hitters under the tutelage of each hitting coach showed improvement (what the hitting coach is supposed to be doing), and then a separate code (again, yes/no) for those who at least didn’t screw up their hitters so much that they actively got worse.

To adjust for the fact that some hitting coaches might have older or younger hitters (and that they might be more or less likely to improve), and also the fact that some coaches might have had hitters who had nowhere else to go but down (or up), I ran a binary logistic regression with last year’s strikeout rate and this year’s age (plus the interaction of the two) predicting change. I saved the values that the model spat out for each case and used those as a control value. It’s no great feat to improve a guy who looked primed for improvement anyway. If the hitter did improve, our hitting coach got credit for the improvement over and above expectation.

Once I had all of these numbers, I ordered all hitters who played under the hitting coach in chronological order (and within years, in simple alphabetical order). I looked to see whether these outcomes (was there improvement—or at least stability—or not?) would stabilize quickly. For the binary outcomes, I used the Kuder-Richardson formula that I have previously used for other binary outcomes. For the cases in which we are taking prior propensity to improve into account, and where the scores might be +.40 and -.55, I used Cronbach’s alpha.

For hitting coaches, the stability numbers were laughably small. Around split sample sizes of 50 or so, reliability numbers were only in the .20s, which is well below the accepted line. I tried again with walk rates and got the same answer. OBP? Yeah, you know me. Same thing. Big sentence: There does not appear to be any reliably measurable talent that hitting coaches have for inducing big improvements in the hitters under their care. There’s going to be a lot of luck involved. Someone out there will re-write that sentence as “hitting coaches don’t matter,” which I think would be a big mistake. We’ll talk about why in a minute.

I re-ran the same analyses for pitching coaches with the same basic parameters and on the same variables (K rate, BB rate, OBP against) and found roughly the same thing. The reliability numbers were miniscule.

Quentin Tarantino, Pitching Coach?
I want to again stop anyone from saying “Hitting (or pitching) coaches don’t matter.” That is one possible explanation, but I’d doubt it’s the correct one. The fact that a hitter or pitcher makes it to the majors suggests that he’s already done a good amount of development in the minors, and there might not be many more big eureka moments to be had with him. Sure there’s room around the edges, but it’s not the sort of big-ticket growth that “this year’s surprise” stories (and coaching resumes) are made of.

Maybe the hitting coach’s more important job is simply to maintain what the players have already become. If someone makes a big improvement, all the better, but overall, perhaps most hitting coaches are similar in their abilities to keep hitters on an even keel. It’s not that any idiot can do the job, but that there’s just not a lot of room to go one better than the next guy. All of those are perfectly reasonable hypotheses that explain our findings and that don’t require saying that the hitting coach is useless. He just has a different job, and our test here doesn’t assess that job properly. That’s not a big deal.

The other thing that might be at work here is that the model is looking for those big one-year breakouts over previous years. While of course those will always be welcomed, maybe it’s more important that our hitting coach is good at sustaining gradual improvement over time, something that the model here is not going to do a good job of picking up on. Those kinds of breakouts are the result of several factors coming together, and that kind of confluence is going to have a lot of luck that goes with it. It’s not reasonable to expect that one coach would be gifted with all that luck all the time.

The one thing that we can take from these findings is that we need to be careful about the pitching (or hitting coach) who gets the “genius” tag after a couple of his pitchers (or hitters) have a good year. He might very well be a genius, but these results suggest that going forward, the chances that he’ll repeat that work are random. We need to get away from the auteur model of coaching (and film-making). Pulp Fiction was a wonderful movie, and Quentin Tarantino had a lot to do with it, but to credit only him for the movie would be a mistake. In the same way, when a hitter (or pitcher) does emerge, we can’t just reflexively give all the credit to his pitching coach and assume that past performance is indicative of future results. There was untapped potential there, and perhaps a very specific set of circumstances that allowed the hitting coach, purposefully or not, to unlock it. Thus, the coach’s results are primed for regression.

Unfortunately, that makes it really hard to answer the question, “Who is the best hitting coach?” The answer might just be “the one who happens to be in the right place at the right time.”

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
this conclusion can be drawn about a lot of sabre matter...isolated incidences such as kevin long with Curtis granderson work for me, and chili davis seems well represented in Oakland's approach, but conclusions rest largely on slippery slopes
As always, Russell, nice article that made me think!

You're hinting at this in the final few paragraphs, but did you consider adding more years to the sample? Unfamiliar with RCI (though I read the "clinical significance" wiki!), but something as crude as a CAGR over, say, 3-4 years with a more lenient z-score threshold could reveal something.
Thought about that... maybe we need 2-3 years to see the process unfold fully. My decision to keep it to one year though was based more on the fact that people a) freak out when hitters/pitchers are doing so much better/worse than last year and b) people call for firings as a result.
Good article, I quite enjoyed it and it gave me pause to think. A couple of thoughts though. Why the disclaimer about gory math ahead? People either should learn it or they will skip it to the end looking for the conclusion. Secondly we always seek to quantify any impact a variable will have on the game. I suspect this is one of variables you won't be able to quantify. I believe that coaching at this level is more to what you have alluded to, a comment here or there that might have an impact. Firings are a byproduct of perceived underachievement. Whether this underachievement is attributable to the coach or the player, is something the Manager or the GM will be required to determine.
The gory math warning was just something that I've turned into a personal trademark/gimmick. It was actually based on Dante's Inferno (Abandon all hope ye who here enter!) I started doing it a few years ago actually to specifically allow people to skip the details if they just wanted the conclusion. Some people like hearing about covariance matrices. Some don't.
A bit late to this party, but one thing to check w/r/t "maintenance" might be whether certain coaches' hitters (or pitchers?) are streakier than others'.
God's gift to statisticians. Random error. The body's resistance for a repeated behavior to be consistent. Once learned, one RE can enhance or take away another. The statistician is bewildered. Aha, it must be regression. Not the Coach, I say. Bah, Humbug. But it does provide fodder for another article. Enjoyed.