How do you know whether your team’s hitting coach or pitching coach is doing a good job? Generally, the answer is “Well, how are his hitters/pitchers doing? Are they getting better?” That seems to be the justification given when he gets fired, after all.

Is that fair? Can a pitching coach really be blamed if his pitchers aren’t performing well? Like a manager, he’s not the one out there throwing the pitches, and the guys who are out there throwing the pitches may not be good. You can’t make a chicken salad out of a sow’s ear. It’s entirely possible that our hitting (or pitching) coach is actively tinkering with the swings of each hitter on his team, that he’s a mad genius, and that he could turn anyone into Babe Ruth. Or he might turn Babe Ruth into Mario Mendoza. Or maybe he’s more of a hands-off guy and just happens to be around when Mario Mendoza turns into Babe Ruth. If it does happen, can we safely proclaim him a coaching genius?

Warning! Gory Mathematical Details Ahead!
This is a tricky question to answer (and the math is going to get very gory this week). First, we ask the question of what a hitting (or pitching) coach’s job is. In theory, it’s to make the hitters (or pitchers) on his team better than they had previously been. Even if he’s not working with the most talented bunch, they should be showing some improvement.

Last year, I tried tackling this question, both with hitters and pitchers, using a very different method (mixed linear modeling, for the initiated). I tried to control, as best as I could, for the talent level that a coach had on hand, and then charged any variation from that talent level to the coach. I figured that some of it was random variance, but the randomness would largely cancel itself out across multiple hitters (or pitchers). At the end of the articles, I put together a “best and worst” section, but found myself asking whether I could actually trust the sample sizes that I had. For example, in the hitting coach article, I sang the praises of Kevin Seitzer, because his Royals hitters had been doing better than my model expected, and because he had what I guessed was a long enough track record. Was it really long enough?

I borrowed some methodology that I used in a previous article on figuring out whether we can trust changes in performance from year to year. The method, known as the reliable change indicator, looks at the difference between this year and last year in some stat (say, Smith picked up 30 points of OBP by going from .290 to .320) but also controls for what sample sizes produced those two numbers and the fact that those numbers are more reliable when they are produced at bigger sample sizes (for the initiated, I created a pooled standard error of the difference, based on the reliability of the stat at X PA and a good estimate of the population variance that we might expect at X PA). The result is something akin to a z-score. This gives us an indicator of whether a player has actually improved or declined.

From 1993-2013, I looked at the (raw, unadjusted) strikeout rate posted by all hitters, minimum 100 PA. In the following year (again, minimum 100 PA), I looked to see how much their strikeout rate had changed, by the reliable change index (RCI) method I just referenced. RCI can tell us whether the change in some measure has been positive or negative and how much faith we should put in those changes, in much the same way that a t-test can tell us whether the difference between two means is significant enough.

I began by using the standard cutoff point of 1.96 (on the z-distribution, this marks off a standard alpha level of .05). What surprised me was that there were very few cases in which anyone reached that level of certainty. To get any sort of sample size that had changed, I dropped the requirement down to a z-score of 0.5. Those who were above 0.5 had (sorta, kinda, we think—we’re getting loosey goosey with our inclusion criteria) increased their strikeouts (bad), and those below -0.5 had reduced their strikeouts (good). Anyone between was considered to be just holding steady.

The important thing to note with the RCI method is that it doesn’t rule on whether Smith really improved his true talent level by 30 points, only that we have some amount of certainty that he had improved by some non-zero amount that we cannot directly observe. Maybe his true talent really did improve by 30 points. Maybe it was just 20 and he got 10 points of luck thrown in for good measure.

I then matched each hitter with his hitting coach—as long as the hitter played on the same team all year and that the team had one hitting coach all year. (Yes, I know that many teams are going to dual hitting coaches.) I looked at how many hitters under the tutelage of each hitting coach showed improvement (what the hitting coach is supposed to be doing), and then a separate code (again, yes/no) for those who at least didn’t screw up their hitters so much that they actively got worse.

To adjust for the fact that some hitting coaches might have older or younger hitters (and that they might be more or less likely to improve), and also the fact that some coaches might have had hitters who had nowhere else to go but down (or up), I ran a binary logistic regression with last year’s strikeout rate and this year’s age (plus the interaction of the two) predicting change. I saved the values that the model spat out for each case and used those as a control value. It’s no great feat to improve a guy who looked primed for improvement anyway. If the hitter did improve, our hitting coach got credit for the improvement over and above expectation.

Once I had all of these numbers, I ordered all hitters who played under the hitting coach in chronological order (and within years, in simple alphabetical order). I looked to see whether these outcomes (was there improvement—or at least stability—or not?) would stabilize quickly. For the binary outcomes, I used the Kuder-Richardson formula that I have previously used for other binary outcomes. For the cases in which we are taking prior propensity to improve into account, and where the scores might be +.40 and -.55, I used Cronbach’s alpha.

For hitting coaches, the stability numbers were laughably small. Around split sample sizes of 50 or so, reliability numbers were only in the .20s, which is well below the accepted line. I tried again with walk rates and got the same answer. OBP? Yeah, you know me. Same thing. Big sentence: There does not appear to be any reliably measurable talent that hitting coaches have for inducing big improvements in the hitters under their care. There’s going to be a lot of luck involved. Someone out there will re-write that sentence as “hitting coaches don’t matter,” which I think would be a big mistake. We’ll talk about why in a minute.

I re-ran the same analyses for pitching coaches with the same basic parameters and on the same variables (K rate, BB rate, OBP against) and found roughly the same thing. The reliability numbers were miniscule.

Quentin Tarantino, Pitching Coach?
I want to again stop anyone from saying “Hitting (or pitching) coaches don’t matter.” That is one possible explanation, but I’d doubt it’s the correct one. The fact that a hitter or pitcher makes it to the majors suggests that he’s already done a good amount of development in the minors, and there might not be many more big eureka moments to be had with him. Sure there’s room around the edges, but it’s not the sort of big-ticket growth that “this year’s surprise” stories (and coaching resumes) are made of.

Maybe the hitting coach’s more important job is simply to maintain what the players have already become. If someone makes a big improvement, all the better, but overall, perhaps most hitting coaches are similar in their abilities to keep hitters on an even keel. It’s not that any idiot can do the job, but that there’s just not a lot of room to go one better than the next guy. All of those are perfectly reasonable hypotheses that explain our findings and that don’t require saying that the hitting coach is useless. He just has a different job, and our test here doesn’t assess that job properly. That’s not a big deal.

The other thing that might be at work here is that the model is looking for those big one-year breakouts over previous years. While of course those will always be welcomed, maybe it’s more important that our hitting coach is good at sustaining gradual improvement over time, something that the model here is not going to do a good job of picking up on. Those kinds of breakouts are the result of several factors coming together, and that kind of confluence is going to have a lot of luck that goes with it. It’s not reasonable to expect that one coach would be gifted with all that luck all the time.

The one thing that we can take from these findings is that we need to be careful about the pitching (or hitting coach) who gets the “genius” tag after a couple of his pitchers (or hitters) have a good year. He might very well be a genius, but these results suggest that going forward, the chances that he’ll repeat that work are random. We need to get away from the auteur model of coaching (and film-making). Pulp Fiction was a wonderful movie, and Quentin Tarantino had a lot to do with it, but to credit only him for the movie would be a mistake. In the same way, when a hitter (or pitcher) does emerge, we can’t just reflexively give all the credit to his pitching coach and assume that past performance is indicative of future results. There was untapped potential there, and perhaps a very specific set of circumstances that allowed the hitting coach, purposefully or not, to unlock it. Thus, the coach’s results are primed for regression.

Unfortunately, that makes it really hard to answer the question, “Who is the best hitting coach?” The answer might just be “the one who happens to be in the right place at the right time.”