One of the emerging storylines of the postseason so far has been inconsistency in the strike zone. That’s not unique to this postseason, of course; every year sees its share of poor calls, and the effect of those calls is magnified when so much is on the line. Whereas a missed strike may be objectionable in the regular season, it can (at worst) alter the outcome of one game out of 162. Missed calls in the postseason, on the other hand, can end seasons.

As a result, every bad call an umpire makes is scrutinized to a much greater degree. When an umpire’s zone is off—poorly defined, or merely inconsistent—whole legions of fans can flood the internet with vitriol. Generally, an umpire who’s doing a bad job of calling balls and strikes won’t favor the fortunes of one team or the other. But it is frustrating, as a fan, to see a beleaguered slugger’s bat taken out of the game on a borderline call, as happened to Matt Kemp recently.

Kemp’s strikeout (among other questionable decisions in recent umpiring) got me wondering whether an inconsistently called zone affects certain types of players more than others. If this notion is true, even though an umpire’s errors may be random in that they occur to each team equally, one team might be affected by those errors to a greater degree. The most obvious place to start with this question is whether an inconsistent zone favors the hitter or the pitcher more.

Before I transition to the numbers, I want to form some hypotheses as a guide. My initial inclination is that the pitcher ought to be favored more by an inconsistent zone, if only because the pitcher exercises more control in the matchup. The pitcher determines the speed, break, location, and type of each pitch, whereas the hitter only gets to swing or not swing. The pitcher can avoid parts of the zone in which the umpire appears to be hazy, or target them, if advantageous.

To illustrate the scenario more concretely, imagine that the pitcher is up 0-2 in the count against a fairly good hitter. At this point, the pitcher gets up to four chances to strike the hitter out (possibly more, with fouls). If he knows that the umpire seems to be calling the top of the strike zone somewhat inconsistently, he can aim his pitches to that area (or above it), in the hopes that the umpire will incorrectly determine that a ball is a strike. On the other side of the matchup, the hitter can’t do much to counteract that approach. If he swings, he risks putting a bad pitch into play, more than likely resulting in a popup. If he doesn’t swing, he’s relying on the umpire, which might be a bad idea if the ump is doing poorly.

On top of this, there’s an informational asymmetry between batter and pitcher. The pitcher, especially the starting pitcher, sees every pitch thrown and each call that’s made. The batter, on the other hand, observes firsthand only the 10 or 20 pitches he receives in a game, as well as whatever he’s able to glean from watching his teammates at the plate. If there are inconsistent parts of the zone, the pitcher will be better able to observe them than each individual hitter, because he gets as much experience with the zone in a game as all of the hitters put together.

These twin advantages the pitcher has (both in control and in information) suggest to me that the pitcher should be favored over the batter when an umpire’s zone is off. But we can do better than speculate, we can test. First, I need to figure out when the umpires are making errors. Then I can see whether batter performance suffers (relative to expectation) in those games in which umpires are doing an especially poor job.

To investigate this question, I first had to build a model which determined when a pitch should be a ball or a strike. According to the rulebook, there are only four factors which influence that decision: the path of the ball (in three dimensions, so the vertical, horizontal, and depth coordinates of the pitch) as well as the height of the batter. In practice, we know that the strike zone varies considerably in response to lots of other factors as well. We know that it shrinks and expands as the count becomes unbalanced, and that catchers influence the size of the zone via framing skill.

Because I wanted to determine whether the zone seemed inconsistent according to the judgment of the players, I decided it was best to incorporate some of these outside-the-rulebook factors as well. After thousands of innings of organized ball, it seems likely that hitters are aware of the way in which the zone changes according to the count, and we have direct evidence that players are aware of pitch framing. So I incorporated these other variables into the model.

In order to have a flexible model that makes accurate predictions about what the umpires will call, I chose to use a machine learning approach (specifically, a random forest model). The idea of this genre of approach is that the form of the model is not fixed in advance, the way a linear model would be. Instead, I feed the algorithm a training set (comprising 30000 pitches), from which the algorithm learns what characteristics make a pitch a ball or a strike. In this way, the algorithm reflects the way a hitter would learn the strike zone, from experience.

After the model is built, I have the algorithm predict whether each pitch in the remaining data should be a ball or strike, depending on what variables it has decided are important, to what degree, and the interactions between them. Then I can contrast what the model decides is a ball or a strike with what was actually called. If there is disagreement, it suggests that the umpire was calling a pitch in a way that it is not usually called—perhaps that the umpire made a mistake*.

Reassuringly, the model predicts that the umpires get it right the vast majority of the time. For about 91 percent of the ball-strike calls, the model and the umpire agree. Note here that this is significantly higher than you get from using a fixed zone, reflecting the fact that the model is taking into account catcher framing and the expansion/contraction process the zone goes through with the count. When the model disagrees with the umpire, it is overwhelmingly in regard to edge cases (catcher perspective):

These edge cases are not all umpire errors, because after all, PITCHf/x is not perfect either—the system has a margin of error as well, and so the umpire and the model are deciding their ball/strike calls based on slightly different data. But probably some of these calls are errors, if only because the edge of the zone is going to be the most difficult portion to call. What’s more, there’s a smattering of pitches in the middle of the zone which should be called strikes (but were apparently called balls), and then outside the zone there’s some of the reverse case.

Assuming you buy into the idea that the model is good at telling a ball from a strike, we can use it to determine how inaccurate umpires have been on a per-game basis. We would expect those per-game accuracies to be centered on their average accuracy (~91%) but with some variation on either side, depending on whether the umpire was doing well or poorly on that particular day.

This graph is a histogram showing how frequent (y-axis) the performance of the umpires (x-axis) was for each of 2000 games in the 2014 regular season. Overall, the umpires do reasonably well, but on bad days, their accuracies can drop to somewhat frightening levels. At the lowest extreme, an umpire can start making incorrect calls as frequently as one in every six or so pitches (as opposed to their normal performance of about one in every 11 pitches). The red line in the above graph illustrates the histogram you would expect if the umpires had a constant probability of error on every pitch for every game. That the actual histogram is a little more dispersed than that red line suggests that umpires sometimes do have bad days, when their calls are systematically off.

To the question which motivated this article: How do the hitters fare when the umpires are having one of those bad days (say the 5 percent of games with the lowest accuracies)? The most obvious way a hitter’s performance could suffer is from additional strikeouts, so that’s what I examined. I used the odds ratio method to control for the particular hitters and pitchers involved, that is to derive an expected strikeout rate for those players. Then, I looked at how often the hitters actually struck out in those games in which the umpires did the worst.

The results show that, as expected, hitters fare worse when the zone is inconsistently defined. The magnitude of the difference isn’t massive, but it’s still considerable: The hitter strikeout rate jumps about 4.4 percent relative to our expectation. In plain English, when the umpires are having a bad day, the batter is 4 to 5 percent more likely to strike out than we would otherwise expect.

It’s not just strikeout rates that increase, though. Just about every positive offensive statistic goes down marginally in the most inconsistent games, and every negative statistic goes up. Even the rate at which balls in play are converted into outs goes up (by ~2 percent), suggesting that hitters might be making slightly weaker contact. That could stem from the inconsistency of the zone as well: Hitters feel pressured to swing when they aren’t sure what the result of a pitch will be. They may be reasoning correctly that a ball hit into play, however weakly, is marginally better than a called strikeout.

These results show that even when the umpire is making random errors, it doesn’t affect all players equally. Because of the asymmetries in the matchup, pitchers prosper when the zone is poorly defined. The good news from this study is that the umpires do get their calls right the majority of the time. Furthermore, other work has shown that the umpires are getting more accurate, so the problem itself is slowly disappearing. But I imagine that does nothing to diminish the sting of seeing your favorite hitter unjustly rung up on a strike that was not a strike late in a playoff game.

*Or that the Pitchf/x data is in error. Or, there’s some extenuating factor the model is unaware of, like a usually good framer doing a poor job on some particular pitch.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Good article, I'd be interested to hear more about this: 'PITCHf/x is not perfect either—the system has a margin of error as well.'

I would think the margin of error would be extremely small once the criteria have been clearly established? I.e. front of the plate, at least 50% of the ball over the plate, clear rules on the height?

I'd like to see the computer call balls and strikes as I expect it would have the added benefit of speeding up the game.
"Good article, I'd be interested to hear more about this: 'PITCHf/x is not perfect either—the system has a margin of error as well.'"

Thanks, and sure. I was referring to the random measurement error inherent in the system, i.e. the difference you would see in reported location if you somehow threw the exact same pitch twice. Sportvision (makers of Pitchf/x) claim that this is about half of an inch to an inch, and that claim has been verified independently by Alan Nathan ( So that's pretty good, especially considering how technically challenging the problem is.

On top of this, there's some systematic error game-to-game, which is caused by e.g. miscalibration of the cameras, on the order of ~1 inch. That kind of calibration error should be mostly removed (or at least dramatically diminished) in this analysis, thanks to correction values given to me by resident experts (in all things) Dan Brooks and Harry Pavlidis (thanks guys!).

To your question, my bet is that Pitchf/x is at least as good, if not substantially better, at recognizing pitch locations as a well-trained human. And, hypothetically, if you were to use it for actual pitch calling, you could improve the accuracy (either by additional cameras or orthogonal data-gathering) to a very low level. Here's a nice piece from Ben Lindbergh on that very subject:
Each hitter has his own strike zone based upon his height and stance. The strike zone can change up to the last instant before the pitch if he adjusts into a final position.

How does Pitchf/x adapt to each hitter in real time? How certain can it be determined that Pitchf/x accurately duplicates each individual strike zone?
In the case of my model, I use the hitter's height as one of the inputs. This improves accuracy to a small degree, suggesting that the umpires are accounting for the hitter's unique strike zone.

You could build something similar in to a machine-called-zone by inferring the position of the hitter's shoulders and knees from the video. The cameras are at 60hz, so they could adjust for the hitter's zone at the precise moment in which the pitch was released.

However, to be clear, I am not advocating for a machine-called zone, bhacking is. It's certainly interesting to ponder though, and I think, were pitch-calling-by-computer to be implemented, it might look something like the machine learning algorithm I used for this piece.
Excellent work!
It would seem to have different effects on different types of hitters as well. For the Vlad Guerrero swing-at-everything type, probably not much difference. But for the guys who work the count and know (or used to know) the strike zone, they're going to get caught with the bat on their shoulder a lot more often until they adjust to the expansion.
yep, that's what I think too (and what I will look into next). Maybe the plate discipline guys are generally less consistent as a result, since they have to adjust more to account for game-to-game variability in the called strike zone.
Just as important, and perhaps more unfair, is a consistent umpire who's zone inadvertently favors a particular pitchers style or again inadvertently penalizes a team due to their lineup construction.

It often seems particular umpires will have a rather consistent variance of their zone - wide on the left or right side, or a higher or lower zone. This will clearly work against a team facing a pitcher whose arsenal may be "enhanced" by that umps zone, or against a lineup getting a wider zone outside if their lineup skews away from that wide zone.

This isn't bad ump'ing - it's just a good or bad fit for 1 pitcher vs. their opponent.

It's not always bad or inconsistent umpiring that has an effect.
Agreed, although I would note that in the model as I built it, if a particular ump had a different version of the zone, that would go in the category of "bad ump'ing" since he's calling it in a way which is inconsistent with other umpires.

I debated whether to include the umpire ID in the model, since players might be aware of the umpire's particular proclivities, but decided that it was too minor of an effect for them to adjust their strategy. That's something I should look into in more detail, though.
That said, I'm personally not in favor of a mechanistic approach /"solution" to this. The impact seems fairly minimal - maybe a swing of 1 out in a game.

As Emerson said, a foolish consistency is the hobgoblin of small minds. Embrace variability.
One thing that jumped out at me was that you would think umpires would be more accurate in games in which a greater percentage of pitches were clearly inside or outside the strike zone, and would be less accurate where a higher percentage of the pitches were on near the edges of the zone. I'd also think that hitters would tend to have better results against pitchers that were consistently in or out of the zone than they would have against pitchers who were putting pitches on the edges of the zone (regardless of umpire accuracy).

If the type of pitches that are difficult to hit (on the edges of the zone) are the same types of pitches that are difficult for an umpire to call accurately, seems not unlikely that that could account for the decline in offensive output in these situations.
That's a great point. One of the things I'm planning to do to follow up is to use a continuous measure of accuracy which penalizes umpires less for the edge cases than for the glaringly obvious missed calls. Hopefully that should correct for this potential problem. In the mean time I'll look and see whether that would explain what's happening in the "least-accurate" games.
I'm not sure that this is even possible, but my guess is that when an umpire is "not accurate" it is much more likely that he is calling a larger zone than he is calling a smaller zone, which is why you are getting more K (and less offense) for "inaccurate" umpiring.

I am not buying that if an umpire's zone is tight (small) but inaccurate, that it would cause an increase in K (and decrease in offense in general).

I track umpires quite a bit, and the ones who have a tight zone are the ones that call an accurate zone. The ones with a bad zone are almost invariable those with a large zone (larger than the actual zone).

I think the problem is not "zeroing out" accurate and inaccurate. If you are using the actual zone and not the de facto zone to determine accuracy, you will find that the average umpire is "inaccurate." (larger zone).

If you define "accurate" as the de fact zone, such that the average umpire is accurate by definition, then you should find that accuracy or at least a deviation in accuracy from the de facto zone, has no effect on K rate or any other offensive component, on the average. Deviation on the large side obviously creates more K and deviation on the small side obviously creates less K.

But to say that "any deviation from accuracy, on the average creates more K and less offense in general," is very misleading.
"If you define "accurate" as the de fact zone, such that the average umpire is accurate by definition, then you should find that accuracy or at least a deviation in accuracy from the de facto zone, has no effect on K rate or any other offensive component, on the average."

Yes, to be clear, I am using the de facto zone, or at least trying to do so, accounting for things like the expansion in certain ball-strike counts and pitch framing. So it is the latter case, where I think we should not expect to see a difference in K rate, but we actually do.
You attribute the increased dispersion in your distribution to umpires having bad days. Since the zones are not umpire-specific, couldn't the dispersion be a reflection not of bad days but of bad umpires?
Could be, but when I bring umpire id into the model, the distribution is still over-dispersed, suggesting that it's not just bad umpires screwing things up.
Ho-hum, another great article from Baseball Prospectus. I am one of the trolls who is constantly harping on bad umpiring. With replay it is almost all centered on balls and strikes and I have thought that it might be playing a role in the decline of offense. There was a game this season, Mariners vs. Astros in which no less than 16 pitches below the zone were called strikes. The umpire, Mark Carlson, was consistent but he was awful. The low strike, as it is called now, and the general lowering and widening of the zone have made hitting a baseball even harder than it inherently is and the loss of balance between pitching and hitting is bringing us back towards 1968, which very few fans of the game want.
This season, Mariners/Astros, more than 16 pitches below the strike zone called strikes, Mark it.