keyboard_arrow_uptop

Caught Looking examines articles from the academic literature relevant to baseball and statistical analysis. This review covers three articles on the topic of racial discrimination by umpires. The goal, as always, is to expose the academic frontier to a wide audience and seek ways to move the discussion forward.

Sports can sometimes provide an interesting laboratory to examine pressing social questions that are hard to analyze in other places. It’s hard to measure whether two accountants have different salaries because they have different productivity levels or because one might face discrimination in the labor market. With sports, and especially baseball, a wealth of data allows us to measure productivity with a precision not available in most industries.

This past year, two papers in the Journal of Sports Economics investigated the possibility of racial discrimination by umpires in Major League Baseball using datasets that covered millions of pitches. The first paper, Further Examination of Potential Discrimination Among MLB Umpires, by Scott Tainsky, Brian Mills and Jason Winfree, focused on data from 1997-2008, and the second, The Connection Between Race and Called Strikes and Balls, by Jeff Hamrick and John Rasp, drew on data from 1989-2010. Both of these papers examined carefully a 2011 paper that appeared in the American Economic Review, the top journal in the economics literature. Strike Three: Discrimination, Incentives, and Evaluation, by Christopher Parsons, Johan Sulaeman, Michael Yates and Daniel Hamermesh, focused on the period 2004-2008.

Parsons and his co-authors found some evidence of discrimination by umpires against pitchers of a different race, but that this effect disappears the more carefully scrutinized the umpire’s decision is. While this effect is statistically significant, it is also small, affecting on average something less than one pitch per team per game. In both of the more recent papers, the authors found that any evidence of racial discrimination by umpires is very dependent on the particular modeling assumptions.

Before digging into what causes the differences in results, it’s useful to look first at the goals of the first study. Parsons and his co-authors had a somewhat more ambitious goal than to simply identify whether there was discrimination among umpires. They also wanted to see whether pitchers and batters changed their behavior in response to this. In other words, if a pitcher found himself on the wrong end of a discriminating umpire, he may be more likely to try to miss bats with fastballs than try to sneak a back-door breaking ball on the black. In addition, they also asked whether umps tried to get away with discrimination more often when they weren’t being as closely watched, either by fans or by the newly invented QuesTec machines, and whether pitchers and batters were adjusting accordingly.

To recap, the authors hypothesized that umpires engaged in racial discrimination, that players were able to recognize that this discrimination existed, that players were able to recognize that the extent of discrimination was different under different conditions of monitoring, and that players adjusted their behavior in the face of discrimination. This is a lot to ask of data, even with millions of observations, and it is remarkable that they found some support all the way through this logical chain.

Their paper focused on umpire-pitcher discrimination, rather than umpire-batter discrimination, and began by classifying umpires and pitchers as White, Black, Hispanic or Asian, then breaking down the called-strike percentage of taken pitches by different racial combinations. After examining over 3.5 million pitches, a little over half of which were called balls or strikes by the umpire, they found a slightly higher called-strike percentage (32.0 percent to 31.5 percent) when the umpire and pitcher are racially matched than when they are not. Next, they employed a linear probability model to account for differences in game situations (inning, count, home field) and umpire-, pitcher- and batter-fixed effects (more on this later), to isolate the effect of umpire discrimination on called strike probabilities. There is some evidence that Hispanic umpires may favor Hispanic pitchers relative to White pitchers, and some evidence that White umpires favor White pitchers over Hispanic and Black pitchers, but the magnitude is small. In the full model, the authors do not find that their estimates for discrimination are statistically significant.

However, they press on. Perhaps close monitoring changes the behavior of umpires—if you know you’re being watched more carefully, you discriminate at a different time. Alternatively, perhaps racial bias changes the behavior of pitchers—if you know you’re getting squeezed on the corners, you don’t throw to the corners. And here they find evidence that, in fact, people are changing their behaviors. In the period of study, QuesTec cameras were installed in some, but not all, ballparks to track pitches. Without QuesTec in the park, umpires are 0.59 percentage points more likely to call a strike for a racially matched pitcher, but the presence of QuesTec reduces the probability by 1.07 percentage points, more than offsetting the racial bias found without QuesTec. Both effects are statistically significant. They find similar results for highly attended games and on terminal (two-strike or three-ball) counts where implicit scrutiny is likely to be high. In addition, PITCHf/x data on pitch location from the 2007-2008 seasons shows that racially matched pitchers do in fact throw more often to the edges of the strike zone in non-QuesTec parks, and that racially matched pitchers give up fewer hits in non-QuesTec parks. Together, all of these results suggest that racial differences between pitchers and umpires impact behavior, and that monitoring reduces or eliminates its effects.

This is fascinating, and even in datasets with millions of observations like we have with pitches, this is a hard thing to find. However, these results come at the end of a long logical chain and depend on the first result—racial discrimination among umpires—being present. The two more recent papers call this into question.

Hamrick and Rasp use data for 32 seasons, from 1989-2010, covering more than 13 million pitches. They also focus on umpire-batter matching, not just umpire-pitcher matching, and find a mix of contradictory information even in their descriptive statistics. For example, White umpires call more strikes both for White pitchers relative to Blacks and Hispanics, but also call relatively more strikes against White hitters. The empirical strategy is to use two-way and three-way interaction terms to identify racial discrimination. Three-way interaction terms combine dummy variables for umpire race, pitcher or hitter race, and situation (terminal count, QuesTec, attendance, game score). In the full version of their model, none of the three-way interaction effects are statistically significant, though they suppress the actual coefficient estimates which would show whether the signs and magnitudes are consistent with what would be expected. From here, they drop the three-way interaction terms and revisit the model with two-way interaction terms.

In this version of the model, the authors find a number of statistically significant differences in behavior across races that appear in various circumstances—for instance, Latin umpires call fewer strikes when QuesTec is present. But note that the three-way indicator was not significant, implying that this behavior is roughly consistent across races of pitchers and batters. There are a number of other situations where umpire race and called strike probability are statistically significantly related, but again, independent of the race of the pitcher or batter.

Hamrick and Rasp suggest that these racial differences in behavior may lead to spurious findings of racial discrimination in other studies. They don’t refer to the Parsons study directly here, but it would seem to apply.

Tainsky, Mills and Winfree use data from 1997-2008 on umpire and pitcher races, testing both the whole period and the 2004-2008 period used in the Parsons study. This paper uses a much smaller set of control variables—simply, an indicator for home park and an indicator for QuesTec to go along with the variables of interest indicator for racial match between pitcher and umpire, and an interaction between the QuesTec and racial match indicators. In models without fixed-effects, the authors find a statistically significant effect of discrimination of a magnitude consistent with the other studies, about a half of a percentage point. They also find that the presence of QuesTec partially offsets this effect. However, they do not find evidence of statistically significant discrimination in models including fixed effects for year, pitcher and umpire. This stands in contrast to the earlier study, which found discrimination in models with fixed effects.

In an appendix to their paper, Tainsky, Mills and Winfree dig into the differences between the two papers, and there are several. First, racially classifying umpires and players is hard, especially in the case of Black Hispanic individuals. Laz Diaz, for instance, is responsible for about a tenth of pitches called by minority umpires, and the two datasets arrive at different classifications. A number of other corrections and distinctions are identified as well.

Second, and more importantly in terms of the difference in results, the two models treat fixed effects differently. Fixed effects, in a nutshell, are like player-specific or umpire-specific constants. It’s deeper than that, but basically, different umpires have at least slightly different strike zones and called-strike probabilities, but with fixed-effects, the umpire’s unique fingerprint is held constant from game to game. The same is true for pitchers—some pitchers simply have an approach that leads to more called strikes than others. It seems highly appropriate to include fixed effects for pitchers and umpires when modeling for discrimination. However, as the appendix points out, the two papers treat fixed effects differently. Tainsky and his collaborators use a single pitcher-fixed effect and find no evidence of discrimination, while Parsons and co-authors use two fixed effects for each pitcher and each umpire, one in QuesTec parks and one in non-QuesTec parks, and with this specification, they turn up evidence of racial discrimination.

Tainsky, Mills and Winfree clearly prefer one fixed effect per pitcher, but it’s easy to see why Parsons, Sulaeman, Yates and Hamermesh may prefer two. QuesTec was controversial at its adoption, and there appears to have been at least some conscious impact of the system on player and umpire behavior. All of the studies showed an increase in called-strike probability in QuesTec stadiums, so something changed in QuesTec games versus non-QuesTec games, either in the approach of pitchers, umpires, or both. It’s also quite possible that not all players or umpires changed in uniform fashion, rather that some adjusted differently to QuesTec than others. In other words, the fixed effect for an individual might be different from QuesTec parks to non-QuesTec parks.

The Tainsky study also looks at PITCHf/x data to map the actual called strike zone and test whether pitchers are more likely to throw to the edges of the zone in the presence of a racially matched umpire. Plots of strike zones developed from a smoothed Generalized Additive Model (see the paper for details and cool diagrams) in stadiums with and without QuesTec look nearly identical for both matched and non-matched pitcher-umpire pairs. With these maps, they do find evidence that pitchers tease the edges of the zone in two-strike counts, but they find no evidence in support of discrimination-induced changes in pitcher behavior in regression models.

The Parsons study turns up some interesting and seemingly robust results that depend on a complex series of interactions. Umpires first must discriminate, but only in some instances. Players must recognize this and must find it beneficial to change their behavior accordingly, but again, only in some instances. Discrimination, though, is hard to recognize in the data, even when looking at millions of pitches over decades. Perhaps pitchers can recognize early in games when they’re getting squeezed and adjust within the game, though this seems hard to imagine when discrimination seems to affect about one pitch per game per team. Perhaps minority pitchers whose tendency is to challenge hitters in high-leverage situations perform better in the minor leagues and thus are self-selected into the majors, but this wouldn’t seem to affect their behavior in QuesTec versus non-QuesTec parks.

Looking forward, PITCHf/x provides a new set of tools to investigate the issue of discrimination at the umpire level. None of the studies takes up this issue directly, but game-day scouting of a particular umpire’s tendencies remains a plausible path toward the behaviors and outcomes identified in the Parsons study. Unfortunately for researchers, but perhaps fortunately for fairness, the PITCHf/x strike zone maps we can create these days represent the information needed to identify discrimination if it exists, but also the kind of monitoring that would drive it away.

Michael Wenz is Associate Professor of Economics at Northeastern Illinois University. Caught Looking reviews mainly recent articles from peer–reviewed academic journals. Please send along suggestions, especially for interesting dissertation or thesis chapters that aren’t always easy to find.

You need to be logged in to comment. Login or Subscribe
swarmee
2/05
"Their paper focused on umpire-pitcher discrimination, rather than umpire-batter discrimination, and began by classifying umpires and *batters* as White, Black, Hispanic or Asian, then breaking down the called-strike percentage of taken pitches by different racial combinations." Shouldn't the *batters* be pitchers instead?
lyricalkiller
2/05
Yup, thanks
mgwenz
2/05
Yes, thanks!
markpadden
2/05
Guess how many Latin umpires were used in these studies?
mgwenz
2/05
It's a small number. Small number of Black umpires too. I wanted to talk a bit more about that part of it, but I don't know what I have to add. Go check out the originals! Mike
bmmillsy
2/05
No need to guess. It's not many minority umpires in general--and as we show in the appendices, the effect estimates are strongly affected by changing just one umpire race/ethnicity categorization. Here's a direct passage from our paper: "Also of note is that even though the sample has nearly 8.3 million pitches, the number is reduced greatly when examining some combinations of umpires and pitchers only for pitches subject to judgment by umpires. Indeed, from 2004-2008, there were only around 2,550 pitches thrown by Black pitchers requiring the judgment of Black umpires."
bmmillsy
2/05
Based on my own study of modeling a strike zone, 2,500 pitches tends to not be nearly enough called pitches to have a good idea of what that strike zone looks like even for a single umpire with less variability than an aggregate of multiple umpire zones.
markpadden
2/06
My point is that there is not nearly enough data to test any of the hypotheses -- the original paper should never have been written. The behavior of handful of individuals -- no matter what that behavior is -- simply cannot be used to infer the behavior and intent of a large group of people.
bmmillsy
2/09
I understood your point, and I was providing further information. Much of what you say here was the entire point of our retort in JSE.
bmmillsy
2/05
I'll also add a bit to what our original paper reported. I fit a number of models with substantial control variables (including location of pitches) to evaluate discrimination that were not included in the published paper or its appendix. I found no statistically significant effects of race matching in the FX data, either (data through either 2010 or 2011, I forget specifically). I even fit these with separate race match variables for each race classification combination. And Michael explains why it might be troublesome to look there (including my own estimates): all pitches are closely monitored after the implementation of F/X. We also make this note in our paper (IIRC, maybe it was removed) that if the interest is the differential effect of monitoring, then it seems strange to use a sample where all pitches are monitored very closely (the data are publicly available) to make this evaluation. There is much more to do here with such great data. I continue to try and work with it and wish I could do it all!
derekdeg
2/05
Where are the tests that show that Joe West, Angel Hernandez, and CB Buckner are just garbage umps? That they may not have any racial bias between them and instead are just really bad at their job. Does it make a difference that they are all of different races and each of them deserves to be replaced?
flyingdutchman
2/05
At the risk of getting too far afield, I could not agree more, Derek. Those three umps are awful, Hernandez especially. On top of being what I believe is the worst ump in the league (subjectively, I admit), he at least once intentionally made the wrong call (the A's non-homer in Cleveland). I don't understand why a player gets kicked out of baseball for life for throwing a game while an umpire gets away with what happened there.
bhacking
2/05
I've said it before and I'll keep saying it until it happens. Let the computer call balls and strikes. I don't go to the ballpark to watch the umpires.
JPinPhilly
2/06
I enjoyed this. I appreciate the initial effort to try and find some evidence of discrimination. I like that someone thought to go looking for that. And I agree with the challenges raised to the initial findings. Good stuff. While it may be tough to pinpoint any real evidence of discrimination, it looks like the data does show that QuesTec (or some other form of monitoring that the umps are aware of) does seem to impact the calls. If the umps know that they're being watched, they try harder to get it right. Feels like it would be obvious but it's still nice to see that demonstrated in the data. I agree with the closing bit about how we now have the kind of technology available to us to not only more accurately measure the variables but also prevent this kind of inconsistency from ever becoming a real problem again. Regardless of whether or not it ever existed in the first place. One pitch per game per team isn't huge but it is statistically significant enough to raise an eyebrow. The bottom line seems to be that we should monitor the umps and let them know it's happening. Then hold them to any glaring issues. Just don't do robot umps. It's too weird.
mgwenz
2/06
I do think a lot of the recent chatter about the transitory nature of large gains to pitch framing is on the money for exactly this reason. Monitoring it will improve the umpiring. Mike
bmmillsy
2/09
Umpires have been monitored for years and it's making a difference on accuracy: http://www.brianmmills.com/uploads/2/3/9/3/23936510/innovations_in_monitoring_and_evaluation_-_the_case_of_mlb_umpires.pdf http://www.brianmmills.com/uploads/2/3/9/3/23936510/externalities_from_umpire_performance__12-22-15_.pdf
BrewersTT
2/07
I would think that if there is preferential behavior, it is most likely strong in some individuals and weak or nonexistent in most. It doesn't seem like the kind of phenomenon that would be present across the board. Also, race matching is not the only form of racial preference. In some other professions, some black individuals have been found to prefer non-blacks, and more than likely there are other groups like this in certain circumstances; I'm just not that widely read in this.