I suppose I ought to say something about the Gold Gloves, huh?
Now, I’m sure everyone knows what I think about Derek Jeter’s defense—it’s probably being overrated even by “advanced” defensive metrics (which aren’t exactly kind to him). So how did he win?
Yesterday, I was listening to ESPN’s Rob Neyer talking to Michael Kay on the radio about the Gold Glove awards. And what stuck out in my mind was when Kay asked Neyer, “If you need a guy to field a ball hit right at him, who do you pick? Is it Jeter, or Elvis Andrus, or Alexei Ramirez?”
Neyer responded that he’d pick Jeter. And I think that goes a long way towards explaining why Jeter won the Gold Glove—he’s good at the elements of fielding that are easy to notice. He fields the balls he gets to very well.
And I think there’s a growing awareness that fielding the balls one gets to only tells a little bit of the story on defense. Getting to balls is also a part of the story—probably a much bigger part of the story. And I think a lot of the frustration people feel over things like Jeter getting the Gold Glove comes from the fact that the Gold Glove voters (and a lot of other people) don’t seem to recognize the importance of that aspect of fielding. The Gold Glove ought to go to the people who are good at fielding more balls, shouldn’t it?
But how do you know who’s good at getting to more balls?
Last week, I talked about the idea that sabermetrics is (in part) the scientific study of baseball. I don’t think that all of sabermetrics is science, of course, but I think that’s a large part of it. And I think that more than anything else, a sabermetrics that is manifestly anti-scientific ceases to be sabermetrics at all.
One of the key requirements for scientific research is reproducibility—the idea that independent observers can come to the same conclusions as the original researchers. The reproducible parts of fielding analysis are things like DER—measures that tabulate plays made and balls in play, objective facts that anyone can derive using scorekeeping methods that Henry Chadwick came up with in the 1870s. Modern advancement in fielding analysis relies on estimates of expected outs from batted ball data.
And that data has proven not to be reproducible on two counts—different data providers cannot reproduce the same estimates of where a ball landed and how it got there, and different analysts using the same batted-ball data cannot reproduce the same estimates of expected outs.
Last week, Tom Tango published his team defensive ratings, based upon the Fan Scouting Report. He compared the results with team totals of two defensive metrics based on Baseball Info Solutions data:
I also included the totals from Dewan (DRS) and MGL (UZR). Dewan’s numbers were dropped by 10 runs per team because the average is +10 per team. I don’t know why.… The three systems agreed on the Nationals, and strongly disagree on the Indians. Correlation of Fans to Dewan is r=.60, and Fans to UZR is also r=.60. UZR to Dewan is r=.57. That’s pretty whacked out when you consider that UZR and DRS use the same data source.
I decided to check the correlation between all of the measures and DER:
Correl |
|
Fans |
0.63 |
DRS |
0.61 |
UZR |
0.66 |
All Three |
0.74 |
BIS |
0.71 |
(“All Three” is the average of the Fans, DRS and UZR; BIS is the average of DRS and UZR).
What the results suggest is that all of these measures are roughly as well-correlated with DER as they are with each other. And that makes sense, since all of them implicitly include all of the information included in DER in their measurement. Metrics based upon batted-ball data, at the team level, boil down to:
(DER – exDER) * BIP
Which is to say a team’s actual DER, minus what the expected DER would be given the estimated batted-ball distribution, multiplied by the number of balls in play (This is a bit of a simplification—certain categories of plays are excluded, like popups, and in the case of UZR, plays by the catcher and pitcher are ignored. And these measures are also looking at things like outfield throwing arms, double-play rates, etc. But fundamentally, turning batted balls into outs is the most important aspect of team defense).
What I want to emphasize here is that there is little room for defensive metrics to differentiate in terms of measuring DER. Because of their focus on measuring individual player contributions, certain categories of plays are excluded—but outside of that, the number of outs recorded and the number of balls in play is not in dispute. Those are facts; everything else is an estimate.
So let’s turn to partial correlation, which tells us the correlation between two variables after removing the influence of a third variable. In this case, we’ll look at the correlation between the various metrics after controlling for the influence of DER. The partial correlation between UZR and DRS is .28 (Between the Fans and UZR it’s .32, and between the Fans and DRS it’s .35).
So only a quarter of the agreement between DRS and UZR is caused by factors outside of plain old DER. That’s very little agreement on estimates of expected outs and elements of defense not included in DER. It’s worth reemphasizing that DRS and UZR both use the same source of batted-ball data, and different data providers can disagree significantly in terms of hit location and batted ball type.
And so if we want to know why people don’t trust what we’ve come to call “advanced” fielding analysis, it’s really because we haven’t given them a reason to trust it. And that’s because of a fundamental abandonment of what makes sabermetrics compelling—the search for objective truth. For a time we stopped doing science when it comes to fielding analysis, and instead have been doing baseball alchemy—trying to transmute lead into Gold Gloves.
The AL Gold Glove voters made a mistake in giving Jeter the award. But I think we make a much bigger mistake if we castigate the Gold Glove voters for their beliefs without a serious effort to give them something in which they ought to believe.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
I think it's about time for the brains behind sabermetrics to have a summit to decide at LEAST a common numerical scale for each major metric type.
So, for the defensive metrics you named above, if they're all based on similar concepts you can say, "We'll make the average score for this type of metric be 100, where 'good' players score above and 'bad' players score below."
The benefit is if all metrics of a same type are reported on a same scale, the scores can be 'averaged' to find a happy medium between all the different sabermetric outlets. Everybody can continue using their own methods, and the combined score actually gives you something closer to the truth than just one metric alone.
It's probably really ambitious and there may be reasons I'm not aware of why it could never work, but we would have a 'standard' set of advanced metrics that can be presented to the public and possibly give a solid data point for gold glove voters to use. These guys don't want to wade through 100 metrics looking for the one that makes sense to them when "errors" are so easy to understand. But if you give them just one that's on a sliding scale that makes sense....well, that would be nice.
I thought about embarking on something similar myself, but I'm probably not the right guy to do it. If the idea has merit, I encourage anybody to run with it.
Perhaps you can enlighten me on something. Unless I am mistaken, the field is broken up into zones and a defender responsible for that zone is given credits or debits depending on what he does with a ball hit in that zone. My question is this: If, say, a first baseman ranges far to the right to snag a grounder and it's in the second base "zone" does that count against the second baseman? Or say there is a shift on Ortiz and the shortstop gets a grounder while standing where the second baseman usually stands, does that hurt the second baseman twoo? I've always wondered about this. A secondary question results from the first, do the zones change per batter or remain static throughout all play?
They each, presumably, take the same data and try to produce accurate information for us in response to the same question. But looking at the wide divergence in the conclusions reached, we can only conclude that someone is doing better than someone else. I think we need, once and for all, to know who.
It'll probably never happen, but it's a nice idea, innit?
It would be interesting to know what the UZR/DRS correlation is for expected outs at the player level. One would hope that it is much higher, as the variance at the player level will be much higher.
You and I agree on this point, but this doesn't seem to describe expected DER at the team level as measured by DRS or UZR - UZR especially seems to see a pretty high spread of expected DER by team.
Measuring expected outs for UZR is difficult, because of the adjustments UZR makes to plays made. Fangraphs has stopped listing expected outs on the player cards, I believe - huh, but as luck would have it, that seems to still be on the splits page. I have some of that data; it should be interesting to take a look at that.
This is really eye-opening stuff.
And as of a few months ago, Field F/X was unable to track the batted ball. That's a problem space Sportvision is still tackling, and maybe they've solved it by now, but I don't think we can assume that.
And then you get into the question of whether or not the Field F/X data will be made available. It's hideously expensive data to collect, and who knows what the revenue model will look like. (For instance, Sportvision is collecting Hit F/X data, which gives the speed and angle of the ball off the bat - some teams are buying that, but it's not being released publicly.)
2. given the lack of jeter bashing in the article, i'd say it's rather yankee friendly.
The POINT is that in this case, the Gold Glove award has turned into a joke of a popularity contest. Even the BCS agrees that Jeter is an extremely below-average defensive shortstop. I suspect that if somebody else repeatedly won a GG award with similarly horrible numbers, a fuss would be made about that as well.
Remember when Rafael Palmiero won the Gold Glove in 1999 despite playing only 28 games in the field for the season? Statisticians and reasonable people are annoyed that a fan-favorite who is demonstrably NOT GOOD keeps getting recognized with an award he does not deserve. Baseball fans WANT to believe in stats and awards. Getting this one right is as simple as looking at film and data.
I don't know about UZR, but this feels like apples and oranges to me.
As far as apples and oranges, think of it this way. If DER is an apple, DRS and UZR are fruit salads - they both include DER, plus expected DER, plus outfield arms and double plays. (I don't think catcher ERA is included in the team totals on Fangraphs.) What we're trying to find out is how much the two fruit salads have in common once you remove the apples.
"Corelation doesn't care about units - it's normalized covariance, where you're looking at the magnitude of the common change measured in standard deviation, essentially."
Correlation in essence works off something called a z-score. In order to find a z-score you need two pieces of information: the average (more accurately, the mean) of the population, and the spread of the population, measured in standard deviation.
So let's say you want to find the z-score in terms of True Average. (I am making up some numbers here, purely for illustration purposes.) So let's say that the average TAv is .260, and the standard deviation is .040. So if someone has a .320, that's 1.5 standard deviations above the mean. That's our z-score.
To find a correlation, what you need to do is compute the z-score for every value in both data sets, and then for each entry you need to multiply the z-score from set A times the corresponding z-score from set B. Then, you sum all of those products, and divide by the number of pairs you have.
That's why it's unimportant to worry about the units of measure when talking about correlation; converting everything to z-scores allows us to compare two data sets with radically different units without a problem.
You seem to want to pin your findings on bad data, or inconsistent misinterpretation of the data, but I think the systems are too different to conclusively say that.
When you say:
"You seem to want to pin your findings on bad data, or inconsistent misinterpretation of the data, but I think the systems are too different to conclusively say that."
Let's key in on this. Why are the systems different?
We know that for the most part, they are trying to measure the same thing. I mean, in principle, they are trying to measure the exact same thing - how a team's defense prevents or allows runs. You keep bringing up how they differ in terms of things like park factors and run values - their methodologies are different.
And, well - isn't that the problem?
If you have two people using the same data (itself of undetermined validity) to measure the same thing and they come up with different answers, well, by definition one of them at minimum is incorrect, right? And the question we want to resolve is which, if any, of the multitude of systems is correct. And there is absolutely no objective standard that has been offered to let people determine which of these systems is using the best data or the best methods.
One major demarcation between science and non-science is the ability to use a model to make a prediction, i.e., that models are falsifiable. Ironically, you appear to have made a case that we should be able to use the fielding values assigned to players to predict a team’s defensive efficiency ratio (DER). Consequently, it appears the models are testable and are modestly successful in predicting outcomes. Hence the field appears to have science-like elements. Now let me address the question of whether differences in the internal workings of the models differ mean that fielding analysis is not following western scientific process.
The situation in fielding analysis is analogous to the situation in the early 20th century in atomics physics. At that time, there were many models for the hydrogen atom, though two eventually dominated, the Bohr and the Schrodinger models. Both models use the same set of input data (mass of an electron, mass of a proton, electrostatic force between the two constituents) and are able to predict values for the wavelengths of electromagnetic radiation emitted from high temperature hydrogen gas (with far superior precision that the fielding models). However, the internal workings of the models vary significantly, and there is very little correlation between the two models in characterizing the behavior of the electron within the atom. This lack of correlation does not mean that both are good science even though both rely upon non-derivable postulates. It is simply part of the process we call science. (And it continues to happen in all scientific fields, continuously.) So why is one of those models used repeatedly in analyzing problems at the sub-atomic level while the other is mentioned only for historical purposes?
The reason the Schrodinger model won out was because it could make more novel predictions and could be applied to wider set of testable phenomena. The Bohr model has many of the features which are testable, but it generates far fewer testable predictions.
So by this analogy, to continue the scientific development of fielding analysis, other possible outcomes need to be predicted. I as you correctly point out in your comments, but it was not as clear in the article, determining predictable defensive influenced "objective" outcomes is actually the problem. I use quotes for objective, since all data is subjective.
But different fruits is a separate issue. If both DRS and UZR include DPs and OF arms, then it seems like you're comparing salads with the same fruits. It's true that to the extent there is a difference, we won't know without digging deeper how much to attribute to each of the 3 elements:
1) expected DER
2) expected DPs (actual DPs must be same)
3) expected base advancement (outs on bases must be same)
But, I would make 2 points: 1) it seems useful to see how well the two system's total expectations or modeled reality compare, given identical data, and 2) realistically, expected DER likely accounts for the lion's share of any overall difference there is.
And "inconsistent misinterpretation?" Sounds like Dubya..... :>)
I don't believe that the differences in the two systems are as straightforward as you and Colin imply. Things like pickoffs, bunts, "home-run saving catches" etc. are all handled differently in the two systems, and those are only the things I know about. Park adjustments, yadda yadda.
This sentence...
"2) realistically, expected DER likely accounts for the lion's share of any overall difference there is."
...is a hypothesis, as far as I know, and I'm skeptical of it.
If your point is that those other components and measurements make the systems more different than people might realize on the surface, then you've raised a good point. I think a lot of folks are already aware of how different the systems are, but it's good to reinforce the point.
But perhaps the bigger reason for my reaction is that this sentence raises my hackles...
"For a time we stopped doing science when it comes to fielding analysis, and instead have been doing baseball alchemy..."
...because it's condescending and I think it diminishes the terrific work that MGL and BIS continue to put into their systems, to better understand the dynamics of fielding and improve the work.