Manufactured Runs: Derek Jeter And The Philosopher’s Stone

November 11, 2010

I suppose I ought to say something about the Gold Gloves, huh?

Now, I’m sure everyone knows what I think about Derek Jeter’s defense—it’s probably being overrated even by “advanced” defensive metrics (which aren’t exactly kind to him). So how did he win?

Yesterday, I was listening to ESPN’s Rob Neyer talking to Michael Kay on the radio about the Gold Glove awards. And what stuck out in my mind was when Kay asked Neyer, “If you need a guy to field a ball hit right at him, who do you pick? Is it Jeter, or Elvis Andrus, or Alexei Ramirez?”

Neyer responded that he’d pick Jeter. And I think that goes a long way towards explaining why Jeter won the Gold Glove—he’s good at the elements of fielding that are easy to notice. He fields the balls he gets to very well.

And I think there’s a growing awareness that fielding the balls one gets to only tells a little bit of the story on defense. Getting to balls is also a part of the story—probably a much bigger part of the story. And I think a lot of the frustration people feel over things like Jeter getting the Gold Glove comes from the fact that the Gold Glove voters (and a lot of other people) don’t seem to recognize the importance of that aspect of fielding. The Gold Glove ought to go to the people who are good at fielding more balls, shouldn’t it?

But how do you know who’s good at getting to more balls?

Last week, I talked about the idea that sabermetrics is (in part) the scientific study of baseball. I don’t think that all of sabermetrics is science, of course, but I think that’s a large part of it. And I think that more than anything else, a sabermetrics that is manifestly anti-scientific ceases to be sabermetrics at all.

One of the key requirements for scientific research is reproducibility—the idea that independent observers can come to the same conclusions as the original researchers. The reproducible parts of fielding analysis are things like DER—measures that tabulate plays made and balls in play, objective facts that anyone can derive using scorekeeping methods that Henry Chadwick came up with in the 1870s. Modern advancement in fielding analysis relies on estimates of expected outs from batted ball data.

And that data has proven not to be reproducible on two counts—different data providers cannot reproduce the same estimates of where a ball landed and how it got there, and different analysts using the same batted-ball data cannot reproduce the same estimates of expected outs.

Last week, Tom Tango published his team defensive ratings, based upon the Fan Scouting Report. He compared the results with team totals of two defensive metrics based on Baseball Info Solutions data:

I also included the totals from Dewan (DRS) and MGL (UZR). Dewan’s numbers were dropped by 10 runs per team because the average is +10 per team. I don’t know why.… The three systems agreed on the Nationals, and strongly disagree on the Indians. Correlation of Fans to Dewan is r=.60, and Fans to UZR is also r=.60. UZR to Dewan is r=.57. That’s pretty whacked out when you consider that UZR and DRS use the same data source.

I decided to check the correlation between all of the measures and DER:

	Correl
Fans	0.63
DRS	0.61
UZR	0.66
All Three	0.74
BIS	0.71

(“All Three” is the average of the Fans, DRS and UZR; BIS is the average of DRS and UZR).

What the results suggest is that all of these measures are roughly as well-correlated with DER as they are with each other. And that makes sense, since all of them implicitly include all of the information included in DER in their measurement. Metrics based upon batted-ball data, at the team level, boil down to:

(DER – exDER) * BIP

Which is to say a team’s actual DER, minus what the expected DER would be given the estimated batted-ball distribution, multiplied by the number of balls in play (This is a bit of a simplification—certain categories of plays are excluded, like popups, and in the case of UZR, plays by the catcher and pitcher are ignored. And these measures are also looking at things like outfield throwing arms, double-play rates, etc. But fundamentally, turning batted balls into outs is the most important aspect of team defense).

What I want to emphasize here is that there is little room for defensive metrics to differentiate in terms of measuring DER. Because of their focus on measuring individual player contributions, certain categories of plays are excluded—but outside of that, the number of outs recorded and the number of balls in play is not in dispute. Those are facts; everything else is an estimate.

So let’s turn to partial correlation, which tells us the correlation between two variables after removing the influence of a third variable. In this case, we’ll look at the correlation between the various metrics after controlling for the influence of DER. The partial correlation between UZR and DRS is .28 (Between the Fans and UZR it’s .32, and between the Fans and DRS it’s .35).

So only a quarter of the agreement between DRS and UZR is caused by factors outside of plain old DER. That’s very little agreement on estimates of expected outs and elements of defense not included in DER. It’s worth reemphasizing that DRS and UZR both use the same source of batted-ball data, and different data providers can disagree significantly in terms of hit location and batted ball type.

And so if we want to know why people don’t trust what we’ve come to call “advanced” fielding analysis, it’s really because we haven’t given them a reason to trust it. And that’s because of a fundamental abandonment of what makes sabermetrics compelling—the search for objective truth. For a time we stopped doing science when it comes to fielding analysis, and instead have been doing baseball alchemy—trying to transmute lead into Gold Gloves.

The AL Gold Glove voters made a mistake in giving Jeter the award. But I think we make a much bigger mistake if we castigate the Gold Glove voters for their beliefs without a serious effort to give them something in which they ought to believe.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Colin Wyers

Latest Articles

You need to be logged in to comment. Login or Subscribe

oneofthem

11/11

in the jeter case though, simple incredulity is enough, and well justified.

Reply to oneofthem

rskelley9

11/12

They should just give texeira and cano Jeters gold gloves.

Reply to rskelley9

xnumberoneson

11/11

I wish that someone with legal access to the videos could post footage of what we're talking about. There has to be a ton of visual evidence of they types of ground balls Andrus, Remirez and Pennington get to that Jeter does not.

Reply to xnumberoneson

crperry13

11/11

Colin, I had a grandiose idea yesterday and was wondering if it's been tried. In industry, there are international standards that tell us how to define and document things so that we can all speak a common language. So far, there are no real "standards" applied to sabermetrics, but a lot of the concepts are the same from site to site. Many of the defensive metrics are based on the same ideas even if they come from differing data sources and have slightly different calculations. Same with WARP, WAR, and other metrics from around the internet that essentially tell us the same thing in different ways.

I think it's about time for the brains behind sabermetrics to have a summit to decide at LEAST a common numerical scale for each major metric type.

So, for the defensive metrics you named above, if they're all based on similar concepts you can say, "We'll make the average score for this type of metric be 100, where 'good' players score above and 'bad' players score below."

The benefit is if all metrics of a same type are reported on a same scale, the scores can be 'averaged' to find a happy medium between all the different sabermetric outlets. Everybody can continue using their own methods, and the combined score actually gives you something closer to the truth than just one metric alone.

It's probably really ambitious and there may be reasons I'm not aware of why it could never work, but we would have a 'standard' set of advanced metrics that can be presented to the public and possibly give a solid data point for gold glove voters to use. These guys don't want to wade through 100 metrics looking for the one that makes sense to them when "errors" are so easy to understand. But if you give them just one that's on a sliding scale that makes sense....well, that would be nice.

I thought about embarking on something similar myself, but I'm probably not the right guy to do it. If the idea has merit, I encourage anybody to run with it.

Reply to crperry13

FlagrantFan

11/11

Excellent post that nails shy defensive metrics are so scorned these days. We really need a metric we can believe in.

Perhaps you can enlighten me on something. Unless I am mistaken, the field is broken up into zones and a defender responsible for that zone is given credits or debits depending on what he does with a ball hit in that zone. My question is this: If, say, a first baseman ranges far to the right to snag a grounder and it's in the second base "zone" does that count against the second baseman? Or say there is a shift on Ortiz and the shortstop gets a grounder while standing where the second baseman usually stands, does that hurt the second baseman twoo? I've always wondered about this. A secondary question results from the first, do the zones change per batter or remain static throughout all play?

Reply to FlagrantFan

StarkFist

11/12

I was actually thinking of suggesting something similar; specifically, an effort by sabermetricians from across the industry to work together at taking apart each of their defensive metrics, analyze them unflinchingly and impartially, and determine which ones are more accurate than which others.

They each, presumably, take the same data and try to produce accurate information for us in response to the same question. But looking at the wide divergence in the conclusions reached, we can only conclude that someone is doing better than someone else. I think we need, once and for all, to know who.

It'll probably never happen, but it's a nice idea, innit?

Reply to StarkFist

garethbluejays2

11/11

Read some of the comments on MLB.com and what stood was the vitriol from yankees fans at any attempt to say Jeter is not a good fielder even if the poster said that he was a wonderful player, sure fire HOFer etc. And what was even funnier was that they castigated the use of statistics in discussing fielding - and then used fielding percentage and errors to show how good Jeter actually is. To paraphrase Orwell, all staistics are equal but some statistics are more equal than others.

Reply to garethbluejays2

oneofthem

11/11

not all yankees fans are like that.

Reply to oneofthem

gmolyneux

11/11

Colin: Am I interpreting you correctly that the correlation of expected DER for DRS and UZR is just .28? That does indeed seem low. On the other hand, expected DER at the team level probably have pretty low variance: most pitching staffs collectively must allow a BIP distribution pretty close to average. So it's hard to know what the correlation "should" be.

It would be interesting to know what the UZR/DRS correlation is for expected outs at the player level. One would hope that it is much higher, as the variance at the player level will be much higher.

Reply to gmolyneux

cwyers

11/11

"On the other hand, expected DER at the team level probably have pretty low variance: most pitching staffs collectively must allow a BIP distribution pretty close to average. So it's hard to know what the correlation 'should' be."

You and I agree on this point, but this doesn't seem to describe expected DER at the team level as measured by DRS or UZR - UZR especially seems to see a pretty high spread of expected DER by team.

Measuring expected outs for UZR is difficult, because of the adjustments UZR makes to plays made. Fangraphs has stopped listing expected outs on the player cards, I believe - huh, but as luck would have it, that seems to still be on the splits page. I have some of that data; it should be interesting to take a look at that.

Reply to cwyers

beeker99

11/11

Is FIELD F/X and its data the answer, then? Or is it going to have the same problems with, e.g., ball location that its sibling systems do?

This is really eye-opening stuff.

Reply to beeker99

cwyers

11/11

That depends on what you're talking about as Field F/X. The data we've been shown so far (some of us were given a sample of Field F/X data in advance of Sportvision's summit) tracks fielder positioning. That's interesting data, but I'm not convinced it really solves our problem. I think tracking the hit ball is much more important in tracking the fielder; without the fielder data we can't separate out various components of fielding ability, but that doesn't help us as much as I think a lot of people want to believe it will without the ability to measure total fielding ability.

And as of a few months ago, Field F/X was unable to track the batted ball. That's a problem space Sportvision is still tackling, and maybe they've solved it by now, but I don't think we can assume that.

And then you get into the question of whether or not the Field F/X data will be made available. It's hideously expensive data to collect, and who knows what the revenue model will look like. (For instance, Sportvision is collecting Hit F/X data, which gives the speed and angle of the ball off the bat - some teams are buying that, but it's not being released publicly.)

Reply to cwyers

map2history

11/11

So, the crux of the argument here is that, because a Gold Glove was possibly awarded erroneously to Derek Jeter, sabermetricians need to standardize their statistical toolbag? Really? Perhaps Ramirez or Andrus, or someone else, was more deserving, and it is certainly conceivable that Jeter won the award for criteria other than that which is most indicative of fielding-excellence. On the other hand, its not exactly a hard-sell, and furthermore name an award in sports that doesn't incorporate either Zen philosophy or popularity? The more important question here is who uses UZR or DER (or any other 'metric) and for what purpose? And, furthermore, how satisfied are THEY with the statistical accuracy of these 'metrics? Sabermetrics will NEVER end the "worthiness" debate with regard to any award in sports no matter how many variables the baseball scientists identify and control for. On the other hand, it seems much more conceivable that the more valuable 'metric-of-the-future will be the one that admits and controls for the large degree of Yankee-Hate which seems even to have permeated "baseball science".

Reply to map2history

oneofthem

11/11

1. the argument is a benign call to improve defense metrics. this is at worst harmless.

2. given the lack of jeter bashing in the article, i'd say it's rather yankee friendly.

Reply to oneofthem

crperry13

11/11

Numbers aren't everything. But numbers don't lie or form opinions. Those numbers agree that Jeter is a very poor shortstop when measured by getting TO balls in play compared to an average shortstop. "Yankee-Hate" has nothing to do with this or the price of rice in China. The only thing on this page that comes across as if it has an agenda is your own post. My guess would be that if one took this article and replaced the name "Derek Jeter" with "Marco Scutaro", you would have no issue with it.

Reply to crperry13

map2history

11/12

CRP13 - are you serious? "Numbers don't lie or form opinions"? So, you believe in BCS Math? Really? If so, your faith in statistics is .... admirable. My agenda aside, Wyers' justifies his call for metric reform on the "the frustration people feel over things like Jeter getting the Gold Glove.." You don't read an agenda in that? You're right, I'd have no issue with Marco Scutaro being criticized for receiving the award, neither would I have a problem with him being praised for it. In fact, if Jason Bartlett won a gold glove based solely on his UZR, with other ratings comparatively lower than other shortstops, I'd not have a problem with that either. But Wyers' call for reforming the impossible is rendered interesting because of the subject - Derek Jeter.

Reply to map2history

crperry13

11/12

The BCS has nothing to do with the price of rice either. If a system is programmed to do a specific thing like the BCS, UZR, FRAA, or whatever, it can only show what it is programmed to show. So I stand by my statement. Fielding metrics might be as flawed as the BCS, but they're just a number. They don't represent any agenda on their own.

The POINT is that in this case, the Gold Glove award has turned into a joke of a popularity contest. Even the BCS agrees that Jeter is an extremely below-average defensive shortstop. I suspect that if somebody else repeatedly won a GG award with similarly horrible numbers, a fuss would be made about that as well.

Remember when Rafael Palmiero won the Gold Glove in 1999 despite playing only 28 games in the field for the season? Statisticians and reasonable people are annoyed that a fan-favorite who is demonstrably NOT GOOD keeps getting recognized with an award he does not deserve. Baseball fans WANT to believe in stats and awards. Getting this one right is as simple as looking at film and data.

Reply to crperry13

irablum

11/12

actually, the issue is not that Jeter got an undeserved award, but why he got one. The issue here is the method. Who votes on the award? Coaches? Who probably should vote on a defensive award? SCOUTS! Not coaches, not baseball writers, not fans. You poll 100 scouts out there, and all 100 will tell you that the best shortstop with glove is not Jeter because that's whats true. And scouts see many more players and many more games than coaches. Plus, its their job to evaluate players. Coaches and managers get paid to teach their team to play, not to evaluate the competition.

Reply to irablum

studes

11/11

Isn't DRS Defensive Runs Saved? If so, that is very different from DER, isn't it? For example, I think it includes catcher ERA, runners picked off and other things not related to DER. Plus, it's denominated in runs, which would make it different from DER.

I don't know about UZR, but this feels like apples and oranges to me.

Reply to studes

cwyers

11/11

Corelation doesn't care about units - it's normalized covariance, where you're looking at the magnitude of the common change measured in standard deviation, essentially.

As far as apples and oranges, think of it this way. If DER is an apple, DRS and UZR are fruit salads - they both include DER, plus expected DER, plus outfield arms and double plays. (I don't think catcher ERA is included in the team totals on Fangraphs.) What we're trying to find out is how much the two fruit salads have in common once you remove the apples.

Reply to cwyers

crperry13

11/11

I'm not contributing anything towards the conversation with this post, but I wanted to say that the following sentence made my head explode:

"Corelation doesn't care about units - it's normalized covariance, where you're looking at the magnitude of the common change measured in standard deviation, essentially."

Reply to crperry13

cwyers

11/11

I'm sorry, that wasn't a very good explanation on my part. Lemme try again.

Correlation in essence works off something called a z-score. In order to find a z-score you need two pieces of information: the average (more accurately, the mean) of the population, and the spread of the population, measured in standard deviation.

So let's say you want to find the z-score in terms of True Average. (I am making up some numbers here, purely for illustration purposes.) So let's say that the average TAv is .260, and the standard deviation is .040. So if someone has a .320, that's 1.5 standard deviations above the mean. That's our z-score.

To find a correlation, what you need to do is compute the z-score for every value in both data sets, and then for each entry you need to multiply the z-score from set A times the corresponding z-score from set B. Then, you sum all of those products, and divide by the number of pairs you have.

That's why it's unimportant to worry about the units of measure when talking about correlation; converting everything to z-scores allows us to compare two data sets with radically different units without a problem.

Reply to cwyers

crperry13

11/11

That helped, I got it now. Thanks.

Reply to crperry13

studes

11/12

I'm not worried about the units. I guess I don't understand your point. The two fruit salads don't have the same fruits.

You seem to want to pin your findings on bad data, or inconsistent misinterpretation of the data, but I think the systems are too different to conclusively say that.

Reply to studes

cwyers

11/12

I don't think that's what I'm saying at all. I do mention the possibility of bad data as a separate point, but UZR and DRS use the exact same data - data quality has absolutely nothing to do with the differences between the two.

When you say:

"You seem to want to pin your findings on bad data, or inconsistent misinterpretation of the data, but I think the systems are too different to conclusively say that."

Let's key in on this. Why are the systems different?

We know that for the most part, they are trying to measure the same thing. I mean, in principle, they are trying to measure the exact same thing - how a team's defense prevents or allows runs. You keep bringing up how they differ in terms of things like park factors and run values - their methodologies are different.

And, well - isn't that the problem?

If you have two people using the same data (itself of undetermined validity) to measure the same thing and they come up with different answers, well, by definition one of them at minimum is incorrect, right? And the question we want to resolve is which, if any, of the multitude of systems is correct. And there is absolutely no objective standard that has been offered to let people determine which of these systems is using the best data or the best methods.

Reply to cwyers

padresprof

11/15

I have some sympathy for the arguments in this article. I too have wondered how models can use data from batting performance to make reasonable predictions for future performance in offense, but a playerâ€™s fielding performance appears to change radically on an annual basis. However, what caught my attention is the possible misunderstanding of the role of correlation in science and what distinguishes how science is conducted from non-science.

One major demarcation between science and non-science is the ability to use a model to make a prediction, i.e., that models are falsifiable. Ironically, you appear to have made a case that we should be able to use the fielding values assigned to players to predict a teamâ€™s defensive efficiency ratio (DER). Consequently, it appears the models are testable and are modestly successful in predicting outcomes. Hence the field appears to have science-like elements. Now let me address the question of whether differences in the internal workings of the models differ mean that fielding analysis is not following western scientific process.

The situation in fielding analysis is analogous to the situation in the early 20th century in atomics physics. At that time, there were many models for the hydrogen atom, though two eventually dominated, the Bohr and the Schrodinger models. Both models use the same set of input data (mass of an electron, mass of a proton, electrostatic force between the two constituents) and are able to predict values for the wavelengths of electromagnetic radiation emitted from high temperature hydrogen gas (with far superior precision that the fielding models). However, the internal workings of the models vary significantly, and there is very little correlation between the two models in characterizing the behavior of the electron within the atom. This lack of correlation does not mean that both are good science even though both rely upon non-derivable postulates. It is simply part of the process we call science. (And it continues to happen in all scientific fields, continuously.) So why is one of those models used repeatedly in analyzing problems at the sub-atomic level while the other is mentioned only for historical purposes?

The reason the Schrodinger model won out was because it could make more novel predictions and could be applied to wider set of testable phenomena. The Bohr model has many of the features which are testable, but it generates far fewer testable predictions.

So by this analogy, to continue the scientific development of fielding analysis, other possible outcomes need to be predicted. I as you correctly point out in your comments, but it was not as clear in the article, determining predictable defensive influenced "objective" outcomes is actually the problem. I use quotes for objective, since all data is subjective.

Reply to padresprof

rogerb

11/11

fantastic article

Reply to rogerb

drewsylvania

11/12

This nicely encapsulates my feelings about both the Gold Gloves and the current defensive metrics.

Reply to drewsylvania

gmolyneux

11/12

Studes: you did raise the issue of units above: "Plus, it's denominated in runs, which would make it different from DER." So I think Colin's reply was on point.

But different fruits is a separate issue. If both DRS and UZR include DPs and OF arms, then it seems like you're comparing salads with the same fruits. It's true that to the extent there is a difference, we won't know without digging deeper how much to attribute to each of the 3 elements:
1) expected DER
2) expected DPs (actual DPs must be same)
3) expected base advancement (outs on bases must be same)

But, I would make 2 points: 1) it seems useful to see how well the two system's total expectations or modeled reality compare, given identical data, and 2) realistically, expected DER likely accounts for the lion's share of any overall difference there is.

And "inconsistent misinterpretation?" Sounds like Dubya..... :>)

Reply to gmolyneux

studes

11/12

I did mention runs, forgot that. But it is an issue to the extent that the two systems interpret runs differently. DRS includes "home runs saved," for instance, which Dewan gives a lot of weight to (1.4 runs). I don't believe UZR has that.

I don't believe that the differences in the two systems are as straightforward as you and Colin imply. Things like pickoffs, bunts, "home-run saving catches" etc. are all handled differently in the two systems, and those are only the things I know about. Park adjustments, yadda yadda.

This sentence...

"2) realistically, expected DER likely accounts for the lion's share of any overall difference there is."

...is a hypothesis, as far as I know, and I'm skeptical of it.

Reply to studes

cwyers

11/12

Studes, I'm not sure I get your objection here. We don't have to speculate on how much the two systems have in common - the reported correlation tells us, for this set of data, what the level of agreement is. What the partial correlation lets us determine is "of the areas of agreement between UZR and DRS, how much is explained by DER?"

Reply to cwyers

studes

11/12

Well, if I'm understanding your point (always a questionable issue due to my lack of powers of comprehension), you're saying that the "disagreement" between UZR and DRS--that left by the partial correlation--is a bad thing. I don't know that that's true, because the two systems have different components and also measure them differently.

If your point is that those other components and measurements make the systems more different than people might realize on the surface, then you've raised a good point. I think a lot of folks are already aware of how different the systems are, but it's good to reinforce the point.

But perhaps the bigger reason for my reaction is that this sentence raises my hackles...

"For a time we stopped doing science when it comes to fielding analysis, and instead have been doing baseball alchemy..."

...because it's condescending and I think it diminishes the terrific work that MGL and BIS continue to put into their systems, to better understand the dynamics of fielding and improve the work.

Reply to studes

mikefast

11/12

Dave, whatever the quality of the work of MGL and Dewan and others and however dedicated and commendable their efforts, it's not science if it's not evaluated objectively in a way that other people can reproduce. That should be the standard in getting new approaches validated, and for whatever reason large portions of the baseball analysis community abandoned that standard in the case of fielding metrics. We have to get back to that standard if we want to make real progress in evaluating fielding.

Reply to mikefast

studes

11/12

Okay, I understand that perspective. Thanks.

Reply to studes

gmolyneux

11/12

Yes, it is just a hypothesis. But really it doesn't matter: what we want to compare is the expected outs on BIP each "engine" generates. I know Fangraphs posts UZR Range runs, so that's easy. Is "rPM" the comparable for DRS? If those are actually quite close, then we can worry about HRs, pickoffs, arms, etc. (to the extent we care about those). But my guess is they aren't that close. And of course, that still leaves a lot of questions about bias in the underlying data.....

Manufactured Runs: Derek Jeter And The Philosopher’s Stone

Thank you for reading

Latest Articles

speX ’24: Week Four

Will I Be Drawing These Stupid Rabbits Forever? $

Deep League Landscape ’24: Week Four $

MLU: Bratt Frustrates Opposing Hitters $

Box Score Banter: Knuckling (Way, Way) Up B

Colin Wyers

Latest Articles

speX ’24: Week Four

Will I Be Drawing These Stupid Rabbits Forever? $

Deep League Landscape ’24: Week Four $