Occasionally, I get asked—what’s going on with my attempts to make a defensive metric? I started off working on a Loess-based defensive metric, and then efforts just stalled. Because of the stall, it’s a fair question, and one that’s harder to answer than I think the questioners realize, because I’ve been slowly coming to some realizations about defensive metrics in general, and they aren’t encouraging.

The short version: I’m not really sure that we’ve gotten any further than where we were when Zone Rating and Defensive Average were proposed in the '80s. And if we have gotten further, I’m not sure how we would really tell. I’ve discussed some of this recently, first in a rather sprawling discussion at Tom Tango’s blog, and then in a conversation with Kevin Goldstein and Jason Parks on the BP podcast. But now’s a nice time to sort of take some time and compose those thoughts.

Let’s start with first principles, I mean really basic stuff: What is sabermetrics? Bill James proposed a definition—“the search for objective knowledge about baseball." And—that really does say a lot, doesn't it? It defines sabermetrics as the search, not the result. It tells us we are looking for knowledge. And it tells us we want to be objective about it.

Now the question comes: Are we being objective about fielding analysis? In other words, do we know what we think we know?

The Trouble with Defense

For the most part, those who are inclined to the sabermetric world view have come to a consensus on the evaluation of offense. There are occasional arguments, but over what I would call "little things." There is more agreement than disagreement, by a long shot.

But now imagine for a second that managers no longer got to set the lineup order. Maybe the umpire throws dice to determine who the next batter is. Or he has a spinner, stolen from a game of Chutes & Ladders. And then imagine that nobody is recording how many times a hitter came to the plate, simply how many innings he played and how many hits, walks, etc. he got.

What would our analysis of offense look like then? Probably a lot like range factor, for example—you'd simply have to hope that over time, the number of plate appearances per inning played approached the average. And over time, you may even be right. (Of course, there's no guarantee that a single season is enough time for this to happen; actually, you'd expect it to not even out for a substantial number of players in any one season.)

And that's where we've been for the longest time when it comes to measuring defense. The solution to this has been to use batted-ball data (both an indicator of how the ball was hit—ground ball, line drive, fly ball, popup—and where it was hit) to approximate chances.

What the Data Says

Now, I've spent a lot of time writing about the data that we're using. To be rather indulgent and quote myself:

A baseball fact is, simply put, something where the decision has a direct outcome on the game. Changing a strikeout into a walk has a very large effect, for instance—it provides both a baserunner for the offense and prolongs the inning.

The batted-ball data we have doesn't conform at all to the definition of baseball stats proposed above, so it's very difficult to say how well those measurements are describing the essential reality on the field of play. I have been studying differences in the data and it seems to shed very little light on the subject. What I can say with some certainty:

  • There are definite differences in how different data providers are defining the events that are occurring.
  • We have not yet established which of the data providers are correct, or more appropriately, we haven't established which are more correct.
  • To the extent that the data providers are erring, it seems that some of the errors are systemic—that is to say, they can be counted upon to repeat themselves in a similar fashion over a long period of time.
  • When multiple data providers are in agreement, we can only say that it is due to something in common between them—we cannot necessarily assume that the underlying reality is the only common element. There is a potential for shared bias, so that multiple data providers are wrong in similar fashions over time.

It's the third point that actually provides the biggest problem for us. If the errors were simple, isolated mistakes, then we could simply address them by adding more data. Over time, we would expect the errors to "wash out." But that is not how bias behaves—we cannot assume that bias will wash out, no matter what the sample size is, or how much we regress a sample to the mean.

And so when we look at repeatability of metrics, we run into a problem that we don't know how much of that repeatability is due to underlying skill, and how much is due to bias.

I've focused on the potential for bias in the batted-ball classifications, largely due to the availability of the data. But there are certainly other ways the data could potentially be biased. Commenter Guy at Tango's blog notes:

The most likely systematic bias in the data will be exacerbated, not remedied, by regression. That is the bias toward rating plays as “easier” when they become outs, or when fielders get to them quickly. Imagine having people rate the difficulty of 200 GBs into the 3B-SS hole from video. Now, imagine that the fielders are digitally removed, and the video stopped before it’s clear whether the ball reaches the OF, and the plays are scored again. Does anyone doubt that the balls that became hits will on average be rated as easier in the second scoring, while the outs become more difficult.

Or, as I put it on the podcast—imagine a ball hit between the shortstop and third baseman. Or imagine several, some where the shortstop gets to the ball, some where the third baseman does, some where it goes past them for a hit. What are your frames of reference, watching on video?

For example, watch this play by Ryan Braun from the All-Star Game. What do you see when the ball is caught, other than Braun, some grass, and maybe a little bit of the outfield fence? And that's a highlight-reel play, where you're getting multiple angles. What about a routine catch? Another clip from the All-Star GameMarlon Byrd's throw to get David Ortiz at second. How much of a frame of reference are you getting to determine the location of the ball?

One can suppose a range bias for the location data, where a fielder's ability to get close to the ball (much less field it) influences the scoring of where the ball was on the field. Is there any evidence for this sort of a bias? Perhaps. What I did was take all players with at least 100 innings played in back-to-back seasons, and look at their plays made and balls in zone as defined by Baseball Information Solutions (from the leaderboards at This is based upon the same BIS data that is fed into UZR or the Fielding Bible Plus/Minus stats. The data ran from 2003-09.

So I looked at BIZ and total plays (Plays plus OOZ, or "out of zone" plays, as defined on Fangraphs) per inning, and divided that by the positional average for that season. Then I looked at the correlation between years:













The auto-correlation for how many plays a player makes isn't really that much higher than the autocorrelation for chances, as defined by BIS. This is especially true for outfielders.

So we have questions about the data quality, as yet unresolved. And I wonder—what conclusions can we draw from the data when we don't know these things?

Method Man

Even using the same data, though, you can come up with drastically different results. Fangraphs publishes two defensive metrics, UZR and Defensive Runs Saved. These are both derived from the same BIS batted-ball data, and purport to measure the same thing (a fielder's value above average, compared to his peers at his position). The correlation between the two for 2009 for qualified starters, as reported by Mitchel Lichtman, UZR's creator, is .79.

Compare that to the correlation between the primary offensive rate stat on Fangraphs, wOBA, and a pretty crude bases per plate appearance measure—(TB+BB+HBP)/PA. For qualified starters in 2009, the correlation is .94.

So you have two methods that seem to disagree quite a bit, at least compared to offensive metrics. And that agreement seems to be driven largely by the underlying data—using the plays and ball-in-zone data from BIS, I constructed a quick-and-dirty runs above/below average metric (similarly to what I did here). That rubric, with almost no adjustments, correlated with DRS at 0.76 and with UZR at 0.65. It seems that simply using the same batted-ball data (and the same set of underlying facts—so-and-so made so many plays and was on the field so often) will get you most of the way to that level of agreement, regardless of method.

So our metrics don't do a very good job of agreeing. We don't know which methods are "better," only which ones we like more. And our data hasn't been validated against some objective standard.

To me, this opens up a simple question—how good are our defensive metrics? Are they useful? How useful?

 And if we go back to the beginning, where we talked about what sabermetrics is about, it doesn’t seem to me to be good or valid sabermetrics to accept these metrics without some sort of evidence, some objective facts that show they measure what we think they measure. And I think the burden of proof is on those who are making claims based upon these metrics to provide that evidence.  

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Fantastic article. I've never been especially comfortable with defensive metrics so I appreciate the analysis of their overall quality.
Great article. The numbers you provide really put into perspective just how "bad" our defensive metrics are. Also, heard your piece on the BP Podcast, and I had never even thought about the angle of the official scorer coming into play with batted-ball types.

I've been thinking for a while now that perhaps we need to take where the ball lands on the field out of the equation all together. What ever happened to Hit/Fx? Is that something of the distant future, or are people working on that now? Maybe, if we classified batted-balls as soon as they are hit (by speed off the bat, spin off the bat, vertical angle off the bat, and I guess a directional vector) we could rigorously define a "pop-up" or "line drive" and therefore not need to worry about biases in the input data?

Seeing as we haven't made much progress in 20 years with defensive metrics, I understand you can't pump out these types of articles every week. It seems, however, that defensive metrics are the cutting edge, so to speak, of sabermetrics, and so I would love to hear about any and all progress made. Even why things fail is good to hear.
This is very similar to my own thought: what if we could objectively classify every ball hit (rather than the subjective difference between a line drive a "flyner" and a fly ball)? Would that help? Or, since you point out the lower correlation between methods using the same data is this more of a methodology issue? It seems to me that once you get into this type of classification you can cross it with fielder positioning data to truly quantify range. We would know just how far a player had to go to get to a ball, and we'd know how quickly he had to get there. Maybe without the highly detailed batted ball we've come as far as our current abilities can take us?
I think if you have the data to do this, you don't NEED to - if you have an expressive set of data that clearly defines the flight path of the ball (even just hang time and some concept of distance) you don't need to dilute the power of that data by putting it in arbitrary buckets.

Of course, the problem is that we don't have that kind of expressive data, and so we make due with these categories. I think we need to be careful, especially if/when we get the data to do better, that we don't let the language we use to describe batted balls overly dictate the sorts of analysis we can do.
I don't think it's fair to say we haven't made much progress in 20 years with fielding metrics. We would seem to have made substantial progress. The methodology outlined by MGL in UZR, for instance, which built on previous work by others, is very thoughtful theory that's far beyond the days of simple range factor.

What we don't have is any quantification of which pieces of that work are most helpful and which don't add anything or perhaps are troublesome in their application.

Our fielding data collection is also much more precise than it was 20 years, but we have no idea when it's better, and how much better it is in those cases, or when it's worse.

If we can sort that out, I have hope that we will find that much of the work that's been done by MGL, Dial, Dewan, Tango, Sean Smith, and others can be made more useful, not less.
Building on what MHaywood points out, what we really need is Hit/Fx to tell us exactly where a ball is headed on the field, the trajectory, and how fast it gets there. This is what is needed to objectively describe a batted ball, and I don't see how it could be done any other way.

It would also be relatively easy (technologically speaking) to sew RFID tags or something similar into players hats or uniforms for tracking purposes. Then we could definitively say that Player X covered 42.6 feet in the 2.2 seconds it took that line drive to reach him in LF to make a catch.

This is all easier said than done of course...
We could even sew those RFID patches into the street clothes of noted night owls to ensure they're own by curfew!

On a serious note, it was my understanding that HitFX would have fielder positioning data included.
Outstanding piece.
Very curious to see how those fielding cameras installed above stadiums are coming along ( Are they still on track to be in every ballpark in the next few years, and if so can we even dream of getting useful data we can analyze from them in the next half-decade?
The discussion of the data quality reminds of the saying they like to use in the computer programing: "garbage in garbage out".

I think evaluating infielders will be harder than outfielders because infield fielding is usually a team exercise. The ability of a 2B, SS, or 3B to turn ground balls into outs depends in part on the capability of 1B. You need a way to determine how long it took for the fielder to get the ball to first base, regardless of the resulting out/safe call. If the fielder gets the ball to the "first base target area" within 2.9 s and the batter-runner needs 3.1 s to get to first base, then the fielder should be credited with an out, regardless of the (in)competency of the first baseman. You need a standardized "first base target area" so that fielders throwing 5'8" short armed firstbasemen can be compared to fielders throwing to long armed 6'4" firstbasemen.
After hearing Colin on the podcast, and reading this piece. I can just say to whomever hired Mr. Wyers, thank you.
If we ran correlations of wOBA using only samples of 300 PA, would the correlation be as high? That is, I wonder if defense will naturally be less correlated on a year-to-year basis due to smaller annual samples?

Perhaps it's not just that the measurement of defensive performance is less accurate. Perhaps it's also that defensive performance is intrinsically more variable.

One thing that I think would help immensely is separating aspects of fielding. We have the components of hitting, but only loosely with fielding. A fielding version of Dan Fox's base-running stat would be very helpful.

- Positioning: The ability to minimize the distance a player must travel to make a play on the ball
- Range: The ability to reach a ball hit a given distance from your initial position, in a given direction.
- Hands: The ability to field the ball presuming you've reached it.
- Arm: The ability to convert a fielded ball in to an out and/or limit base-runner advancement.

Setting aside the run conversation question, simply knowing the ability of fielders to reach balls X distance away from them given Y time would be a huge help.

I would love to see sabermetrics get in to the area of converting what scouts look for in to quantifiable figures and trying to aggregate from there. Let's get the skills right and not just focus on performance.
Tango does much of this with the Fans Scouting Report that he gathers every year.

Most of the advanced fielding systems also separate out an arm rating for outfielders.

Unless FIELDf/x or similar data is ever released, it will be tough (though not necessarily impossible) to separate positioning, range, and hands in what we have from the batted ball data.
Funny you should ask; I've done work like that in the past.

What's interesting to ask at this point is - how much of the year to year correlation for a metric like UZR is because of the inherent fielding skill of the player, and how much is due to persistent qualities of either the data or the method that do not correspond to the player's fielding skill?

It's not enough to show the persistence until you can measure and account for that potential bias. You're absolutely right that there is more variability in fielding (just due to the number of chances, if nothing else), but that doesn't necessarily explain all of it.
I had already responded to Colin's point regarding the comparison of correlation for offense and defense. I will reproduce here:

What is the relevance here? You are taking known hits, known extra base hits, and known outs, and you have one system that arranges it one way and another that arranges it another way. The correlation would have to be in the high r=.9x. With fielding systems, you are taking known outs (in some systems), estimated outs (in another), and estimated hits (for all systems), and trying to find the correlation.


Otherwise, I share Colin's general skepticism of subjective data being treated as objective data. He fairly asks legitimate and nuanced questions regarding the advancement level of fielding stats.

However, I am bothered that a reader, after reading Colin's piece, would come to a conclusion like:

"Seeing as we haven't made much progress in 20 years with defensive metrics..."

The only fair conclusion to make is that we don't know how much progress we have made, and not that we haven't made much progress.

We've made "some". Is that a little? A lot? You can't say "not much". This is part of the nuance in Colin's piece that may be glossed over, if said reader is representative of a portion of the readership.
"The short version: I’m not really sure that we’ve gotten any further than where we were when Zone Rating and Defensive Average were proposed in the '80s. And if we have gotten further, I’m not sure how we would really tell."

I didn't come to that conclusion after reading Colin's piece-- seems to me it was in the piece to begin with.
I guess this is part of the nuance. Colin said this:

"The short version: I’m not really sure that we’ve gotten any further than where we were when Zone Rating and Defensive Average were proposed in the '80s. And if we have gotten further, I’m not sure how we would really tell."

So, he's asking two questions:
"Have we gotten any further?"
"How can we tell if we have?"

Rather than specifically making it questions, he's wondering. But, he's not concluding.

His actual conclusion was questions:
"To me, this opens up a simple question—how good are our defensive metrics? Are they useful? How useful?"

And that's where we are. We're in the investigation stage. And in the noted thread, I said this:

"Room for improvement and discussion, as long as you start with the [recorded, not estimated] data. "
Colin said he wasn't sure and wasn't sure how to tell. You seemed to wipe away the uncertainty and conclude that we have made no progress.

There is reason to believe we have made progress, certainly on the theory side. But until we can test, we won't know for sure. That's historically been the standard in sabermetrics.

However, people who have developed fielding metrics will take your criticism very differently if you say, "I don't see how to tell how accurate your metric is" versus "Your metric is worthless." The former is a statement of fact that can be contested and explained, though it may raise some emotions. The second is a very value judgment that comes across as very dismissive and not focused on the examination of facts.
I said we haven't made much progress. That's very different than concluding all defensive metrics are worthless, wipe away any uncertainty, and say there has been "no" progress. And relative to the advancement of all the offensive metrics we have, and even pitching evaluations, I stand by my statement that we haven't made much progress on the defensive front.

In no way was I trying to bash anyone who has developed a defensive metric... heck, I would have no idea where to start developing one. But as Colin said, how good can the metrics we currently have be if we're still unclear on what a line drive is? My only reason for saying it was to recognize that developing sound, objective defensive stats is a much, much slower process than other aspects of baseball.

I am extremely interested in the investigative process, and so my comment was to ask Colin to keep us as up to date as possible with any advancements, yet I know they would probably be few and far between.
Well that's the idea, isn't it? For offensive metrics, the margin of error for offensive chances is zero, isn't it? (As you go further back and the data recording gets spottier this changes a bit - in 1908 I don't know as I'd say it's zero, but for relatively recent data I'd be comfortable assuming it is.)

So you have estimated fielding chances underpinning all of these defensive metrics. And so the question is, how accurate is that estimate? And to the extent that there is error in those estimates, is it random error or biased error? And that's a question we don't have to answer for offensive metrics. (At least, ones that bother to use plate appearances as their measure of chances.)

And you're right - this shouldn't be a surprising conclusion, if you know about the differences between the constructions of the two types of metrics. But it does neatly illustrate the differences in construction that you mention and give us an idea of what their impact is.

Then, once we recognize those issues, I think it's incumbent upon us as analysts to address those if we want to make use of fielding measurements. And frankly, I don't see a lot of that being done.


I think the null hypothesis has to be that we haven't made progress. That isn't an assertion that the null hypothesis is true - it's simply what we have to test against. The test hypothesis is that some progress has occured.

We can have a conversation about how confident we are about the occurance of progress, and the magnitude of progress. I just really hope that conversation involves evidence as well as opinion. And I don't think its enough to show evidence of effort and say that constitutes evidence of improvement. We know there's been a lot of work sunk into recording batted ball data and then interpreting that data. What we don't know is if it actually adds to what we know about defense, and if it does add anything, what it adds.

"Not much" may not be the right answer - and you're right that we can't conclude that from what we know right now - but it's certainly a POSSIBLE answer, isn't it? I don't know that we can simply dismiss it from our list of possibilities, at least not without doing the work to show that it's incorrect. (And of course "not much" is terribly subjective - once we have done the work and shown how good each of these metrics are, then we can let everyone decide how they want to describe it - if it's too little, more than they expected, etc., but we can have that disagreement in the context of the data.)
I really don't disagree with anything in this post.


By the way, if we look at how much progress we've had over the last 20 years for offensive metrics, the answer will be: much less than fielding metrics. Palmer's Linear Weights and Run Expectancy matrix holds up fantastically well. And where the gains have been made (baserunning) ends up having limited to no impact to most players.
It should also be noted that even if a recorder accurately describes a batted ball, there are errors in the transcription. That is, some 1% or 2% (or higher in some park) have clearly "impossible" data points. For example, the computer operator will select the wrong position (player) from the drop-down list, but mark the batted ball in the correct location. So, it might look like Jayson Werth ranged into LF to catch a ball.
Excellent article, hard to find anything to disagree with.

As someone who produces a set of defensive stats, and who has studied this for almost 30 years (and likely doesn't have any better answers than Colin) a few thoughts...

I started off way back when reading Bill James' introduction of Defense Efficiency Rating for teams, and thinking of how it might be applied to individuals. James stated that with proper positioning any ball is fieldable, unless it's hit too high off the wall. So let's take the team total balls in play used for DER and assign each and every one to a fielder. There is a matter of opinion on which fielder had the best chance in a split zone, but likely much less so than calling ground balls hits or errors.

If we know that there's uncertainty in the source data, perhaps we should back off in how precisely we want to measure things. Colin gives and example of grading balls hit in the 3b/ss hole as difficult or easy. I might ask "Why bother?", let me know it was in the hole and which fielder presumably had the best chance, as described above. Peter Jensen called his Gameday derived system "Big Zone Metric" because he didn't attempt to bin the batted balls into zones, only assign them to responsible fielders.

We can be less precise in our reporting of the results. We might be used to seeing runs saved on the season reported to the nearest tenth, but our observational biases and errors probably won't let us see inside 5 runs. I can add the fielding runs to batting and baserunning to calculate a WAR value, but in grading the fielders it might be best to follow the example of Tom Tippett's "Pursue the Pennant" which assigned six grades, Ex,Vg,Gd,Av,Fr,Pr and I'll add Bd to make it three above and three below average.
I totally agree with this, and I think the Hit/Fx will help. If we classify the batted balls directly off the bat, we don't need to know where the fielders are positioned--we can use the batted-ball data to say which fielder presumably "had the best chance".
So, a rating scale with average, and three ratings above and below average?

Can I recommend a scale from 20-80?
I have often said that I think defense will end up as a graph, not a number. The collection and accuracy is the biggest issue. If we can "only say" that a fielder got to balls here, here, and here, that's something. Of course, there's going to be outliers and we still lack the knowledge of how much the positioning matters. At some point, if we could get accurate positioning, we'll be able to say "Jeter can go 20 feet to his left, 24 to his right, etc" and end up with a "Keeler chart" - a diagram of the field that would specifically show where "they ain't". Positioning could be adjusted based not only on tendencies, but to shrink those "Keeler zones." We'd learn if one rangy player could cover for a less mobile one. It's an interesting time, but I think if we continue to try and do this statistically, we'll end up in much the same place.
That is, we'll end up in much the same place we are now. I don't believe defense will ever adequately be expressed as a number.
FINALLY! Someone has put a mathematical explanation as to why defensive metrics have recently been grossly overvalued.

You can watch many games and "see" players making good plays (or bad), but the statistical evaluation will rank the player far to the opposite end of the spectrum. Derek Jeter's defensive prowess has been the villified for several years, but gentlemen there is a reason he continues to be the shortstop for the best team in baseball. He is adequate at the position, in fact he makes very few mistakes and often makes stellar plays. Is he the gold glover (best)? Maybe not, but he is good.
The problem with your assertion is that objective scouts who don't buy into his "aura" will tell you similar things to what UZR or P/M have been.

DJeter is sure handed with a good arm, but his first step quickness is poor, and he has a good deal of trouble ranging to his left. Past a diving Jeter, but boy what a pretty gritty dive it was.

I would say he's average at best, just based on scouting. Look at the more dynamic shortstops in league and you'll find that Jeter is just not on their level as far as tracking balls down and being able to cannon throws across the diamond. His value comes from being adequate enough at short and being able to rake for his position (at least before this year).

If Jeter couldn't hit and didn't have such a reputation as a winner, would he be thought of as a good defensive shortstop? I bet not.
I wrote about Derek Jeter's defense last year

I agree with kensai, Jeter is very sure handed (very few infield hits or errors) but lacks range. Each play has a run value, with hits to the outfield costing more than infield hits. One or two extra ground balls a week through the infield isn't something our brains catalog very well, but they can be counted on the scoresheet.
Isn't the ultimate goal to predict how many runs will be allowed by the group of players on the field at any given time, the pitcher and defense? Such a model is testable and in the end, the most important thing we should care about.
I took all fielders with over 200 innings in 2009 and 2008 and did an autocorrelation of their RZR (using FanGraphs BIZ) and came up with 0.62 correlation (only a 0.25 when using UZR/BIZ). Why am I showing such a higher relationship than your Plays / Inning divided by Average Pos Plays / Inning using the same data?
I was under the impression that UZR data came from STATS and that data used to compute DRS came from Baseball Info Solutions, and figured that was a big part of the reason the two did not correlate very well. I've scored for both companies, and they do use a very different methodology in their plotting.

If what you write is true and they do both come from the BIS data, then that is disturbing... but I wonder why STATS has their reporters plot location if no one is using that data.