Occasionally, I get asked—what’s going on with my attempts to make a defensive metric? I started off working on a Loess-based defensive metric, and then efforts just stalled. Because of the stall, it’s a fair question, and one that’s harder to answer than I think the questioners realize, because I’ve been slowly coming to some realizations about defensive metrics in general, and they aren’t encouraging.

The short version: I’m not really sure that we’ve gotten any further than where we were when Zone Rating and Defensive Average were proposed in the '80s. And if we have gotten further, I’m not sure how we would really tell. I’ve discussed some of this recently, first in a rather sprawling discussion at Tom Tango’s blog, and then in a conversation with Kevin Goldstein and Jason Parks on the BP podcast. But now’s a nice time to sort of take some time and compose those thoughts.

Let’s start with first principles, I mean really basic stuff: What is sabermetrics? Bill James proposed a definition—“the search for objective knowledge about baseball." And—that really does say a lot, doesn't it? It defines sabermetrics as the search, not the result. It tells us we are looking for knowledge. And it tells us we want to be objective about it.

Now the question comes: Are we being objective about fielding analysis? In other words, do we know what we think we know?

The Trouble with Defense

For the most part, those who are inclined to the sabermetric world view have come to a consensus on the evaluation of offense. There are occasional arguments, but over what I would call "little things." There is more agreement than disagreement, by a long shot.

But now imagine for a second that managers no longer got to set the lineup order. Maybe the umpire throws dice to determine who the next batter is. Or he has a spinner, stolen from a game of Chutes & Ladders. And then imagine that nobody is recording how many times a hitter came to the plate, simply how many innings he played and how many hits, walks, etc. he got.

What would our analysis of offense look like then? Probably a lot like range factor, for example—you'd simply have to hope that over time, the number of plate appearances per inning played approached the average. And over time, you may even be right. (Of course, there's no guarantee that a single season is enough time for this to happen; actually, you'd expect it to not even out for a substantial number of players in any one season.)

And that's where we've been for the longest time when it comes to measuring defense. The solution to this has been to use batted-ball data (both an indicator of how the ball was hit—ground ball, line drive, fly ball, popup—and where it was hit) to approximate chances.

What the Data Says

Now, I've spent a lot of time writing about the data that we're using. To be rather indulgent and quote myself:

A baseball fact is, simply put, something where the decision has a direct outcome on the game. Changing a strikeout into a walk has a very large effect, for instance—it provides both a baserunner for the offense and prolongs the inning.

The batted-ball data we have doesn't conform at all to the definition of baseball stats proposed above, so it's very difficult to say how well those measurements are describing the essential reality on the field of play. I have been studying differences in the data and it seems to shed very little light on the subject. What I can say with some certainty:

  • There are definite differences in how different data providers are defining the events that are occurring.
  • We have not yet established which of the data providers are correct, or more appropriately, we haven't established which are more correct.
  • To the extent that the data providers are erring, it seems that some of the errors are systemic—that is to say, they can be counted upon to repeat themselves in a similar fashion over a long period of time.
  • When multiple data providers are in agreement, we can only say that it is due to something in common between them—we cannot necessarily assume that the underlying reality is the only common element. There is a potential for shared bias, so that multiple data providers are wrong in similar fashions over time.

It's the third point that actually provides the biggest problem for us. If the errors were simple, isolated mistakes, then we could simply address them by adding more data. Over time, we would expect the errors to "wash out." But that is not how bias behaves—we cannot assume that bias will wash out, no matter what the sample size is, or how much we regress a sample to the mean.

And so when we look at repeatability of metrics, we run into a problem that we don't know how much of that repeatability is due to underlying skill, and how much is due to bias.

I've focused on the potential for bias in the batted-ball classifications, largely due to the availability of the data. But there are certainly other ways the data could potentially be biased. Commenter Guy at Tango's blog notes:

The most likely systematic bias in the data will be exacerbated, not remedied, by regression. That is the bias toward rating plays as “easier” when they become outs, or when fielders get to them quickly. Imagine having people rate the difficulty of 200 GBs into the 3B-SS hole from video. Now, imagine that the fielders are digitally removed, and the video stopped before it’s clear whether the ball reaches the OF, and the plays are scored again. Does anyone doubt that the balls that became hits will on average be rated as easier in the second scoring, while the outs become more difficult.

Or, as I put it on the podcast—imagine a ball hit between the shortstop and third baseman. Or imagine several, some where the shortstop gets to the ball, some where the third baseman does, some where it goes past them for a hit. What are your frames of reference, watching on video?

For example, watch this play by Ryan Braun from the All-Star Game. What do you see when the ball is caught, other than Braun, some grass, and maybe a little bit of the outfield fence? And that's a highlight-reel play, where you're getting multiple angles. What about a routine catch? Another clip from the All-Star GameMarlon Byrd's throw to get David Ortiz at second. How much of a frame of reference are you getting to determine the location of the ball?

One can suppose a range bias for the location data, where a fielder's ability to get close to the ball (much less field it) influences the scoring of where the ball was on the field. Is there any evidence for this sort of a bias? Perhaps. What I did was take all players with at least 100 innings played in back-to-back seasons, and look at their plays made and balls in zone as defined by Baseball Information Solutions (from the leaderboards at This is based upon the same BIS data that is fed into UZR or the Fielding Bible Plus/Minus stats. The data ran from 2003-09.

So I looked at BIZ and total plays (Plays plus OOZ, or "out of zone" plays, as defined on Fangraphs) per inning, and divided that by the positional average for that season. Then I looked at the correlation between years:













The auto-correlation for how many plays a player makes isn't really that much higher than the autocorrelation for chances, as defined by BIS. This is especially true for outfielders.

So we have questions about the data quality, as yet unresolved. And I wonder—what conclusions can we draw from the data when we don't know these things?

Method Man

Even using the same data, though, you can come up with drastically different results. Fangraphs publishes two defensive metrics, UZR and Defensive Runs Saved. These are both derived from the same BIS batted-ball data, and purport to measure the same thing (a fielder's value above average, compared to his peers at his position). The correlation between the two for 2009 for qualified starters, as reported by Mitchel Lichtman, UZR's creator, is .79.

Compare that to the correlation between the primary offensive rate stat on Fangraphs, wOBA, and a pretty crude bases per plate appearance measure—(TB+BB+HBP)/PA. For qualified starters in 2009, the correlation is .94.

So you have two methods that seem to disagree quite a bit, at least compared to offensive metrics. And that agreement seems to be driven largely by the underlying data—using the plays and ball-in-zone data from BIS, I constructed a quick-and-dirty runs above/below average metric (similarly to what I did here). That rubric, with almost no adjustments, correlated with DRS at 0.76 and with UZR at 0.65. It seems that simply using the same batted-ball data (and the same set of underlying facts—so-and-so made so many plays and was on the field so often) will get you most of the way to that level of agreement, regardless of method.

So our metrics don't do a very good job of agreeing. We don't know which methods are "better," only which ones we like more. And our data hasn't been validated against some objective standard.

To me, this opens up a simple question—how good are our defensive metrics? Are they useful? How useful?

 And if we go back to the beginning, where we talked about what sabermetrics is about, it doesn’t seem to me to be good or valid sabermetrics to accept these metrics without some sort of evidence, some objective facts that show they measure what we think they measure. And I think the burden of proof is on those who are making claims based upon these metrics to provide that evidence.