Remember a few weeks ago when Alex Gordon was leading the American League in WAR? No one questions that Gordon is having a(nother) really good season and should rightly get some down-ballot MVP votes, but the best player in the American League? People quickly noticed that a good chunk of Gordon’s WAR came from his defensive ratings, where, at the time, he was picking up roughly two wins worth of value in left field. Gordon’s regarded as a good left fielder, but “good left fielder” is also the “great personality” of fielding aficionados.

It’s been known for a long time that a single year’s defensive ratings, particularly for outfielders, isn’t a reliable indicator of a player’s talent level. It might accurately represent what he did in the past year, to the extent that the data sources that we have available can do that, but it doesn't tell us what he really is. Commonly, I hear “you need three years’ worth of fielding data to get a reliable sample.” That’s fine if we want to know how good a defender someone really is, but when we are trying to figure out questions like “Who had the best 2014 season among these 15 randomly selected teams?” it means that a good chunk of that value is based on a stat that could just be a mirage.

I often hear people at this point appeal to technology. Statcast, the much-discussed stalking mechanism, erm … tracking technology that will tell us exactly where everyone was at all times on a baseball field will save us. Earlier in the year, MLBAM put up this lovely teaser video of Jason Heyward making a fantastic catch to end a game from July 2013. Wanna watch?

Currently, even the best “advanced” defensive metrics are based on data sources that have a lot of holes. Stringers manually input where a ball landed. They make judgment calls on whether the ball was a line drive or a fly ball. There’s very little data on how long a ball was in the air or how fast it was hit. No one tracks where a fielder was and how far he had to run. The metrics do the best they can with what data are out there (and it’s a heck of a lot better than fielding percentage), but what if we had better data? Statcast is that data set.

Or is it? As we can see from the Heyward video, we can expect to have information on fly ball hang time, fielder positioning (and distance from the eventual landing place), fielder reaction time and foot speed, and the length of the route that he actually took to get to the ball and how efficient that route was. Surely, more refined data will make for a better fielding metric!

Warning! Gory Mathematical Details Ahead!

Let’s talk about why defensive metrics, particularly outfield ones, are unreliable in the first place by starting with a fantastically oversimplified model. Most of what outfielders do to earn their keep is track down fly balls. We’ll start with a fairly standard fly ball with four seconds of hang time that is currently somewhere in the middle of the outfield of your favorite stadium. It would be really helpful for the defense if the center fielder could go get that.

Now let’s assume that as soon as the ball is hit, the center fielder reacts immediately (not true, but we’ll talk about that in a minute), and that he runs in a straight line toward where the ball will land (also not true, but again, we’ll talk about that). If the ball is going to come down right where he was standing to begin with, any minimally competent MLB outfielder wearing a glove could have converted that into an out. There are also sections of the park such that no matter how much range a fielder has, there is no human being that can run that far in four seconds.

So there are balls that can be caught by even the worst outfielder and balls that can’t be caught by the best. Then there are the ones in the middle that can be caught by the good fielders, but not by the bad. In our oversimplified model, we can assume that our center fielder has an effective range in the shape of a circle. But how big is that circle? We don’t have data explicitly on range for players, but we can make a few reasonable estimates. Let’s use a reasonable proxy for range, which is speed down the line from home to first. An 80 (elite) runner makes that trip of 90 feet from the right-handed batter’s box in four seconds flat (or less). A 50 (average) runner makes that trip in 4.3 seconds. Let’s just assume that those numbers hold in the outfield as well. On our four-second fly ball, the elite runner has an effective range of 90 feet, while someone running a 4.3 pace could cover only 83.7 feet.

This becomes a geometry problem. A circle with a radius of 90 feet has an area of 25,446.9 square feet. One with a radius of 83.7 feet has an area of 22,009 square feet. That means that the difference between an elite-range center fielder and just a guy is 3,437.9 square feet. Since center fielders are generally chosen because they have some wheels, while the 80 runner probably makes for an elite center fielder, the 50 runner is probably considered a poor one (again, taking out reaction time and route running).

Consider that your basic MLB park has a fair territory area somewhere around 110,000 square feet. If we are generous and say that the outfielders are only responsible for half of that area (55,000 square feet), and that the difference between good and bad outfielders in their effective ranges are similar (call it 3,500 square feet each), then 16 percent of outfield area falls into the category of balls that a really good outfielder would get to while a really bad one would not.

In 2013, center fielders handled an average of four balls per game that were classified either as fly balls or line drives, either ones that they caught on the fly or that they just picked up after they fell to the ground for a hit. That’s roughly 650 per season. If only 16 percent of those are in the area between the ranges of the good and the bad, then we’re talking about 100 or so fly balls. And that’s assuming that all balls hang in the air for a good 4 seconds. We’re probably talking about double digit numbers of fly balls where there’s any chance for the good and bad to show who they really are.

Let’s go back to some of the assumptions we made about range that were silly. Not all players have the same reaction time. There’s a good deal of scouting that looks at a fielder’s “first step.” Some people react more quickly than others. In general, it’s assumed that humans react to a visual stimulus (that ball … it is coming nigh to me!) within about 200 milliseconds, but we also see that there is considerable variation between people on this. We also know that some fielders take better routes than others (and with the new route efficiency stats, we ought to be able to prove it). With Statcast data, we ought to be able to put together some formula for figuring out how quickly, on average, each fielder reacts, and how fast he runs and how well he plans out his route and therefore figure out an effective range for each fielder. How much ground can Alex Gordon really cover? It’s just a matter of getting some math done.

Statcast promises to be a fantastic tool for figuring out what Alex Gordon’s or Jason Heyward’s true range is on your average fly ball. The ability to break apart reaction time from route efficiency from foot speed and even look at it directionally (how is Gordon when going to his left? His right?) will be a great boon to outfield coaches and talent evaluators. Unfortunately, it might not tell us what we actually want to know.

There are two issues that could torpedo our quest for a good reliable fielding metric. One is the issue of fielder positioning. If a fielder starts 70 feet from a ball, it’s a lot easier to catch than if it is 90 feet away. Undoubtedly, once these data are released into the wild, we’ll start to see multi-level analyses looking at whether certain fielders always seem to be nearer to fly balls (or grounders, for that matter) and whether those players tend to cluster on certain teams. Suppose that one team seems to position its players better than the others, regardless of whether the fielders then go on to make the catch. Should we credit the players on that team (or on those teams) for the catches they make or should some of that go to whatever system is telling them where to stand? What if Alex Gordon’s catches are all because the Royals are amazing at positioning?

But there’s another threat to a good, reliable fielding metric. Fielding is really rather noisy when you think about it. Consider for a moment that over four seconds of hang time, we estimate the difference between a good fielder and a poor one as a couple of feet of effective range. If for some reason our poor fielder had a few feet of head start, he would begin to look like an elite defender. How easy is it to get a few extra feet? Easier than you imagine, at least once in a while. Next time you are at a game, take a few moments and watch the outfielders in between pitches. Don’t watch the pitches. Watch the outfielders. They move around. Not a lot, mind you, but a lot of them fidget. I can’t say I blame them. As an outfielder, you stand there for minutes on end with nothing to do. And if you take a jump to the left or a step to the right, that’s a couple feet of movement. Remember that the subset of fly balls that distinguishes good fielders from bad is very small. But suppose that on one of those, the bored fidgeting just happened to take our fielder in the right direction. Over a small sample size, it wouldn’t take more than a couple (un)lucky strikes to bend the results one way or the other.

Even setting that aside, reaction time itself is highly variable within a person. We know that in the lab, well-rested people who have to perform a sustained-attention task are bound to have moments where their reaction time doubles (or more). Going from a base reaction time of 200 ms to 400 ms might not seem like much to the naked eye, but remember that 200 ms is the difference between a 50 runner down the line and a 70 runner. When someone is sleep deprived, the chances of a big lapse in reaction time go up further. These lapses happen randomly, and losing two tenths of a second is losing 5 percent of the time available to catch that four-second fly ball. It’s probably only a difference of a couple of feet, but we’ve seen that a couple of feet make a big difference. There’s a reason there’s a stimulant problem in baseball. Even slicing imperceptible amounts of time off reaction times in the field can have a big impact.

Fielding, especially in the outfield, is a very statistically noisy process when you break it down. No wonder we have trouble coming to a good consensus on how much value an outfielder added. He might be a very different outfielder from play to play and there are only a small handful of plays over the course of a season that will help us to distinguish who is good and who isn’t. That’s a recipe for a very unreliable metric.

Why Won’t Statcast Solve Our Problems?

Statcast will give us plenty of information on players and individual batted balls. We’ll probably be able to build much better models of how much a ball in the air hit with X distance at Y angle is worth in expected value. We’ll probably have a better idea of the overall true talent of individual players. But the question that we want to know, at least as it pertains to WAR, is “What did Alex Gordon do in 2014?” Did he catch those balls or not? The fact that whether or not he made the catch might have more to do with random luck isn’t as important.

In theory, something that is very luck-driven can simply count on the law of large numbers to equalize that random variance and bleed it out of the measure. But we’ve seen that over the course of a year, we’re only going to get a double-digit sample of balls that actually matter. Worse, because the difference between the ball being caught and falling for a hit in terms of value is something on the order of three quarters of a run, we can expect big swings in value even based on a couple of lucky catches or unlucky misses. It’s entirely possible that Alex Gordon really is a +20 defender in left field. It’s also possible that he’s really +10 and having a good year in the luck department. Or an average defender having an amazingly lucky year.

Sometimes the answer isn’t more granular data. The best answer is “Um, it’s really hard to measure this in the allotted time. Sorry.” I wouldn’t suggest giving up, but the reality is that we’re just going to have to live with the uncertainty. We’ve known that WAR has some margin of error that comes built in. I just don’t want people to get their hopes up that there’s something on the horizon that can save us. Outfield defense is just really hard to pin down.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Well it's a good thing there is no luck bound up in ERA, batting average, called strikes, etc!

Sarcasm aside, the statcast is going to be mind-blowing key helpful. It will be able to measure two things well, and separately: player positioning and how far individual players can get from point a to point b. Both incredibly useful data points.
I think the assumption that a "poor" fielder has 50 speed is a huge oversight. MLB "average" speed isn't the average at all, realistically most players are 30-40 runners.

Anthony Gose, an 80 runner and elite fielder, covers a lot more ground than Melky Cabrera, a poor fielder, and the difference is significantly more than 8 feet on a 4 second fly ball.
Most players that are 30-40 grade runners aren't playing center field, they're playing the infield in some capacity
At that particular juncture, I was talking about center fielders. As Dave points out, you rarely see a 30 or 40 runner in CF, nor do you see a true 80 runner in a corner. The point is that even these "huge" gaps are really only worth a few feet of advantage for a good fielder and that there are chance factors that can wipe out those few feet's worth of advantage with relative ease.
Nice article. I think this is exactly the issue in a lot of other places. There isn't a place where a catcher's pop time figures directly into his WARP, for example. I think there's lots of interesting work left to do.
Oh there's plenty more to do. Baseball is such a rich dataset.
Why is there a tendency to try to determine a fielders value by trying to figure out his talent level? Is not the value of a fielder how many balls he turned into outs? If Gordon did that better in 2014 than any other left fielder, then he is the most valuable defensive left fielder in 2014. Maybe he's not the most talented, but what matters in 2014 is what he DID in 2014. We don't measure the value of batters on how many hits they SHOULD have had (considering their talent), but how many they DID have. Why should we treat defense any differently?
This is the same concept that explains the difference between FanGraphs pitcher WAR (which is FIP-based- how a pitcher generally "should" have done) and the other pitcher WAR(P)s (based on runs allowed- how a pitcher actually did).

There's value in measuring both- how a hitter/pitcher/fielder "should" have done is generally better for predicting future performance while how they actually performed is of course much better at describing their past value, which is how WAR is almost always used.
How many balls a player turned into outs is a good start. But then you have to consider how many balls a player didn't turn into outs, either because he got to the ball and made an error, or because he never got to the ball. And THEN you should consider the fielder's starting position: a slower defender shifted properly might get to more balls than a faster defender.

Measuring pitchers on how many runs they SHOULD have allowed already happens (i.e., FIP), as does measuring batters on how many runs they SHOULD have created (i.e, context-neutral stats like wRC). These stats answer a different question: to me, they're less about evaluating what a player DID and more about what a player SHOULD do going forward. There's value in that, surely.

(Oh, and batters will absolutely be evaluated on what they should have done [DIBS?] once things like batted ball speed and angle off the bat are publicly available.)
But the question the article seems to open with is whether Gordon is MVP worthy, which should SOLELY be a function of how he performed. Things like FIP and DIBS and such are useful for predicting future performance, but what wins games is what actually happens on the field.

If a guy hits .300 by leading the league in beating out a ton of weak dribblers, that's of much more value of the guy who hit .200 while leading the league in line drives. And if a fielder makes 100 outs while never diving for a single ball while another makes 50 outs (of which 20 were Baseball Tonight highlights) - assuming they both played the same amount of innings at the same position - give me the "no flash" outs every time.

Oversimplification? Perhaps. But sometimes an out is an out and a hit is a hit, not matter how much geometry and physics we add to the data.
By this same logic, the current use of average run values on hits to calculate batter WARP is a problem as well. If I hit 10 grand slams in a season, what actually happened is 40 runs scored as a result of those hits. But I don't have control over the base-out state, so I only get credit for ten average homeruns. I'm not saying I disagree with this logic, but used another way it's an intelligent way of arguing for RBIs and runs scored as the basis for evaluating past performances. Where the line should be drawn is very much debatable.
My inclination is not to measure a defensive player's value by how many balls he turned into outs, but instead how he compared to a set baseline for his position.

Use a concept such as replacement level for a left fielder. We know how much ground the replacement level player will cover and how frequently he gloves the ball he gets to. Apply that baseline to every chance a player had, and you have a comparison independent of the number of balls hit his way. Of course, you would need to insure he had enough land outside his range to feel confident you weren't limiting his upside.

Though this covers range, it says nothing about his throws.
In terms of variability among players, isn't 60 or 90 balls per season of difference the equivalent of 100 or 150 points of OBA for players that have 600 plate appearances? It would seem we're in the same range of difference between the best and the worst.
In terms of percentage of chances, sure, but in batting, there's always the chance to do something good (or bad) and it's much more in the batter's control to actually go and do it.

Basically, those 60-90 balls are the only ones not in either the "any idiot can get that" or the "no one could get that" area. Imagine if all hitters were evaluated on the basis of 60-90 plate appearances and that the outcomes of the rest of those plate appearances were basically pre-determined.
I'm particularly excited to see reaction times and route efficiency scores for balls struck directly toward the fielder. These are difficult for a fielder to judge and I'm curious as to the degree it will show up in the new stats.

Route efficiency scoring will be interesting since an 'ideal route' is often not a straight line (despite what we've been told about the shortest distance between two points). For a ball at the margin of your personal fielding range, a straight line is ideal, but for more routine plays, especially with a man on base, your route is designed to optimize the throw rather than the catch. It'll be fun to watch these metrics evolve.
Is StatCast data going to be publicly released? Is there a timetable?
Maybe this is legend, but I have been led to believe what your call fidgeting by the outfielders might actually be adjustments based on the anticipated pitch (shortstops have been known to relay that information) or just based on the count (two strikes, he is more likely to go the other way).
There's probably some of that in there. But you try standing still for 15 minutes without having much to do and tell me you don't wiggle a bit ;)
"There’s a reason there’s a stimulant problem in baseball. Even slicing imperceptible amounts of time off reaction times in the field can have a big impact."

Really off topic, but I will definitely remember this - and the 200ms difference for runners down the first base line, and the other data here - the next time someone mentions that amphetamines aren't performance enhancers.
So, the summary here is that the question “What did Alex Gordon do in 2014?” will not be answered by StatCast. But will it be answered better than it is with the current dataset?
I think the answer will be better, but not to the point where it gets past the fundamental noisiness of what we're trying to measure.