A little more than a week ago, Jon Heyman of CBS sent out a tweet wondering why it was that Starling Marte and Bryce Harper had the same WAR. Heyman was quoting Baseball-Reference's version of WAR, which at that moment in time showed Marte and Harper tied at 1.7 wins. Harper had clearly been the superior hitter, but drilling down, it turned out that the fielding metric used by Baseball-Reference loved Marte's defense enough (and thought Harper's was average enough) to call them equals.

The problem with any sort of number this early in the season is that on many measurements, we're still at a time when players haven't logged enough playing time for the measure to be considered reliable. But of course, some measures are more reliable than others. The more reliable a measure, the sooner we can be more confident that it actually reflects what the player's talent level was during that time. The less reliable it is, the more likely it is that there will be fluky spikes and valleys over short (and sometimes long) periods of time. Fielding metrics are an estimate of how many outs a player saved from Opening Day onward, and what that was worth. However, in the same way that a player who went 3-for-4 on Opening Day is technically a .750 hitter for the moment, it’s not real. A fielding metric might need some time to stabilize as well before we get a good read on what’s going on.

There's been research on how quickly various batting and pitching statistics stabilize, but in general, few people have asked the question of how reliable our fielding metrics are. One reason is that several of the most commonly cited fielding metrics (UZR, DRS) rely on proprietary data not available to the general public. We just don't have the ability to peek under the hood. And so, we're going to have to get a little creative.

Warning! Gory Mathematical Details Ahead!
There is a publicly available data set that has batted ball type and hit location data for Major Legue Baseball. Retrosheet (put them in the Hall of Fame!) data files from 1993-1999 have the type of ball hit (grounder, fly ball, line drive), as well as zone data on where the ball was hit. This isn't the ideal data set for a few reasons. First, the zones aren't very granular, and they were input by stringers scoring the game from the press box, so the difference between a line drive and a fly ball might be in the eye of the beholder. Also, the youngest of these data are old enough to be enrolling in high school this fall. However, if anyone would like to show me a publicly available data set that is better…

I started by looking at ground balls for infielders. First, I calculated what zones "belonged" to an infielder. For each zone, I looked at which infielder(s) made the play at least 25 percent of the time (when the ball did not scoot through) over all seven years in the data set. When a zone had more than one fielder assigned to it, for example, a ball in the 56 zone (between short and third) might belong to the shortstop or the third baseman, I did not penalize the third baseman for not fielding the ball if the shortstop got there first. It simply went as a "no play" for the third baseman. Conversely, I did not reward the shortstop for somehow making a play in short right field. (What the heck was he doing out there anyway?)

My criterion for success was whether or not an out was recorded on the ground ball (either by force out, or just good ol’ throwing the ball to first). I played around with whether or not he got to the ball (regardless of whether he finished the play) or whether he fielded and threw cleanly. (If the first baseman dropped the throw, whose fault is that?) It didn't change the results all that much. All events were coded 0/1 (not out/out).

This is a simpler model than is actually used in the major defensive metrics. What I've created here is a basic "outs per ball in zone" metric. The more developed measures control for more factors and adjust for the difficulty of each play, and they are better off for it. But then again, all defensive metrics boil down to "How many balls was he near and how many did he turn into outs?" I'm happy to concede that I'm dealing with a rough approximation and that your mileage may vary if your model is fueled by more granular data. But this ought to give us some order of magnitude to work with.

I used the Kuder-Richardson, formula 21 to look at reliability. KR-21 is specifically set up to look at reliability in binary outcomes. I considered the stat stable when KR-21 crossed .70. I looked at sample sizes of up to 600 balls per fielder, meaning that I can see stability numbers to sampling frames of 300 in real life. If a measure failed to reach .70 within the frame available, I used the Spearman-Brown prophecy formula to estimate the point at which it would reach the reliability line in the sand.

The results for ground balls to the infielders:

First basemen: We need 290 GB at or near the first baseman before our crude measure of fielding stabilizes
Second basemen: 540 GB
Shortstops: 420 GB
Third baseman: 400 GB

Next, I looked at fly balls and pop ups for all seven non-battery positions. I used the same basic logic, except that I assigned each zone to the fielder who made more than 50 percent of the plays in that zone. I excluded fly balls that left the park. Also, this does not include line drives, and catching those is largely a matter of luck. I coded each fly ball 0/1 based on whether or not the fielder caught the ball.

For infielders, I was only able to go out to a sampling frame of 200 pop-ups (so my top resolution was 100 pop-ups). For outfielders (who get more fly balls), I was able to go to 500 (so my estimates run to 250 fly balls)

First basemen: 48,000 pop-ups.* Really.
Second basemen: 400 pop-ups.
Shortstops: 320 pop-ups.
Third basemen: 3,240 pop-ups.*
Left fielders: 370 fly balls.
Center fielders: 280 fly balls.
Right fielders: 210 fly balls.

*Those corner infielder numbers are mostly the result of the fact that reliability numbers barely budged from zero in the tested sample. We'll talk a bit more about what this means in a minute.

To give some context around those numbers, the average team in 2012 had to take care of 6.3 ground balls and 4.7 fly balls/pop-ups per game. (Surprised?) That means that even if Starling Marte had played every inning of every game for the Pirates in left field and every single fly ball that the other team hit was hit his way, after 40 games, we would only expect him to have 188 fly balls hit his way, and that's a only halfway to getting a reliable measure of his outfield range. However, after 40 games, there are certain parts of Starling Marte's batting line that can be considered reliable.

(Careful readers will note that I didn't address throwing arm stats, and that was more a matter of sample size than anything. In the past I've found that performances in throwing runners out on the bases aren't very stable year to year, primarily because there just aren't a lot of chances that a player gets to show off his arm.)

What it Means
Here's the dirty little secret about WARP. It's an amalgamation of a bunch of different measures, converted into the same denominator so that they all add up, and sold as a coherent whole. By packaging all of the parts together, it gives them the illusion that they are all on equal footing with one another. They aren't. WARP is what happens when you add offense, defense, and baserunning (and compare them against a position-adjusted baseline). The problem is that we can be a lot more confident a lot more quickly that a player's offensive numbers accurately portray both how he’s performed and what his true talent level has been over the course of a season. Colin Wyers has much more to say on that subject today.

With defensive numbers, that point of reliability just doesn't happen that fast. It takes longer for a player's true colors to shine through on defense. When a guy like Starling Marte has a big number on his defense, it might reflect what he's done to date, but we can't be completely confident that it captures what he did during that time, and there’s even more uncertainty about who he was deep down. And even if the metric isn’t overstating his performance, we can’t be sure whether he’s really the best fielder in the league or just enjoyed a convenient spike of luck. Either way, we need to be careful to drill down a bit to see what is driving a high (or low) value and to frame our understanding accordingly.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
I rapidly get out of my depth when the Kuder-Richardson formalism comes up, so the following question may not make sense, but I'm going to ask it anyway. :-) If I understand it correctly, one of the underpinnings of the KR21 formula, at least as applied to test construction for exams in classes, etc., is the assumption that the "test questions" are of broadly equal difficulty. That clearly doesn't apply in a baseball setting. Unless a fielder is just incomprehensibly bad, he'll make all of the "easy" plays. It'll be the "hard" plays that separate the good fielders from the bad ones, and the "incredibly hard" plays that separate the great fielders from the good ones. Why, then, is KR21 an appropriate formalism for this subject? Aren't you asking it to do an analysis that it's not really well suited for?
It am taking some small liberties with KR-21. I'm assuming that a grounder is a grounder is a grounder (and that all are of equal difficulty), mostly because in the data set I'm using, I can't tell the difference as to which grounders were soft two-bouncers right at the fielders and which were screamers headed through the middle. The way that I have the database structured, I lined up the "test questions" in chronological order. So for first basemen, "question" #1 was the first ground ball that he saw from 1993 onward that was hit in his general area. For some guys, that was an easy one, for others a near impossible ball to get. What I'm counting on is that the noise all cancels out in the wash.
Thanks for the followup, and I understand your methodology a little better now, notably the fact that you almost must use KR21 because of the limitations of the data set. However, that's kinda my concern in a nutshell.

Two points. First, the assumption that all grounders or pop flies are of equal difficulty is obviously wrong (nor do you claim it to be otherwise, for sure), and it leads to the inclusion of lots of plays in your data base that really don't contribute much in terms of discriminatory power. Any ground ball hit within 5 feet of a fielder is going to turn into an "accepted chance" for that fielder, to use 60-year-old terminology, unless the guy is immobile on the scale of a late-career Frank Howard. Those chances may shed light on the inadequacy of guys with real hands of stone, or a terminal case of the throwing yips (think Steve Sax or Chuck Knoblauch), but otherwise they don't contribute much except added statistical clutter.

Second, the contention that all that clutter "cancels out in the wash" is dubious, because not all fielders have the same proportion of non-trivial plays attributed to them. There are a number of reasons for that, ranging from the fielders' own reputations to the reputations of teammates to the surfaces they played on to the pitchers they played behind, and so on. In essence, they aren't all taking the same fielding exam -- which again is one of the key points about KR21.

Yes, I understand now that with the limitations of the data set, you probably can't do better. But with the "right" data set, that is, a reduced set that looks only at balls in play that really do have discriminating power, I'd be pretty confident that the numbers required to achieve some degree of stability would be much reduced, although you'd have to use a more powerful algorithm to test that claim.
Oops: when I said "reputations of teammates," I really meant "objective capabilities of teammates." I wish we could edit these comments to fix things like that. Anyway...
I have to admit that the mathematics involved in saber stats go way over my head, so when you launch into the "Gory Mathematical Details" I often skip forward to the conclusion. As a result, I'm not sure if this is the appropriate article for this question, but I'm going to ask it anyway, since it's a discussion of defensive metrics and WAR. And again, the question involves Bryce Harper.

I've heard Harper's rookie year characterized as "the greatest ever by a teenager." Tony Conigiliaro sprang immediately to mind, so I checked the numbers. Using basic stats, Tony C's rookie season at age 19 in 1964 he hit .290/.354/.530 with an OPS+ of 137. Harper at age 19 hit .270/.344/.477, OPS+ 120. Tony C's rate stats were better too, and it certainly appears that he had the better teenage season. But then you turn to WAR...Conigliaro's was 1.6, Harper's 5.2, and I'm sure that's where anaylsts trumpeting Harper as the best teenager ever are basing their statements. That's a big difference, and it appears to be in dWAR. Was Tony C that bad a defender? And given that it was 1964, how do we know? Should we even compare the two using strictly WAR?
Harper (139 G/597 PA) logged more playing time than did Conigliaro (111 G/444 PA). Harper played primarily in CF, while Conigliaro played in LF. It looks like you're quoting the Baseball Reference version of WAR. You're correct that most of the difference between the two seasons comes down to Conigliaro being rated as 10 wins below average in the field and Harper as 14 above. For Harper (and for everyone from 2003 onward), BBRef uses Defensive Runs Saved from BIS (The Fielding Bible people). For pre 2003, they use a measure called Total Zone, which was invented by Sean Smith. I once created a similar measure. TotalZone uses roughly (and I mean very roughly) the same ideas I've used here.

The broader point is that Harper's claim to being the best teenager ever rests a good deal on his superior (in the eyes of the metrics) fielding performance. Conigliaro was clearly the more productive hitter. We can put more faith in the reliability of those hitting metrics than the defensive metrics. There's a decent case to be made that Conigliaro deserves a second look and that the case in favor of Harper is not so clear cut.
I still remember the day Colin Wyers went off on Up & In about how defensive metrics are totally broken because line drive/fly ball classification is borked.
I remember listening to that episode while riding on a bus from Chicago to Cleveland coming home from a wedding. It was the first time I'd ever heard Colin's voice.
Hideous, isn't it?
Wouldn't go that far. It's just surprising. Like the way you look. I imagined you as a clean shaven kind of guy.
I have a degree on actuarial math from the university of Michigan, my GPA wasn't great and I never got a job in the field, but the diploma's on the wall so I understand things like mean, variance and sample size I'm not one to be scared by the phrase "gory math details ahead" like most, and even my eyes glaze over when the when the WARP formula comes out.

Still, I remember an article on this site years ago that quantified the defensive value of a good defensive catcher vs a good offensive catcher and the defensive metrics at the time showed a completely negligible difference between the best and worst defensive catchers. So small that if it were that small at the toughest position on the field, there would never be a reason to consider defense when filling a lineup card as long as the player was adequate. Fast forward to last year, where I see Yadier Molina as a top 10 WAR player and mike trout rack up more WAR than Miguel Cabrera despite miggys significant edge not just in then triple crown stats but in ops*, and I could only conclude that the brand new stats had gone from under rating defense to overrating it.

This article confirms what I've suspected ever since the MVP debate, the formula overstates defense and base running. Just because all three are part of the game doesn't mean they are equal and until a corrective factor is found, supporters of WAR should hold off on calling non supporters dinosaurs. Either that or start hyping Marte as much as Harper.

*(before anyone tries to read team based bias in my arguments, my favorite tiger is Justin Verlander and I still voted for David Price for Cy Young last year on the strength of his ERA.)
Why must the fact that Mike Trout finished ahead of Miguel Cabrera in WAR mean that defence and baserunning are over-valued? Is it really so hard to believe that very good hitter + elite baserunning + elite defence > elite hitter + poor baserunning + poor defence?

From looking at Trout's performance last year, it appears around 3/4 of his WAR was generated by his bat. It's simply not true to say that hitting, baserunning and defence are equally valued.
I read Colin's article last night and immediately read Russell's "companion" piece. Great stuff.
"Here's the dirty little secret about WARP." Yes some defensive metrics have become obsolete. However, it is easily possible to deconvolute the fielding metrics from the batting and running contributions toward WARP. And BWARP still seems to be a valid metric.

On the otherhand, what is possibly more broken is PWARP. It has gotten to the point that one of the BP writers actively dismisses PWARP. It is very easy to locate examples where "worth" as defined by PWARP appears to be contrary to other more traditional, but acceptable, metrics. As a simple example:

Barry Zito
year ip W-L QS-BQS ERA
2010 199 9-14 19-1 4.15
2012 186 15-8 17-0 4.15

After a quick glance at the numbers and ignoring W-L, one would suspect the two years are nearly equivalent, but by PWARP, Zito was nearly 2 wins more valuable in 2010 than 2012.

And at 7-0 with 5 QS out of 8 and a 2.44 ERA, Matt Moore has been nearly replaceable this year: PWARP = 0.3.

Hopefully some food for thought.
The thing with PWARP and Moore is that the metric is looking strictly at his defense-independent performance, which isn't as impressive as his traditional stats. He's walked almost five batters per nine innings, and he has the lowest BABIP in the AL, which PWARP is attributing to his defense, not to him. His FIP is 4.50. Of course, certain guys are able to outperform their FIPs by inducing weak contact consistently, and those pitchers would be blind spots for PWARP. But prior to his 50 or so innings this season, we had no real indication that Moore was one of those them.
Thank you Ben for confirming what I suspected. By its construction, PWARP is fielding "independent", that it possesses some value, but that there are limitations. And as with all metrics it is an approximation for measuring the quantities desired.

I have no problem with this interpretation.

So as a metric, what is the error associate with it? It appears to be a systematic error thus affecting its accuracy. Given sufficient samples the random error should disappear. Or is it a random error - that eventually all FIP pitchers eventually regress toward the mean?