A little more than a week ago, Jon Heyman of CBS sent out a tweet wondering why it was that Starling Marte and Bryce Harper had the same WAR. Heyman was quoting Baseball-Reference's version of WAR, which at that moment in time showed Marte and Harper tied at 1.7 wins. Harper had clearly been the superior hitter, but drilling down, it turned out that the fielding metric used by Baseball-Reference loved Marte's defense enough (and thought Harper's was average enough) to call them equals.
The problem with any sort of number this early in the season is that on many measurements, we're still at a time when players haven't logged enough playing time for the measure to be considered reliable. But of course, some measures are more reliable than others. The more reliable a measure, the sooner we can be more confident that it actually reflects what the player's talent level was during that time. The less reliable it is, the more likely it is that there will be fluky spikes and valleys over short (and sometimes long) periods of time. Fielding metrics are an estimate of how many outs a player saved from Opening Day onward, and what that was worth. However, in the same way that a player who went 3-for-4 on Opening Day is technically a .750 hitter for the moment, it’s not real. A fielding metric might need some time to stabilize as well before we get a good read on what’s going on.
There's been research on how quickly various batting and pitching statistics stabilize, but in general, few people have asked the question of how reliable our fielding metrics are. One reason is that several of the most commonly cited fielding metrics (UZR,
Warning! Gory Mathematical Details Ahead!
There is a publicly available data set that has batted ball type and hit location data for Major Legue Baseball. Retrosheet (put them in the Hall of Fame!) data files from 1993-1999 have the type of ball hit (grounder, fly ball, line drive), as well as zone data on where the ball was hit. This isn't the ideal data set for a few reasons. First, the zones aren't very granular, and they were input by stringers scoring the game from the press box, so the difference between a line drive and a fly ball might be in the eye of the beholder. Also, the youngest of these data are old enough to be enrolling in high school this fall. However, if anyone would like to show me a publicly available data set that is better…
I started by looking at ground balls for infielders. First, I calculated what zones "belonged" to an infielder. For each zone, I looked at which infielder(s) made the play at least 25 percent of the time (when the ball did not scoot through) over all seven years in the data set. When a zone had more than one fielder assigned to it, for example, a ball in the 56 zone (between short and third) might belong to the shortstop or the third baseman, I did not penalize the third baseman for not fielding the ball if the shortstop got there first. It simply went as a "no play" for the third baseman. Conversely, I did not reward the shortstop for somehow making a play in short right field. (What the heck was he doing out there anyway?)
My criterion for success was whether or not an out was recorded on the ground ball (either by force out, or just good ol’ throwing the ball to first). I played around with whether or not he got to the ball (regardless of whether he finished the play) or whether he fielded and threw cleanly. (If the first baseman dropped the throw, whose fault is that?) It didn't change the results all that much. All events were coded 0/1 (not out/out).
This is a simpler model than is actually used in the major defensive metrics. What I've created here is a basic "outs per ball in zone" metric. The more developed measures control for more factors and adjust for the difficulty of each play, and they are better off for it. But then again, all defensive metrics boil down to "How many balls was he near and how many did he turn into outs?" I'm happy to concede that I'm dealing with a rough approximation and that your mileage may vary if your model is fueled by more granular data. But this ought to give us some order of magnitude to work with.
I used the Kuder-Richardson, formula 21 to look at reliability. KR-21 is specifically set up to look at reliability in binary outcomes. I considered the stat stable when KR-21 crossed .70. I looked at sample sizes of up to 600 balls per fielder, meaning that I can see stability numbers to sampling frames of 300 in real life. If a measure failed to reach .70 within the frame available, I used the Spearman-Brown prophecy formula to estimate the point at which it would reach the reliability line in the sand.
The results for ground balls to the infielders:
First basemen: We need 290 GB at or near the first baseman before our crude measure of fielding stabilizes
Second basemen: 540 GB
Shortstops: 420 GB
Third baseman: 400 GB
Next, I looked at fly balls and pop ups for all seven non-battery positions. I used the same basic logic, except that I assigned each zone to the fielder who made more than 50 percent of the plays in that zone. I excluded fly balls that left the park. Also, this does not include line drives, and catching those is largely a matter of luck. I coded each fly ball 0/1 based on whether or not the fielder caught the ball.
For infielders, I was only able to go out to a sampling frame of 200 pop-ups (so my top resolution was 100 pop-ups). For outfielders (who get more fly balls), I was able to go to 500 (so my estimates run to 250 fly balls)
First basemen: 48,000 pop-ups.* Really.
Second basemen: 400 pop-ups.
Shortstops: 320 pop-ups.
Third basemen: 3,240 pop-ups.*
Left fielders: 370 fly balls.
Center fielders: 280 fly balls.
Right fielders: 210 fly balls.
*Those corner infielder numbers are mostly the result of the fact that reliability numbers barely budged from zero in the tested sample. We'll talk a bit more about what this means in a minute.
To give some context around those numbers, the average team in 2012 had to take care of 6.3 ground balls and 4.7 fly balls/pop-ups per game. (Surprised?) That means that even if Starling Marte had played every inning of every game for the Pirates in left field and every single fly ball that the other team hit was hit his way, after 40 games, we would only expect him to have 188 fly balls hit his way, and that's a only halfway to getting a reliable measure of his outfield range. However, after 40 games, there are certain parts of Starling Marte's batting line that can be considered reliable.
(Careful readers will note that I didn't address throwing arm stats, and that was more a matter of sample size than anything. In the past I've found that performances in throwing runners out on the bases aren't very stable year to year, primarily because there just aren't a lot of chances that a player gets to show off his arm.)
What it Means
Here's the dirty little secret about WARP. It's an amalgamation of a bunch of different measures, converted into the same denominator so that they all add up, and sold as a coherent whole. By packaging all of the parts together, it gives them the illusion that they are all on equal footing with one another. They aren't. WARP is what happens when you add offense, defense, and baserunning (and compare them against a position-adjusted baseline). The problem is that we can be a lot more confident a lot more quickly that a player's offensive numbers accurately portray both how he’s performed and what his true talent level has been over the course of a season. Colin Wyers has much more to say on that subject today.
With defensive numbers, that point of reliability just doesn't happen that fast. It takes longer for a player's true colors to shine through on defense. When a guy like Starling Marte has a big number on his defense, it might reflect what he's done to date, but we can't be completely confident that it captures what he did during that time, and there’s even more uncertainty about who he was deep down. And even if the metric isn’t overstating his performance, we can’t be sure whether he’s really the best fielder in the league or just enjoyed a convenient spike of luck. Either way, we need to be careful to drill down a bit to see what is driving a high (or low) value and to frame our understanding accordingly.