We need to talk about defense. Well, not really defense, but how we measure defense. OK, not really how we measure defense, but what we do with it after we measure it. A couple of weeks ago, Matt Winkelman wrote something that questioned how defensive metrics were structured, specifically around the question of how we adjust for a player’s position in putting a value on his performance.
Winkelman begins by recapping the work that’s been done previously concerning positional adjustments and how they are handled in uber-metrics like WAR(P), and then writes a paragraph that should be required reading for anyone who dips their toes in the sabermetrics pool.
The entire sample size of players for the positional adjustment study is limited to players who have played at least two different positions over the course of the study. In the original adjustments, Tom Tango tried to account for the experience difference that might come from the positional change, but what I am more interested in is what kind of players these measurements do and do not come from.
Winkelman then goes on to point out that the players who play two positions (say, shortstops who move to second base) within a season are (mostly) the shortstops who weren’t good enough to really hack it at short, and that’s why they were moved. In other words, he’s arguing that the positional adjustments on which we rely are based on a biased sample.
There’s no real way to deny that his claim is true. As nice as it would be if baseball were to hand out playing time randomly, the way that they would do it in a good clinical randomized controlled trial, they don’t. The question is how much this really matters.
Let’s think about who plays multiple positions:
- Utility infielders. Utility infielders are a rather interesting bunch. They don’t carry a big enough bat to be regular starters, and they aren’t elite enough defenders for a team to swallow the bad bat, hit them eighth, and reap the rewards of the defense. They’re stuck in a weird space where they are good enough at defense to handle shortstop, and their managers feel fine sticking them at second and third when needed. And for the most part, they are native shortstops. There aren’t many native second or third basemen—and certainly no native first basemen—who become utility infielders. Additionally—and this is crucial—someone thought that playing them all over the infield was a good idea. It was probably someone whose entire job it is to evaluate baseball talent for a living. Imagine a guy who a team believes can handle short, but for whatever reason, they don’t think he’d handle third all too well. So they never put him there.
- Fourth outfielders. These guys are much like the utility infielders above. They tend to come in two flavors: can handle center field and … can’t. This is a problem when we try to “convert” performance in center field into performance in a corner spot. Teams tend to carry a backup outfielder of the “can handle center” type, though he might do some fill-in duty in left or right, as needed. In fact, if for some reason it makes sense for a team to have both the starting center fielder and the backup center fielder in the lineup, there’s usually little hesitation on a manager’s part to stick the backup in a corner. However, we don’t normally see the reverse, where it makes sense for a team to have three corner outfield bats in the lineup, and the team just punts and sends one of them out to center. We get to see what “can handle center” can do in left, but rarely do we get to see what “should stick to left field” can do in center. That means the “played CF and LF” bucket that we use to adjust center field to left field is going to be over-run by guys who are capable of handling center field.
- Guys who are getting older. There’s the category of player who is still “good enough” to handle third, if you needed him to, but who is going to eventually become a full-time first baseman … and that process has already started. We don’t normally see what a good third baseman does at first. We see what the ones who are “headed that way” do.
- Guys who aren’t catchers. Catchers basically just catch.
- Hide-a-players. There are some players who probably shouldn’t be on the field to begin with, and a lot of times they’re riding the first base/corner outfield shuttle. It’s rare to see a player who regularly plays both center field and first base. In fact, it’s rare that you find players who split their time between the infield and the outfield. But if we’re trying to figure out the difference between center field and first base, we can use a chain. We can look at guys who played both first base and left field. A lot of them are horrendously bad fielders who are just looking for a place where they can’t do any damage. And then look at guys who played both center and left who are … well, probably pretty good center fielders. Does that methodology make anyone else nervous?
A common (though not exclusive) thread in what’s above is the idea that players who are playing multiple positions are mostly shifting down the defensive spectrum, playing “below their abilities.” We can find the average difference in performance among players who played those two positions, but does it really work the same way when someone is sliding down the scale compared to when they are sliding up?
I can’t say that I blame the people who have derived the positional adjustments that are currently in use. Even if we know that the data is biased, it’s the best thing that we have available. Their methodology is entirely defensible (sorry). Looking at players who have played two different positions over the course of a year and then taking some sort of weighted average of those performances makes sense on the surface. If nothing else, you know that a sample composed of Smith at second base vs. Smith at third base holds Smith constant.
Or does it?
Warning! Gory Mathematical Details Ahead!
There’s another hidden assumption in that methodology that we need to discuss beyond the sample bias. Let’s assume the perfect set up for this type of analysis. We have 500 innings’ worth of observations for Smith and a bunch of other people at second base, and 500 innings of that same group at shortstop. The exact number of innings isn’t important. We begin with the assumption that since it’s the same group of players, any differences between how they perform at shortstop compared to how they perform at second base is a function of how hard the position is.
Is Smith the same man standing to the left of second base as he is standing to the right of second base?
Fielding a ball is a symphony of several skills. The fielder has to react to the fact that a ball has been hit, pick up the trajectory of a ball moving toward him, and move his body over to where the ball is going to be so that he can pick it up. He has to receive the ball into his hands and sometimes throw it. And there’s a time limit, because the batter is motoring down the line.
But let’s break some of that down further. Reaction time can be helped along by experience. If Smith has played a bunch of second base, then he’s had a lot of experience with the angles that the ball can take and how to react to them. Perhaps he’s learned through thousands of repetitions some specific motion patterns that will get him to that ball faster. One thing that I’ve found in the past is that experience at a position matters. (Others have found this as well.) Players who play a position more often have higher success rates fielding the ball when they get there.
On top of that, there are several ways to be a good fielder. One can be a gifted athlete and be quick and nimble, but another can live off of his excellent reaction time, and still another off his soft hands and strong arm. And the various positions call for different skills, just based on the geometry of the game. The “average difference” method assumes that “fielding ability” is unitary skill, and again, that the differences that appear between Smith at 2B and Smith at SS are due to the difficulties of the position, rather than the idea that Smith’s skill set was tailor made for second, but not so much for short.
This is the great untapped promise of Statcast, at least on the public side. When Statcast was released, we were hoping it would bring us a whole new world of being able to measure defense. And yes, now we have things like catch probability, and that’s nice, but we don’t have much data on players’ average reaction times or ranging abilities, even though we know from MLB broadcasts that they can easily calculate those numbers. Right now, the best public defensive metrics are overall value metrics based on plays made or not made, but they lack the fine details of telling us how the play was made.
For example, did Smith get a good read and jump on the ball? Did he just run really fast? And for the purposes of the topic at hand, does Smith have the same reaction time when he’s playing at short that he does when he’s playing at second? And if so, does that really matter? We don’t know (yet). But, we’ll have to make do with what’s public. Fortunately for us, from 1993-1999, Retrosheet has some pretty good batted ball data, complete with batted ball locations. They’re zone based and probably rely on some stringer’s interpretation of where the ball went (20 years ago!), but they are free.
Using that data set, I looked into one skill, which is the ability of a player to range over and get to a ground ball hit in his general area. For a shortstop, this might be a ball hit to the ‘6’ zone, the ’56’ zone, or the ‘6M’ zone. I didn’t look to see whether he fielded the ball cleanly or whether he made the throw in time. I just looked at “range.” I then found players who had experience with at least 100 such ground balls at second base and another 100 at shortstop. Their “success rates” correlated at a mere r = .237. That’s … not what we might expect.
We’re measuring a single skill for the same group of players within the same time frame. That correlation should be higher. I played around with other combinations of positions, but kept getting the same rather low correlations. It seems that Smith is a different man at shortstop than he is at second base, or at least that knowing what Smith is like at shortstop tells us very little about Smith at second base.
I took the analysis one step further. I found players who—over the seven years in the data set—had at least 500 ground balls at shortstop, and found their success rate on those balls. Then, I looked at what happened when those players were playing second base. For all the candidate ground balls to second base, I coded them as either “success” (player got there) or “failure” (nope). Can we use the player’s shortstop success rate to predict whether he would get to that ball now that he was playing second? It turns out that the answer was no. And that when I swapped positions out across the infield, the sort of cross-pollination that we might expect never happened.
I even looked to see whether the effect was different for utility infielders who logged a lot of time at both spots or for guys making cameo appearances at a spot. The answer was constantly “no.” This isn’t airtight methodology. There’s probably still going to be some selective sampling in here, but there’s little evidence that skills—at least one of the skills that we care about in terms of producing outs—carry over from one position to another, even within the same person. So what does that say for a positional adjustment methodology that assumes a player playing two positions is the same, and any observed differences are the result of the difficulties of the position varying?
The reason that we have positional adjustments in WAR(P) is that we want to give a player credit for playing a difficult position. He might be an average defensive shortstop, but (we assume) if his team needed him to do it, he could be a somewhat above-average second baseman and a crackerjack left fielder. (Hanley Ramirez would like a word with you.) But if the sample that produced those adjustments is significantly biased and the main assumption—that “defensive ability” is portable from place to place—isn’t solid, then is that construct modeling reality or a convenient representation of what we thought reality was?
WAR(P) Might Need to Change
Maybe you could make the case that the positional adjustments are useful in the aggregate, but of course, teams don’t sign aggregate players. They sign Smith. In fact, let’s take two players, Smith and Jones. Both have played primarily second base and both are average defenders there. Smith has only ever played second base in his career, while Jones has played some third base and acquitted himself quite well there. WAR(P) might look at Smith and say, “Well, he’s an average second baseman, so he’d be fine over at third base,” but Jones actually has a track record to back that claim up.
We are probably under-valuing that Jones has actually played third base before (and done well). Smith’s previous performance at second base means little for what he might do at third base. He might be amazing. He might be awful. He might be average. But just going by his performance to date at second, we can’t assume anything. Maybe Smith is a quick learner and could pick up third base rather easily, and maybe league-wide those mental skills fill in the (lack of) correlation that we see between positions. Of course, WAR(P) has no mental skills component. Maybe it needs one.
But even just fixing the selective sampling issue might be a big deal. The idea of positional adjustments is that Smith, as a second baseman, could be replaced (in the sense of a “replacement-level” player) by someone who already has a track record at second base or alternately by, say, a right fielder who is willing to take an X-run hit to his fielding numbers. What if going down the defensive spectrum from second to right isn’t the same thing as going up the slide from right to second? What if that means that Smith is a little more valuable than we might have thought because while a right fielder could theoretically play second, he’s going to take a much bigger hit than just X runs?
In an era where (say it with me now) “a win is worth $10 million!” if our models are one run off, then we’ve made a million-dollar mistake. I love WAR(P) as much as the next sabermetrician, but I do worry that we’ve all become a little too comfortable with it. We can’t treat it as though it was handed down from Mount Wyers, fully formed, never to need any tinkering. Once in a while, it’s good to check your assumptions. If they’re not true, then you need a model that relies on better ones.