Last week, we talked a bit about measuring the uncertainty in our estimates of offense. I hinted at having a few additional ideas on quantifying the uncertainty involved. Let’s examine two different routes we could take, both of which would offer less uncertainty than what we quantified last week.
When we did our estimates of uncertainty last week, we compared the linear weights value of an event to the actual change in run expectancy, given the baseout states before and after the event. What we can do instead is prepare linear weights values by baseout state and find the standard error of those instead. Looking at official events:
LWTS 
STDERR 

Out 
0.246 
0.184 
2B 
0.724 
0.179 
1B 
0.443 
0.164 
3B 
1.010 
0.065 
K 
0.261 
0.021 
NIBB 
0.296 
0.000 
1.398 
0.000 

0.315 
0.000 

0.174 
0.000 
The high value of the out is somewhat misleading—that includes things like the reaching on error, which we separate out in our current linear weights implementation. But here, the source of error comes from the potential differences in baserunner advancement. It makes a certain kind of sense—an Adam Dunn double and a Juan Pierre double present different opportunities for a runner at first to advance, for instance. (A Juan Pierre double is probably closer to an Adam Dunn single, and an Adam Dunn double is probably a good chance at a triple for Juan Pierre.)
So now you’re reduced your estimated error without changing your run estimates! Congratulations. The downside is that you’re now measuring your error against something that I suspect most people have a hard time understanding. You’re getting pretty far into the weeds of hypothetical runs, rather than measuring against a good proxy for actual runs, like what we did last week.
Another thing we can do is look at the change in run expectancy for each event. This isn’t a particularly new idea (Gary R. Skoog came up with it in 1987, calling it the Value Added approach[i]), although it hasn’t been especially popular because of the playbyplay data needed to compute. Let’s pull up the same run expectancy chart we used last week:
0 
1 
2 

000 
0.489 
0.263 
0.101 
100 
0.858 
0.512 
0.221 
020 
1.073 
0.655 
0.319 
003 
1.308 
0.898 
0.363 
120 
1.442 
0.904 
0.439 
103 
1.677 
1.146 
0.484 
023 
1.893 
1.290 
0.581 
123 
2.262 
1.538 
0.702 
To get the Value Added for a plate appearance, you take the runs scored on the play, add the ending run expectancy, and subtract the starting run expectancy. So a bases loaded home run with no outs would have an ending run expectancy of .489, plus runs scored of 4, minus 2.262: a Value Added of 2.227. A home run with the bases empty and two outs has the same run expectancy at the beginning and the end, so you end up with a Value Added of only 1.
This approach, needless to say, does a much better job of reconciling with actual runs scored than the linear weights approach. It also comes closer to measuring performance “in the clutch,” although it ignores inning and run differential (we’ll talk about that at a later date). So why not use it instead of linear weights? The data needed to power it is readily available now, at least for the modern era, as is the computing power required to accomplish it.
The issue—if you’ll remember back to our goals laid out in the first week—is that we want to avoid overcrediting a player for the accomplishments of his teammates. A player is not directly responsible for the baseout states he comes to bat in; that’s the product of the hitters ahead of him in the lineup. But if you look, there’s a substantial relationship between the average absolute Value Added of the situations a player comes to bat in, and the difference between his linear weights runs and his Value Added runs (per PA):
In other words, a player’s Value Added is driven in part by the quality of opportunities he has, not simply what he does with them. The ability for a player to impact plays is greater in some situations than others, and players who get to bat in those situations more than the typical player will have more of a chance to accrue Value Added.
But we can adjust for this. Using a variation on Leverage Index called baseout Leverage Index, we can adjust the Value Added run values for the mix of baseout situations a player comes up in. We take the average absolute change in run expectancy in each baseout state and compare that to the average absolute change in run expectancy in all situations, so that 1 is an average situation and higher values mean more possible change in run expectancy. Then we divide the Value Added values by the leverage to produce an adjusted set of values that reflects a player’s value in the clutch without penalizing or rewarding a player based on the quality of his opportunities.
There is still a substantial relationship between linear weights runs and adjusted Value Added—most players won’t see a drastic change. But some will. Take Robinson Cano, for instance. In 2009, he had a pretty good offensive season by contextneutral stats, batting .320/.352/.520 in 674 plate appearances. That’s good for 25.5 runs above average. But Cano had some pretty pronounced splits that season, hitting .376/.407/.609 with the bases empty but .255/.288/.415 with men on. Cano’s “clutch” performance was so bad, he ended up being worth 1.6 adjusted Value Added runs, worse than the average hitter despite hitting well above average for the season on the whole.
It comes down to what we want to measure—do we want the contextneutral runs, which say Cano was a superb hitter in 2009? Or do we want the clutchbased Value Added runs, which say he was below average? Or should we present both, and let readers decide which they prefer? Here’s your chance to weigh in. We’re not taking a poll—it’s not an election, per se. But we’ll listen to arguments on both sides, and I promise you that this isn’t a trick question.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
When I am playing fantasy, I really do not care how good the player is in theory. If he is assigned high leverage situations, I don't want that sorted out. After all, I seek out players who are placed in highleverage situations.
If I am asking who the best ballplayers are, who makes the most of the opportunities, then I want the deleveraged values. Comparing second basemen who bat second on one team and sixth on another really requires deleveraging.
1) How difficult is it to separate a contextneutral offensive value from a contextdependent offensive value? Would pitching and defense be given the same considerations? Should they?
2) How long would it be expected to take for a player's context neutral performance and his clutch performance to stabilize? Can the metrics be developed to account for that?
My own personal preference would be the contextneutral value, for two reasons. One reason is that I think the contextdependent value can be so different from the overall stats (such as the Cano example) that it will be difficult to sell to most fans without extensive explanation.
But my main reasoning is this: When I look at a stat like WARP, my brain translates that into "How good was this player?" And when I think about how good a player is, I tend to think of the aggregate of what he's done. In that perspective, the context and timing of each discrete event is essentially random. In other words, unless there is reason to believe that "contextbased hitting" is a skill, my thought is that a contextbased pattern is just noise.
An analogy might be that you are going to flip a coin ten times and you get 1 point for each tail in the first five, but 2 points for each in the second five. If your exercise gives you five heads and five tails, I want the probability described as 50%, even if tails actually came up in 4 of the last 5 tosses and thus was actually "worth" more.
Again, this is assuming that player ability doesn't actually change based on context. I guess in a perfect world the stat could combine both factors (especially if we learn that contextual performance really IS a skill and consistent over time). Maybe something like WARP = xx.xx (y), where xx.xx is the value based on the aggregate and (y) is a rating/grade/classification/number that tells us how much his "real" (contextbased) value differed from his aggregate value.
You're absolutely right, there's a segment of people who will find it very hard to accept a WARP that isn't contextneutral. Right now, though, we have a segment who finds it very hard to accept a WARP that isn't. There's a lot of people who insist that what players do in the clutch is important, even if it isn't repeatable. If it was a matter simply of truth, it'd be easy to resolve this dispute. It's a matter of what one values, though, which is much harder to resolve through sheer force of arguement.
***
When I poll readers on my blog, there's an equal divide. Some are interested just in what happened at the time it happened, and others are just interested in the outcomes as if they happened in a vacuum.
Basically, we always need to have multiple versions of whatever metric you create, because everyone is coming to the table look for answers to different questions. And we're not able to find "common" questions enough to allow us to present just one version.
Whatever improvements there are to be made to WARP or RE24/boLI, they are telling us fundamentally different things. Usually they'll broadly agree. Sometime they won't. When we see them diverge, we learn something interesting and important about the performance of a player.
Why muddle the signals?
Colin treats them as statistical (variation from random sampling) but they're not statistical. They're the result of intentionally deciding on a context neutral stat (linear weights) then changing his mind. In this case what the "error" term says is. "I'm giving you the linear weights value but if I wanted to give you the RE24 instead, it's likely to make a difference in a range of X runs, although I could have just gone ahead and given you the RE24 value."
If Colin is using samples to estimate a mean and thus estimate a standard deviation from the sample means, he is using standard error correctly.
Standard error (of the mean) is a description of the uncertainty as to what the true linear weights average value is, because we have a finite sample. It would be .456 divided by the square root of the number of doubles in the sample
And please, context neutral. That's the best way of looking at a sample.