Reworking WARP: The Uncertainty of Offense, Part Two

September 5, 2013

Last week, we talked a bit about measuring the uncertainty in our estimates of offense. I hinted at having a few additional ideas on quantifying the uncertainty involved. Let’s examine two different routes we could take, both of which would offer less uncertainty than what we quantified last week.

When we did our estimates of uncertainty last week, we compared the linear weights value of an event to the actual change in run expectancy, given the base-out states before and after the event. What we can do instead is prepare linear weights values by base-out state and find the standard error of those instead. Looking at official events:

NAME	LWTS	STDERR
Out	-0.246	0.184
2B	0.724	0.179
1B	0.443	0.164
3B	1.010	0.065
K	-0.261	0.021
NIBB	0.296	0.000
HR	1.398	0.000
HBP	0.315	0.000
IBB	0.174	0.000

The high value of the out is somewhat misleading—that includes things like the reaching on error, which we separate out in our current linear weights implementation. But here, the source of error comes from the potential differences in baserunner advancement. It makes a certain kind of sense—an Adam Dunn double and a Juan Pierre double present different opportunities for a runner at first to advance, for instance. (A Juan Pierre double is probably closer to an Adam Dunn single, and an Adam Dunn double is probably a good chance at a triple for Juan Pierre.)

So now you’re reduced your estimated error without changing your run estimates! Congratulations. The downside is that you’re now measuring your error against something that I suspect most people have a hard time understanding. You’re getting pretty far into the weeds of hypothetical runs, rather than measuring against a good proxy for actual runs, like what we did last week.

Another thing we can do is look at the change in run expectancy for each event. This isn’t a particularly new idea (Gary R. Skoog came up with it in 1987, calling it the Value Added approach[i]), although it hasn’t been especially popular because of the play-by-play data needed to compute. Let’s pull up the same run expectancy chart we used last week:

RUNNERS	0	1	2
000	0.489	0.263	0.101
100	0.858	0.512	0.221
020	1.073	0.655	0.319
003	1.308	0.898	0.363
120	1.442	0.904	0.439
103	1.677	1.146	0.484
023	1.893	1.290	0.581
123	2.262	1.538	0.702

To get the Value Added for a plate appearance, you take the runs scored on the play, add the ending run expectancy, and subtract the starting run expectancy. So a bases loaded home run with no outs would have an ending run expectancy of .489, plus runs scored of 4, minus 2.262: a Value Added of 2.227. A home run with the bases empty and two outs has the same run expectancy at the beginning and the end, so you end up with a Value Added of only 1.

This approach, needless to say, does a much better job of reconciling with actual runs scored than the linear weights approach. It also comes closer to measuring performance “in the clutch,” although it ignores inning and run differential (we’ll talk about that at a later date). So why not use it instead of linear weights? The data needed to power it is readily available now, at least for the modern era, as is the computing power required to accomplish it.

The issue—if you’ll remember back to our goals laid out in the first week—is that we want to avoid over-crediting a player for the accomplishments of his teammates. A player is not directly responsible for the base-out states he comes to bat in; that’s the product of the hitters ahead of him in the lineup. But if you look, there’s a substantial relationship between the average absolute Value Added of the situations a player comes to bat in, and the difference between his linear weights runs and his Value Added runs (per PA):

In other words, a player’s Value Added is driven in part by the quality of opportunities he has, not simply what he does with them. The ability for a player to impact plays is greater in some situations than others, and players who get to bat in those situations more than the typical player will have more of a chance to accrue Value Added.

But we can adjust for this. Using a variation on Leverage Index called base-out Leverage Index, we can adjust the Value Added run values for the mix of base-out situations a player comes up in. We take the average absolute change in run expectancy in each base-out state and compare that to the average absolute change in run expectancy in all situations, so that 1 is an average situation and higher values mean more possible change in run expectancy. Then we divide the Value Added values by the leverage to produce an adjusted set of values that reflects a player’s value in the clutch without penalizing or rewarding a player based on the quality of his opportunities.

There is still a substantial relationship between linear weights runs and adjusted Value Added—most players won’t see a drastic change. But some will. Take Robinson Cano, for instance. In 2009, he had a pretty good offensive season by context-neutral stats, batting .320/.352/.520 in 674 plate appearances. That’s good for 25.5 runs above average. But Cano had some pretty pronounced splits that season, hitting .376/.407/.609 with the bases empty but .255/.288/.415 with men on. Cano’s “clutch” performance was so bad, he ended up being worth -1.6 adjusted Value Added runs, worse than the average hitter despite hitting well above average for the season on the whole.

It comes down to what we want to measure—do we want the context-neutral runs, which say Cano was a superb hitter in 2009? Or do we want the clutch-based Value Added runs, which say he was below average? Or should we present both, and let readers decide which they prefer? Here’s your chance to weigh in. We’re not taking a poll—it’s not an election, per se. But we’ll listen to arguments on both sides, and I promise you that this isn’t a trick question.

[i] Some analysts, such as Tom Tango, refer to this as RE24.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Colin Wyers

More about:

Latest Articles

You need to be logged in to comment. Login or Subscribe

ErikBFlom

9/05

As usual, the answer is... it depends.

When I am playing fantasy, I really do not care how good the player is in theory. If he is assigned high leverage situations, I don't want that sorted out. After all, I seek out players who are placed in high-leverage situations.

If I am asking who the best ballplayers are, who makes the most of the opportunities, then I want the de-leveraged values. Comparing second basemen who bat second on one team and sixth on another really requires de-leveraging.

Reply to ErikBFlom

Mooser

9/05

So RE24/boLI?

Reply to Mooser

TangoTiger1

9/05

That's correct. I expand further on my blog if you are interested.

Reply to TangoTiger1

BurrRutledge

9/05

Thanks for the opportunity to peak behind the curtain, Colin. I have more questions than opinions at this point.

1) How difficult is it to separate a context-neutral offensive value from a context-dependent offensive value? Would pitching and defense be given the same considerations? Should they?

2) How long would it be expected to take for a player's context neutral performance and his clutch performance to stabilize? Can the metrics be developed to account for that?

Reply to BurrRutledge

pjbenedict

9/05

Regarding linear weights and adjusted Value Added: Which has been a better predictor? I realize that's not the point of WARP, but wouldn't the answer to that question point to which measures a stable attribute more effectively?

Reply to pjbenedict

cwyers

9/05

Linear weights, almost assuredly. As you note, though, that's not the point of WARP.

Reply to cwyers

Mooser

9/05

How many years of Value Added runs do you need (or PAs)in order for Value Added runs to become just as good an indicator as Linear Weights. If its only like 3 years (rather than say 10) I would like to see a 1 year WARP with Linear Weights and a 3 year WARP for Value Added Runs. Anything more than 3 year WARP is likely useless in terms of determining True Talent as then you get into aging factors etc.

Reply to Mooser

cmaczkow

9/05

I think this is a fascinating question, because it really boils down to "What do you want the stat to actually indicate?"

My own personal preference would be the context-neutral value, for two reasons. One reason is that I think the context-dependent value can be so different from the overall stats (such as the Cano example) that it will be difficult to sell to most fans without extensive explanation.

But my main reasoning is this: When I look at a stat like WARP, my brain translates that into "How good was this player?" And when I think about how good a player is, I tend to think of the aggregate of what he's done. In that perspective, the context and timing of each discrete event is essentially random. In other words, unless there is reason to believe that "context-based hitting" is a skill, my thought is that a context-based pattern is just noise.

An analogy might be that you are going to flip a coin ten times and you get 1 point for each tail in the first five, but 2 points for each in the second five. If your exercise gives you five heads and five tails, I want the probability described as 50%, even if tails actually came up in 4 of the last 5 tosses and thus was actually "worth" more.

Again, this is assuming that player ability doesn't actually change based on context. I guess in a perfect world the stat could combine both factors (especially if we learn that contextual performance really IS a skill and consistent over time). Maybe something like WARP = xx.xx (y), where xx.xx is the value based on the aggregate and (y) is a rating/grade/classification/number that tells us how much his "real" (context-based) value differed from his aggregate value.

Reply to cmaczkow

cwyers

9/05

Well, as I am fond of saying (and if you play the Colin Wyers Drinking Game, this is a Babylon 5 quote, so take a shot), Understanding is a three-edged sword -- your side, their side, and the truth.

You're absolutely right, there's a segment of people who will find it very hard to accept a WARP that isn't context-neutral. Right now, though, we have a segment who finds it very hard to accept a WARP that isn't. There's a lot of people who insist that what players do in the clutch is important, even if it isn't repeatable. If it was a matter simply of truth, it'd be easy to resolve this dispute. It's a matter of what one values, though, which is much harder to resolve through sheer force of arguement.

Reply to cwyers

TangoTiger1

9/05

I love your coin analogy.

***

When I poll readers on my blog, there's an equal divide. Some are interested just in what happened at the time it happened, and others are just interested in the outcomes as if they happened in a vacuum.

Basically, we always need to have multiple versions of whatever metric you create, because everyone is coming to the table look for answers to different questions. And we're not able to find "common" questions enough to allow us to present just one version.

Reply to TangoTiger1

ncsaint

9/05

Why choose between them? It seems to me that the best reason to keep leverage out of WARP is that we already have RE24/boLI.

Whatever improvements there are to be made to WARP or RE24/boLI, they are telling us fundamentally different things. Usually they'll broadly agree. Sometime they won't. When we see them diverge, we learn something interesting and important about the performance of a player.

Why muddle the signals?

Reply to ncsaint

Mooser

9/05

Because WARP includes defence and baserunning. That being said, if your going to include non-context neutral batting runs (RE24/boLI) than doesnt the defence and baserunning components need to include context as well. I suppose with play by play data we have the ability to do that, but I think it needs to be consistent.

Reply to Mooser

cwyers

9/05

As Mooser notes, WARP is much more than just offense -- the direct counterpart to adjusted Value Added isn't WARP, it's Batting Runs Above Average (or linear weights, more broadly speaking).

Reply to cwyers

tbunns

9/05

Does "Out" include double and triple plays?

Reply to tbunns

newsense

9/05

BRAA vs. RE24/boLI is a good (if irresolvable) discussion but the context here is the error bars Colin is attaching to BRAA/WARP.

Colin treats them as statistical (variation from random sampling) but they're not statistical. They're the result of intentionally deciding on a context neutral stat (linear weights) then changing his mind. In this case what the "error" term says is. "I'm giving you the linear weights value but if I wanted to give you the RE24 instead, it's likely to make a difference in a range of X runs, although I could have just gone ahead and given you the RE24 value."

Reply to newsense

TangoTiger1

9/05

Newsense is absolutely correct and explains it perfectly.

Reply to TangoTiger1

newsense

9/05

Once again, as I commented on the previous article, Colin is calling "standard errors" what are really "standard deviations".

Reply to newsense

therealn0d

9/06

I'm not certain you are correct here. Standard error and standard deviation aren't very different, unless you mean some other use of the term standard error.
If Colin is using samples to estimate a mean and thus estimate a standard deviation from the sample means, he is using standard error correctly.

Reply to therealn0d

newsense

9/06

Standard deviation is a description of distribution. If the SD of a double is .456, we can estimate that 68% of doubles produced a change in run expectancy between .723 +/- .456.

Standard error (of the mean) is a description of the uncertainty as to what the true linear weights average value is, because we have a finite sample. It would be .456 divided by the square root of the number of doubles in the sample

Reply to newsense

mattidell

9/05

I would appreciate you present any number of different statistics if they essentially answer different questions.

Reply to mattidell

eliyahu

9/08

Late to the conversation here, but I have a separate question: If the Value-Add approach does not stabilize over time to align with the context neutral approach, am I to understand that clutch hitting is an actual skill that is predictive? Over a career, I would expect Cano to balance his poor performance in high leverage situations with outperformance. If, over an entire career, this does not balance out, do we really want to ignore it?

Reply to eliyahu

HPJoker

9/17

Could the extra value added be because of the pitcher's strategy with men on base? I could imagine that a home run would be more likely with the bases loaded than with the bases empty because the pitcher is being forced to throw strikes. I don't know if this is true or not, and I don't think B-R has MLB stats.

And please, context neutral. That's the best way of looking at a sample.

Reply to HPJoker

Reworking WARP: The Uncertainty of Offense, Part Two

Thank you for reading

Latest Articles

Fantasy Starting Pitching Planner ’24: Week Four $

Next Man Up: Week Four $

Something’s Off $

MLU: ‘Tugboat’ Wilkinson is Cruising $

TA94: April $

Colin Wyers

More about:

Latest Articles

Fantasy Starting Pitching Planner ’24: Week Four $

Next Man Up: Week Four $

Something’s Off $

Thank you for reading

Related Articles

Latest Articles

More about:

Latest Articles

Related Articles