Reworking WARP: The Overlooked Uncertainty of Offense

August 28, 2013

Previous Installments of Reworking WARP
The Series Ahead [8/21]

When I started working on a series about revising WARP, I didn’t expect to have much to say on the subject of offense. Measuring offense is probably the least controversial part of modern sabermetrics. So why start here? I have a few reasons:

It’s a good place to start, foundationally. The topic of run estimation covers a lot of tools that are useful in more up-for-debate areas.
The goal of this series is to be inquisitive; we shouldn’t just assume anything is right. We ought to test.
We tend to take the relatively low amount of measurement error on offense for granted, and so neglect the measurement error we do have.

So, we’ll math. But before we math, let’s talk a bit about how sabermetricians measure offense, as opposed to what I like to call “RBI logic.” Traditional accounting of baseball offense works on two basic principles:

If you get on base and eventually score, you are credited with a run scored.
If you drive in a runner (including yourself), you are credited with a run batted in.

Ignoring some pretty silly edge cases, this reconciles with team runs scored. The problem is that it’s such a binary model—either a runner scores or he doesn’t. With baseball, though, there are outcomes that can increase the probability of a runner scoring without driving him in immediately:

· You can advance the runner, which makes him more likely to be driven in in a subsequent at-bat, and

· You can avoid making an out, which—even if you do not advance the runner in doing so—gives additional batters behind you chances to drive him in.

So RBI logic does a very good job of reconciling to team runs, by sheer force of will, but it’s a poor reflection of the underlying run-scoring process. You end up crediting players for coming up in spots where runners are in scoring position, and ignoring the contributions of players who advance runners over. You also ignore the value of not making outs.

The foundation of most modern sabermetric analysis of run scoring is the run expectancy table. Here’s a sample table, derived from 2012 data:

RUNNERS	0	1	2
000	0.489	0.263	0.101
100	0.858	0.512	0.221
020	1.073	0.655	0.319
003	1.308	0.898	0.363
120	1.442	0.904	0.439
103	1.677	1.146	0.484
023	1.893	1.290	0.581
123	2.262	1.538	0.702

Top to bottom, it goes by the runner on base—a zero indicates no runner on base, one through three indicates a runner on that base. Left to right is the number of outs in an inning. (It’s not explicitly listed on most run expectancy tables, but the three-out state is a special state in which runs expected goes to zero.) The table lists the average number of runs expected to score in the rest of the inning from that state—the lowest is with the bases empty with nobody on and two outs, at 0.101 runs expected, all the way up to the bases loaded with no outs, where 2.262 runs score on average.

What’s interesting isn’t so much the run expectancy itself, but the change in run expectancy between events. So let’s run through an example. Say you have runners on first and third, no outs. That’s a run expectancy of 1.677. Now, suppose the next hitter walks. That moves you to a bases loaded, no outs situation. That walk would be worth 0.585 runs—a pretty important walk. What if the hitter strikes out instead? That moves you into a first and third with one out situation, for a value of -0.531.

We come up with the value of each event by looking at the average run expectancy change for each event—that’s known as the event’s linear weights value. Here’s a set of linear weights values for official events in 2012:

Event	LWTS
HR	1.398
3B	1.008
2B	0.723
1B	0.443
HBP	0.314
IBB	0.174
NIBB	0.296
K	-0.261
Out	-0.246

We’ve separated the intentional walk from other walks. You’ll note that a hit-by-pitch is worth more runs than a walk—pitchers tend to issue fewer walks with first base occupied, compared to hit batters. Shockingly, a home run is worth more than a triple, a triple is worth more than a double, and so on.

Now let’s look at the same table, but with one new piece of information—the standard deviation around that average change in run expectancy:

Event	LWTS	STDERR
HR	1.398	0.533
3B	1.008	0.520
2B	0.723	0.456
1B	0.443	0.327
Out	-0.261	0.187
HBP	0.314	0.183
NIBB	0.174	0.170
K	-0.246	0.147
IBB	0.296	0.071

There is a substantial correlation between the average run value of an event and its standard error, which shouldn’t be surprising. It also tells us that the actual value of a player’s offense is more uncertain the more he relies upon power—the value of a home run is more uncertain that that of a single, after all.

We need to get into a bit of gritty math stuff here before getting to the fun stuff. What you have to remember is that the standard deviation is simply the square root of the variance around the average. In order to combine standard deviations, you have to first square them, then combine them, then take the square root again. (In other words, variances add, not standard deviations.)

Now, here’s a list of the top 20 players in batting runs above average (derived from linear weights) in 2012, along with the estimated error for each:

NAME	BRAA	STDERR
Mike Trout	61.7	6.6
Buster Posey	49.7	6.5
Miguel Cabrera	49.2	7.1
Andrew McCutchen	48.7	6.7
Prince Fielder	44.3	6.7
Edwin Encarnacion	44.0	6.5
Robinson Cano	43.6	7.0
Ryan Braun	43.5	6.8
Joey Votto	43.0	5.6
Adrian Beltre	42.7	6.8

So the difference between Mike Trout and Miguel Cabrera in 2012 was 12.5 runs. The combined standard error for the two of them (remember, variances add) is 9.7. How confident are we that Trout was a better hitter (relative to average) than Cabrera in 2012? Divide the difference by the standard error and you get 1.3—that’s what’s known as a z-score. Look up a z-score of 1.3 in a z-chart, and you get .9032—in other words, roughly 90 percent. So there’s a 90 percent chance, given our estimates of runs and our estimates of error, that Trout was the better hitter. Now, we should emphasize that a 90 percent chance that he was means there’s a 10 percent chance that he wasn’t. What if we compare Posey to Beltre? That’s a difference of seven runs, which works out to a confidence level of 77 percent that Posey was the better hitter. What about comparing Braun to Votto? That’s a difference of just half a run between them—our confidence is only about 52 percent, essentially a coin flip between them.

So what we have is a way to measure our measurement of run production, and then to apply a confidence interval to our estimates. For a full-time player (one qualified for the batting title, that is) the average standard error is roughly six runs. If you want to compare bad hitters to good hitters, sure, most of the time the difference between them far outstrips the measurement error. But if you want to compare good hitters to good hitters (which is frankly a lot more interesting, and probably a lot more common), then you’ll often find yourself running into cases where the difference between them is close to, if not lower than, the uncertainty of your measurements.

So if we can quantify our measurement uncertainty, the next question we can ask is, is there a way to measure offense that’s subject to less measurement uncertainty? I have a handful of ideas on the subject, which we’ll take a look at next week.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Colin Wyers

You need to be logged in to comment. Login or Subscribe

philly604

8/28

I'm really enjoying this series and think it's a real step in the right direction.

I assume you'll get there in the end, but the thing that jumps out to me is that a 6 run uncertainty level on the supposedly easy hitting side of the player value equation has pretty big ramifications for player valuations.

At ~6M/win a 6 run approximation for hitting contributions would suggest a 3.6M level of uncertainty, right?

Seems like if this approach takes hold we may lose a lot of the knee jerk declarations that team X or Y is dumb for every signing (which would be a good thing).

Reply to philly604

cwyers

8/28

Yeah, 6 runs is roughly 2/3rds of a win, depending on the exact run environment you're in. And that's just for one WARP component. Once we get into stuff like replacement level and positional adjustment, we'll see that margin of error creep up. So yeah, if you start looking at WARP as an estimate rather than a point value, you get into a lot of... well, MAYBE this guy was better, rather than this guy WAS better.

Reply to cwyers

jdeich

8/28

Fantastic and insightful work. It may be hard to communicate to wider audiences (as uncertainty is in science, medicine, economics, etc.), but it would keep analytically-minded analysts in check. Too often you'll hear borderline calls stated as absolutes, "Posey was a better hitter than McCutchen in 2012!", when they're within the intuitive measurement error.

This analysis should readily extend to baserunning. Much like hitting, you have discrete end states with a "responsible party" (assuming defense averages out over large N). Pitching follows similar logic.

Defense... I think we can differentiate the exceptional from the average, and the average from the abysmal, but finer distinctions are likely not yet reliable.

My personal guesstimate is that anything less than a gap of 1.0 to 1.5 WAR over a full season isn't significant. This may be an olive branch to the traditional community when it comes to awards. WAR becomes a tool to establish who makes up the "top tier", and then discussions of more qualitative factors can weigh in. (Such as the somewhat infamous "Cabrera moved to 3B to help his team!")

Reply to jdeich

cwyers

8/28

I agree that uncertainty may be a hard sell in some spots, but I think there may actually be an audience for it outside of the people currently invested in sabermetrics. I think the stridency and certitude of some people who advocate for sabermetrics can be offputting, and I think an analytical approach that's explicit about measurement error and uncertainty could possibly interest, rather than repel, people who are tired of the current approach.

Reply to cwyers

SkyKing162

8/28

So the variance for a player's batting runs is based on the population variance for each event's change in run expectancy, yes?

Why is that? If your actual goal is the exact change in run expectancy, calculate that directly. If you don't want to actually use run expectancy, why do we need to worry about the potential variation around the average linear weights values?

Do you want base-out context or not? Seems like you're taking an odd middle-ground here (or I'm missing something.)

--- ---

Tangential question: why use run expectancy variance, not win-expectancy variance?

Reply to SkyKing162

cwyers

8/28

So the question is, why not use a Value Added approach, if your goal is to minimize error in predicting actual changes in RE?

Reply to cwyers

SkyKing162

8/28

Yes. And if that's not your goal, why worry about the variance in the linear weights numbers?

Reply to SkyKing162

TangoTiger1

8/28

Sky is correct.

Reply to TangoTiger1

Mooser

8/28

So are you suggesting just use RE24 instead of Linear Weights?

Reply to Mooser

TangoTiger1

8/28

If the purpose is to track offensive impact by the base-out situation, then yes. That answers that question.

If the purpose is to assume that the event would have typically occurred in a typical situation, then no.

All you have to answer is this question: how much weight do you want to give a bases loaded walk, compared to a bases empty walk, if both occurred with two outs?

If you want to give the same value, then use standard linear weights, and both get around .3 runs.

If you want to give them different values, then go with RE24, where one walk gets exactly 1.0 runs and the other gets around .13 runs.

It's a personal choice. No wrong answer.

Reply to TangoTiger1

TheRedsMan

8/29

If we assume that players generally are able to tailor their production to circumstance, than this would suggest that RE24 will be a more accurate representation of the run value produced, but that Linear Weights provides a better estimate of talent/ predictor of future run production. It's analogous to ERA vs. FIP in a way.

Reply to TheRedsMan

eliyahu

8/28

I agree with this, and have never understood why the Value Added approach never got more mainstream support. Shifts in run expectancy is an idea that most non-SABR inclined people can get their heads wrapped around. And why use linear weights when you can get the actual impact of a given play given the specific circumstance?

My instinct is that the lean for linear weights would come from a desire -- justified or not -- not to "punish" someone who comes up in a lower leverage situation. A single, after all, is a single, that thinking would go.

I think that approach oversimplifies what a single is. Circumstances do vary from AB to AB, and while some players may have more "Value Added" opportunity over the course of a season, that's a drawback to the methodology that I'm more comfortable living with. (Not really different, in principle, to a player who happens to play against tougher pitchers over a given year.)

Reply to eliyahu

TangoTiger1

8/28

Well, those on the leading edge need to support RE24 more. Google RE24 and you'll get some good articles. But, we need more people spreading the word.

Reply to TangoTiger1

jdouglass

8/28

I don't like the notion that LW 'punishes' a player. What it does is not reward or deduct from their value circumstances that are beyond their control.

Reply to jdouglass

noonan

8/28

Wouldn't it be better to model the error inherent in run estimation based on the distributions we know for batting average, HR/FB, etc?

Reply to noonan

cwyers

8/28

We'll talk about this next week.

Reply to cwyers

ncsaint

8/28

Doesn't that undercut one of the main functions of WAR -- judging what a player's performance would contribute to an average team?

Without the linear weights, you'd expect an identical line to come out to a higher WAR for a player whose team got runners on base more. So you're rewarding a player for playing on a good team, and, presumably, reducing the year-to-year correlation of WAR.

Reply to ncsaint

SkyKing162

8/28

This is a good point. Maybe do RE24 divided by average RE24_leverage for the season? So if you bat in important situations more often because of your team, we reduce your RE24 total? And vice versa?

Of course, lineup position comes into play here, too. If you're a #4 hitter, you get more high leverage PAs, whereas #9 hitters get fewer (I'm assuming). Now, some of that is tied to how good of a hitter you are. Better hitters deserve more important lineup spots and therefore slightly higher leverage situations. How to account for that?

Reply to SkyKing162

TangoTiger1

8/28

This is handled by "base-out Leverage Index" (boLI), though in retrospect, I should have called it LI24.

Baseball Reference tracks it, and you will see that there is not much deviation.

Reply to TangoTiger1

ericmvan

8/30

What is most frustrating about the neglect of RE24 for hitters in favor of a context-neutral approach is that *we do the precise opposite for pitchers.* RA, once adjusted for inherited runners, is essentially a measure of pitchers' net delta Run Expectancy, with context. We should be using linear weights or Base Runs to calculate every pitcher's context-neutral RA, but nobody does that. Bill James' ERC is an attempt at an estimate, but we can do better.

In a perfect world, we have a context neutral run value, one that includes the base-out situation, and one that includes the inning / score situation by converting WPA back to equivalent changes in RE. For both hitters and pitchers. It's the middle figure that's "real" and not any kind of estimate; the other two numbers attempt to subtract context that we think may lack predictive value, and add further context that we also believe is non-predictive. But having all three measures (including some addenda that measure the contribution of leverage and opportunity alone) handy for everyone will allow us to address many interesting questions.

Reply to ericmvan

sahmed

8/30

While I appreciate you highlighting the difference between how we treat hitters and pitchers, there is a fundamental difference: Pitchers pitch to every batter and therefore have a direct hand in the entire base-out situation. Hitters, on the other hand, step into a base-out situation that has been determined for them. So there IS a fundamental difference here, not that I have my mind made up that we should take a context-dependent approach for pitchers and a context-neutral approach for hitters.

Reply to sahmed

ericmvan

8/30

Yes, that explains why we do it the way we do. It doesn't explain the weird lack of interest that the sabermetric community seems to have for a full context-neutral pitching metric. Yes, we know that some of BABIP is luck, but we know that much of it isn't (for instance, on a start-by-start basis, most pitchers have a significant correlation between FIP and BABIP). What I want is that full context-neutral metric (essentially a pitcher's TAv allowed) and the same thing with league-average BABIP substituted, and better yet, with the smartest possible estimate of the pitcher's true BABIP skill.

(I might mention that every metric I've suggested I used to do while with the Red Sox, using my simplified (conceptually) / expanded (number of terms) version of Base Runs, so they are very doable! You do things like substitute league-average rates of runners out on base and passed balls. All very straightforward.)

Reply to ericmvan

sahmed

8/31

Good points. I actually always compile wOBA against for pitchers in my own databasesâ€”didn't realize it was not commonplace.

Reply to sahmed

newsense

8/28

I think there is some confusion about standard deviation and standard error here.

Standard deviation is a description of distribution. If the SD of a double is .456, we can estimate that 68% of doubles produced a change in run expectancy between .723 +/- .456.

Standard error (of the mean) is a description of the uncertainty as to what the true linear weights average value is, because we have a finite sample. It would be .456 divided by the square root of the number of doubles in the sample.

Reply to newsense

ncsaint

8/28

So what we're talking about here is all deviation, right? Since you have thousands of doubles and home runs, the standard error would be tiny.

Perhaps pretty significant for 3Bs and IBBs, though.

Is that right?

Reply to ncsaint

newsense

8/28

I think so, but I think you mean NIBBs not IBBs.

The high variances in 3Bs and NIBBs make sense. For a triple men on base, potential runs, are all converted into actual runs while if there was no one on base, particularly with two outs, the certainty of scoring is much less. In a similar way, the value of a walk is much higher if first base is occupied and much higher still if the bases are loaded. If first base is open with two outs, the value is much less.

Reply to newsense

ncsaint

8/28

Sorry, I wasn't being clear. I was double-checking that the numbers in the article are all talking about deviation, and then talking about 3Bs and IBBs in terms of error. That is, given the huge sample for everything else on the list, they would have tiny errors, even where the variance is high, but IBBs and 3Bs might have a significant error simply because there are few of them, with 3Bs being obviously much higher because of the higher variance.

Is that about right?

Reply to ncsaint

ncsaint

8/28

Stating that there is only a 90% chance Trout was the better hitter strikes me as framing these interesting results wrong. You are returning halfway to what you call 'RBI logic'. That is, you don't have the binary outcome problem -- you get credit for moving runners over and not making outs -- but you are going back to rewarding players for coming up in the right situations.

There's a case for that, obviously. (That's what WPA is for, no?) But if we stick to WAR-logic, I think the way to frame it is not the there is a 90% chance that Trout was better, but rather that Trout *was* better because he performed in a way that would lead to more runs 90% of the time.

Or possibly not quite. I would imagine the variance would still matter a bit even if you stick to WAR logic. I assume that the marginal value of runs in a single game decreases above some fairly low threshold. Scoring the seventh run does less for your chances, on average, than scoring the third. So it *might* be that, because of the lower variance, X runs above average made up mostly of walks and singles would be worth more than X runs above average made up mostly of 2Bs and HRs. (I think?)

So if you were able to take that into account, the values would be different, but the end result, by WAR logic, would still be that player X was better than player Y because his performance would add more runs most of the time, not that player X was probably better than player Y.

Reply to ncsaint

ScottBehson

8/28

Yay error variance!!!

Reply to ScottBehson

a-nathan

8/29

I didn't have time to read all the comments, so please excuse me if this has been discussed already. When comparing BRAA for two players, say Trout and Miggy, you have simply added the variances of the two players, then taken the sqrt to get the standard error in their difference. That would be a correct procedure if the variances of the two players are uncorrelated. But are they uncorrelated? I claim no, since they are derived from the same linear weights table. What say you about that?

Reply to a-nathan

TangoTiger1

8/29

Alan: I think this is part of the confusion with what Colin is doing. He's really presuming that the RE24 model is the target model, and he's presuming that he's unaware of Trout and Cabrera's performance in the 24-base-out states, and so, that's what the "error" term is about.

Setting that aside, you are correct that if say all of Trout and Cabrera's hits were singles, and we were in fact doing an error term of the single the correct way (we'd end up with a value of something like .002), then their error terms (which at this point would be extremely tiny, less than 1 run) would move in the exact same direction.

But, that's not what Colin is doing.

Reply to TangoTiger1

a-nathan

8/29

So let's see if I understand: In Colin's table showing LWTS and STDERR, the STDERR is the standard deviation found from the variation (appropriately weighted perhaps?) among the 24 states contributing to the total. If he had used RE24, then there would be no variance (other than the tiny variance in the actual RE24 numbers, which are based on many contributing events and are therefore small). Am I getting it?

Reply to a-nathan

TangoTiger1

8/30

Perfect!

***

As for how "small" small is for RE24, in some states (bases empty 0 outs), it's very tiny. In other states, it's larger than you might think. In some league-years, you'll have say the man on 2B state have a HIGHER run value than the man on 3B state.

This is easily corrected by using Markov chains.

Reply to TangoTiger1

NathanAderhold

9/02

I just want to make sure I understand the discussion in the comments about LWTS vs. RE24. I'm still trying to wrap my head around the differences between the two.

As I understand it now, LWTS looks at the change in run expectancy of an event (e.g. 1B, IBB, etc.) for each base/out scenario, then takes the average to come up with a value that is used across the board.

RE24, on the other hand, looks at the change in run expectancy of an event for the specific base/out scenario, which it then plugs in as the value. No averaging involved.

Is this right?

Reply to NathanAderhold

TangoTiger1

9/02

Correct.

Or in another words, RE24 uses a chart similar to this:
http://www.tangotiger.net/lwtsrobo.html

While Linear weights uses a chart similar to what Colin showed above.

Reply to TangoTiger1

Reworking WARP: The Overlooked Uncertainty of Offense

Thank you for reading

Latest Articles

Box Score Banter: A Very, Very, Very Fine Houck B

MLU: Potential Rotation Fitts $

The Call-Up: Jonatan Clase $

The Most Dominated Teams of All Time: 21-22 $

The Call-Up: Jack Leiter $

Colin Wyers

More about:

Latest Articles

Box Score Banter: A Very, Very, Very Fine Houck B

MLU: Potential Rotation Fitts $

The Call-Up: Jonatan Clase $

Thank you for reading

Related Articles

Latest Articles

More about:

Latest Articles

Related Articles