Reworking WARP: Why We Need Replacement Level

September 11, 2013

Last week, we talked about different ways to measure offensive performance. Let’s talk some about baselines. The past few weeks have had a lot of math; this time I want to step back and talk some theory (although we’ll have a fair amount of math as well.)

What’s funny is that sabermetrics is regarded as being about math first, when really the heart of the thing is theory. I and my fellow travelers have been accused of “ruining” the game with numbers. But from its earliest days, the spread of baseball was as much about newsprint filled with columns of numbers in agate type as it was about the stories written about the game. Numbers have always had an incredible power to tell us about the game of baseball, and that was as true in 1913 as it is in 2013. Scratch any columnist who talks about how stats are ruining the game and you can find a voluminous knowledge of the history of the game as told in numbers, of the records people hold and the records people haven’t managed to break.

But the relationship baseball fans, writers, and front office personnel have with numbers is changing. Traditional baseball stats are, by and large, an accounting of what happened on a baseball field. At their best, they count the fundamental events on the field—walks, strikeouts, home runs, etc. At their worst, you have convoluted and frankly biased counts of things— runs batted in, earned or unearned runs, games “saved.” Modern stats, by contrast, attempt to relate what happened to the fundamental building blocks of baseball, runs and wins.[1]

So in general, with traditional baseball stats, you’re evaluated by either having lots of the things that are considered good, or fewer of the things that are considered bad. The statistics that have been inspired by sabermetrics, on the other hand, have largely been about evaluating a player inside of a context. What this means is that sabermetrics tends to evaluate things in marginal or relative terms, rather than in absolute terms.[2]

Now, you could do this on a case-by-case basis. If you want to compare Babe Ruth and Barry Bonds, you could take the stats of each, in addition to the complete stats for their respective leagues in their career, and work from there. But you’d end up reinventing the wheel every time, for starters. And it doesn’t scale very well—once you start comparing three players, say, that approach starts to break down. And if you want to include Babe Ruth’s pitching stats as well, you pretty much have to throw up your hands and walk away. It quickly becomes apparent that having a common baseline makes a lot of things easier, sometimes even possible.

The most notable baseline in use today is replacement level. Originally conceived of by Bill James, it has gained prominence through stats like VORP, WARP and the different WAR implementations at FanGraphs and Baseball-Reference. Though popular, it has its fair share of detractors. So let’s examine some of the alternatives, in hopes of seeing why using replacement level may be desirable and meaningful.

If you’ll indulge me for a second, let’s consider the 2012 Most Valuable Player voting. I know, I know, even I eventually tire of flogging such obviously deceased equines, but I promise to examine the issue narrowly without relitigating the larger point and then move on.

Let’s exclude defense, position, baserunning, any of that, and just focus on hitting, using linear weights values (we can cycle back around to Value Added at a later date). According to Batting Runs Above Average, Mike Trout was responsible for 61.7 runs above average. Miguel Cabrera, on the other hand, was responsible for 49.2 runs above average. That’s a rather significant gap of 12.5 runs between them.

There is the question of the disparity in plate appearances, though. Miguel Cabrera had 697 PAs, while Mike Trout had 639 PAs. In a sense, that’s already accounted for in our batting runs—Trout ostensibly just gets a zero for each of those 58 PA Cabrera has that he doesn’t. There’s a problem with this, though: the assumption that the PA differential between Trout and Cabrera would likely be soaked up by an average player. Let’s take a look at the distribution of True Average in 2012, weighted by plate appearances (excluding pitchers hitting):

It’s a close facsimile of a normal distribution, but it isn’t quite one. For one thing, it has a leftward or negative skewness to it—in other words, the tail to the left of the average is longer than the tail to the right. More significantly, it has a much higher peak (and is a bit narrower) than a normal distribution would predict—that’s known as kurtosis. But still, we see a mean of .264, a median of .266, and a mode of .267. In other words, when you weight by plate appearances, what you end up with is average is average is average. Looking at this picture, you could quite readily believe that Trout’s missing PA were likely to be made up by an average hitter.

Things look quite a bit different when you stop weighting by plate appearances:

Now we see a mean of .237 and a median of .246. (The mode is unchanged, however.) What this means is that the average player is, counterintuitively, below average. This happens because a player’s offensive ability and his playing time are substantially correlated, at 0.49. In other words, because the better players get more playing time, they drag up the league averages. But if you pick players at random, you’re far more likely to find a player who’s below average than one who’s above average.

In this particular case, we can simply look at the players who played before Trout was called up to see this illustrated. Before April 28th, the date Trout made his first appearance in 2012, Angels center fielders had 73 plate appearances and put up a True Average of .170, or a BRAA of -6.47. It may not be wholly fair to Trout to hold him accountable for just how bad his teammates were before he was called up, but nor is it fair to Cabrera to treat the PA disparity between them as if it was covered by wholly average hitters.

We can also look at the case of the essentially average hitter. Rickie Weeks had a .261 TAv over 677 PA, getting a BRAA of 0.3. Vinnie Rottino also had a TAv of .261, but only 39 PA, netting him a BRAA of 0.0. It strains credibility to the breaking point to suggest that Weeks was no more valuable than Rottino, but that’s exactly what using average performance as a baseline does.

Now let’s swing ’round to the other extreme and instead of looking at runs above average, look at runs above zero—so instead of everything summing to zero at the league level, it sums to the number of runs scored in the league. It certainly does close the gap between Trout and Cabrera, giving Trout 136.6 runs to Cabrera’s 130.8. But that method has problems of its own. Michael Young finally had a season so bad that the Rangers would actually consider trading him, putting up a .243 True Average. But because he managed to get to 651 PAs (owing more to his abilities at persuading Ron Washington more than his abilities as a player), he had 65.2 runs above zero, just a hair above the 64.7 runs above zero of Pablo Sandoval, who had an impressive .290 True Average but was limited to 442 plate appearances. If Pablo Sandoval had a .144 True Average over another 209 PA, he would have brought himself to the same True Average as Young; the average pitcher had a TAv of .140 in 2012. Assuming that PA differentials are gobbled up by hitters no better than the typical pitcher (or worse) is no better than assuming they’re taken by players who are league average.

If baseball players all had the same amount of playing time, we could ignore these questions altogether. But they don’t, and neither the average nor the absolute runs scale is equipped to properly value playing time: one overrates it and one underrates it. And in figuring out what the correct answer is, we should probably divert some attention to figuring out why this is.

Let’s do some back-of-the-envelope math. According to the 2010 U.S. Census, there were 41,688,289 males aged 20 to 39, which we’ll call our total population pool for major pro sports in America.[3] There were 1,284 people who played in Major League Baseball in 2012. Another 469 played in the National Basketball Association in the 2012-2013 season. And roughly 2,700 played in the National Football League in 2012. That’s something like 4,453 professional athletes in the three major sports playing at the highest level (and thanks to Bo Jackson’s retirement, we don’t have to worry about double-counting). In other words, professional athletes represent about 0.01% of the total talent pool available. The biggest pro sports leagues aren’t going to be able to capture the entire top hundredth of a percent, mind you—there’ll be some losses to the military, to hockey and soccer and the like, the Olympics, and maybe even some stockbrokers and car salesmen in the group. But those leagues invest an awful lot into talent scouting, and the incentives of becoming even a backup player in any of the major pro sports leagues in America are pretty compelling.

So what you see in the majors is the result of 30 teams investing significant money, time, and manpower into finding and training the top 500 or so baseball players out of an available population of over 40 million people. It’s an incredibly elite group—even the “bad” players in MLB are in the top tenth of a percent of baseball ability in the country. It’s hard to notice this sometimes, because everyone major leaguers are playing with and against is also an elite baseball talent, but it’s true. So this has two implications:

1) A truly average major-league hitter is not a commodity thing, it’s a lot closer to being a one-in-a-million event. The number of average and above major-league players is vastly outstripped by the number of below-average potential major league players, even if you restrict yourself to the top one percent of baseball talent in the country.

2) Because of this, there are thousands of hitters who are below the major league average in talent but who are better than an automatic out in every plate appearance—there is never any need for a major league team to turn to a batter who has no chance whatsoever of getting on base.

One of the most common criticisms of replacement level is that it’s hypothetical. As a Boston Globe columnist put it:

This “replacement player” who constitutes the very linchpin of the entire premise is mythical. There is nothing measurable or precise about his existence. Yet supposedly intelligent people have signed off on this utterly bogus piece of baseball idiocy.

And no, there is not one single replacement-level player—he is a hypothetical amalgamation of many different players, just like our average player is, or our “absolute zero” all-outs batter is. But unlike the “absolute zero” batter, there really are replacement-level players in the majors. A lot of them, actually, if you know where to look.

Let’s return to our distribution of talent from earlier, but instead of focusing only on players in the majors, let’s look at everyone playing in the affiliated minors as well. To put everyone on a common scale, we’ll use TAv derived from major-league equivalencies, translations of stats that use minor-league production to estimate how a player would perform in the majors. (These are not wholly projections, although the translations used here are the same ones used in PECOTA for projecting the performance of minor leaguers.)

That’s well over one million plate appearances across all levels in 2012. (Pitchers batting are excluded.) Only 184,179 of those PA, or about 17 percent of the total, were in the majors. That’s an average TAv of .219, well below the MLB average of .260 if you’re weighting by PA, but not too far off the average if you don’t weight by PA.

What replacement level represents is the point on the talent distribution where the number of players available exceeds the amount of major league playing time available. Baseball talent is not perfectly fungible—if you need a second baseman, you can probably make do with the best freely-available player if he’s a shortstop, or maybe even if he’s a third baseman. But if he’s a first baseman, he’s no substitute for a second baseman unless you can contort the rest of your roster to fit. And the number of available players is curtailed by roster size limits, rules about transactions, and players under contract to other clubs. So while replacement level may be a much better reflection of how talent is distributed in baseball than the average, the inefficiencies in allocating talent make it trickier to measure.

But because something is hard to measure doesn’t make it any less important, or any less important to try to do well. What we want to do is quantify the difficulty of measuring replacement level—that is, determine how well we can measure it—and try to do better. (If you’ve noticed a theme to the series by now, this is probably it.) So next week, we’re going to investigate replacement level, and we’re going to look at quantifying how well we’re measuring it.

[1] There is a somewhat more recent movement to a third kind of stats, stats which neither are content to simply count things nor attempt to measure contributions towards winning. Those are efforts to measure, for lack of a better term, player skills—things like plate discipline, swing rates, the rates at which a pitcher throws various pitches, etc. That’s rather beyond the scope of this article.

[2] Even so-called “absolute” linear weights runs don’t behave like pure counting stats, because outs have a negative value, and so at a very low point it’s possible to have negative production. Other efforts at absolute value stats, such as Win Shares, simply use a rather low baseline and treat performance above it as “absolute” value. The fact that the very few measures sabermetrics has that purport to be absolute value are in fact relative helps illustrate how relative value is a fundamental feature of sabermetric measures.

[3] It’s imprecise, as it excludes the ability of pro sports to pull from central and south America, Canada, Europe and Asia. It also isn’t a perfect representation of the age distribution among pro sports, but it’s close enough. And we could model in the population growth since 2010, but it won’t matter enough to spend time on it. If anything, we’re underestimating the total talent pool available, which won’t undermine the point we’re making.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Colin Wyers

More about:

Latest Articles

You need to be logged in to comment. Login or Subscribe

therealn0d

9/11

How did I know the quote from the Boston Globe column was Bob Ryan before clicking the link? Ryan just gets more ridiculous over time.

Reply to therealn0d

Johnston

9/12

I must disagree. I think that he has already maxed out his ridiculousness.

Reply to Johnston

Dodger300

9/11

I am curious how it is determined how much of the mythical replacement player consists of offensive contributions vs. defensive contributions.

Wouldn't we expect different franchises make different choices? For example, an organization might decide since the offensive contribution is going to be weak anyway, it will "punt" hitting and bring up a defensive wiz who normally would not have found playing time. But a different team might choose to balance offense and defense the best it could.

Isn't it possible, even likely, that would create replacement players of significantly different value, depending on the organization? How are these subjective choces accounted for when establishing what a replacement level player can be expected to produce, on each side of the ball?

Reply to Dodger300

cwyers

9/11

I'm doubtful, and here's why. It's not like replacement players are bench players -- bench players are players you're counting upon needing and having, so I can see different teams having different philosophies about what they're looking for on their bench. A replacement player is a guy who you call upon because you need a guy -- there's an emergency, start the fire drill and get all hands on deck. So, maybe you want certain things out of your replacement, but realistically, you're going to take the best available guy.

Now, it's possible that replacement varies from organization to organization. It's actually an interesting notion. But I don't think the profile of replacement player is going to vary by team all that much. Which isn't to say that some replacements won't be all-glove, no-bat guys -- there's a lot of different ways to skin that cat. But I don't think it's a strategic choice on anyone's part.

Reply to cwyers

Dodger300

9/12

Maybe you are correct that teams grab on to any port in a storm when they need a replacement.

But maybe a well run franchise plans ahead during the winter and stocks its AAA and AA teams with the sort of replacement players it prefers if the team should run into trouble during the season.

Regardless, it is clear that the formula for determining replacement level cannot help but be based on subjective assumptions, such as those in your comment above.

That is why those who point to the simple WAR or WARP numbers at the end of the season as eliminating any need for discussion regarding value (a la Trout v. Cabrera) are doing a great disservice to the cause.

Misusing the WAR number to serve as an purely objective measure when it is not only makes it more difficult to establish the validity of the measure, since others can often tear apart such "conclusions."

Reply to Dodger300

paulcl

9/11

Maybe this is just my eyesight, but the weighted by PA and unweighted graphs look identical. They have the same Chi-squared value too.

Reply to paulcl

bornyank1

9/11

Fixed.

Reply to bornyank1

tbunns

9/11

Enjoyed the article. This was my favorite of the series by far.

One question/clarification - I just want to make sure I understand this correctly...in linear weights, there are values for the outcomes for the batter (single, double, etc.) and the value for each ends up being the average value of that outcome and the change of baserunners, outs, and runs on the field. So when we take a batter and try to find their 'value', the assumption is that an event (such as a single) is the same across batters even though some singles with a runner on first lead to first and second, some lead to first and third, and some lead to first and a run. Is that right?

One request - can we please stop saying RBIs, saves, etc. are biased stats...the stats are clearly defined and have no more error than any other baseball counting stat I believe. Perception of what these stats are telling us is biased by folks and I am pretty sure that is what you mean. So say that. For me, it detracts when you add little comments like this in the series.

Reply to tbunns

bline24

9/11

Love this series. Keep them coming. Bravo.

Reply to bline24

cmaczkow

9/11

Thinking about the value added discussions from last week, do we have the data (and the computing power) to determine an *actual* replacement level for each player in the majors? I realize this is an exercises chock full of assumptions and questions. It is also an exercise that measures something completely beyond each player's control. But to get a true "value-added" stat it seems like it would be necessary.

So, if Cano circa 2009 only did his hitting with the bases empty, he might be the MVP if his primary backup was bad enough. (And if his primary backup is hurt for a month, maybe you have to assume his backup's backup is the replacement. Like I said, complicated and beyond the player's control, but a more accurate indicator of actual "value" if that is the way you are defining the question.

Reply to cmaczkow

cwyers

9/11

It gets a bit tricky, for three reasons:

1) If a guy has a really good or bad backup, it doesn't make sense to me to credit or debit him for it. You want to compare players on a common footing.

2) You'd have to identify a guy's backup, which is tough.

3) You don't want to know how well the backup played, but how well the backup WOULD have played if the primary player was out.

So I don't think that's an approach we'll be taking.

Reply to cwyers

gweedoh565

9/12

I love this idea, though I guess this would be more from the perspective of a player's value to a specific team, rather than his context-neutral value.

In the described case, a player could be assigned 30 different values- one for each team, based on that teams current roster- which would be really interesting in terms of evaluating potential trade partners.

Reply to gweedoh565

ScottBehson

9/11

This is great, and should be required reading for any sports journalist

Reply to ScottBehson

ncooke

9/11

Maybe a dumb question, but if the problem with absolute runs is difference in playing time, what's the problem with absolute runs per plate appearance?

Reply to ncooke

cwyers

9/11

That's what TAv is, under the hood, and you run into the same problem you run into with runs above average where an average guy in 650 PA is rated the same as an average guy in 25 PA.

Reply to cwyers

NathanAderhold

9/11

Fantastic stuff, Colin. Loving this series.

I always had a general understanding that replacement level was meant to account for survivorship bias, but I never got the specifics of what made it better than average/zero.

Thanks for explaining that so clearly.

Reworking WARP: Why We Need Replacement Level

Thank you for reading

Latest Articles

Deep League Landscape ’24: Week Seven $

The Challenge of Challenge Trades $

MLU: Big Names, Medium Results $

The Dairy Daddies are Coming B

Putting Bat Speed in Context $

Colin Wyers

More about:

Latest Articles

Deep League Landscape ’24: Week Seven $

The Challenge of Challenge Trades $

MLU: Big Names, Medium Results $

Thank you for reading

Related Articles

Latest Articles

More about:

Latest Articles

Related Articles