This is something of a culmination of work I’ve been doing over the past few months—taking a menagerie of stats available here at Baseball Prospectus and merging them together under the heading of “Wins Above Replacement Level.” We’ve had WARP for quite a while—and its close sibling, VORP, as well—but it has been rather distinct from the rest of our offerings. That’s coming to an end.

We will, of course, still carry a number of baseball statistics not concerned with directly measuring a player’s value—there are a number of stats both descriptive and predictive that aren’t going away. But in terms of value measures, we’re going to consolidate down to one view.

And that view will not be identical to similar measures of wins above replacement found at other websites. It will obviously bear some similarities (they all should agree more often than they disagree, I would think) but it’s not the goal of the enterprise. A while back, Rob Neyer quoted a letter that sums up a growing point of view:

Instead of competing on WAR, Tango, Forman, BP and anyone else relevant should try to come to some consensus. Maybe you can serve as the summit leader? Otherwise, the sabermetric viewpoint will drown in its own contradictions.

Let me tell you why I disagree—and let’s start off with the definition of sabermetrics proffered by Bill James, who after all coined the word to start with: "the search for objective knowledge about baseball." I think you can restate that definition simply as “baseball science,” or the study of baseball using the scientific method.

And what I’ve been coming to grips with is how little of the field of sabermetric endeavor that definition covers (not all of what follows is necessarily going to be science, although it is definitely informed by what I would term a scientific study of baseball). Once you know things, you still need to interpret them—facts in isolation do not necessarily carry meaning.

So in coming to a reckoning of a player’s value, science is only going to take us so far. And two reasonable people can disagree over the assumptions one starts out from. I’ve talked before about why I find the systemic approach valuable—in short, I find it more useful to start off with assumptions and move to conclusions than vice versa. But that doesn’t mean our assumptions are perfect.

I think it’s tempting for some people to make the perfect the enemy of the good—what cannot be done perfectly is dismissed outright. And when it comes to replacement level metrics, where that’s most tempting is in the definition of replacement itself.

Average Joes

The case against replacement level, I think, can be summed up like so:

  1. It is a hypothetical construct—one perfect embodiment of replacement level doesn’t exist,
  2. It is difficult to define—different people have different conceptions of replacement level, and
  3. It doesn’t convey any additional meaning.

The first two of those I feel are correct but don’t detract from the usage of replacement level—because the third point is clearly wrong.

The most frequently cited alternative to replacement level is average. On the first point, it fails to improve upon replacement level—an “average” baseball player does not exist any more or less than a replacement player. It’s an abstraction, designed to help us visualize a player’s value relative to something else. It does improve on the second point, as “average” (if we take that to signify the arithmetic mean) has a consistent definition; in other words, everyone understands it to mean the same thing.

There is one question that average is deeply unsuited to answer, except circularly—what is the value of an average ballplayer? What it will tell you is simply this—the same as every other average ballplayer. Often this will serve, but just as often, it really won’t.

The biggest point of breakage is in determining the value of playing time. Again—average is equal to average, be it in 150 plate appearances or 650 plate appearances. But realistically, we know that the player who manages to sustain an average level of performance is more valuable than the player who does so for only 150 plate appearances. It is possible that the 150 PA player could, given an additional 500 PA (assuming he was capable of playing an additional 500 PA at all), put up the same level of performance. But that’s irrelevant to measuring what has happened.

Or take the example of a player who has produced 20 runs below average. That is, again, very different if it occurs in 150 PA or 650 PA—roughly the difference between a .150 TAv and a .230 TAv.

And while above we extolled the virtues of the average player with more playing time, nobody would find a “true” .150 TAv hitter (that is to say, one who will hit .150 TAv regardless of playing time, not a hitter who hits .150 TAv over a cold streak) more valuable the more he plays—there’s an opportunity cost to deploying that hitter, in that he’s taking at-bats from a player who can do more to help his team. He’s actually hurting his team more the more he plays.

And this is why we find replacement level useful—we are trying to find the point at which a player starts to contribute to his team by playing more, as opposed to detracting from his team. And this is something that is difficult to measure, as the critics of replacement level say—but the difficulty doesn’t make it any less important for us to know.

Baseball fans have always known intuitively that there is such a point, of course—they even have a name for it, the Mendoza line. But it may be helpful to know why it exists. Consider the distribution of MLB batting performance, by TAv, in 2010 (with an idealized normal distribution fit to the data, for illustration purposes):

Distribution of True Average by plate appearances in 2010

What you see is the majority of plate appearances at the average of .260, with outliers becoming less common the further you get from the average. The normal distribution seems to be a very good (although not perfect) approximation for the data.

That is, again, in terms of plate appearances. If you consider players, on the other hand, it becomes a very different story. Looking at the area of the graph between 1 and 4 standard deviations above the average, we see that those plate appearances came from 78 players. Looking at 1 to 4 SDs below the average, we instead see 305 players.

What the graph seems to suggest, at first blush, is that below-average players are just as rare as above-average players. But this simply isn’t the case—below-average players are much, much more common than above-average players (and this isn’t even considering the number of those players available in a club’s minor-league system). What’s limited is the number of opportunities for below-average players—baseball teams have a limited amount of playing time available to them, and they strive mightily to make sure the lion’s share of that playing time goes to their better players.

The break-even point (that is, our platonic “ideal” replacement level) is the point where the number of available players at that level of talent (note—at that level of talent) exceeds the available playing time. Where this gets tricky is that there are transactional costs to acquiring baseball players—making trades, adding players to the 40-man roster, etc.—that impose a limit on how flexible teams can be with their replacements. In other words, the supply of talent isn’t totally liquid. So the practical replacement level may be a bit lower than our platonic “ideal” replacement level—which is fine, as typically we define replacement level as it exists in practice, not in terms of the distribution of talent.

Defining replacement level

So in order to come up with a baseline of what a replacement player is, we need to define a population of replacement players and take the average of that (This is important to note—50% of our replacements will be above our baseline, and 50% of them will be below it. So in practice, we fully expect to see submarginal performance—that is, some level of performance below replacement).

I had two main objectives in picking my replacement-level pool. On one hand, I wanted to make sure I was picking my pool of replacements independently of how well they performed—using a player’s observed level of performance to set replacement level runs the risk of leading yourself around by the nose, leading you to a lower replacement level than you really should find.

On the other hand, I wanted to avoid something like Nate Silver’s study of freely available talent, which relies on salary data that goes back only to the mid '80s. And I wanted to ensure that I had a sizable pool of replacements at which to look.

What I came upon was the notion of looking at a team’s opening day roster, and calling anyone who replaces a player on it, well, a replacement. Not finding a ready source of opening day rosters, I did my best to reconstruct estimates of every team’s opening day roster from the play-by-play accounts provided by Retrosheet.

Now, this may differ in theory from other definitions of replacement—these players obviously aren’t “freely-available” in all cases, in that some of them are either players returning from injury or top prospects. In practice, this matters very little.

What it does give us is a far more sensitive indicator of replacement level than we’ve had previously, allowing us to track changes in replacement level over time. Looking at replacement-level TAv+, by year:

Graph of replacement level over time.

These are not smooth values, so you can see the magnitude of the noise in the measurement (the values will be smoothed out when applied to what you see on the site). But you can see the replacement level shifting gradually over time. What this means is that any particular replacement metric’s baseline is set by the period of study; by taking a closer look, we can use a different replacement level for the '50s than the aughts, better suiting our metric to a wide range of time frames.

Combined with our positional adjustments, runs above average derived from linear weights, and park factors, we now have the elements necessary to compute a new VORP, one which meshes with TAv, our preferred rate stat for measuring offense. We can go further and combine that with our estimates of a player’s contributions on defense and baserunning (EqBRR and its component stats will be undergoing slight changes, mostly so that they draw upon the same run expectancy tables used to generate our offensive stats) to come up with something like what Nate used to call SuperVORP. Running those figures through Pythagenpat converts runs to wins, and we have WARP.

What’s next

Readers rightly want to know when they’ll get to see the finished product. We are currently undergoing two big projects—obviously we’re busy at work putting together the latest edition of our annual, but we’re also busy behind the scenes rebuilding the website (well, I say we, but my involvement entirely consists of looking at what others are doing and saying, “Wow, that looks great”). The overhaul of our statistics is going hand-in-hand with those two efforts, and you should see the new stats in concert with the completion of those two projects.

You probably also want to know what we’re doing with WARP for pitching. We’ll get into that next week.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
I understand the concept or WARP, but that doesn't mean I have to like it. WARP compares a player to a hypothetical other option, but that isn't how we generally compare players. We'll usually look for someone similar to compare said player to.

I would prefer to compare players to the average player. I understand a team cannot simply replace someone with an average player, but the 2003 Tigers proved you can't even replace a player with a replacement level player.

Of particular frustration is HoF discussions using WARP. In theory an average player who played in the Major leagues an exceptionally long time could put up HoF level WARP totals without ever being one of the best players at his position.
I don't really understand your objection here, Scott:

"WARP compares a player to a hypothetical other option, but that isn't how we generally compare players."

So how does this make average better? When a player has to be replaced, a team doesn't say "What do we have relative to the average?" The team says "What do we have?" An estimated replacement level does a better job of simulating real roster-staffing issues than an estimated average level. Colin refers to this specifically in this piece. So how again is using average better?

To your other point, where do you find those HoF discussions using WARP, other than at BP? See, at BP, Jay Jaffe is smart enough to understand that raw WARP totals alone are not a good way to determine who is and isn't a "Hof-quality" player. For this purpose Jay has devised the JAWS system, which works in a couple of different measures of peak value.

Where are they having HoF discussions using raw WARP totals? If you find these discussions, you should probably chime in regarding peak value, and then maybe those discussions will get better. None of this has anything to do with WARP's limitations, so could you be more specific as to what those are?
Average player is just as hypothetical as replacement player. Using replacement player as your baseline allows you to try to account for the fact that playing time is not uniformly distributed between above average players and below average players.

And how is an average player accumulating WARP any different from an average player accumulating hits or RBIs or Wins or whatever? Anybody using counting stats solely with no discussion of rate stats/peak performance is not having a very useful HoF discussion.
First, Colin, excellent work. Thank you.

The trend line seems very sensitive to the first year used in the study. If one takes out the three oldest data points, the slope of the trend line might be remarkably more positive. There's a noticeable dip in replacement level after the 1961 expansion. The talent pool seemingly gets worst right around 1968, the Year of the Pitcher, suggesting that the rule changes to correct pitching dominance might have made it possible for more individuals to compete at an MLB level.

Looking at this I'd be tempted to do one of two things:

1) Trying to research further back, so that all post-World War Two seasons were considered; or

2) Starting the generation of the trend line at a natural break in the level of competition, probably in an expansion year.

Again, though, thank you for a great article.
First off, I love the name flyingdutchman. Now, as to your points:

If you want to get really specific on how teams replace players you have to look at the marginal value of all options. That includes players they have available in their farm system, players they can trade for, and players they can sign as FA's. Obviously teams have different options depending on when the replacement will occur.

The marginal value of a replacement player is the level of production minus the amount of resources you have to provide to gain his services. I accept the fact that this is beyond the scope of WARP. I also accept the fact that teams will have below average players who are above replacement level due to the fact that about half of the players at any position will be below average and most of those players will be above replacement level.

Regarding HoF players I'm familiar with JAWS. However, WARP comes up in HoF dicussions outside of BP. I can recall it coming up in the Hardball times. I don't recall peak value being discussed elsewhere, and to be clear, it isn't exactly the easiest thing to look up. I agree it matters, particularly given what we think about when we think of greatness. For the record, yes I would vote for Bert Blyleven.

You would like me to be clear on what I feel are WARP's biggest limitations? Fine. They are twofold:

1. Crediting players for the difference between mediocrity and the proverbial kid in minors adds very little knowledge to the discussion, save for the acknowledgment that a player can be below average while still being useful and better than the other options.

2. The 2003 Detroit Tigers. I understand random variance in terms of both the performance level of various players and in terms of actual game results. Having said that, if a team that is making an effort to win every game can manage to play collectively below replacement level, it raises some questions I am too civilized to ask.

In any event I do respect what Colin is doing here. If you and others find it useful, I'll simply be happy for you and move on. :)
Regarding #2 and the 2003 Tigers, even though you acknowledge the potential effects of random variation in performance, I'm not sure if you give it enough credit. You've selected them because they were historically bad, but a small number of replacement-level teams might perform that poorly by random variation alone.

Furthermore, I'm not sure about your assumption that they were making an effort to win every game. Yes the players put on the field probably were, but I imagine there might have been young sub-replacement level players who were played to gain experience in the hopes of being above replacement-level in future years (off the top of my head without looking at numbers, Jeremy Bonderman was probably such a player). Finally, there may have been a lot of replacement-level or better players that should have been given playing time but were not due to mistakes by management.

So the fact that on occasion a team can have an *observed* performance a little below replacement-level is not too troubling.
I would disagree with one - I think it also tells us, at least in some sense, what the value of being average is - or another way, the value of playing time for average or better players. It's useful in some cases to quantify, I think - like Joe Mauer's MVP candidacy in 2009, where he missed a good deal of playing time. Without something like replacement level, it's hard to put a value on what that missed playing time cost his team.

As for the Tigers... what I want to note is that when we measure replacement level, we measure how replacement players perform on average teams. If you fill out an entire team with replacement players, they are going to perform worse than the model would predict. A replacement-level hitter on an average team still gets an average number of RBI chances (in the aggregate, at least) and when he does manage to get on base, is being driven in by average hitters.

Okay, so he's probably being hidden in the eight-hole of the lineup, and is being driven in by the average hitting pitcher. There's nowhere to hide a replacement hitter on a team full of replacement hitters, though. And the batters in front of and behind him are other replacement level players. In that sense, what you get is a cascade effect - a catastrophic failure, in other words.

But while those teams are dramatic examples, they're not very useful when it comes to evaluating a player's value independent of his context. There just aren't that many really terrible teams out there.
I agree 100% that we need replacement level due to the playing time issue. As far as I'm concerned, the exact level can be set a number of places. I used "bench" as my level for Win Shares Above Bench (and posted my research at Baseball Graphs), and I think Keith Woolner did the same thing in one of the BPro books.

Is there a particular reason to believe that a replacement level equal to the "26th man" is better than another level?
I think you're being a little too hard on 'average' as a benchmark.

The goal of baseball teams is to win. You don't win by being better than replacement level; you win by being better than your opponent. In the aggregate, that means being better than average. A team that is not better than average is not going to win. A team that has a player at a given position who is better than replacement level, but worse than average, is being _hurt_ by that player in their pursuit of winning it all.

Replacement level is an interesting and useful concept for modeling the effects of talent scarcity. But it doesn't follow that being above replacement level has positive value to teams that want to win.

To put it another way: replacement level as a baseline tells you what you are getting from a player that the worst team isn't getting. Average as a baseline tells you what you are getting that your real competition aren't getting. It seems pretty clear to me that the latter is a more relevant way to measure contribution toward winning.
But replacement level tells you as much as average level does, and more. Targeting 2 WAR over a full season may be roughly equal to average--just adjust from there.

On the other hand, you lose something with the average baseline, as Colin explains. It gives no credit to a player who plays an entire season at an average level WHEN COMPARED TO a player who was above average but had only one plate appearance. That player actually helped his team reach the postseason more than the first one did.

If players all played the same amount of time, then I'd agree with you. But they don't.

Great article--crystallizes many points that have never been completely nailed down in my head.

Also love the replacement level graph, and how it shows a gradual increase in replacement level quality over time--apparently reflecting an increase in quality of competition (I think--replacement players are closer to the best players now than before, meaning more depth is available).

Have you looked at AL vs. NL separately since 2005 using this approach? The expectation is that replacement players perform better in the NL than the AL because of the league quality differences. I'd be interested to know if that holds.
I put in my two cents here, which I'll recopy here:

[b]The need for the defacto replacement level[/b]

It doesn’t matter if you show a pitcher is a .650 pitcher, and another is a .500 pitcher, or you show then as +.250 and +.100 above some baseline. They are still in the same order, and in the same degree of difference.

What is missing is the playing time component. So, suppose you have the .650 pitcher with 81 innings (9 full games), and the .500 pitcher with 202.5 innings (22.5 full games). If you use the .400 baseline as the “zero” level, then these two pitchers have the same value: they will be paid by their teams the same amount. They both have 2.25 WAR (that is, +.250 x 9 = 2.25; +.100 x 22.5 = 2.25). They will both get paid about 9MM$. And you get that in the marketplace, if you think of Ben Sheets and an average pitcher.

We didn’t HAVE to have the .400 baseline as the comparison point. You could have broken it down into two steps: the marginal dollars over average, and what you are paying for average.

The above pitchers look like this:
.650 win%, 9 full games
= +.150 x 9 = 1.35wins over average

.500 win%, 22.5 full games
= 0 wins over average

Say for example that a pitcher is being paid 4MM$ for each win above average. So the first pitcher is getting 5.4MM$ over average, and the second one is zero.

Now, the question is: how much do you pay for average innings? Let’s say that we are paying 0.4MM$ for each complete game. So, the Ben Sheets guy, the 81 innings or 9 games, will get 3.6MM$ if he pitches average. Added to that is his 5.4MM$ for pitching above average, and he’s paid 9MM$.

The second guy is getting 0.4MM$ x 22.5 games = 9MM$. And with zero wins above average, he still gets to 9MM$ total.


All the replacement level does is let’s you merge the two steps into one.

If you look at JC’s book, he breaks it down into these two steps exactly like I’m doing it. Except he assigns only 1MM$ per marginal win (more or less), and then he has an insanely high value of MM$ per playing time. So, when you try to combine the two steps into one, he ends up with a defacto replacement level that is insanely low.

Therefore, by thinking in terms of replacement level, we are exposing exactly where the zero point is, the point where it doesn’t matter how many PA or IP you give, no team will pay you for that crappy level of talent. Indeed, this is EXACTLY how teams think. “He’s a .300 pitcher? Don’t even talk to me about him being a workhorse… he’s costing us wins.”

So, this is why replacement level is important: it reflects reality. That’s what economic theory is supposed to do, it has to reflect reality. And if it doesn’t, then it has to tell us why reality is wrong.

We don’t need replacement level. But, by having a replacement level, we are making our lives much easier, and we are ensuring we don’t end up with ridiculous results. Until you know exactly what you are doing, use replacement level.
Replacement level is not just an economic concept, and I think that limiting it that way misses the bigger point. If you want to compare players with significantly different amounts of playing times, then you DEFINITELY need a baseline that is different than average.
But that value in terms of wins is analogous to the value in terms of dollars. Having Ben Sheets for 81 innings is equivalent, in impact of wins, or dollars, to an average pitcher with 200 innings.

You don't need to have the replacement concept to show that, since I showed that you can make the comparison without it. The replacement concept makes it clearer, certainly. It's just not a necessity.

As others have mentioned with NBA, do we really need to compare to the "13th man"? Or can we get there in different ways, especially when you consider "chaining"? It's not the 13th man that replaces the starter, but the 6th man. And his spot is taken by the 7th man, and so on.
I should also point out that studes uses "Wins Above Bench", arguing that the bench player is the replacement player (similar again, to the way we'd think of the NBA).

The only issue there is that the bench player is not paid the league minimum, so, that's not the true zero point.

However, given the salary paid to the bench player (say 750,000$), then it would be easy enough to extrapolate that to 400,000$, by reducing his win impact slightly downward.

Again, this goes to the idea of the defacto replacement level player (the talent level at which a team will not pay for a player).
Tango, you're again insisting that replacement player is only an economic concept. I'm saying it's not. It's an important adjustment to any metric that compares players to average, when those players have significantly different amounts of playing time.
By the way, the average major league player paid the major league minimum actually performs at least as well, if not better than, bench level. We've discussed this before and I don't mean to shout out Colin here.
Dave, I've been looking all over, and the only thing I can find is this:

Is that still the way you do WSAB?
Colin, sorry. I should have posted the link.

The approach is still the same, but I have changed the percentages. In particular, I found that the old criticism of Win Shares is true--it undervalues starting pitchers--so their replacement level is lower. I think I wrote about the updated values in the Annuals and not on the web.

But the basic idea is the same: starters against next level of bench players. I didn't feel beholden to the "freely available talent" approach.
Well, I wouldn't be surprised. Virtually every first or second year player is going to be paid 400,000$ to 500,000$, and that includes all the great rookies and sophs.

In no way did I mean to suggest that MLB min = replacement level. I DID mean to suggest that a free agent paid the MLB min IS replacement level. That's two completely different things.
Yup, that's my point. If we want the replacement model to reflect economics and if we use your parameters, then we can only apply it to players with more than six years of experience, right?
I agree that chaining is an important factor, and one that ought to be discussed. That's why I'm not convinced that the "26th man" approach is best.

However, if you want to compare players with different amounts of playing time with just one number, then you will be limited by using an average baseline. Yes, you can make the comparison without it, if you throw in other stats and do the math. But why make people do that? It's like saying this:

"Hey, here's a stat, but you can't use it the way it is. Here's some other numbers that will help you. Do the math yourself."

Why not make it usable in the first place?
As someone who has done his part in creating wins above replacement (WAR), I'm all for doing the math myself, and then giving out the answer.

I'm trying to give a different angle to those people who don't buy into the replacement concept that we can still meet on Canal Street if they get off at Lafayette instead of Broadway. That two sane people would value (in wins or dollars) Ben Sheets expected to pitch 81 innings the same as an average starter expected to pitch 200 innings, without one guy's framework necessarily being better than the other's.
I've never postulated a baseball stat. No time like the present, as I sit here on my train during the evening commute.

I understand how playing time is the achilles heal of any 'average' as a baseline derived like WAR/WARP/WARL. At the end of the day, they are essentially a high-fillutin' counting stat.

I just don't understand why that should stop us from translating WAR/WARP/WARL into a rate stat. Then, at any given time, we can see how well any player, or group of players, is doing "per unit of time." Be it per game, per inning, etc.

For instance, once you have the rate stat, you can then create an 'average' for any position. Just take the mean performance of all players at any position, over that length of time, and you've now got your 'average' baseline. e.g. if the 30 left fielders who played in the first ten games contributed 5 WAR/WARP/WARL., then an average SS contributes 0.0167 WAR/WARP/WARL per game (that is 4WAR/300games from that position).

If you pair the rate stat and the counting stat together as paired slash stats, you can see how well your guy is doing on a per game basis, and also compare to 'average'.

In the above example, the average left fielder is contributing 0.0167/0.167. If I've got a righty left fielder who I've been platooning, and he's contributing 0.02/0.04, then I can see exactly what his contribution has been.

Does this make any sense, or should I have just kept surfing and reading during my train ride tonight?
Realized as I walked home from the train station that it might have helped if I'd actually finished off the postulation with regard to 'average.' ...

If one is adamant that these numbers should be leveled to 'average' play, then you can zero out the 'average' to get a WAA / WAAP / WAAL if you so choose...

If you do that, then our platooning LFer is contributing 0.0833/-0.127 WAAg/WAA.

And since the WAA and WAR both use WAR as their base, they should always correlate.


Okay, so... what am I missing?
Burr, that's what I did/do with Win Shares at THT. Here's an example:[]=2008&pos_filter[]=All&Submit=Submit

"WSP" refers to Win Shares Percentage, where .500 is average. It's a Win Shares rate stat. Yes, sometimes players are better than 1.000.
Got it, thank you.
You aren't missing anything. Ideally, we always want to express things on two dimensions: .600 win%, 30 decision, or 18-12, etc. The problem is when we insist on expressing things as a single dimension. Do you show that as +3 wins above average? Do you show that as +6 wins above replacement?

If everyone has 30 decisions, then it won't matter. But, suppose you have someone who is .800 win%, 15 decisions, or 12-3? Do you show that as +4.5 wins above average? Do you show that as +6 wins above replacement?

Basically, in an ordered list, do you want to see 18-12 appear before, after, or tied with the guy who is 12-3?

(Note: I'm using W/L as a proxy for a pitcher's ERA or FIP or your favorite pitching stat.)
Have I ever mentioned how much the addition of a comment section has improved BP? These responses are almost as engaging as the article. Keep it up Colin, TT, and Dave.
It seems to me that replacement level wouldn't move around as much as average and so you get a more consistent point from which to start measuring. In Colin's chart, the replacment value moved from 71 to 77 over a 56 year period. That's a fairly small change over a long time.

I would expect that average level to fluctuate much more. I remember reading some older BP articles in which the writer commented that removing Barry Bonds from the dataset would cause the average to move. Bond may have been an extreme case, but it illustrates how using average as the yardstick can be problematic - if the yardstick keeps changing size then it's hard to know just how valuable +X above average is. I want the value of a player that is +X above the baseline to be consistent.

The key advantage average does have is that it is more intuitive to understand, even if the average fluctuates. I know that above average is better than below average and that average has value. I don't have the slighest feel for is how much better +X above replacement is compared to +Y above replacement. All I know that replacement is not good. If I had a clue as to how much better average is compared to replacement, then I'd be more enlightened