Here at Baseball Prospectus, we talk a lot about so-called “deserved” runs, hoping to improve upon the raw outcomes traditionally used to evaluate player contributions. We’ve discussed why we think certain metrics do a better job of that— such as Deserved Run Average (DRA) for pitchers or Called Strikes Above Average (CSAA) for catchers—and offered various benchmarks to substantiate those views.

What we haven’t done is to pull together, all in one place, our thinking as to what the concept of credit being “deserved” really means, and how one might go about determining it. We have used the terms “descriptive,” “reliability,” and “predictive” to discuss different types of performance, but not fully elaborated on why those three factors matter to us, or how they compare in importance to each other. Having had a few years to see these concepts in action, it seems like a good time to summarize what we have learned.

Our current views can be condensed into a series of principles. Some may seem obvious, even trivial, but they build upon each other to explain why certain things strike us as very important, while others barely interest us at all. All principles are discussed from the perspective of baseball, but they should apply also to basketball, hockey, and even non-sports contexts.

__Principle 1__**: The fundamental informational unit is one player season**

Teams are collections of players, and players have careers. We can focus on almost any excerpt of player value, from careers to multi-season stints, down to individual months or games. But the fundamental base measurement of a player’s contribution is seasonal. Most of baseball’s prestigious achievements are awarded in seasonal, rather than dynastic or monthly elements, and most prestigious player awards follow suit.

We raise this point first because it is fundamental, and because it drives our preference to avoid external or multi-seasonal information in assessing how a player performed during a given season. Our charge, as both we and baseball traditionally see it, is to derive the most information we can out of one season’s production, not to imagine what that performance might have been.

We say this because many traditional statistics known to be problematic over a season (like batting average and ERA) are actually quite informative over the course of a long career. A batter who consistently hits well and a pitcher who consistently pitches well should generally end up rating well in both. Our challenge is that we’d rather not wait for a full career to decide if a player is worth watching (or adding). We want to figure it out within a season, and preferably even sooner than that.

Typical leaderboard statistics often struggle to provide this information. Let’s discuss why.

__Principle 2__**: Typical leaderboards can be noisy and misleading**

On its face, your typical leaderboard is straightforward: the statistic(s) of choice is tabulated, typically through an average (for a rate, like batting average) or a sum (for a count, like runs batted in), and each player’s “achievements” can be directly compared to those of other players. But these leaderboards also have problems, some by design, and others caused by math. Our focus on “deserved” runs seeks to correct or at least minimize these problems.

The biggest problem is well known: official baseball scoring, as in many sports, is largely an all-or-nothing affair. A batter got a hit or he did not. The pitcher gave up a home run or he did not. The right fielder caught the ball or he did not. None of those events is properly credited in its entirety to any individual player. The batter managed to get the bat on the ball, but he didn’t control where the fielders were.^{[1]} The pitcher gave up a home run, but the triple-digit heat index played a part. The right fielder got the putout, but the smaller dimensions of the park made it easier.

And yet, that is all a traditional leaderboard can provide: a tabulation of *outcomes*. Not player *contributions*: play *outcomes*. This is misleading, because we implicitly interpret leaderboards as a summary of player *contributions*. People look at leaderboards to see “how good a batter has been this year” or “how well that guy is pitching.” Nobody reads a leaderboard saying, “Wow, that pitcher had great outcomes this year!” Traditional leaderboards thus get misinterpreted as a summary of actual player value, even though actual contributions are rarely presented.

The other problem is mathematical: while statistical averages are theoretically reasonable (they literally are defined as the expectation of a set of numbers), they can also be noisy and inaccurate. Perhaps this is obvious to you: if each play has some randomness within it, and virtually all of them do, then any attempt to collect a series of plays will also be noisy and somewhat inaccurate. Mathematically, the James-Stein estimator shows that raw averages tend to over-credit contributions unless we place some limit on their ability to do so. The best answer still typically involves working off the average, but also requires being skeptical as to how different any player is from the overall group to which they belong.

Traditional leaderboards do not report true player contributions: they summarize *outcomes* for which players are at best *mostly* responsible. If we want to discover a player’s actual contribution, we need to look *behind* the raw outcomes, and furthermore be careful how we average those contributions, lest we reintroduce noise we are trying to control.

__Principle 3__**: A player deserves credit for their most likely contribution to the outcomes of plays in which they participated**

Having defined our unit of interest (player seasons) and our problem (noisy, poorly-averaged raw statistics), we’re now able to define our preferred alternative. In short, we look to give players credit only for contributions they made. The problem is that it is impossible to measure with certainty any individual’s contribution on a particular play, as multiple players on different teams are typically involved.

What to do? The answer is to do the best we can and focus on the most likely (average) contribution that a player has made. Going forward, we will just refer to this concept as the **most likely contribution** with the understanding that, on a rate basis, contributions are an average expectation.

As with raw baseball statistics, much of the evidence comes through volume. The question is how best to make use of all those plays. Our strong preference is to put the plays and their various participants into a statistical regression, so that each player’s most likely contribution can be isolated from that of other players, while also isolating external factors outside the player’s control.

But just as important is our insistence that regressions express *skepticism *about the uniqueness of each player’s probable contribution, relative to others like them. There were over 1,000 batters during the 2017 season, but it is unlikely there are 1,000 different levels of batter quality. Most hitters are very good, and a solid plurality of them are basically average. Tools like ridge regression, multilevel modeling (which generally employs ridge penalties), and other forms of regularization are extremely useful if you wish to isolate player contributions. The specifications of these models can vary, depending on the quality of information, but the need to incorporate statistical skepticism should be constant.

__Principle 4__**: A player’s most likely average contribution should not be confused with so-called “true talent” or projected future performance**

In lieu of focusing on a player’s most likely average contribution, some speak of isolating a player’s “true talent.”

We dislike this term. People seem to confuse the term “talent” (which generally describes innate, but undeveloped potential) with “ability” or “skill.” Even then, it is wrong to assume that past player contributions (which we are trying to measure) merely summarize a player’s skill, because a) no play involves the full range of a player’s skills, and b) multiple skills are required to be successful in baseball.

That said, a player’s most likely contribution will certainly *reflect* their true skill, and their most likely contribution should be closer to their skill levels than the raw outcomes most leaderboards feature now. But equating past contributions entirely with player skill is asking for trouble, because they are not necessarily the same.

Similarly, some analyses conflate a player’s past actual contribution with that player’s expected future performance. This is both problematic and unnecessary. Projecting performance is difficult, and it is rare that exactly one season of data will provide the best estimate of how a player will begin the next one. Typically, two or even three seasons of data can be needed, and that is before looking at external factors. Too often people treat FIP or DRA, for example, as predictions of the future for pitchers. These statistics were never designed to operate in that way, and assuming otherwise can lead you astray. Let estimates of past performance be just that, and leave the future to projection systems, which have their own ways of incorporating high-quality estimates of past performance.

__Principle 5__**: A player’s most likely contribution cannot be directly ****measured; thus, it must be inferred from contribution measures evaluating descriptiveness, reliability, and predictiveness
**

The tricky thing about player contributions is that there is no agreed-upon way to isolate them. Hitters can be measured by batting average (AVG) or on-base percentage (OBP) or slugging percentage (SLG) or on-base-plus-slugging (OPS) or weighted on-base average (wOBA) or any number of other statistics. We will call these different statistics **metrics**; all these metrics reflect player contributions to some extent, and we need some way to choose which ones do it best.

To make that choice, we need benchmarks that reflect performance measures that we believe are useful in measuring true contributions. The three we have used most often are the following:

__Descriptiveness__: the extent to which the player’s estimated past contributions to plays corresponds to the outcomes of those plays;__Reliability__: the extent to which the player’s estimated past contributions are similarly estimated by the same metric during future plays; and__Predictiveness__: the extent to which the player’s estimated past contributions correspond to the outcomes of future plays involving that player.

None of these measures is absolute, but all reflect different assumptions we can reasonably make about a player’s most likely contribution. First, we expect a player’s contribution to correspond at least somewhat with the outcomes of the plays with which he was involved (Descriptiveness). We expect that contribution, if accurately measured, to be somewhat consistently diagnosed by a good metric over time (Reliability). Finally, we expect that contribution to show up in the outcomes of future plays involving that player (Predictiveness).

We will call these tests our **Comparison Measures** for ease of reference, and also to distinguish them from the **metrics **being evaluated. We will also, pursuant to our first principle above, generally look at performance only across full seasons. You can look at portions of seasons or multiple seasons if you wish, but the results should generally be similar, at least relative to each other.

__Principle 6__**: The best measure of likely contribution is probably Reliability, followed by Predictiveness and then Descriptiveness**

One tricky aspect of our Comparison Measures is that they have conflicting goals. For example, on any decent metric, the descriptive performance (matchup to past outcomes) will virtually never be the same as the predictive performance (matchup to future outcomes). (If it did, we would not be having this conversation). Thus, it is extremely likely that neither the past nor the future results are perfect measurements of contribution, and that both are almost certainly somewhat wrong.

Reliability, which instead looks at the extent to which a metric corresponds to itself over time, does not face that conflict. It is also the most useful because it focuses directly on consistency of the metric we are trying to evaluate. However, Reliability has a different challenge: measuring the same thing consistently does not mean one is measuring something useful. For example, a statistic based entirely on a calendar could reliably tell us all games played on Wednesdays, both now and in the future. But no one would consider that to be useful summary of a player’s season, as there are many other days ending in “y” when players make meaningful contributions. Thus, we still want our other Contribution Measures to be reasonable, to ensure our metric remains connected to reality.

So, how do you find the right balance? The answer has multiple components: a) the methods by which we score our Comparison Measures; b) the range of numbers to look for from each metric, and c) deciding when we have a winning combination.

On the first point, we tend to use correlations when evaluating these metrics. You can use something else, such as expected error (mean absolute error, mean squared error, etc.). We prefer correlations, though, because they are easy to compare and understand (0 is bad, 1 is the best possible, and higher is better); because they are mathematically equivalent to error measurements when normalized; and most importantly because they allow us to compare metrics on different scales (which raw error measurement often does not).

On the second point, we can only cite our experience to date. However, that experience has been consistent. Descriptiveness usually generates the highest correlations among the Measures, which makes sense given that all metrics are based on past results. But since we know that the raw descriptive measurements are “wrong,” we see descriptive performance as little more than an overall reality check. In fact, the higher the Descriptiveness, the more suspicious we are, because the best descriptive fits arise from overfitting past outcomes—precisely the problem we are trying to avoid.

Reliability usually falls in between Descriptiveness and Prescriptiveness in terms of the magnitude of the correlation values. Reliability correlations are usually lower than the Descriptive correlations because the metric is no longer being compared to known past results, but rather to the way it would measure future results.

Predictiveness tends to be the lowest in magnitude of the three because the metric now must correlate not only with itself, but with the additional variance of play outcomes the metric did not take into account. Predictive performance is still important because it tests metrics on previously unseen data, which is a good measure of whether your models are finding something real and sustainable (like a player contribution). The use of Predictiveness as a measure may seem to contradict what was said above about avoiding projections, but it does not: like other measures, future performance has value in checking the validity of past results. But it is not our primary goal, because we are trying to derive the player’s most likely *past* contributions, not forecast the future.

Finally, the hard part: once you have these measures for your proposed metric, what do you do with them? The answer, we think, is to find combinations that maximize Reliability, without sacrificing much, if any Predictiveness, and showing some reasonable level of Descriptiveness. If a metric you generate can meet those criteria better than other metrics currently available, preferably while accounting for uncertainty of measurement, then you may have a winner.

This is essentially the hierarchy from which we have benchmarked and justified Deserved Run Average (DRA), but we can give you an easier example. Let’s apply these Contribution Measures to some well-known batting statistics:

**Batting Metric Performance by Contribution Measures (teams, 1980–2016)**

metric |
Descriptive |
Desc_Err |
Reliability |
Rel_Err |
Predictive |
Pred_Err |

OPS | 0.94 | 0.003 | 0.63 | 0.02 | 0.59 | 0.02 |

wOBA | 0.93 | 0.004 | 0.62 | 0.02 | 0.58 | 0.02 |

OBP | 0.86 | 0.008 | 0.61 | 0.02 | 0.52 | 0.02 |

AVG | 0.79 | 0.012 | 0.56 | 0.02 | 0.45 | 0.02 |

This table shows the Descriptiveness, Reliability, and Predictiveness for the listed metrics, at the team level, for all seasons taken together from 1980 through 2016. (We used 2017 also as needed to provide the second season of comparison for the 2016 measures of Reliability and Predictiveness). We have the much-maligned batting average at the bottom, the somewhat-preferred OBP above that, the sabermetric-preferred wOBA listed second, and on top, OPS, another measure of batting success that tends to work very well. The correlations are generated using the Robust Pearson correlation method we described in our recent article comparing wOBA and OPS.

In addition to the correlation scores, we provide a margin of error (“Err”) next to each measure so you can confirm which differences are material and which ones are not. To account for error, add the specified adjacent amount to either side of all these correlations, and if the numbers for two statistics still do not overlap, their differences are “outside the margin of error” and therefore probably meaningful.

This table reflects many of the trends we have reported above. First, Descriptiveness generates the highest correlations, as expected, with the Reliability scores coming in somewhat lower, and the Predictiveness scores somewhat lower still. Second, we recommended that you look for the highest Reliability score, using Predictive scores as a tie-breaker or check of sorts. By Reliability, the metrics OPS, wOBA, and OBP are all decent, with their differences largely within the margin of error. Using Predictive testing as a tiebreaker, the difference between OPS and wOBA on the one hand, and OBP on the other hand, is outside the margin of error for these statistics, suggesting that OPS and wOBA are better choices than OBP to evaluate batter run contributions. Most analysts would agree with this assessment.

At the bottom of the chart, we see the sabermetric bête noire, batting average, doing poorly by comparison in every measure, with its inferiority outside the margin of error relative to OBP, and well outside the margin of error as to OPS and wOBA. This suggests that batting average, while generally reflective of team batting quality, is inferior to OBP as a measure of team hitting, and decidedly inferior to OPS and wOBA. Again, this reflects the consensus of the analysis community.

__Principle 7__**: If one metric performs better than another in all three Contribution Measures, it is almost certainly the superior metric**

As noted above, our various Contribution Measures operate at cross-purposes: it is impossible to maximize Descriptiveness while simultaneously maximizing Predictiveness, and given the variance that surrounds future performance, there is a probable ceiling on how Reliable any metric can realistically be.

That said, as demonstrated by the chart above, these cross-purposes do not stop superior metrics from demonstrating superior performance across the board. OBP performs much better than batting average in all three Measures, and OPS consistently performs better than OBP.

We can think of only one reason why a metric could perform better than another over multiple conflicting challenges: because said metric is more accurately isolating the player’s most likely contribution, and is therefore just a better metric, period.

Choices between metrics will not always be so clear, particularly as novel metrics approach theoretical ceilings in potential Reliability and Predictiveness. But when a proposed new metric is better than another in *all three *Contribution Measures, and especially when those improvements stretch or exceed the margin of error, the decision is easy: the new metric is superior, and ought to be both preferred and adopted.

__Conclusion__

There are plenty of other wrinkles to evaluating player contributions which we have not covered. There may also be further Contribution Measures, not considered here, that ought to be considered. In the meantime, we hope this summary helps others understand where we’ve been coming from, and where we keep trying to go.

*Many thanks to the BP Stats team for peer review and editing.*

^{[1]} Fielder positioning has of course become a much more active issue in recent years.