It’s around the time that projection engines are being tweaked, updated, and improved, in anticipation of the release of new predictions for the coming year. At Baseball Prospectus, Rob McQuown is hard at work ironing out the kinks for this year’s release of PECOTA. Given the present focus on predictions, the time is ripe for a retrospective look at how the projections fared last year.

There’s no better source for a large-scale comparison of projection algorithms than Will Larson’s Baseball Projection Project, which I will use for this article. Larson’s page houses the old predictions of as many different sources as he can get his hands on, including methods as diverse as Steamer, the Fan Projections a FanGraphs, and venerable old Marcel. It’s a rich storehouse of information concerning the ways in which we can fail to foresee baseball.

The most obvious task when confronted with the contrasting predictions of a series of algorithms is to compare them and pick a winner. I am going to refrain from doing so. I am not an unbiased observer. As much as I’d like to make a pretense of objectivity, there are sufficiently many free parameters in any such comparison of projections that I could never guarantee that I wasn’t tilting the competition to favor PECOTA. These types of comparisons are best done by third parties on multi-year samples of projection engines.

Instead I aim, by gathering the predictions of nine separate algorithms*, to get a better sense for the scope and scale of prediction accuracies and errors. I’ll concern myself only with hitters for now, and limit it to just this past year, 2014. As a global metric of offense, I’ll rely upon OPS, not because it’s the best, but because all the various systems make predictions for it.

The first striking finding is that by and large the systems do a decent job at predicting who will be good and who will not. The root mean squared errors (RMSE), a measure of prediction accuracy, all fit in somewhere between .16 and .19 for different projections, implying that a hitter’s OPS can be guessed to within about 200 points. These RMSEs drop to ~.1 if you apply a 200 PA threshold, which eliminates some small-sample abnormalities. That’s not perfect or even close to it, but it shows that good players can be distinguished from bad players with relative certainty.

One reason that the RMSE is so elevated is due to the shifting run environment in the league. Some projections undoubtedly make an effort to account for the most recent trends, but because the algorithms are inherently backward-looking, they’ll tend to damp out whatever larger tendencies are occurring in baseball (such as the recent freefall of offense). This effect can be observed quite clearly in the estimated average OPS of 511 players, relative to the actual OPS of those same players:

Every single prediction puts the average OPS of these players substantially (30 to 70 points) higher than the actual OPS achieved by the players. Some of this effect is due to survivor bias, but even if you apply strict plate appearance cutoffs, the prediction algorithms (as a group) expected hitters to produce 10 or so more points of OPS than they actually achieved.

In times of great change in the league, prediction systems are going to become less accurate in an absolute sense. Everyone knows that the strike zone is growing, or maybe now shrinking, but at any rate changing, and that movement is shaping the run environment to a large degree. There are also innumerable other factors at play, including increasing average fastball velocity and the advent of the modern hyper-specialized bullpen. To whatever degree all of these factors combine to shift the run environment, the prediction algorithms will be greatly confused. Because all of the predictions are essentially historical in nature (utilizing the rich information of prior baseball careers), the rare eras of great upheaval in the run environment will be the most difficult to predict.

**All systems miss on the same players**

I’ll turn now from the general to the specific, looking at individual players. Some players are easy to forecast on account of their consistency. Mike Trout, to date, has put up OPSs of .963, 989, and .938 in his three MVP-caliber years; had you guessed .950 even, you would not have been far off in any of them. Other players, however, are maddeningly difficult to deal with, on account of injuries, youth, or otherwise variable performance.

The greater part of the players in MLB fall into the easy-to-predict camp. Such is the state of the predictions nowadays that about half of all players can be forecast to within 50 points of their actual OPS values. These players are not problems for the prediction algorithms, provided they don’t get injured.

This graphic shows the spectrum of absolute prediction errors in 2014 (minimum of 200 PAs), using the consensus predictions (average of all prediction systems). Most errors fall within a reasonable range of 50 to 100 points of OPS, but there is a long tail of seriously missed projections. Granted, some of these players are aberrations, the recipients of gifts from the BABIP gods, or, on the flip side, snake-bit by injuries severe enough to impact their ability, but not severe enough to remove them from the field.

But there are still some genuine projection errors lurking. Players like Victor Martinez, J.D. Martinez, and Michael Brantley seem to have made true steps forward. These players contribute the tail of the above graph. While there is a hazard to over-interpretation, it looks to me as though there are two kinds of prediction errors at work. Here’s another histogram of the real prediction errors scattered around 0 (a perfect prediction), to show you what I mean.

One is a standard, random process, the luck of BABIP combined with opponent quality variation and whatever else (which could be modeled as a normal distribution with mean 0 and standard deviation of .015 [blue curve]). And then there is a second kind of error, a sort of extreme event where a hitter’s ability changes for one reason or another, which would be completely unforeseeable under the first random process.

Take Brantley, for example. The most bullish take on Brantley was provided by BaseballGuru, who prophesied a decent if unspectacular .751 OPS. In reality, Brantley got to .890, when all was said and done. The story is much the same with the other breakouts: Victor Martinez, predicted for an .800 OPS (by the usually over-optimistic fans), achieved .974, and so on.

The astonishing fact is that not one but all of the systems missed on these players. While the details differ, projections are by and large similar to each other. When a player drastically over- or underperforms his projection, it’s not as though there is often one rogue system which foresaw it, while the others flopped. Accordingly, if you use the consensus projection of all nine prediction algorithms (instead of just a single one), it’s no more accurate (by any measure) than the best individual prediction algorithm. There’s no “wisdom of crowds” effect here which could help us predict the breakouts or breakdowns.

Perhaps this fact shouldn’t be astonishing, in that all of the various projection systems fundamentally work with the same data (outcomes, as measured on a per-PA basis). According to James Surowiecki (who wrote the book on this), one element required for there to be a “wise crowd” is that there must be diversity of opinion. For the most part, all the current best projection systems are not very diverse. A corollary of this line of reasoning is that a new kind of prediction algorithm (one which relied on a different sort of data, or set up a different way), even if it wasn’t very accurate, could still be helpful if used in conjunction with one of these other sets of predictions. Simply by providing a different, more diverse take on the same players, the overall accuracy could potentially be improved.

I’ll close with one last parcel of information that I find both surprising and encouraging. I was intrigued by a comment from Jared Cross, made at the Saber Seminar this year, about pitching projections. He noted that if you guessed a league average ERA for every pitcher in the league (a flat projection), you’d be right to within .95 runs, on average, while the very best current projection systems get to within .8 runs.

I did the same analysis for hitters. With a league average guess for every hitter who got more than 200 PA last year, you’d get to within 120 points of the true OPS, on average. The projections, in contrast, range from within 80 to within 100 points of OPS, depending on the particular algorithm used. As with pitching, the projections are more accurate than randomness, but not by all that much. We’ll never be able to reduce the error of projections to zero (and that’s probably a good thing), but the ubiquity of breakouts suggests that we have at least some progress to make.

*Full list: PECOTA, Cairo, CBS, Fangraphs Fan projections, Baseball Guru, Marcel, Oliver, Steamer, and Zips.

#### Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
Can you actually test that and report the results?

I have some ideas about how to identify the flat vs. variable players based on PECOTA's percentile projections, and I'll write up the results of that soon.

MGL below says that his research DOES support your claim. I wanted to see those results.

It's one thing to say we'll get smaller RMSE with players who have a lower standard deviation in past 3 or 4 years of wOBA. It's another to say that those guys will have an RMSE of .028 and the variable guys will have an RMSE of .029.

So, I was looking to see to the extent that we're talking about in turning that into the english "easier".

Using RMSE or something like that will yield worse results for these players, but, again, I am not sure that means they are "harder" to project. I am agnostic as far the different methods for evaluating players.

Here is an example of what I mean. Say we have 3 groups of players with variable histories. One group is all over the map historically, but their projection is an OPS of .650. IOW, their weighted career average is somewhere around .620 or so (once we regress toward the mean, we get the .650).

The second group is all over the map, and their projection is .750. The third group, .850.

If we test our projections for each group, we indeed get a collective actual of .650, .750, and .850 for each of the 3 groups, so we pretty much nailed it.

But, because they were quite variable historically, as I said, the variance in actual WITHIN each group will be large. Some of the .650 group will be .550 and some will be .750, etc. Same with the other 2 groups.

So using a RMSE or even average error will yield a high number which will make it look like these were bad projections. But were they? I don't know, you tell me.

The exact same thing will occur with players who have limited histories. We can probably nail the .700, .750, and .800 players, but within each group there will be a lot of variability and the average RMSE errors will be large, just like the players with a lot of inconsistency in the past but with more data in that history.

Compare that to players who are more consistent historically. We will have our .650, .750, and .850 groups (etc.) but within those groups there will be less variability. Does that mean that these projections were "good"? Again, it depends on how you define "good" and "bad" and what method you use to evaluate the projections.

I can say one thing about evaluations. If you are using any system which does not incorporate the number of opportunities in the "actual" it is a bad evaluation system. And that is because the fewer the opps, the more random variability there will be. It is not enough to just say, "I'll use a 200 PA cutoff," or something like that.

I had meant it individual RMSE, but you make a good point in terms of the collective.

Unfortunately, I don't have any great insight into what would be the 'correct' question to ask of the projection systems. It should measure the correlation between the projected rank for each player and the actual rank, but extra weight should be given to getting the tails correct, since it seems like in the end that is what really matters. What I mean is that if player A is projected on average to be a 40th percentile player, but ends up at the 60th percentile, and Player B is projected as 60th percentile, but ends up as 80th percentile, it is more important for a projection system to be right on Player B than Player A, as the distribution thins in the tails that 20 percentile move from 60 to 80 will have a bigger impact in terms of baseball production than a 20 percentile move from 40 to 60.

Anyway, these are half-baked thoughts, but its something that has bothered me about projection systems for a long-time. I hate the way systems incorporate reversion to the league mean. I understand why it is done, and why it yields better results, but it ignores the reality that it adds no actual value to the projections, except on a macro level, which in my opinion is useless.

The "Look -- your fancy-pants projection algos 'barely' outperformed a monkey/dart/etc. in absolute accuracy" articles really need to stop. What we need is practical evaluation of *relative* accuracy over several seasons, such as what one site has done with weekly football player rankings already: http://www.fantasypros.com/about/faq/accuracy/#pay

I have no affiliation with that site, nor do I think their methodology is perfect. I do think they have the right *idea*, however, and it should be the general framework used to evaluate any ranking or projection system. That is, in some simulated competition, how would each projection system have fared? And by all means, include the all-players-are-the-same projection set as a control...so we can see just how damaging it would be to use anything like them in real life.

Re: evo34, I certainly wouldn't recommend trying to win your fantasy league without looking up some projections first. That would be foolhardy. It sounds like you are looking for a solid comparison between projection algorithms, which I think is a very worthwhile endeavor, but as I noted, shouldn't be done by me. I'm more focused on whether and how to improve the performance of PECOTA. The point of this article, and the "little better than random" factoid at the end, was to point out that there *is* substantial room for improvement, and the most fruitful place I see to pursue that improvement is in terms of identifying some of the major breakouts that happen every year, as they contribute a disproportionate amount of forecasting error to the total.

Of course, such a subset has a problem in that many large one year changes are just random changes, so maybe a more restrictive test could be applied to see if the performance in the previous say 2 or 3 years, was signficantly different than the next 2 or 3 years. In other words, you wouldn't use 2014 JD Martinez in your training set as a breakout player unless his performance in 2015 is also much better than his pre-2014 level of performance.

Basically, if we reason from the basic conceit of PECOTA (that looking at the career trajectories of previous players whose careers most closely match the player of interest gives us a good prediction of the player of interest's future career), we have to make some choices about how to integrate data that is telling us different things.

To take an example, if I look at Player A, whose top 5 comps hit .250/.325/.425 the following year on average (after translating their production into 2014 league context), I could say that the most probable outcome for Player A is to hit .250/.325/.425 next year, then apply a normal distribution around that (pulling the variance for each statistic from the comps), and then perhaps apply some other modifiers, like league-context adjustments for 2015, or ballpark effects, etc. That gives me a range of possible production values, like maybe 1-stdev above is .275/.340/.450 and 1 below is .225/.310/.400. Another approach would be to generate a 20th-percentile slash line by just taking the line of the worst comp, then the 30th-percentile could be the average of the 2 worst, the 40th percentile the same as the 2nd-worst comp, and so on, so the 80th percentile is the performance of the best of the 5 comps, and if I have even more comps I can generate higher (and lower) percentiles. And similarity scores can be used to weight comps and improve the accuracy (though they also add noise, potentially).

So that's great, I've used league context to adjust the performance of comps and applied it to my future performance. If I have a normal set of comps, though, there are some duds and some studs in my list, so the difference between a 90th- and 10th- percentile performance is pretty big. Saying "there's less than a 10% chance he ends up with a .290 TAv" is useful, but it means that a whole bunch of folks will miss their average projection by a whole lot. When people ask how someone is going to do next year, they are really asking, "How's he going to do when he's relevant to my interests?" They don't want you to average the 1 time in 4 that he gets -1 WARP with the 2 times in 3 that he gets 1 WARP and the 1 time in 10 that he gets 4 WARP and say that he projects to get 0.8 WARP. The question is really asking about that 3/4 of the time that he's a solid major league player. In those cases, he's a 1.4 WARP player, on average.

Improving the performance of PECOTA *requires* "a solid comparison between projection algorithms." How else are you going to know how good you are?

Thank you for your recommendations regarding other ways PECOTA could be improved. Those are certainly significant and complex issues, each of which would require substantial reworking (and in some cases, for example identifying pitchers who serially under/overperform expected BABIPs, a better understanding of the underlying sabermetrics). I will note that working to improve one aspect of PECOTA does not preclude progress on these other, obviously worthwhile projects.

My point is that you should *not* be concerned about missing a handful of "breakout" players, nor with minimizing absolute error. You should be focused on trying to improve clear shortcomings in your algorithm. A partial list:

Pitchers who are unique enough for long enough that it's clear they are defying BABIP norms and will continue to do so. See: Young, Chris.

Pitchers who perform abnormally well/poorly with men on base -- thus affecting projected ERA.

Pitcher velocity.

Hitter platoon effects. How much of a hitter's past performance was driven by favorable L/R match-ups, vs. the projected mix he will face this season?

Long-term forecasts. This topic deserves a separate post. But suffice it to say when you find your algo spits out a 6-year TAv projection of: .280,.281,.270,.266,.285,.240, you have a serious over-fitting issue...

It pretty noticeable following the PECOTA predictions each year that the system is very conservative. Statistically, that's justified - if you have an outcome with a 1/100 or even 1/10 chance, you are better off assuming that it won't happen than assuming that it will, so your single projection is going to be too conservative or too aggressive on guys who hit those low-probability shots, even though it will be accurate on a grand scale. In other words, getting the mean approximately right doesn't require you to properly estimate the variance.

The question is, though, what do we want a projection system to do? If you were making a projection system for rolling a bunch of dice with different numbers on their faces, would you ask it to tell you the expected value of each outcome, and be satisfied that it was going to miss high & low on some of them? That's what the systems are doing now. Asking them to predict which dice are going to roll their highest values is unreasonable, so you need to judge the success on something more nuanced than just RMSE. The projection needs to give different possible outcomes with a likelihood attached to each (or return some formula for generating the probability distribution of different stats) in order for it to be judged on anything but aggregate accuracy.

So for a normal die, you would want the algorithm to tell you there's a 1/6 chance of each number 1-6 appearing, since that's the best it can possibly do at predicting. Then you need some method of determining whether or not those predictions were accurate when there's no way to repeat the experiment. For example, figuring out how likely the actual result was, given the projections.

We don't, because they're not.

"Skill might have a normal distribution,"

Not at the professional level.

You make the right observation with outcome probabilities. Namely, what is the probability of something happening that shouldn't have. I like your dice example since it reminds me of Strat-O-Matic.

It IS a very complex subject.