Last year at this time, when we were first unveiling PECOTA, I was besieged with questions about the system’s accuracy. From the very start, the system has always had its believers and its skeptics; all of them wanted to know whether the damn thing worked.
My evasive answers to these questions must surely have seemed like a transparent bit of spin doctoring. One of my readers suggested to me, quite seriously, that I had a future in PR or politics. But I was convinced–and remain convinced–that a forecasting system should not be judged by its results alone. The method, too, is important, and PECOTA’s methodology is sound. It presents information in a way that other systems don’t, explicitly providing an error range for each of its forecasts–which, importantly, can differ for different types of players (rookies, for example, have a larger forecast range than veterans). Its mechanism of using comparable players to generate its predictions is, I think, a highly intuitive way to go about forecasting. Besides, all of the BP guys seemed to appreciate the system, and getting the bunch of us to agree on much of anything is an accomplishment in and of itself.
Now that it has a season under its belt, however, we can do the good and proper thing and compare PECOTA against its competition. (We go into further detail on PECOTA with an essay in Baseball Prospectus 2004.)
For this study, we evaluated the performance of seven sets of forecasts–PECOTA and these six worthy opponents:
- BBHQ Ron Shandler’s Baseball HQ
- DMB Diamond Mind Baseball forecasts created by Tom Tippett, available as part of DMB’s 2003 projection disk
- Primer Baseball Primer/ZiPS Projections, available in a series of articles at www.baseballprimer.com
- RotoTimes www.rototimes.com
- RotoWire www.rotowire.com
- Warren Scoresheet guru Ken Warren’s projections, available here
Four of the seven sets of forecasts (including PECOTA) require their users to pay for them in some form or another; the RotoTimes, Primer, and Warren projections are free. In each case, we used the set of forecasts that coincided most closely with the season’s start date, so as to have the most accurate lists of team and roster changes, and so forth. This meant using the “Web version” of the PECOTAs, rather than the “Book version,” which incorporated some slight refinements to the system’s methodology.
We evaluated only those predictions for players who had a forecast provided for by each of the seven projection systems: a total of 285 pitchers and 360 position players. The systems differed in their coverage–DMB and PECOTA, for example, evaluated a larger sample of players. Most of the “extra” players were fringe guys and minor leaguers–players for whom it is somewhat harder to generate a good prediction because of small sample sizes–and so without this emphasis on “common” players, we wouldn’t be comparing apples to apples. Our focus will be on OPS and ERA, two imperfect measures of performance that are nevertheless alluring in their simplicity.
Forecasters, for whatever reason, tend to be a little bit more impassioned about their position player forecasts, so we’ll start with those. First, it’s worthwhile to look at some basic descriptive statistics for the respective models. We’ll also look at the same statistics given actual results for all hitters who achieved the rookie minimum of at least 130 plate appearances last season (a total of 393 players).
(Key: Obs=number of observations, Mean=average OPS prediction for each forecaster, Std Dev=standard deviation on OPS predictions for each forecaster, Min=minimum OPS projection for each forecaster, Max=maximum OPS projection for each forecaster)
Table 1: Descriptive Statistics: OPS Predictions (PA>=130)
Obs Mean Std Dev Min Max BBHQ 360 .762 .098 .515 1.193 DMB 360 .752 .099 .538 1.313 PECOTA 360 .743 .102 .515 1.295 PRIMER 360 .777 .099 .550 1.214 TIMES 360 .759 .102 .520 1.259 WARREN 360 .774 .095 .591 1.220 WIRE 360 .786 .112 .492 1.333 ACTUAL 393 .753 .115 .436 1.278
Among the seven systems, RotoWire predicted the highest league-wide OPS, PECOTA the lowest. Most of the predictions ended up on the high side. PECOTA, RotoTimes and, especially, RotoWire accounted for a larger degree of variance than did the other systems (as measured by standard deviation).
None of the systems, however, accounted for as much variance as actually occurred. It is only natural that things turned out this way. A good projection should do what it can do remove the influence of ‘luck’ from its forecast lines; nevertheless, we need to acknowledge that luck does occur. Simulating a few seasons in DMB or Strat-o-Matic should give you a good feel for this–I just completed a replay of the 2003 season in which Sammy Sosa, based purely on a few hundred rolls of the virtual dice, hit 58 home runs for my Cubs, rather than his actual total of 40. Even if a system does a perfect job of estimating a player’s ‘true’ level of talent, it will miss on some projections due to this kind of sample variance. PECOTA attempts to account for this problem by means of its percentile forecasts.
Is higher variance in a projection system a good thing? Frankly, you can come up with a pretty decent projection simply by taking a weighted average of the player’s performance in his previous three seasons, adding in some regression to the mean, and making a simple adjustment for age. Such a ‘naïve’ forecasting system would have a very small variance, while still providing for reasonably accurate projections. However, it wouldn’t be telling you anything that you didn’t already know: a player is going to perform in the future more or less as he has in the past. A higher variance is indicative of a forecasting system that is more discerning, that is predicting sharper changes in a player’s level of performance–what PECOTA calls “breakouts” and “collapses”. I’d venture to say that a system with higher variance has the potential to provide more value to its user, so long as it can maintain a high degree of accuracy while doing so.
Another way that we can look at this is to examine how closely the various projection systems mimic one another. In the table below, I’ve provided the correlation coefficients between the average forecast and those provided by each projection system for our group of 360 hitters.
Table 2: OPS Predictions: Correlation against Average
Warren .983 DMB .979 PECOTA .968 Primer .966 Times .964 BBHQ .963 Wire .930
With the exception of RotoWire, which has emerged as something of an outlier, the differences between the projection systems are subtle. If it seems that discussions about forecasting quickly devolve into the esoteric, this is why–the “naïve” forecast gets us 90% of where we need to be, and so we each push every edge that we can in order to get a leg up on that last 10%. Not that the last 10% is unimportant: baseball teams routinely commit millions of dollars to a single long-term contract, and so having even a marginal edge in player evaluation can provide for a significantly more efficient return on investment.
With that in mind, let’s see how each system fared in the face of actual competition. We’ll evaluate the results for two subsets of position players: first, all players who received a projection from each system and had at least 130 PA (n=314 hitters), and second, all players who received a projection from each system and had at least 300 PA (n=234 hitters). Why not simplify things by settling upon one threshold for playing time? Do we really care about how a bunch of scrubs perform? The trouble with that is that the more we limit our subset of players, the more selection bias we are introducing. Take Jeremy Giambi for example, a player whom PECOTA projected, quite correctly, to flop. Largely because of his poor performance, Giambi received just 156 PA last year, far fewer than he might have if he had hit better. By setting the playing time threshold too high, we wouldn’t be giving a system proper credit for projecting collapse seasons like Giambi’s. That said, the argument that fuller major league seasons provide for better comparison points has some merit, and so we’ve resolved the matter simply by providing things both ways.
We have evaluated each subset of players by comparing the projection systems according to three metrics:
- Correlation Coefficient
- Mean Error
- Root mean square error (“RMSE”)
There is no one “right” answer as to which of these three metrics is most appropriate. RMSE tends to punish a system for errors that are large in magnitude, whereas Mean Error takes a more even-handed approach. RMSE and Mean Error penalize a system for having its league averages miscalibrated, whereas Correlation Coefficient does not. The only thing you really need to know is that the higher the Correlation Coefficient, and the lower the RMSE and Mean Error, the better a system has performed.
Shuffle Up and Deal…
Table 3a: Comparison of Predicted versus Actual OPS, Minimum 130 PA
Mean Correl Rank Error Rank RMSE Rank BBHQ .691 (4) .068 (4) .086 (4) DMB .696 (3) .066 (2-T) .085 (1-T) PECOTA .700 (2) .065 (1) .085 (1-T) PRIMER .685 (5) .070 (6) .090 (6) TIMES .674 (6) .069 (5) .089 (5) WARREN .709 (1) .066 (2-T) .085 (1-T) WIRE .649 (7) .076 (7) .098 (7)
Table 3b: Comparison of Predicted versus Actual OPS, Minimum 300 PA
Mean Correl Rank Error Rank RMSE Rank BBHQ .690 (5) .067 (4-T) .084 (2-T) DMB .694 (3) .066 (3) .084 (2-T) PECOTA .711 (2) .064 (1-T) .085 (4-T) PRIMER .692 (4) .067 (4-T) .085 (4-T) TIMES .683 (6) .068 (6) .086 (6) WARREN .715 (1) .064 (1-T) .081 (1) WIRE .672 (7) .071 (7) .091 (7)
Three systems earned first-place recognition in at least one of the six categories: PECOTA, DMB, and Warren, with Warren winning the MVP based on taking first place outright in three categories, and tying for first in another two. Ken’s been a great resource to Scoresheet Baseball players for a number of years now, and it’s nice to see him fare so well against the big boys. As for those bold RotoWire projections? They’re a bunch of great guys too, but RotoWire’s OPS projections finished last in each of the six departments.
Hitting, of course, is only 48% of the battle, so we’ll also evaluate how the systems did in projecting ERA. The pitching projections have always been near and dear to my heart, as the pilot version of PECOTA projected pitching performances only (hence the first letter in PECOTA’s name). My thought at the time was that the need for pitching projections was more urgent, as there was more room to improve upon the existing systems; we’ll see if that intuition held true.
First, a look at our set of descriptive statistics for the 285 pitchers common to all seven systems.
Table 4: Descriptive Statistics: ERA Predictions (IP>=50)
Variable Obs Mean Std Dev Min Max BBHQ 285 4.00 0.66 2.34 5.91 DMB 285 4.03 0.87 1.98 6.32 PECOTA 285 4.29 0.78 2.14 7.47 PRIMER 285 4.18 0.63 2.47 6.19 TIMES 285 3.90 0.67 2.10 5.88 WARREN 285 3.89 0.76 1.66 6.15 WIRE 285 4.04 0.83 1.93 7.00 ACTUAL 297 4.27 1.27 1.12 8.16
PECOTA and Primer projected substantially higher league-wide ERAs than the rest of the systems, a figure that compared nicely to actual league results. One commonsensical test that you can perform on a projection system is to see what it expects a league-average performance to be. If it is projecting an average ERA to be in the neighborhood of 3.90, a level not observed in more than a decade, then something is very wrong.
All of the systems substantially underestimated the variance in actual ERA, a reflection of the fact that it is an unstable metric that is subject to a tremendous amount of luck. Some people have questioned PECOTA for the wide ranges of ERAs included within its percentile forecasts, but those results are consistent with the character of the statistic.
The pitching projections were also more highly differentiated from one another than the position player forecasts:
Table 5: ERA Predictions: Correlation against Average
DMB .935 Warren .912 BBHQ .894 Times .883 Primer .795 Wire .780 PECOTA .775
PECOTA, Primer, and RotoWire all turned in relatively distinct pitching projections, while the other four systems were highly similar to one another. One reason for these disparities may be the differential treatment of hits and balls in play, otherwise known as DIPS Theory. The Primer projections apply a “strong” version of DIPS, postulating that a pitcher exerts almost no influence upon his hit rate. PECOTA uses a much weaker version of the theory, effectively assigning some portion of the variance in a pitcher’s hit rate projection to luck, some to skill, and some to the strength of the defense behind him; this is still enough to differentiate it from the more old-fashioned forecasting models that do not account for the unusual influences of hit rate at all.
Let’s move right along to our results, looking first at all pitchers with a minimum of 50 IP (n=220), and secondly at all pitchers with a minimum of 130 IP (n=115).
Table 6a: Comparison of Predicted versus Actual ERA, Minimum 50 IP
Mean Correl Rank Error Rank RMSE Rank BBHQ .430 (4) .88 (2) 1.14 (2-T) DMB .441 (2) .89 (3) 1.16 (4-T) PECOTA .479 (1) .85 (1) 1.11 (1) Primer .395 (6) .90 (4-T) 1.14 (2-T) Times .428 (5) .90 (4-T) 1.16 (4-T) Warren .431 (3) .96 (7) 1.24 (6) Wire .341 (7) .90 (4-T) 1.18 (7)
Table 6b: Comparison of Predicted versus Actual ERA, Minimum 130 IP
Mean Correl Rank Error Rank RMSE Rank BBHQ .425 (3) .72 (4) .92 (2-T) DMB .470 (2) .71 (2-T) .93 (4) PECOTA .486 (1) .68 (1) .86 (1) Primer .389 (4) .71 (2-T) .92 (2-T) Times .372 (6) .76 (6) .97 (5) Warren .388 (5) .75 (5) .98 (6) Wire .327 (7) .80 (7) 1.03 (7)
PECOTA swept the pitching categories, proving to be substantially more accurate than any of the other forecasts according to all six of our metrics. BBHQ, DMB, and Primer also turned in strong showings, with the other three systems lagging behind. The old adage that pitching is too unpredictable to bother with has a ring of truth to it, hence the name of BP’s only previous attempt at running pitching projections: “WFG”. But PECOTA is closer to solving the riddle than any of the other systems. We’re adding consideration of groundball:flyball ratio into PECOTA this year, something that should assist it further (also described in more detail in the essay in Baseball Prospectus 2004).
We can crown an overall champion for the year by summarizing the results for each of the 12 metrics that we’ve tested–six apiece for pitchers and position players. We’ve assigned seven points for a first-place finish, six for a second-place finish, on down to one point for a last place finish, and summed the results to determine a victor.
Table 7a: Results for hitter projections
---- Min 130 PA ---- ---- Min 300 PA ---- Corr Mean RMSE Corr Mean RMSE PECOTA 6 7 6 6 6.5 3.5 DMB 5 5.5 6 5 5 5.5 Warren 7 5.5 6 7 6.5 7 BBHQ 4 4 4 3 3.5 5.5 Primer 3 2 2 4 3.5 3.5 Times 2 3 3 2 2 2 Wire 1 1 1 1 1 1
Table 7b: Results for pitcher projections
---- Min 50 IP ---- ---- Min 130 IP ---- Corr Mean RMSE Corr Mean RMSE PECOTA 7 7 7 7 7 7 DMB 6 5 3.5 6 5.5 4 Warren 5 1 2 3 3 2 BBHQ 4 6 5.5 5 4 5.5 Primer 2 3 5.5 4 5.5 5.5 Times 3 3 3.5 2 2 3 Wire 1 3 1 1 1 1
Table 7c: Overall results (7a plus 7b)
TOTAL PECOTA 77 DMB 62 Warren 55 BBHQ 54 Primer 43.5 Times 30.5 Wire 14
Take these results with a grain of salt; one year’s worth of data may not be enough to establish the superiority of one projection system over another, especially in the hitting categories, where the results were tightly bunched. But for the time being at least, it looks like PECOTA walks the walk.