February 1, 1999
The Prospectus Projections Project
A year's worth of data is in, and here are the resultsThe Prospectus Projections Project began last February, when we asked about 40 people to make "human" projections of how 100 hitters and 25 pitchers would perform in 1998. We wanted to see how people would do compared to the computer formulas that appear in such publications as Baseball Prospectus and STATS' Major League Handbook. In the end, 27 people - all knowledgeable baseball fans with a good fundamental understanding of baseball and baseball statistics - ended up sending us their projections by mid-March.
We emphasized to all our participants that their projections had to be at least somewhat from the gut. While everybody was allowed to use whatever knowledge, statistics, or books that they had in their possession, the projections were not to come out of a computer or even a formula. A few potential participants begged out because of this clause. But our helpful 27 contributed their projected batting average, slugging average, and on-base percentage for 100 mostly randomly selected hitters (we avoided players on expansion teams, so as to not play "guess the park factor"), and the projected ERA, K rate, BB rate, and IP for 25 pitchers. The pitching data will be analyzed in a future article; we're here to tell you the results of the offensive projections.
It's a bit harder to come up with "results" than you might think. Not all projections systems aim for the same type of success. If you're trying to hit on a long shot at the racetrack, for example, a conservative projection system that sticks closely to what horses have done in the past is going to do you much less good than a more aggressive system that tries to project what long shot is going to break out of the pack. Even if the aggressive system succeeds, it will have far more serious errors on average, but it will very well make you more money because it picked out one winner who can pay off huge.
One common way - though again, there is no real "right" way - of analyzing
Bold figures indicate that the value is statistically significantly better than the 27 human forecasts at the 95% confidence level.
Another way to compare the human forecasts to the other systems is to look
STATS' projections were the most accurate forecasts for three of the variables (BA, SLG and ISO), were second to one person's (Steve Moyer's) in OBA by less than one-tenth of a point, were second to one person's (Steve Rubio's) in OPS by half a point, and were 7th in OBA-BA (behind five people's and the mean human forecast). From the two charts we can see that, in general, the objective forecasts were usually more accurate than almost all human forecasts, except for OBA-BA, where the most accurate forecasts were competitive with all of the objective systems, to the extent that the mean human forecast was the most accurate. Another way of analyzing the projections is to test how often they fall within a certain tolerance. A natural measure of appropriate error tolerance that we can use to test the projections is the mean absolute difference - its "natural variability" between a statistic in 1997 and 1998. For example, the mean absolute change in batting average was 25.9 points between 1997 and 1998 (regardless of the sign). Below, Table 3 gives the percentage of predictions that had errors smaller than the mean absolute difference for each statistic.
These tables reaffirm some of the things we know from Tables 1 and 2, but they also offer up a lot of new information. It seems that roughly 2/3 of the projections are correct to within the mean absolute change from year to year, and roughly one-third are right within half that value. The least accurately predicted statistic in the project this year was slugging, with the best performance of the 31 forecasts being only 29% correct within the tighter tolerance. Curiously, the non-BA components of OBP (OBP-BA) and SLG (ISO) were the most accurately predicted statistics, indicating that batting average was the hardest thing to predict for most of the participants. The biggest advantage STATS' projections seems to have is that they make the fewest large errors in its' BA predictions (only 34%).
Two of the objective projection systems, Vlad and Wilton, have more of a "gambling" nature than the STATS and Palmer systems. This shows up in that Vlad and Wilton are likely to be very right or very wrong. Note, for example, that among the computer projections Vlad has the most ISO predictions within one-half of the absolute difference (40%) despite having the fewest predictions within the mean absolute difference (56%). STATS, on the other hand, has the most ISO predictions within the mean absolute difference (71%), and the fewest within one-half of the mean absolute difference (34%). The conservative nature of STATS' system thus significantly reduces the number of very big errors and thus its RMSE, but at the cost of very accurate projections when compared to a more aggressive approach. The less conservative Vlad and Wilton systems produce a series of projections that comes closer to the observed variance of performance, but don't always assign the big changes to the right hitters. It is likely that the Vlad and Wilton systems have more room for future improvement because of that trait.
While this project was not a contest (we were more interested in the 'people vs. tools' angle), we did keep track of each participant's projections and how accurate they were. Below we've listed all the participants in order and alongside of the average RMSE of the three main variables (BA, OBP-BA, ISO). Following that is a similar rankings by plain MSE, which "punishes" large errors less than RMSE. The use of the three variables is almost certainly fairer than ranking just by OPS, since you can rank high in an OPS ranking if you are equally wrong in different directions on several of the variables. The three variables are like length, width, and height, while OPS is like volume. Nevertheless, we include an RMSE ranking by OPS as well in the third table below.
BA, OBP-BA, and ISO by RMSE
1.STATS 28.9 2.Mean of People's Projections 29.3 3.Palmer 30.3 4.Steve Moyer 30.4 5.Dave Schoenfeld 30.5 6.Greg Spira 30.5 7.Steve Rubio 30.7 8.Wilton 30.7 9.Jim Furtado 31.1 10.Vlad 31.2 11.John Sickels 31.3 ------------------------------------- Above are One Standard Deviation better than average human projection 12.Dean Carrano 31.6 13.T. Madison 32.0 14.Jeff Joseph 32.6 15.David Pease 32.8 16.Michael Wolverton 32.8 17.Chris Conley 32.8 18.Mark Jareb 33.0 19.Greg Bunimovich 33.0 20.Joe Sheehan 33.1 21.Sean Forman 33.6 22.Gregg Pearlman 33.7 23.Jason Gische 33.9 24.HJ Park 34.0 25.Dan Szymborski 34.0 26.Doug Pappas 34.2 27.Gary Huckabay 34.2 28.Jeff Hildbrand 34.2 ------------------------------------- Below are One Standard Deviation worse than average human projection 29.Allen Speir 34.4 30.Ron Johnson 35.0 31.Daniel Levine 35.7BA, OBP-BA, and ISO by MSE
1 STATS 23.0 2 Mean of People's Projections 23.3 ------------------------------------- Above are Two Standard Deviations better than average human projection 3 Steve Moyer 23.8 4 Greg Spira 24.0 5 Palmer 24.1 6 Steve Rubio 24.3 7 Wilton 24.5 8 Dean Carrano 24.6 9 Jim Furtado 24.6 ------------------------------------- Above are One Standard Deviation better than average human projection 10 Vlad 24.7 11 Dave Schoenfeld 24.8 12 TJ Madison 24.9 13 John Sickels 24.9 14 Jeff Joseph 25.2 15 Michael Wolverton 25.3 16 Greg Bunimovich 25.4 17 Joe Sheehan 25.6 18 David Pease 25.6 19 Mark Jareb 25.9 20 Chris Conley 25.9 21 Sean Forman 26.0 22 Dan Szymborski 26.2 23 Gregg Pearlman 26.2 24 Doug Pappas 26.6 25 Gary Huckabay 26.6 26 Jeff Hildebrand 26.9 ------------------------------------- Below are One Standard Deviation worse than average human projection 27 HJ Park 27.0 28 Ron Johnson 27.2 29 Jason Gische 27.3 30 Allen Speir 27.5 31 Daniel Levine 27.7OBP by RMSE
1 Steve Rubio 82.8 2 STATS 83.3 3 John Sickels 83.9 4 Vlad 85.1 5 Mean of People's Projections 85.5 6 Steve Moyer 86.4 7 Gary Huckabay 86.7 ------------------------------------- Above are One Standard Deviation better than the average human projection 8 Palmer 86.9 9 Jim Furtado 87.0 10 TJ Madison 87.3 11 Wilton 87.7 12 Greg Spira 88.4 13 Joe Sheehan 88.9 14 Michael Wolverton 89.7 15 Dave Schoenfeld 90.0 16 Dan Szymborski 90.1 17 Jeff Joseph 91.5 18 Sean Forman 91.7 19 Dean Carrano 92.3 ------------------------------------- Below are One Standard Deviation worse than average human projection 20 Mark Jareb 93.2 21 Dave Pease 94.2 22 HJ Park 94.6 23 Ron Johnson 94.8 24 Chris Conley 95.0 25 Gregg Pearlman 96.0 26 Doug Pappas 96.6 27 Greg Bunimovich 98.3 28 Jason Gische 100.7 29 Jeff Hildebrand 102.1 30 Allen Speir 104.1 ------------------------------------- Below is Two Standard Deviations worse than average human projection 31 Daniel Levine 105.6Now let's look at which players each system did their worst at:
Andres Galarraga was obviously the biggest problem. Every system expected his power to largely disappear out of Coors Field. It didn't. It will be interesting to see how these systems handle projecting Galarraga's 1999. Was it a complete fluke, a real change, or did the systems miss something? Meanwhile, Rich Becker walked a lot less than anyone expected, and no system anticipated quite how horrible John Flaherty would be.
We can reach the not-so-wild conclusion from these worst misses that no system is really immune from blowing it completely when ballplayers do the totally unexpected; the worst projections from each system are probably going to be similar to the worst projections from all the other systems year after year. It seems unlikely that the projection systems, especially the more conservative ones, can improve at all in this area.
Now for the best guesses from the various systems:
There's a lot less repetition here. Only four players (Allensworth, King, Matheny, and E. Young) get mentioned twice. No player was the best pick for more than one measurement by any system, and no player was the best pick in any one measurement for more than one system. There doesn't seem to be any discernible trend in the players who make this list, although a notable number of them do have very little power. Rich Becker is the only player who pops up on both tperhaps up as Pete Palmer's best isolated power projection and his worst walk/hbp rate projection.
Now, on which players did the humans agree the most on their individual projections?
BA - Jose Cruz OBP - Troy O'Leary SLG - Mike Bordick OPS - Kenny Lofton OPS-BA - Michael Tucker ISO - Mike Bordick Overall - Mike Matheny
None of these projections bombed, though everybody vastly underestimated what Mike Bordick's power would be like in 1998.
Meanwhile, the most disagreement among the humans about how players would do in 1998 showed up in these players:
BA - Gary Sheffield OBP - Doug Glanville SLG - Gary Sheffield OPS - Gary Sheffield OPS-BA - Brian Johnson ISO - Gary Sheffield Overall - Gary Sheffield
This doesn't tell us much except that everybody though differently about which Gary Sheffield would show up in 1998.
There is a small positive correlation between the disagreement and the error of the mean human forecast. The amounts are small enough to mean little, except perhaps for the .36 correlation in the ISO category. That correlation could mean that people have more trouble calculating ISO than other parts of a ballplayer's offense.
Among the statistical projection systems, Brian Johnson, Joe Carter and Rafael Palmeiro were the sources of the most agreement, while Rich Becker was the projection on which the systems disagreed the most. There does seem to be a bit more agreement among the systems on veterans than younger players, but other than that there are no obvious trends that show up. The humans did not show any particularly strong unanimity on any of the players the computer systems most strongly agreed upon, strangely enough. We don't really see all that much of a relationship between the size of the spread of the predictions and the accuracy of the projections except in the one case mentioned above.
All in all there aren't that many conclusions we can reach yet. This is really the start of the research, not its conclusion. We plan to continue this project next season in some form.
What we have learned from this is that a group of knowledgeable baseball fans can, as a group, predict offensive performance similarly to the best computer projection systems, though not any better. At the same time, most knowledgeable baseball fans probably won't be able to do projections themselves that are as good as published computer based projections. We've always seen that different projection systems can be successful in different ways, but that none really succeed in any way that's remarkable.
Hopefully, various nuggets we've learned along the way will lead to more interesting discoveries in the future. But for now, let us thank everyone who has participated in this study. The 27 people who contributed "human" projections, almost all of whom found that the work was harder to do than expected but persevered anyway, get a big hand. Worth singling out among the 27 is Daniel Levine, who also helped arrange all the data into usable form. We also thank the designers of the computer projection systems: Bill James and STATS (let us note here that we know that the computer projections from STATS, unlike the other computer projections analyzed here, are occasionally adjusted a bit by humans; we just aren't very concerned about this), Pete Palmer, Gary Huckabay, and Clay Davenport. And let us conclude by thanking Harold Brooks, the co-author of this piece, who is uniquely responsible for most of the intelligence found in this study.
If you're interested in participating in whatever form this project continues in next season, please send an e-mail to email@example.com. Note that we definitely are looking to include more computer projection systems next year. And that pitchers and catchers report in six weeks.
Finally, how did the projection systems do in terms of forecasting big changes in player performance? Here are contingency tables that give the results of forecasts and observed big changes of OPS. I've made the problem into a 3x3 problem, with the "Down" category being a fall from 1997 of at least 70 points, the "Up" category being an improvment of at least 70 points, and the "Middle" being everybody else. 70 points is approximately the median absolute change from 1997. The columns give the number of forecast/observed pairs by the observed change and the rows give the number by the forecast change. For example, for the mean human forecast, there were 7 forecasts of a player dropping 70 points or worse in 1998 that were associated with players doing 70 points or more worse. There were four cases were that forecast was made and the player was within 70 points of 1997, no cases where the player went up by 70 points or more, and 11 cases where the player was forecast to stay within 70 points and he actually went down by 70 or more points. The last column (row) in each table gives the total number of forecast (observed) changes of 70 points or more. (Note that the observed total row is the same for each table).
So what are you looking at? In reading these grids, keep in mind that it's good to be on the main (top left-bottom right) diagonals or corners, and bad to be off of them. There are only three cases in the bad corners (top right and bottom left) --Vlad had Rich Becker going up by 100 points and he went down by 74, while Wilton had Bernard Gilkey going up by 97 (down by 115) and Steve Finley going up by 77 (down by 84).
The tables can be summarized by a variety of measures. A particularly appropriate one is one known (at least in meteorology) as the Heidke score. It gives the fraction in the right boxes (main diagonal) reduced by the number you would get right by random guessing. The Heidke score is at the bottom right of each table (for the mean human, it's .208). The best (worst) possible score is 1 (-1). A score of zero is associated with random guessing. Vlad wins by this measure, while Wilton does surprisingly badly.
The value in parenthesis after each forecast name is the percentage of forecasts that were either big drops or big ups. The observed value is 49%. Vlad has the highest value at 35%, so that it makes forecasts that "look" the most like a real distribution of observed values, but it's still a ways from reality.
Observed Changes Down Middle Up Total Mean (20%) D 7 4 0 11 M 11 38 20 69 U 0 2 4 6 Tot 18 44 24 .208 STATS (29%) D 9 9 0 18 M 9 34 18 61 U 0 1 6 7 Tot 18 44 24 .246 Vlad (35%) D 8 6 0 14 M 9 32 15 56 U 1 6 9 16 Tot 18 44 24 .259 Palmer (15%) D 7 3 0 10 M 11 40 22 73 U 0 1 2 3 Tot 18 44 24 .191 Wilton (23%) D 5 5 0 10 M 14 36 19 66 U 2 3 5 10 Tot 18 44 24 .155
To summarize, it's pretty unlikely to forecast a big breakout (or bust) and have the opposite happen. It's much more likely to miss big changes. The percentage of big changes (either sign) correctly forecast by the sytstems were:
Mean 26% STATS 36% Vlad 41% Palmer 21% Wilton 24%
Vlad and STATS are the best at picking up on the big changes by this measure. (This is not contradictory to the description of STATS as conservative--it's just that different measures can give different impressions. If a variety of measures give the same picture, as with Vlad, then you can have more confidence in the result.)
If I pick a lower threshold for big changes (50 points), I get the following:
Heidke Mean (35%) .185 STATS (41%) .249 Vlad (53%) .165 Palmer (31%) .204 Wilton (42%) .102
Again, the parenthetical value is the percentage of forecasts of big changes, with the observed percentage = 62%. The conservative systems pass up Vlad in large part because of Vlad's struggles with the middle forecasts. The contingency table for Vlad at a 50-point threshold is
Observed Changes Down Middle Up Total D 10 7 1 18 M 14 14 12 40 U 1 12 15 28 Tot 25 33 28 .165
Note how "flat" the distribution in the middle row is. It's very peaked for the 70 point threshold and having only 42% of the observed middle group being in the forecast middle hurts Vlad a lot. The lowest value for any of the other 50 point threshold systems is 58%.
Now, for the really bad forecasts:
Forecast to go down by more than 50 points and went up by 50 points: Andres Galarraga (missed by everyone) Mickey Morandini (missed by STATS-forecast down 69) John Olerud (missed by Wilton-forecast down 55) Forecast to go up by more than 50 points and went down by 50 points: Bernard Gilkey (missed by mean human, Palmer, Wilton) Rich Becker (missed by Vlad-forecast up 100)
Finally, a few notes from Clay about the version of Wilton used in Harold's study and the final version that was used in BP '99: in the study, the Wilton program mentioned was a prototype of the version which appears in the book. Most importantly, it did not convert statistics from the DT format to the team/league environment, except for a simple Colorado adjustment. The prototype also contained several bugs which had yet to be caught (but were caught before we put Wilton into this year's book). Finally, the prototype had different criteria for choosing "matching" players and weighting their contributions. Overall, the differences should be minor; the judgments of Wilton's performance are still valid, even if the exact assessment varies.