The Prospectus Projections Project began last February, when we asked about
40 people to make "human" projections of how 100 hitters and 25 pitchers
would perform in 1998. We wanted to see how people would do compared to the
computer formulas that appear in such publications as Baseball
Prospectus and STATS’ Major League Handbook. In the end, 27 people – all
knowledgeable baseball fans with a good fundamental understanding of
baseball and baseball statistics – ended up sending us their projections by
We emphasized to all our participants that their projections had to be at
least somewhat from the gut. While everybody was allowed to use whatever
knowledge, statistics, or books that they had in their possession, the
projections were not to come out of a computer or even a formula. A few
potential participants begged out because of this clause. But our helpful
27 contributed their projected batting average, slugging average, and
on-base percentage for 100 mostly randomly selected hitters (we avoided
players on expansion teams, so as to not play "guess the park factor"), and
the projected ERA, K rate, BB rate, and IP for 25 pitchers. The pitching
data will be analyzed in a future article; we’re here to tell you the
results of the offensive projections.
It’s a bit harder to come up with "results" than you might think. Not all
projections systems aim for the same type of success. If you’re trying to
hit on a long shot at the racetrack, for example, a conservative projection
system that sticks closely to what horses have done in the past is going to
do you much less good than a more aggressive system that tries to project
what long shot is going to break out of the pack. Even if the aggressive
system succeeds, it will have far more serious errors on average, but it
will very well make you more money because it picked out one winner who can
pay off huge.
One common way – though again, there is no real "right" way – of analyzing
projections is to measure the mean absolute error (MAE) and the root mean
squared error (RMSE), so that’s what we did in this case. Table 1
summarizes the errors for the human and computer forecasts by giving the
minimum human error (Min), the maximum human error (Max) and then the
errors for the mean human forecast and four objective systems, all in terms
of RMSE. The objective systems that we analyzed included Bill James’ system
(as included in the STATS 1998 Major League Handbook and later updates),
the neural net Vlad system (as used in Baseball Prospectus 1998), Pete
Palmer’s projection system (as available in The Spy: Baseball ’98), and
experimental projections from Wilton, the projection system that will be
used in the 1999 edition of the Baseball Prospectus.
Bold figures indicate that the value is statistically significantly better
than the 27 human forecasts at the 95% confidence level.
Another way to compare the human forecasts to the other systems is to look
at the number of people who had more accurate forecasts than the mean or
objective systems for each variable, as in Table 2.
STATS’ projections were the most accurate forecasts for three of the
variables (BA, SLG and ISO), were second to one person’s (Steve Moyer’s) in
OBA by less than one-tenth of a point, were second to one person’s (Steve
Rubio’s) in OPS by half a point, and were 7th in OBA-BA (behind five
people’s and the mean human forecast). From the two charts we can see that,
in general, the objective forecasts were usually more accurate than almost
all human forecasts, except for OBA-BA, where the most accurate forecasts
were competitive with all of the objective systems, to the extent that the
mean human forecast was the most accurate. Another way of analyzing the
projections is to test how often they fall within a certain tolerance. A
natural measure of appropriate error tolerance that we can use to test the
projections is the mean absolute difference – its "natural variability"
between a statistic in 1997 and 1998. For example, the mean absolute change
in batting average was 25.9 points between 1997 and 1998 (regardless of the
sign). Below, Table 3 gives the percentage of predictions that had errors
smaller than the mean absolute difference for each statistic.
Table 4 gives the percentage of forecasts with errors smaller than one-half the
mean absolute difference.
These tables reaffirm some of the things we know from Tables 1 and 2, but
they also offer up a lot of new information. It seems that
roughly 2/3 of the projections are correct to within the mean absolute
change from year to year, and roughly one-third are right within half that
value. The least accurately predicted statistic in the project this year
was slugging, with the best performance of the 31 forecasts being only 29%
correct within the tighter tolerance. Curiously, the non-BA components of
OBP (OBP-BA) and SLG (ISO) were the most accurately predicted statistics,
indicating that batting average was the hardest thing to predict for most
of the participants. The biggest advantage STATS’ projections seems to have
is that they make the fewest large errors in its’ BA predictions (only 34%).
Two of the objective projection systems, Vlad and Wilton, have more of a
"gambling" nature than the STATS and Palmer systems. This shows up in that
Vlad and Wilton are likely to be very right or very wrong. Note, for
example, that among the computer projections Vlad has the most ISO
predictions within one-half of the absolute difference (40%) despite having
the fewest predictions within the mean absolute difference (56%). STATS, on
the other hand, has the most ISO predictions within the mean absolute
difference (71%), and the fewest within one-half of the mean absolute
difference (34%). The conservative nature of STATS’ system thus
significantly reduces the number of very big errors and thus its RMSE, but
at the cost of very accurate projections when compared to a more aggressive
approach. The less conservative Vlad and Wilton systems produce a series of
projections that comes closer to the observed variance of performance, but
don’t always assign the big changes to the right hitters. It is likely that
the Vlad and Wilton systems have more room for future improvement because
of that trait.
While this project was not a contest (we were more interested in the
‘people vs. tools’ angle), we did keep track of each participant’s
projections and how accurate they were. Below we’ve listed all the
participants in order and alongside of the average RMSE of the three main
variables (BA, OBP-BA, ISO). Following that is a similar rankings by plain
MSE, which "punishes" large errors less than RMSE. The use of the three
variables is almost certainly fairer than ranking just by OPS, since you
can rank high in an OPS ranking if you are equally wrong in different
directions on several of the variables. The three variables are like
length, width, and height, while OPS is like volume. Nevertheless, we
include an RMSE ranking by OPS as well in the third table below.
BA, OBP-BA, and ISO by RMSE
1.STATS 28.9 2.Mean of People's Projections 29.3 3.Palmer 30.3 4.Steve Moyer 30.4 5.Dave Schoenfeld 30.5 6.Greg Spira 30.5 7.Steve Rubio 30.7 8.Wilton 30.7 9.Jim Furtado 31.1 10.Vlad 31.2 11.John Sickels 31.3 ------------------------------------- Above are One Standard Deviation better than average human projection 12.Dean Carrano 31.6 13.T. Madison 32.0 14.Jeff Joseph 32.6 15.David Pease 32.8 16.Michael Wolverton 32.8 17.Chris Conley 32.8 18.Mark Jareb 33.0 19.Greg Bunimovich 33.0 20.Joe Sheehan 33.1 21.Sean Forman 33.6 22.Gregg Pearlman 33.7 23.Jason Gische 33.9 24.HJ Park 34.0 25.Dan Szymborski 34.0 26.Doug Pappas 34.2 27.Gary Huckabay 34.2 28.Jeff Hildbrand 34.2 ------------------------------------- Below are One Standard Deviation worse than average human projection 29.Allen Speir 34.4 30.Ron Johnson 35.0 31.Daniel Levine 35.7
BA, OBP-BA, and ISO by MSE
1 STATS 23.0 2 Mean of People's Projections 23.3 ------------------------------------- Above are Two Standard Deviations better than average human projection 3 Steve Moyer 23.8 4 Greg Spira 24.0 5 Palmer 24.1 6 Steve Rubio 24.3 7 Wilton 24.5 8 Dean Carrano 24.6 9 Jim Furtado 24.6 ------------------------------------- Above are One Standard Deviation better than average human projection 10 Vlad 24.7 11 Dave Schoenfeld 24.8 12 TJ Madison 24.9 13 John Sickels 24.9 14 Jeff Joseph 25.2 15 Michael Wolverton 25.3 16 Greg Bunimovich 25.4 17 Joe Sheehan 25.6 18 David Pease 25.6 19 Mark Jareb 25.9 20 Chris Conley 25.9 21 Sean Forman 26.0 22 Dan Szymborski 26.2 23 Gregg Pearlman 26.2 24 Doug Pappas 26.6 25 Gary Huckabay 26.6 26 Jeff Hildebrand 26.9 ------------------------------------- Below are One Standard Deviation worse than average human projection 27 HJ Park 27.0 28 Ron Johnson 27.2 29 Jason Gische 27.3 30 Allen Speir 27.5 31 Daniel Levine 27.7
OBP by RMSE
1 Steve Rubio 82.8 2 STATS 83.3 3 John Sickels 83.9 4 Vlad 85.1 5 Mean of People's Projections 85.5 6 Steve Moyer 86.4 7 Gary Huckabay 86.7 ------------------------------------- Above are One Standard Deviation better than the average human projection 8 Palmer 86.9 9 Jim Furtado 87.0 10 TJ Madison 87.3 11 Wilton 87.7 12 Greg Spira 88.4 13 Joe Sheehan 88.9 14 Michael Wolverton 89.7 15 Dave Schoenfeld 90.0 16 Dan Szymborski 90.1 17 Jeff Joseph 91.5 18 Sean Forman 91.7 19 Dean Carrano 92.3 ------------------------------------- Below are One Standard Deviation worse than average human projection 20 Mark Jareb 93.2 21 Dave Pease 94.2 22 HJ Park 94.6 23 Ron Johnson 94.8 24 Chris Conley 95.0 25 Gregg Pearlman 96.0 26 Doug Pappas 96.6 27 Greg Bunimovich 98.3 28 Jason Gische 100.7 29 Jeff Hildebrand 102.1 30 Allen Speir 104.1 ------------------------------------- Below is Two Standard Deviations worse than average human projection 31 Daniel Levine 105.6
Now let’s look at which players each system did their worst at:
Andres Galarraga was obviously the biggest problem. Every system expected
his power to largely disappear out of Coors Field. It didn’t. It will be
interesting to see how these systems handle projecting Galarraga’s 1999.
Was it a complete fluke, a real change, or did the systems miss something?
Meanwhile, Rich Becker walked a lot less than anyone expected, and no
system anticipated quite how horrible John Flaherty would be.
We can reach the not-so-wild conclusion from these worst misses that no
system is really immune from blowing it completely when ballplayers do the
totally unexpected; the worst projections from each system are probably
going to be similar to the worst projections from all the other systems
year after year. It seems unlikely that the projection systems, especially
the more conservative ones, can improve at all in this area.
Now for the best guesses from the various systems:
There’s a lot less repetition here. Only four players (Allensworth, King,
Matheny, and E. Young) get mentioned twice. No player was the best pick for
more than one measurement by any system, and no player was the best pick in
any one measurement for more than one system. There doesn’t seem to be any
discernible trend in the players who make this list, although a notable
number of them do have very little power. Rich Becker is the only player
who pops up on both tperhaps up as Pete Palmer’s best isolated power
projection and his worst walk/hbp rate projection.
Now, on which players did the humans agree the most on their individual
BA - Jose Cruz OBP - Troy O'Leary SLG - Mike Bordick OPS - Kenny Lofton OPS-BA - Michael Tucker ISO - Mike Bordick Overall - Mike Matheny
None of these projections bombed, though everybody vastly underestimated
what Mike Bordick’s power would be like in 1998.
Meanwhile, the most disagreement among the humans about how players would
do in 1998 showed up in these players:
BA - Gary Sheffield OBP - Doug Glanville SLG - Gary Sheffield OPS - Gary Sheffield OPS-BA - Brian Johnson ISO - Gary Sheffield Overall - Gary Sheffield
This doesn’t tell us much except that everybody though differently about
which Gary Sheffield would show up in 1998.
There is a small positive correlation between the disagreement and the
error of the mean human forecast. The amounts are small enough to mean
little, except perhaps for the .36 correlation in the ISO category. That
correlation could mean that people have more trouble calculating ISO than
other parts of a ballplayer’s offense.
Among the statistical projection systems, Brian Johnson, Joe Carter and
Rafael Palmeiro were the sources of the most agreement, while Rich Becker
was the projection on which the systems disagreed the most. There does seem
to be a bit more agreement among the systems on veterans than younger
players, but other than that there are no obvious trends that show up. The
humans did not show any particularly strong unanimity on any of the players
the computer systems most strongly agreed upon, strangely enough. We don’t
really see all that much of a relationship between the size of the spread
of the predictions and the accuracy of the projections except in the one
case mentioned above.
All in all there aren’t that many conclusions we can reach yet. This is
really the start of the research, not its conclusion. We plan to continue
this project next season in some form.
What we have learned from this is that a group of knowledgeable baseball
fans can, as a group, predict offensive performance similarly to the best
computer projection systems, though not any better. At the same time, most
knowledgeable baseball fans probably won’t be able to do projections
themselves that are as good as published computer based projections. We’ve
always seen that different projection systems can be successful in
different ways, but that none really succeed in any way that’s remarkable.
Hopefully, various nuggets we’ve learned along the way will lead to more
interesting discoveries in the future. But for now, let us thank everyone
who has participated in this study. The 27 people who contributed "human"
projections, almost all of whom found that the work was harder to do than
expected but persevered anyway, get a big hand. Worth singling out among
the 27 is Daniel Levine, who also helped arrange all the data into usable
form. We also thank the designers of the computer projection systems: Bill
James and STATS (let us note here that we know that the computer
projections from STATS, unlike the other computer projections analyzed
here, are occasionally adjusted a bit by humans; we just aren’t very
concerned about this), Pete Palmer, Gary Huckabay, and Clay Davenport. And
let us conclude by thanking Harold Brooks, the co-author of this piece, who
is uniquely responsible for most of the intelligence found in this study.
If you’re interested in participating in whatever form this project
continues in next season, please send an e-mail to firstname.lastname@example.org.
Note that we definitely are looking to include more computer projection
systems next year. And that pitchers and catchers report in six weeks.
Finally, how did the projection systems do in terms of forecasting big
changes in player performance? Here are contingency tables that give the
results of forecasts and observed big changes of OPS. I’ve made the
problem into a 3×3 problem, with the "Down" category being a fall from 1997
of at least 70 points, the "Up" category being an improvment of at least 70
points, and the "Middle" being everybody else. 70 points is approximately
the median absolute change from 1997. The columns give the number of
forecast/observed pairs by the observed change and the rows give the number
by the forecast change. For example, for the mean human forecast, there
were 7 forecasts of a player dropping 70 points or worse in 1998 that were
associated with players doing 70 points or more worse. There were four
cases were that forecast was made and the player was within 70 points of
1997, no cases where the player went up by 70 points or more, and 11 cases
where the player was forecast to stay within 70 points and he actually went
down by 70 or more points. The last column (row) in each table gives the
total number of forecast (observed) changes of 70 points or more. (Note
that the observed total row is the same for each table).
So what are you looking at? In reading these grids, keep in mind that it’s
good to be on the main (top left-bottom right) diagonals or corners, and
bad to be off of them. There are only three cases in the bad corners (top
right and bottom left) –Vlad had Rich Becker going up by 100 points and he
went down by 74, while Wilton had Bernard Gilkey going up by 97 (down by
115) and Steve Finley going up by 77 (down by 84).
The tables can be summarized by a variety of measures. A particularly
appropriate one is one known (at least in meteorology) as the Heidke score.
It gives the fraction in the right boxes (main diagonal) reduced by the
number you would get right by random guessing. The Heidke score is at the
bottom right of each table (for the mean human, it’s .208). The best
(worst) possible score is 1 (-1). A score of zero is associated with
random guessing. Vlad wins by this measure, while Wilton does surprisingly
The value in parenthesis after each forecast name is the percentage of
forecasts that were either big drops or big ups. The observed value is
49%. Vlad has the highest value at 35%, so that it makes forecasts that
"look" the most like a real distribution of observed values, but it’s still
a ways from reality.
Observed Changes Down Middle Up Total Mean (20%) D 7 4 0 11 M 11 38 20 69 U 0 2 4 6 Tot 18 44 24 .208 STATS (29%) D 9 9 0 18 M 9 34 18 61 U 0 1 6 7 Tot 18 44 24 .246 Vlad (35%) D 8 6 0 14 M 9 32 15 56 U 1 6 9 16 Tot 18 44 24 .259 Palmer (15%) D 7 3 0 10 M 11 40 22 73 U 0 1 2 3 Tot 18 44 24 .191 Wilton (23%) D 5 5 0 10 M 14 36 19 66 U 2 3 5 10 Tot 18 44 24 .155
To summarize, it’s pretty unlikely to forecast a big breakout (or bust) and
have the opposite happen. It’s much more likely to miss big changes. The
percentage of big changes (either sign) correctly forecast by the sytstems
Mean 26% STATS 36% Vlad 41% Palmer 21% Wilton 24%
Vlad and STATS are the best at picking up on the big changes by this
measure. (This is not contradictory to the description of STATS as
conservative–it’s just that different measures can give different
impressions. If a variety of measures give the same picture, as with Vlad,
then you can have more confidence in the result.)
If I pick a lower threshold for big changes (50 points), I get the following:
Heidke Mean (35%) .185 STATS (41%) .249 Vlad (53%) .165 Palmer (31%) .204 Wilton (42%) .102
Again, the parenthetical value is the percentage of forecasts of big
changes, with the observed percentage = 62%. The conservative systems pass
up Vlad in large part because of Vlad’s struggles with the middle
forecasts. The contingency table for Vlad at a 50-point threshold is
Observed Changes Down Middle Up Total D 10 7 1 18 M 14 14 12 40 U 1 12 15 28 Tot 25 33 28 .165
Note how "flat" the distribution in the middle row is. It’s very peaked
for the 70 point threshold and having only 42% of the observed middle group
being in the forecast middle hurts Vlad a lot. The lowest value for any of
the other 50 point threshold systems is 58%.
Now, for the really bad forecasts:
Forecast to go down by more than 50 points and went up by 50 points: Andres Galarraga (missed by everyone) Mickey Morandini (missed by STATS-forecast down 69) John Olerud (missed by Wilton-forecast down 55) Forecast to go up by more than 50 points and went down by 50 points: Bernard Gilkey (missed by mean human, Palmer, Wilton) Rich Becker (missed by Vlad-forecast up 100)
Finally, a few notes from Clay about the version of Wilton used in Harold’s
study and the final version that was used in BP ’99: in the study, the
Wilton program mentioned was a prototype of the version which appears in
the book. Most importantly, it did not convert statistics from the DT
format to the team/league environment, except for a simple Colorado
adjustment. The prototype also contained several bugs which had yet to be
caught (but were caught before we put Wilton into this year’s book).
Finally, the prototype had different criteria for choosing "matching"
players and weighting their contributions. Overall, the differences should
be minor; the judgments of Wilton’s performance are still valid, even if
the exact assessment varies.