keyboard_arrow_uptop

40 people to make "human" projections of how 100 hitters and 25 pitchers
would perform in 1998. We wanted to see how people would do compared to the
computer formulas that appear in such publications as Baseball
Prospectus
and STATS’ Major League Handbook. In the end, 27 people – all
knowledgeable baseball fans with a good fundamental understanding of
baseball and baseball statistics – ended up sending us their projections by
mid-March.

We emphasized to all our participants that their projections had to be at
least somewhat from the gut. While everybody was allowed to use whatever
knowledge, statistics, or books that they had in their possession, the
projections were not to come out of a computer or even a formula. A few
potential participants begged out because of this clause. But our helpful
27 contributed their projected batting average, slugging average, and
on-base percentage for 100 mostly randomly selected hitters (we avoided
players on expansion teams, so as to not play "guess the park factor"), and
the projected ERA, K rate, BB rate, and IP for 25 pitchers. The pitching
data will be analyzed in a future article; we’re here to tell you the
results of the offensive projections.

It’s a bit harder to come up with "results" than you might think. Not all
projections systems aim for the same type of success. If you’re trying to
hit on a long shot at the racetrack, for example, a conservative projection
system that sticks closely to what horses have done in the past is going to
do you much less good than a more aggressive system that tries to project
what long shot is going to break out of the pack. Even if the aggressive
system succeeds, it will have far more serious errors on average, but it
will very well make you more money because it picked out one winner who can
pay off huge.

One common way – though again, there is no real "right" way – of analyzing

 TABLE 1: Root Mean Squared Error RMSE BA OBP SLG OPS OBP-BA ISO Min 27.7 30.4 58.3 82.8 17.3 41.6 Max 35.8 39.0 76.2 105.6 22.8 53.8 Mean 28.7 31.4 60.0 85.5 17.0 42.1 STATS 27.4 30.4 58.1 83.3 18.6 40.8 Vlad 28.9 31.2 61.5 85.1 18.6 46.0 Palmer 29.1 33.0 60.1 86.9 19.8 41.9 Wilton 29.7 32.6 62.0 87.7 19.1 43.4

projections is to measure the mean absolute error (MAE) and the root mean
squared error (RMSE), so that’s what we did in this case. Table 1
summarizes the errors for the human and computer forecasts by giving the
minimum human error (Min), the maximum human error (Max) and then the
errors for the mean human forecast and four objective systems, all in terms
of RMSE. The objective systems that we analyzed included Bill James’ system
(as included in the STATS 1998 Major League Handbook and later updates),
the neural net Vlad system (as used in Baseball Prospectus 1998), Pete
Palmer’s projection system (as available in The Spy: Baseball ’98), and
experimental projections from Wilton, the projection system that will be
used in the 1999 edition of the Baseball Prospectus.

Bold figures indicate that the value is statistically significantly better
than the 27 human forecasts at the 95% confidence level.

Another way to compare the human forecasts to the other systems is to look

 TABLE 2: Forecast Accuracy Num BA OBP SLG OPS OBP-BA ISO Mean 2 2 3 2 0 3 STATS 0 1 0 1 5 0 Vlad 2 2 4 2 5 8 Palmer 3 7 3 4 10 2 Wilton 6 4 5 6 6 4

at the number of people who had more accurate forecasts than the mean or
objective systems for each variable, as in Table 2.

STATS’ projections were the most accurate forecasts for three of the
variables (BA, SLG and ISO), were second to one person’s (Steve Moyer’s) in
OBA by less than one-tenth of a point, were second to one person’s (Steve
Rubio’s) in OPS by half a point, and were 7th in OBA-BA (behind five
people’s and the mean human forecast). From the two charts we can see that,
in general, the objective forecasts were usually more accurate than almost
all human forecasts, except for OBA-BA, where the most accurate forecasts
were competitive with all of the objective systems, to the extent that the
mean human forecast was the most accurate. Another way of analyzing the
projections is to test how often they fall within a certain tolerance. A
natural measure of appropriate error tolerance that we can use to test the
projections is the mean absolute difference – its "natural variability"
between a statistic in 1997 and 1998. For example, the mean absolute change
in batting average was 25.9 points between 1997 and 1998 (regardless of the
sign). Below, Table 3 gives the percentage of predictions that had errors
smaller than the mean absolute difference for each statistic.

 TABLE 3: Errors Smaller than Mean Absolute Error BA (25.9) OBP (27.8) SLG (54.3) OPS (76.6) OBP-BA (14.9) ISO (39.9) Min 44 44 51 50 50 55 Max 65 64 69 66 70 70 Mean 60 59 67 59 72 65 STATS 66 62 67 65 62 71 Vlad 64 59 69 70 62 56 Palmer 60 60 65 67 63 64 Wilton 63 51 64 64 65 63

Table 4 gives the percentage of forecasts with errors smaller than one-half the
mean absolute difference.

 TABLE 4: Errors Smaller than One Half Mean Absolute Error BA (12.9) OBP (13.9) SLG (27.1) OPS (38.3) OBP-BA (7.5) ISO (20.0) Min 20 17 15 21 21 28 Max 37 36 29 38 45 40 Mean 31 28 23 33 41 37 STATS 31 40 17 38 40 34 Vlad 34 26 21 35 30 40 Palmer 31 31 16 38 35 34 Wilton 35 30 17 36 34 34

These tables reaffirm some of the things we know from Tables 1 and 2, but
they also offer up a lot of new information. It seems that
roughly 2/3 of the projections are correct to within the mean absolute
change from year to year, and roughly one-third are right within half that
value. The least accurately predicted statistic in the project this year
was slugging, with the best performance of the 31 forecasts being only 29%
correct within the tighter tolerance. Curiously, the non-BA components of
OBP (OBP-BA) and SLG (ISO) were the most accurately predicted statistics,
indicating that batting average was the hardest thing to predict for most
of the participants. The biggest advantage STATS’ projections seems to have
is that they make the fewest large errors in its’ BA predictions (only 34%).

Two of the objective projection systems, Vlad and Wilton, have more of a
"gambling" nature than the STATS and Palmer systems. This shows up in that
Vlad and Wilton are likely to be very right or very wrong. Note, for
example, that among the computer projections Vlad has the most ISO
predictions within one-half of the absolute difference (40%) despite having
the fewest predictions within the mean absolute difference (56%). STATS, on
the other hand, has the most ISO predictions within the mean absolute
difference (71%), and the fewest within one-half of the mean absolute
difference (34%). The conservative nature of STATS’ system thus
significantly reduces the number of very big errors and thus its RMSE, but
at the cost of very accurate projections when compared to a more aggressive
approach. The less conservative Vlad and Wilton systems produce a series of
projections that comes closer to the observed variance of performance, but
don’t always assign the big changes to the right hitters. It is likely that
the Vlad and Wilton systems have more room for future improvement because
of that trait.

While this project was not a contest (we were more interested in the
‘people vs. tools’ angle), we did keep track of each participant’s
projections and how accurate they were. Below we’ve listed all the
participants in order and alongside of the average RMSE of the three main
variables (BA, OBP-BA, ISO). Following that is a similar rankings by plain
MSE, which "punishes" large errors less than RMSE. The use of the three
variables is almost certainly fairer than ranking just by OPS, since you
can rank high in an OPS ranking if you are equally wrong in different
directions on several of the variables. The three variables are like
length, width, and height, while OPS is like volume. Nevertheless, we
include an RMSE ranking by OPS as well in the third table below.

BA, OBP-BA, and ISO by RMSE

```1.STATS                          28.9
2.Mean of People's Projections   29.3
3.Palmer                         30.3
4.Steve Moyer                    30.4
5.Dave Schoenfeld                30.5
6.Greg Spira                     30.5
7.Steve Rubio                    30.7
8.Wilton                         30.7
11.John Sickels                  31.3
-------------------------------------
Above are One Standard Deviation better than average human projection
12.Dean Carrano                  31.6
14.Jeff Joseph                   32.6
15.David Pease                   32.8
16.Michael Wolverton             32.8
17.Chris Conley                  32.8
18.Mark Jareb                    33.0
19.Greg Bunimovich               33.0
20.Joe Sheehan                   33.1
21.Sean Forman                   33.6
22.Gregg Pearlman                33.7
23.Jason Gische                  33.9
24.HJ Park                       34.0
25.Dan Szymborski                34.0
26.Doug Pappas                   34.2
27.Gary Huckabay                 34.2
28.Jeff Hildbrand                34.2
-------------------------------------
Below are One Standard Deviation worse than average human projection
29.Allen Speir                   34.4
30.Ron Johnson                   35.0
31.Daniel Levine                 35.7
```

BA, OBP-BA, and ISO by MSE

```1  STATS                         23.0
2  Mean of People's Projections  23.3
-------------------------------------
Above are Two Standard Deviations better than average human projection
3  Steve Moyer                   23.8
4  Greg Spira                    24.0
5  Palmer                        24.1
6  Steve Rubio                   24.3
7  Wilton                        24.5
8  Dean Carrano                  24.6
-------------------------------------
Above are One Standard Deviation better than average human projection
11 Dave Schoenfeld               24.8
13 John Sickels                  24.9
14 Jeff Joseph                   25.2
15 Michael Wolverton             25.3
16 Greg Bunimovich               25.4
17 Joe Sheehan                   25.6
18 David Pease                   25.6
19 Mark Jareb                    25.9
20 Chris Conley                  25.9
21 Sean Forman                   26.0
22 Dan Szymborski                26.2
23 Gregg Pearlman                26.2
24 Doug Pappas                   26.6
25 Gary Huckabay                 26.6
26 Jeff Hildebrand               26.9
-------------------------------------
Below are One Standard Deviation worse than average human projection
27 HJ Park                       27.0
28 Ron Johnson                   27.2
29 Jason Gische                  27.3
30 Allen Speir                   27.5
31 Daniel Levine                 27.7
```

OBP by RMSE

```1  Steve Rubio                   82.8
2  STATS                         83.3
3  John Sickels                  83.9
5  Mean of People's Projections  85.5
6  Steve Moyer                   86.4
7  Gary Huckabay                 86.7
-------------------------------------
Above are One Standard Deviation better than the average human projection
8  Palmer                        86.9
11 Wilton                        87.7
12 Greg Spira                    88.4
13 Joe Sheehan                   88.9
14 Michael Wolverton             89.7
15 Dave Schoenfeld               90.0
16 Dan Szymborski                90.1
17 Jeff Joseph                   91.5
18 Sean Forman                   91.7
19 Dean Carrano                  92.3
-------------------------------------
Below are One Standard Deviation worse than average human projection
20 Mark Jareb                    93.2
21 Dave Pease                    94.2
22 HJ Park                       94.6
23 Ron Johnson                   94.8
24 Chris Conley                  95.0
25 Gregg Pearlman                96.0
26 Doug Pappas                   96.6
27 Greg Bunimovich               98.3
28 Jason Gische                 100.7
29 Jeff Hildebrand              102.1
30 Allen Speir                  104.1
-------------------------------------
Below is Two Standard Deviations worse than average human projection
31 Daniel Levine                105.6
```

Now let’s look at which players each system did their worst at:

 Worst Mean STATS Vlad Palmer Wilton BA Becker Becker Flaherty Flaherty Olerud OBP C.Johnson Morandini Galarraga McGwire Galarraga SLG Galarraga Sosa Galarraga Flaherty Gilkey OPS Galarraga Sosa Galarraga Flaherty Galarraga OBP-BA Becker Becker Jones Becker Becker ISO Galarraga Galarraga McGwire Gilkey Gilkey

Andres Galarraga was obviously the biggest problem. Every system expected
his power to largely disappear out of Coors Field. It didn’t. It will be
interesting to see how these systems handle projecting Galarraga’s 1999.
Was it a complete fluke, a real change, or did the systems miss something?
Meanwhile, Rich Becker walked a lot less than anyone expected, and no
system anticipated quite how horrible John Flaherty would be.

We can reach the not-so-wild conclusion from these worst misses that no
system is really immune from blowing it completely when ballplayers do the
totally unexpected; the worst projections from each system are probably
going to be similar to the worst projections from all the other systems
year after year. It seems unlikely that the projection systems, especially
the more conservative ones, can improve at all in this area.

Now for the best guesses from the various systems:

 Best Mean STATS Vlad Palmer Wilton BA Mueller Greer Carter R.Davis L.Gonzalez OBP Lankford Klesko Gant Vizquel Grudzielanek SLG Veres Matheny D.Cruz King Guillen OPS T.Martinez A.Rodriguez Allensworth Alfonzo King OBP-BA Molitor Everett E.Young Allensworth Matheny ISO E. Young Weiss Conine Becker Hoiles

There’s a lot less repetition here. Only four players (Allensworth, King,
Matheny, and E. Young) get mentioned twice. No player was the best pick for
more than one measurement by any system, and no player was the best pick in
any one measurement for more than one system. There doesn’t seem to be any
discernible trend in the players who make this list, although a notable
number of them do have very little power. Rich Becker is the only player
who pops up on both tperhaps up as Pete Palmer’s best isolated power
projection and his worst walk/hbp rate projection.

Now, on which players did the humans agree the most on their individual
projections?

```BA - Jose Cruz
OBP - Troy O'Leary
SLG - Mike Bordick
OPS - Kenny Lofton
OPS-BA - Michael Tucker
ISO - Mike Bordick
Overall - Mike Matheny
```

None of these projections bombed, though everybody vastly underestimated
what Mike Bordick’s power would be like in 1998.

Meanwhile, the most disagreement among the humans about how players would
do in 1998 showed up in these players:

```BA - Gary Sheffield
OBP - Doug Glanville
SLG - Gary Sheffield
OPS - Gary Sheffield
OPS-BA - Brian Johnson
ISO - Gary Sheffield
Overall - Gary Sheffield
```

This doesn’t tell us much except that everybody though differently about
which Gary Sheffield would show up in 1998.

There is a small positive correlation between the disagreement and the
error of the mean human forecast. The amounts are small enough to mean
little, except perhaps for the .36 correlation in the ISO category. That
correlation could mean that people have more trouble calculating ISO than
other parts of a ballplayer’s offense.

Among the statistical projection systems, Brian Johnson, Joe Carter and
Rafael Palmeiro were the sources of the most agreement, while Rich Becker
was the projection on which the systems disagreed the most. There does seem
to be a bit more agreement among the systems on veterans than younger
players, but other than that there are no obvious trends that show up. The
humans did not show any particularly strong unanimity on any of the players
the computer systems most strongly agreed upon, strangely enough. We don’t
really see all that much of a relationship between the size of the spread
of the predictions and the accuracy of the projections except in the one
case mentioned above.

All in all there aren’t that many conclusions we can reach yet. This is
really the start of the research, not its conclusion. We plan to continue
this project next season in some form.

What we have learned from this is that a group of knowledgeable baseball
fans can, as a group, predict offensive performance similarly to the best
computer projection systems, though not any better. At the same time, most
knowledgeable baseball fans probably won’t be able to do projections
themselves that are as good as published computer based projections. We’ve
always seen that different projection systems can be successful in
different ways, but that none really succeed in any way that’s remarkable.

Hopefully, various nuggets we’ve learned along the way will lead to more
interesting discoveries in the future. But for now, let us thank everyone
who has participated in this study. The 27 people who contributed "human"
projections, almost all of whom found that the work was harder to do than
expected but persevered anyway, get a big hand. Worth singling out among
the 27 is Daniel Levine, who also helped arrange all the data into usable
form. We also thank the designers of the computer projection systems: Bill
James and STATS (let us note here that we know that the computer
projections from STATS, unlike the other computer projections analyzed
here, are occasionally adjusted a bit by humans; we just aren’t very
let us conclude by thanking Harold Brooks, the co-author of this piece, who
is uniquely responsible for most of the intelligence found in this study.

If you’re interested in participating in whatever form this project
continues in next season, please send an e-mail to spira@baseballpages.com.
Note that we definitely are looking to include more computer projection
systems next year. And that pitchers and catchers report in six weeks.

Finally, how did the projection systems do in terms of forecasting big
changes in player performance? Here are contingency tables that give the
results of forecasts and observed big changes of OPS. I’ve made the
problem into a 3×3 problem, with the "Down" category being a fall from 1997
of at least 70 points, the "Up" category being an improvment of at least 70
points, and the "Middle" being everybody else. 70 points is approximately
the median absolute change from 1997. The columns give the number of
forecast/observed pairs by the observed change and the rows give the number
by the forecast change. For example, for the mean human forecast, there
were 7 forecasts of a player dropping 70 points or worse in 1998 that were
associated with players doing 70 points or more worse. There were four
cases were that forecast was made and the player was within 70 points of
1997, no cases where the player went up by 70 points or more, and 11 cases
where the player was forecast to stay within 70 points and he actually went
down by 70 or more points. The last column (row) in each table gives the
total number of forecast (observed) changes of 70 points or more. (Note
that the observed total row is the same for each table).

So what are you looking at? In reading these grids, keep in mind that it’s
good to be on the main (top left-bottom right) diagonals or corners, and
bad to be off of them. There are only three cases in the bad corners (top
right and bottom left) –Vlad had Rich Becker going up by 100 points and he
went down by 74, while Wilton had Bernard Gilkey going up by 97 (down by
115) and Steve Finley going up by 77 (down by 84).

The tables can be summarized by a variety of measures. A particularly
appropriate one is one known (at least in meteorology) as the Heidke score.
It gives the fraction in the right boxes (main diagonal) reduced by the
number you would get right by random guessing. The Heidke score is at the
bottom right of each table (for the mean human, it’s .208). The best
(worst) possible score is 1 (-1). A score of zero is associated with
random guessing. Vlad wins by this measure, while Wilton does surprisingly

The value in parenthesis after each forecast name is the percentage of
forecasts that were either big drops or big ups. The observed value is
49%. Vlad has the highest value at 35%, so that it makes forecasts that
"look" the most like a real distribution of observed values, but it’s still
a ways from reality.

```      Observed Changes
Down  Middle   Up     Total
Mean (20%)
D     7      4      0      11
M    11     38     20      69
U     0      2      4       6
Tot  18     44     24    .208

STATS (29%)
D     9      9      0      18
M     9     34     18      61
U     0      1      6       7
Tot  18     44     24    .246

D     8      6      0      14
M     9     32     15      56
U     1      6      9      16
Tot  18     44     24    .259

Palmer (15%)
D     7      3      0      10
M    11     40     22      73
U     0      1      2       3
Tot  18     44     24    .191

Wilton (23%)
D     5      5      0      10
M    14     36     19      66
U     2      3      5      10
Tot  18     44     24    .155
```

To summarize, it’s pretty unlikely to forecast a big breakout (or bust) and
have the opposite happen. It’s much more likely to miss big changes. The
percentage of big changes (either sign) correctly forecast by the sytstems
were:

```Mean   26%
STATS  36%
Palmer 21%
Wilton 24%
```

Vlad and STATS are the best at picking up on the big changes by this
measure. (This is not contradictory to the description of STATS as
conservative–it’s just that different measures can give different
impressions. If a variety of measures give the same picture, as with Vlad,
then you can have more confidence in the result.)

If I pick a lower threshold for big changes (50 points), I get the following:

```             Heidke
Mean   (35%) .185
STATS  (41%) .249
Palmer (31%) .204
Wilton (42%) .102
```

Again, the parenthetical value is the percentage of forecasts of big
changes, with the observed percentage = 62%. The conservative systems pass
up Vlad in large part because of Vlad’s struggles with the middle
forecasts. The contingency table for Vlad at a 50-point threshold is

```      Observed Changes
Down  Middle   Up     Total

D    10      7      1       18
M    14     14     12       40
U     1     12     15       28
Tot  25     33     28     .165
```

Note how "flat" the distribution in the middle row is. It’s very peaked
for the 70 point threshold and having only 42% of the observed middle group
being in the forecast middle hurts Vlad a lot. The lowest value for any of
the other 50 point threshold systems is 58%.

Now, for the really bad forecasts:

```Forecast to go down by more than 50 points and went up by 50 points:
Andres Galarraga (missed by everyone)
Mickey Morandini (missed by STATS-forecast down 69)
John Olerud (missed by Wilton-forecast down 55)

Forecast to go up by more than 50 points and went down by 50 points:
Bernard Gilkey (missed by mean human, Palmer, Wilton)
Rich Becker (missed by Vlad-forecast up 100)
```

Finally, a few notes from Clay about the version of Wilton used in Harold’s
study and the final version that was used in BP ’99: in the study, the
Wilton program mentioned was a prototype of the version which appears in
the book. Most importantly, it did not convert statistics from the DT
format to the team/league environment, except for a simple Colorado
adjustment. The prototype also contained several bugs which had yet to be
caught (but were caught before we put Wilton into this year’s book).
Finally, the prototype had different criteria for choosing "matching"
players and weighting their contributions. Overall, the differences should
be minor; the judgments of Wilton’s performance are still valid, even if
the exact assessment varies.

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

### Latest Articles

6/28
0
6/28
0
• ##### More Time to the Worst Players on the Field B
6/28
0
You need to be logged in to comment. Login or Subscribe