CSS Button No Image Css3Menu.com

Baseball Prospectus home
  
  
Click here to log in Click here for forgotten password Click here to subscribe

World Series time! Enjoy Premium-level access to most features through the end of the Series!

No Previous Article
No Next Article

February 1, 1999

The Prospectus Projections Project

A year's worth of data is in, and here are the results

by David Cameron and Greg Spira

The Prospectus Projections Project began last February, when we asked about 40 people to make "human" projections of how 100 hitters and 25 pitchers would perform in 1998. We wanted to see how people would do compared to the computer formulas that appear in such publications as Baseball Prospectus and STATS' Major League Handbook. In the end, 27 people - all knowledgeable baseball fans with a good fundamental understanding of baseball and baseball statistics - ended up sending us their projections by mid-March.

We emphasized to all our participants that their projections had to be at least somewhat from the gut. While everybody was allowed to use whatever knowledge, statistics, or books that they had in their possession, the projections were not to come out of a computer or even a formula. A few potential participants begged out because of this clause. But our helpful 27 contributed their projected batting average, slugging average, and on-base percentage for 100 mostly randomly selected hitters (we avoided players on expansion teams, so as to not play "guess the park factor"), and the projected ERA, K rate, BB rate, and IP for 25 pitchers. The pitching data will be analyzed in a future article; we're here to tell you the results of the offensive projections.

It's a bit harder to come up with "results" than you might think. Not all projections systems aim for the same type of success. If you're trying to hit on a long shot at the racetrack, for example, a conservative projection system that sticks closely to what horses have done in the past is going to do you much less good than a more aggressive system that tries to project what long shot is going to break out of the pack. Even if the aggressive system succeeds, it will have far more serious errors on average, but it will very well make you more money because it picked out one winner who can pay off huge.

One common way - though again, there is no real "right" way - of analyzing

TABLE 1: Root Mean Squared Error
RMSE BA OBP SLG OPS OBP-BA ISO
Min 27.7 30.4 58.3 82.8 17.3 41.6
Max 35.8 39.0 76.2 105.6 22.8 53.8
Mean 28.7 31.4 60.0 85.5 17.0 42.1
STATS 27.4 30.4 58.1 83.3 18.6 40.8
Vlad 28.9 31.2 61.5 85.1 18.6 46.0
Palmer 29.1 33.0 60.1 86.9 19.8 41.9
Wilton 29.7 32.6 62.0 87.7 19.1 43.4
projections is to measure the mean absolute error (MAE) and the root mean squared error (RMSE), so that's what we did in this case. Table 1 summarizes the errors for the human and computer forecasts by giving the minimum human error (Min), the maximum human error (Max) and then the errors for the mean human forecast and four objective systems, all in terms of RMSE. The objective systems that we analyzed included Bill James' system (as included in the STATS 1998 Major League Handbook and later updates), the neural net Vlad system (as used in Baseball Prospectus 1998), Pete Palmer's projection system (as available in The Spy: Baseball '98), and experimental projections from Wilton, the projection system that will be used in the 1999 edition of the Baseball Prospectus.

Bold figures indicate that the value is statistically significantly better than the 27 human forecasts at the 95% confidence level.

Another way to compare the human forecasts to the other systems is to look

TABLE 2: Forecast Accuracy
Num BA OBP SLG OPS OBP-BA ISO
Mean 2 2 3 2 0 3
STATS 0 1 0 1 5 0
Vlad 2 2 4 2 5 8
Palmer 3 7 3 4 10 2
Wilton 6 4 5 6 6 4
at the number of people who had more accurate forecasts than the mean or objective systems for each variable, as in Table 2.

STATS' projections were the most accurate forecasts for three of the variables (BA, SLG and ISO), were second to one person's (Steve Moyer's) in OBA by less than one-tenth of a point, were second to one person's (Steve Rubio's) in OPS by half a point, and were 7th in OBA-BA (behind five people's and the mean human forecast). From the two charts we can see that, in general, the objective forecasts were usually more accurate than almost all human forecasts, except for OBA-BA, where the most accurate forecasts were competitive with all of the objective systems, to the extent that the mean human forecast was the most accurate. Another way of analyzing the projections is to test how often they fall within a certain tolerance. A natural measure of appropriate error tolerance that we can use to test the projections is the mean absolute difference - its "natural variability" between a statistic in 1997 and 1998. For example, the mean absolute change in batting average was 25.9 points between 1997 and 1998 (regardless of the sign). Below, Table 3 gives the percentage of predictions that had errors smaller than the mean absolute difference for each statistic.

TABLE 3: Errors Smaller than Mean Absolute
Error BA (25.9) OBP (27.8) SLG (54.3) OPS (76.6) OBP-BA (14.9) ISO (39.9)
Min 44 44 51 50 50 55
Max 65 64 69 66 70 70
Mean 60 59 67 59 72 65
STATS 66 62 67 65 62 71
Vlad 64 59 69 70 62 56
Palmer 60 60 65 67 63 64
Wilton 63 51 64 64 65 63
Table 4 gives the percentage of forecasts with errors smaller than one-half the mean absolute difference.
TABLE 4: Errors Smaller than One Half Mean Absolute
Error BA (12.9) OBP (13.9) SLG (27.1) OPS (38.3) OBP-BA (7.5) ISO (20.0)
Min 20 17 15 21 21 28
Max 37 36 29 38 45 40
Mean 31 28 23 33 41 37
STATS 31 40 17 38 40 34
Vlad 34 26 21 35 30 40
Palmer 31 31 16 38 35 34
Wilton 35 30 17 36 34 34

These tables reaffirm some of the things we know from Tables 1 and 2, but they also offer up a lot of new information. It seems that roughly 2/3 of the projections are correct to within the mean absolute change from year to year, and roughly one-third are right within half that value. The least accurately predicted statistic in the project this year was slugging, with the best performance of the 31 forecasts being only 29% correct within the tighter tolerance. Curiously, the non-BA components of OBP (OBP-BA) and SLG (ISO) were the most accurately predicted statistics, indicating that batting average was the hardest thing to predict for most of the participants. The biggest advantage STATS' projections seems to have is that they make the fewest large errors in its' BA predictions (only 34%).

Two of the objective projection systems, Vlad and Wilton, have more of a "gambling" nature than the STATS and Palmer systems. This shows up in that Vlad and Wilton are likely to be very right or very wrong. Note, for example, that among the computer projections Vlad has the most ISO predictions within one-half of the absolute difference (40%) despite having the fewest predictions within the mean absolute difference (56%). STATS, on the other hand, has the most ISO predictions within the mean absolute difference (71%), and the fewest within one-half of the mean absolute difference (34%). The conservative nature of STATS' system thus significantly reduces the number of very big errors and thus its RMSE, but at the cost of very accurate projections when compared to a more aggressive approach. The less conservative Vlad and Wilton systems produce a series of projections that comes closer to the observed variance of performance, but don't always assign the big changes to the right hitters. It is likely that the Vlad and Wilton systems have more room for future improvement because of that trait.

While this project was not a contest (we were more interested in the 'people vs. tools' angle), we did keep track of each participant's projections and how accurate they were. Below we've listed all the participants in order and alongside of the average RMSE of the three main variables (BA, OBP-BA, ISO). Following that is a similar rankings by plain MSE, which "punishes" large errors less than RMSE. The use of the three variables is almost certainly fairer than ranking just by OPS, since you can rank high in an OPS ranking if you are equally wrong in different directions on several of the variables. The three variables are like length, width, and height, while OPS is like volume. Nevertheless, we include an RMSE ranking by OPS as well in the third table below.

BA, OBP-BA, and ISO by RMSE

1.STATS                          28.9
2.Mean of People's Projections   29.3
3.Palmer                         30.3
4.Steve Moyer                    30.4
5.Dave Schoenfeld                30.5
6.Greg Spira                     30.5
7.Steve Rubio                    30.7
8.Wilton                         30.7
9.Jim Furtado                    31.1
10.Vlad                          31.2
11.John Sickels                  31.3
-------------------------------------
Above are One Standard Deviation better than average human projection
12.Dean Carrano                  31.6
13.T. Madison                    32.0
14.Jeff Joseph                   32.6
15.David Pease                   32.8
16.Michael Wolverton             32.8
17.Chris Conley                  32.8
18.Mark Jareb                    33.0
19.Greg Bunimovich               33.0
20.Joe Sheehan                   33.1
21.Sean Forman                   33.6
22.Gregg Pearlman                33.7
23.Jason Gische                  33.9
24.HJ Park                       34.0
25.Dan Szymborski                34.0
26.Doug Pappas                   34.2
27.Gary Huckabay                 34.2
28.Jeff Hildbrand                34.2
-------------------------------------
Below are One Standard Deviation worse than average human projection
29.Allen Speir                   34.4
30.Ron Johnson                   35.0
31.Daniel Levine                 35.7
BA, OBP-BA, and ISO by MSE
1  STATS                         23.0
2  Mean of People's Projections  23.3
-------------------------------------
Above are Two Standard Deviations better than average human projection
3  Steve Moyer                   23.8
4  Greg Spira                    24.0
5  Palmer                        24.1
6  Steve Rubio                   24.3
7  Wilton                        24.5
8  Dean Carrano                  24.6
9  Jim Furtado                   24.6
-------------------------------------
Above are One Standard Deviation better than average human projection
10 Vlad                          24.7
11 Dave Schoenfeld               24.8
12 TJ Madison                    24.9
13 John Sickels                  24.9
14 Jeff Joseph                   25.2
15 Michael Wolverton             25.3
16 Greg Bunimovich               25.4
17 Joe Sheehan                   25.6
18 David Pease                   25.6
19 Mark Jareb                    25.9
20 Chris Conley                  25.9
21 Sean Forman                   26.0
22 Dan Szymborski                26.2
23 Gregg Pearlman                26.2
24 Doug Pappas                   26.6
25 Gary Huckabay                 26.6
26 Jeff Hildebrand               26.9
-------------------------------------
Below are One Standard Deviation worse than average human projection
27 HJ Park                       27.0
28 Ron Johnson                   27.2
29 Jason Gische                  27.3
30 Allen Speir                   27.5
31 Daniel Levine                 27.7
OBP by RMSE
1  Steve Rubio                   82.8
2  STATS                         83.3
3  John Sickels                  83.9
4  Vlad                          85.1
5  Mean of People's Projections  85.5
6  Steve Moyer                   86.4
7  Gary Huckabay                 86.7
-------------------------------------
Above are One Standard Deviation better than the average human projection
8  Palmer                        86.9
9  Jim Furtado                   87.0
10 TJ Madison                    87.3
11 Wilton                        87.7
12 Greg Spira                    88.4
13 Joe Sheehan                   88.9
14 Michael Wolverton             89.7
15 Dave Schoenfeld               90.0
16 Dan Szymborski                90.1
17 Jeff Joseph                   91.5
18 Sean Forman                   91.7
19 Dean Carrano                  92.3
-------------------------------------
Below are One Standard Deviation worse than average human projection
20 Mark Jareb                    93.2
21 Dave Pease                    94.2
22 HJ Park                       94.6
23 Ron Johnson                   94.8
24 Chris Conley                  95.0
25 Gregg Pearlman                96.0
26 Doug Pappas                   96.6
27 Greg Bunimovich               98.3
28 Jason Gische                 100.7
29 Jeff Hildebrand              102.1
30 Allen Speir                  104.1
-------------------------------------
Below is Two Standard Deviations worse than average human projection
31 Daniel Levine                105.6
Now let's look at which players each system did their worst at:

Worst Mean STATS Vlad Palmer Wilton
BA Becker Becker Flaherty Flaherty Olerud
OBP C.Johnson Morandini Galarraga McGwire Galarraga
SLG Galarraga Sosa Galarraga Flaherty Gilkey
OPS Galarraga Sosa Galarraga Flaherty Galarraga
OBP-BA Becker Becker Jones Becker Becker
ISO Galarraga Galarraga McGwire Gilkey Gilkey

Andres Galarraga was obviously the biggest problem. Every system expected his power to largely disappear out of Coors Field. It didn't. It will be interesting to see how these systems handle projecting Galarraga's 1999. Was it a complete fluke, a real change, or did the systems miss something? Meanwhile, Rich Becker walked a lot less than anyone expected, and no system anticipated quite how horrible John Flaherty would be.

We can reach the not-so-wild conclusion from these worst misses that no system is really immune from blowing it completely when ballplayers do the totally unexpected; the worst projections from each system are probably going to be similar to the worst projections from all the other systems year after year. It seems unlikely that the projection systems, especially the more conservative ones, can improve at all in this area.

Now for the best guesses from the various systems:

Best Mean STATS Vlad Palmer Wilton
BA Mueller Greer Carter R.Davis L.Gonzalez
OBP Lankford Klesko Gant Vizquel Grudzielanek
SLG Veres Matheny D.Cruz King Guillen
OPS T.Martinez A.Rodriguez Allensworth Alfonzo King
OBP-BA Molitor Everett E.Young Allensworth Matheny
ISO E. Young Weiss Conine Becker Hoiles

There's a lot less repetition here. Only four players (Allensworth, King, Matheny, and E. Young) get mentioned twice. No player was the best pick for more than one measurement by any system, and no player was the best pick in any one measurement for more than one system. There doesn't seem to be any discernible trend in the players who make this list, although a notable number of them do have very little power. Rich Becker is the only player who pops up on both tperhaps up as Pete Palmer's best isolated power projection and his worst walk/hbp rate projection.

Now, on which players did the humans agree the most on their individual projections?

BA - Jose Cruz
OBP - Troy O'Leary
SLG - Mike Bordick
OPS - Kenny Lofton
OPS-BA - Michael Tucker
ISO - Mike Bordick
Overall - Mike Matheny

None of these projections bombed, though everybody vastly underestimated what Mike Bordick's power would be like in 1998.

Meanwhile, the most disagreement among the humans about how players would do in 1998 showed up in these players:

BA - Gary Sheffield
OBP - Doug Glanville
SLG - Gary Sheffield
OPS - Gary Sheffield
OPS-BA - Brian Johnson
ISO - Gary Sheffield
Overall - Gary Sheffield

This doesn't tell us much except that everybody though differently about which Gary Sheffield would show up in 1998.

There is a small positive correlation between the disagreement and the error of the mean human forecast. The amounts are small enough to mean little, except perhaps for the .36 correlation in the ISO category. That correlation could mean that people have more trouble calculating ISO than other parts of a ballplayer's offense.

Among the statistical projection systems, Brian Johnson, Joe Carter and Rafael Palmeiro were the sources of the most agreement, while Rich Becker was the projection on which the systems disagreed the most. There does seem to be a bit more agreement among the systems on veterans than younger players, but other than that there are no obvious trends that show up. The humans did not show any particularly strong unanimity on any of the players the computer systems most strongly agreed upon, strangely enough. We don't really see all that much of a relationship between the size of the spread of the predictions and the accuracy of the projections except in the one case mentioned above.

All in all there aren't that many conclusions we can reach yet. This is really the start of the research, not its conclusion. We plan to continue this project next season in some form.

What we have learned from this is that a group of knowledgeable baseball fans can, as a group, predict offensive performance similarly to the best computer projection systems, though not any better. At the same time, most knowledgeable baseball fans probably won't be able to do projections themselves that are as good as published computer based projections. We've always seen that different projection systems can be successful in different ways, but that none really succeed in any way that's remarkable.

Hopefully, various nuggets we've learned along the way will lead to more interesting discoveries in the future. But for now, let us thank everyone who has participated in this study. The 27 people who contributed "human" projections, almost all of whom found that the work was harder to do than expected but persevered anyway, get a big hand. Worth singling out among the 27 is Daniel Levine, who also helped arrange all the data into usable form. We also thank the designers of the computer projection systems: Bill James and STATS (let us note here that we know that the computer projections from STATS, unlike the other computer projections analyzed here, are occasionally adjusted a bit by humans; we just aren't very concerned about this), Pete Palmer, Gary Huckabay, and Clay Davenport. And let us conclude by thanking Harold Brooks, the co-author of this piece, who is uniquely responsible for most of the intelligence found in this study.

If you're interested in participating in whatever form this project continues in next season, please send an e-mail to spira@baseballpages.com. Note that we definitely are looking to include more computer projection systems next year. And that pitchers and catchers report in six weeks.

Finally, how did the projection systems do in terms of forecasting big changes in player performance? Here are contingency tables that give the results of forecasts and observed big changes of OPS. I've made the problem into a 3x3 problem, with the "Down" category being a fall from 1997 of at least 70 points, the "Up" category being an improvment of at least 70 points, and the "Middle" being everybody else. 70 points is approximately the median absolute change from 1997. The columns give the number of forecast/observed pairs by the observed change and the rows give the number by the forecast change. For example, for the mean human forecast, there were 7 forecasts of a player dropping 70 points or worse in 1998 that were associated with players doing 70 points or more worse. There were four cases were that forecast was made and the player was within 70 points of 1997, no cases where the player went up by 70 points or more, and 11 cases where the player was forecast to stay within 70 points and he actually went down by 70 or more points. The last column (row) in each table gives the total number of forecast (observed) changes of 70 points or more. (Note that the observed total row is the same for each table).

So what are you looking at? In reading these grids, keep in mind that it's good to be on the main (top left-bottom right) diagonals or corners, and bad to be off of them. There are only three cases in the bad corners (top right and bottom left) --Vlad had Rich Becker going up by 100 points and he went down by 74, while Wilton had Bernard Gilkey going up by 97 (down by 115) and Steve Finley going up by 77 (down by 84).

The tables can be summarized by a variety of measures. A particularly appropriate one is one known (at least in meteorology) as the Heidke score. It gives the fraction in the right boxes (main diagonal) reduced by the number you would get right by random guessing. The Heidke score is at the bottom right of each table (for the mean human, it's .208). The best (worst) possible score is 1 (-1). A score of zero is associated with random guessing. Vlad wins by this measure, while Wilton does surprisingly badly.

The value in parenthesis after each forecast name is the percentage of forecasts that were either big drops or big ups. The observed value is 49%. Vlad has the highest value at 35%, so that it makes forecasts that "look" the most like a real distribution of observed values, but it's still a ways from reality.

      Observed Changes
    Down  Middle   Up     Total
Mean (20%)
D     7      4      0      11
M    11     38     20      69
U     0      2      4       6
Tot  18     44     24    .208

STATS (29%)
D     9      9      0      18
M     9     34     18      61
U     0      1      6       7
Tot  18     44     24    .246

Vlad (35%)
D     8      6      0      14
M     9     32     15      56
U     1      6      9      16
Tot  18     44     24    .259

Palmer (15%)
D     7      3      0      10
M    11     40     22      73
U     0      1      2       3
Tot  18     44     24    .191

Wilton (23%)
D     5      5      0      10
M    14     36     19      66
U     2      3      5      10
Tot  18     44     24    .155

To summarize, it's pretty unlikely to forecast a big breakout (or bust) and have the opposite happen. It's much more likely to miss big changes. The percentage of big changes (either sign) correctly forecast by the sytstems were:

Mean   26%
STATS  36%
Vlad   41%
Palmer 21%
Wilton 24%

Vlad and STATS are the best at picking up on the big changes by this measure. (This is not contradictory to the description of STATS as conservative--it's just that different measures can give different impressions. If a variety of measures give the same picture, as with Vlad, then you can have more confidence in the result.)

If I pick a lower threshold for big changes (50 points), I get the following:

             Heidke
Mean   (35%) .185
STATS  (41%) .249
Vlad   (53%) .165
Palmer (31%) .204
Wilton (42%) .102

Again, the parenthetical value is the percentage of forecasts of big changes, with the observed percentage = 62%. The conservative systems pass up Vlad in large part because of Vlad's struggles with the middle forecasts. The contingency table for Vlad at a 50-point threshold is

      Observed Changes
    Down  Middle   Up     Total

D    10      7      1       18
M    14     14     12       40
U     1     12     15       28
Tot  25     33     28     .165

Note how "flat" the distribution in the middle row is. It's very peaked for the 70 point threshold and having only 42% of the observed middle group being in the forecast middle hurts Vlad a lot. The lowest value for any of the other 50 point threshold systems is 58%.

Now, for the really bad forecasts:

Forecast to go down by more than 50 points and went up by 50 points:
Andres Galarraga (missed by everyone)
Mickey Morandini (missed by STATS-forecast down 69)
John Olerud (missed by Wilton-forecast down 55)

Forecast to go up by more than 50 points and went down by 50 points:
Bernard Gilkey (missed by mean human, Palmer, Wilton)
Rich Becker (missed by Vlad-forecast up 100)

Finally, a few notes from Clay about the version of Wilton used in Harold's study and the final version that was used in BP '99: in the study, the Wilton program mentioned was a prototype of the version which appears in the book. Most importantly, it did not convert statistics from the DT format to the team/league environment, except for a simple Colorado adjustment. The prototype also contained several bugs which had yet to be caught (but were caught before we put Wilton into this year's book). Finally, the prototype had different criteria for choosing "matching" players and weighting their contributions. Overall, the differences should be minor; the judgments of Wilton's performance are still valid, even if the exact assessment varies.

0 comments have been left for this article.

No Previous Article
No Next Article

RECENTLY AT BASEBALL PROSPECTUS
The View from the Loge Level: Managing to Wi...
Fantasy Freestyle: Playoff Spotlight: Alcide...
Minor League Update: Games of Monday, Octobe...
Playoff Prospectus: World Series Preview: Gi...
Pebble Hunting: An Illustrated Guide to the ...
Baseball Therapy: The Truth About Butterflie...
Pitching Backward: How To Get A Hit Off Madi...

MORE FROM FEBRUARY 1, 1999
Transaction Analysis: January 23-28
Pitcher Usage and Result Patterns: Colorado ...

MORE BY DAVID CAMERON
1999-09-30 - Durocher's Revenge?
1999-07-15 - BPs Mid-Season Rankings
1999-03-10 - DiMaggio
1999-02-01 - The Prospectus Projections Project
1998-10-29 - 1998 Internet Baseball Awards Results
1998-09-28 - Playoff Preview - Boston vs. Cleveland
1998-09-15 - A Quick and Dirty Guide to 1998 Park Factors
More...