Spring training is in full bloom, and besides the furtive fantasy league planning going on in offices throughout the country (should the GDP’s seasonal adjustments take this into account?), it is also the time for predictions to be made about the upcoming season. Most baseball coverage includes player or team projections of some kind, and Baseball Prospectus is no exception.

However, I’ve often wondered just how accurate even the best predictions can

be. We’ve written about predicting player performance before, so I am going to focus on predictions at the team level, namely the order of finish within a division.

An interesting aspect of the question is whether even perfect predictions of team quality can result in reliable predictions of standings. Is a 162-game season sufficiently long enough for competitive teams to differentiate themselves conclusively?

One way to approach this is to use a past season’s actual results to create a “perfect” predictor of a team’s ability to win a game over a span of time, and simulate a season’s worth of games to see if you recreate the actual observed standings. For example, knowing that a team won 55% of its games in a season (89 wins out of 162 games), simulate a season in which they have a 55% chance of winning each game. This is the probability that maximizes the chance of observing 89 wins in the simulated season.

Repeating this process for each team in turn creates a full season of predictions, each tuned to maximize the chance that the simulated outcome

will match the known outcome. This represents the best that a perfect

estimate of each’s team’s ability to win games over the course of a season can

do, with randomness over 162 games accounting for the only source of discrepancy in this artificially-constrained experiment. And this is exactly the approach I took to the question.

I simulated 1000 seasons. For each team, I used their actual 2004 winning

percentage as the probability of winning a game, and used a normal distribution as an approximation for the expected number of wins for each team over 162 games.

A side note to the statistically-minded: though winning and losing is really more of a binomial probability distribution, for a large enough number of trials, the distribution is approximately normal. Usually, [# trials] * [probability] and [# trials] * (1- [probability]) both have to be greater than 5 to be a “good enough” approximation. In baseball terms, this means that we need to be able to expect at least 5 wins and 5 losses over whatever number of games we’re looking at. With 162 games (or trials) over a season, both conditions are easily met whether we’re talking about the 2001 Mariners or the 2003 Tigers.

For example, Atlanta won 96 games and lost 66, for a .593 winning percentage. Using a 59.3% chance of winning each game over 162 games, I approximated the chance of winning a certain number of games by a normal distribution with a mean of 96 wins, and a standard deviation of SQRT(162*59.3%*(1-59.3%)) = 6.25 wins.

This approach does not guarantee that all teams end up with the proper aggregate number of wins and losses, since it treats all teams as independent but it is still a useful model. Over the course of all the simulated seasons, each team will fluctuate in the number of wins, sometimes being higher than expected, sometimes lower, but should, in the long run, average out to their “true” level of ability, which is fixed at their observed 2004 level in this model. Here are the aggregate results for each team.

ActW = Actual Wins in 2004

AvgW = Average number of wins across 1000 simulated seasons

MinW = Minimum number of wins in any simulated season

MaxW = Maximum number of wins in any simulated season

StdDev = Standard Deviation in the number of wins per season

TEAM ActW AvgW MinW MaxW StdDev MIN 92 92.3 72 111 6.43 CHA 83 82.8 62 104 6.31 CLE 80 79.6 60 98 6.25 DET 72 72.1 51 92 6.26 KCA 58 57.8 38 77 6.13 NYA 101 100.9 80 120 6.20 BOS 98 97.5 80 114 6.04 BAL 78 78.1 56 102 6.24 TBA 70 70.6 43 88 6.19 TOR 67 67.4 43 87 6.23 ANA 92 91.8 71 116 6.45 OAK 91 90.8 70 114 6.26 TEX 89 89.0 71 109 6.09 SEA 63 63.2 44 82 6.07 SLN 105 105.0 84 124 5.96 HOU 92 92.1 69 115 6.28 CHN 89 89.0 69 106 6.44 CIN 76 76.4 55 96 6.52 PIT 72 72.7 51 95 6.23 MIL 67 67.5 45 88 6.44 ATL 96 96.4 75 118 6.49 PHI 86 85.9 65 107 6.35 FLO 83 83.1 63 106 6.12 NYN 71 70.7 50 89 6.22 MON 67 66.9 45 86 6.45 LAN 93 93.4 71 113 6.33 SFN 91 91.2 72 111 6.55 SDN 87 86.6 65 108 6.57 COL 68 68.2 46 91 6.20 ARI 51 51.1 33 73 6.28

The individual teams seem to be showing the expected amount of

variation in simulated wins. The next step is to determine the division standings based on these simulated results. Recall that the model parameters

have been chosen so as to be the most likely to reproduce the actual 2004 results. For each simulated season, I looked at the standings in each division

to see if the simulation resulted in the actual standings observed in 2004.

There are other measures we could have used other than “completely accurate

prediction of division standings”. We could have counted the number of teams

who were correctly placed. Or we could have computed the average error in the

number of wins predicted for each team. Each of which have their merits. But

the most basic question one might ask is “did the prediction match the eventual outcome?” Since the order of finish is typically of more interest than the size of the gaps between teams (a first place team who wins by 20 games gains no or minimal advantage compared to a team who wins by 2 games), we’ll use just that to evaluate the results.

Match = Percentage simulated standings matched actual 2004 standings

Division Match AL East 25.0% AL Central 27.8% AL West 20.8% NL East 23.5% NL Central 18.0% NL West 26.9%

Standings in six divisions predicted correctly: 0.1% (1 out of 1000)

The NL Central had the lowest match rate. This is to be expected, because

it has the most teams (six) of any division. The number of possible permutations of order of finish, 720, is six times larger than in a five team division, with just 120 possible orderings.

Given the combinatorial realities, it is somewhat surprising that the AL West, with just four teams and just 24 possible orderings, did not match more often. However, another factor is at work here. The AL West had three teams finish within 3 games of one another (the Angels with 92 wins, the Athletics with 91 wins, and the Rangers with 89 wins). Teams that are predicted to be close in ability, say two teams expected to win 85 and 86 games respectively, will obviously be more likely to confound the prediction. It is the combined impact of the number of teams, and how well they are separated in ability that determines the effect of randomness over a season on the likelihood of a matching prediction.

The other 5-team divisions average out to about a 25% match rate. While

this is about 30 times better than a completely random guess at the standings (which has a 1-in-120 chance, or about 0.8% of being correct), it’s still much lower than the casual fan might have expected.

The moral of the story is that even 162 games is still a fairly small sample when trying to separate competitive teams from one another. Teams usually fall in a narrow band of ability — between 60 and 100 wins, rather than the full theoretical range of 0 to 162 wins. Even assuming perfect assessments of a team’s ability to win games over the course of a season, as we did in this model, the actual differences between teams are often too small to overcome the noise of a single season’s worth of games.

So as we head into the 2005 season, keep in mind that of everyone putting

out their projections now, there is surely someone who will be crowing about

their successes at the end of the year about the accuracy of their predictions (and, yes, we sure hope that it’s Baseball Prospectus doing the crowing.) There is a certain amount of analysis that helps inform good predictions, but beyond a certain level, whether the results favor one well-formed prediction or another is largely the luck of the draw. The most useful part of a prediction may not be the actual numbers themselves, but the quality of the thought process that

generates them. Great predictions are byproducts of carefully considered analysis. Read *how* the predictions were generated, and what insights and assumptions they rest on, rather than just *what* the predictions are, and be a

better-informed fan because of it.