The most important thing Bill James did for baseball, in my opinion, was to
question the assumptions that were prevalent in the game and determined how
it was played. His statistical methods were tools, and only tools,
developed to help test the validity of those assumptions. Some
analysts–myself included–have occasionally forgotten that point, and
treated James’ results as if they were something grander than a mere tool.
In this study, I am going to put one of James’ premier tools–a tool that I
have used uncritically for many years–to the test. The results have forced
me to revise some of my own methods. Let’s all take a good, hard look at
James’ “Pythagorean theorem”.
The Pythagorean theorem, as James called it, was a formula designed to
relate how many runs a team scored and allowed to its won-lost record. The
most common way to express it is
RS^2 Winning Pct = WPct = ------------ RS^2 + RA^2
where RS = runs scored, RA = runs allowed, and ^ means “raised to the
power of”, in this case, 2. It was the “raised to the power of
2” parts that reminded James of geometry’s Pythagorean theorem (a^2 =
b^2 + c^2), hence the bestowment of an unwieldy name.
Aside from the confusion caused by the name, the formula works reasonably
well. Take the 1998 Yankees: they scored 965 runs and allowed 656, for an
estimated winnning percentage of (965^2)/(965^2 + 656^2) = 931,225 /
136,151 = .684. For a 162-game season that means 110.8 wins. They actually
won 114, of course, so the formula was off by a little more than three
wins. That’s a typical error for the routine.
Just to confuse things, though, James wrote in later works that in order to
make the formula work best, you shouldn’t raise runs to the power of 2. It
should be a little less than that, he said, and named 1.82 as the ideal
Now, I’m not sure where he got that value, but in the 1990s, anybody can
get a database that shows runs scored and allowed for every team in
history, courtesy of Sean Lahman’s database,
and use Excel (or your favorite non-Microsoft spreadsheet) to get a regression line.
I put together a data set
of all teams that had played at
least 120 games in their season. Here’s how the root mean square error of
the wins varies with selected values of the exponent:
Exponent Error 2.00 4.126 1.90 4.044 1.89 4.041 1.88 4.040 1.87 4.039 1.86 4.040 1.85 4.041 1.84 4.044 1.83 4.048 1.82 4.052
If your dataset differs–you limit it to 20th-century teams, you set a
threshold of 60 games, or of zero games–you’ll get slightly different
answers. It is pretty obvious, though, that there isn’t much difference
among any of the 1.8x values, and that they are all somewhat, though not
overwhelmingly , better than using 2.00.
All of these, though, reduce the exponent to one number. Pete Palmer took a
different approach than James. Palmer’s system works off of the difference
between runs scored and allowed, rather than the ratio. In his system,
Runs Per Win = RPW = 10 * sqrt (RPG/9)
where RPG is the average number of runs per game, by both teams, in the
games that team played. Then expected wins becomes
ExpW = (R - RA)/RPW + (W+L)/2
If you apply Palmer’s method to the Pythagorean dataset above, you get an
average error of 4.038 wins – slightly better than the best Pythagorean value.
Testing the Theorem–Method One
The reason Palmer’s method tests better is that he makes use of data that
the Pythagorean method ignores: the run environment. James’ method assumes
that scoring twice as many runs as your opponents always results in the
same won/lost record, regardless of whether you win by an average of 4-2 or
20-10. This turns out to be a poor assumption.
This can be confirmed by a close examination of the real data. Using the
same Pythagorean dataset as before, I can calculate a “needed
exponent” for each team, the value that makes the Pythagorean formula
work perfectly for that team. The “needed exponent” can be
Log (W/L) NeedExp = ----------- Log (R/RA)
It won’t work when runs scored = runs allowed, but that’s not a problem.
Set it to infinity, or some appropriately large number, and continue.
If I stratify the data according to RPG (again, that is R + RA per game),
and look at the median values for RPG and the needed exponent, the
following pattern occurs:
Median Median Layer NumTms RPG NeedExp >12 88 13.02 2.056 10-12 294 10.58 2.018 9-10 460 9.39 1.957 8-9 697 8.51 1.857 7-8 384 7.61 1.784 <7 109 6.77 1.625
The needed exponent of the medians is strongly related to the run-scoring
environment; the data above suggests a logarithmic relationship, with a
regression line of NeedExp = .44 + 1.51 log RPG (that's a base-ten log, not
natural, for the morbidly curious).
If you go back and use the formula above, to calculate a
"Pythagorean" exponent for each individual team, the average
error is reduced to 3.991, half again as much improvement as there was by
changing from a 2.00 exponent to the optimal 1.87.
Testing the Theorem--Method II
A ) Modeling runs scored distributions
Just to make sure this wasn't a data fluke, I tried to replicate the
results through a modeled environment. This gives me the advantage of being
able to control the run distribution and make far more tests than I can
with the genuine data.
To build this model, I first needed to know the likelihood of a team
scoring X runs in a single game, given that they averaged Y runs per game.
How often does a team that averages 4.0 runs per game score exactly four
runs? Exactly 11 runs? Get shut out? This turned out to be a difficult
step: no one I knew had a good, simple formula to do it.
I have seen speculation suggesting that the Poisson distribution should be
a good model for run scoring per game. The Poisson distribution is often
used to model the distribution of a number of events (in our case, runs)
over a specified time period (a game) or spatial domain, given the typical
value for that period.
Traffic engineers use the Poisson distribution to model the number of cars
that will pass a spot on a highway at its busiest hour, knowing the hourly
average. Wildlife officials use it to model the number of fish that are in
a lake from knowing how many get caught. If X is the number of events, and
Y is the mean number of events, then the Poisson distribution is defined as:
P(X,Y) = (Y^X) * exp (-Y) / X!
Unfortunately, the data says it's a lousy fit for runs per game:
The standard Poisson distribution is too narrow around the mean; real teams
score zero--which the Poisson distribution does not handle well--and 10
runs considerably more often than M. Poisson's formula suggests, and score
within a run of their average correspondingly less often.
Nonetheless, the Poisson distribution is the way to go if you need a little
Think about the way a real team enters play. It becomes clear that a team
that averages four runs per game over a season doesn't take that 4.00
average out there each and every day. They play in better or worse
pitcher's parks, they play in different weather conditions and they don't
always face an average pitcher. I can emulate these effects by using a
series of Poisson functions, rather than just one, to try and replicate the
effects of having different averages on different days.
On the other hand, I don't want to make it too complicated. I wound up
using three Poisson equations to model the team's run distribution. Each
counts 1/3 towards the final total, and all are evaluated for the same
value of X.
The Y value, however, changes. I found the best matches to the median
distribution by using the following values for Y: the actual RPG, and RPG
plus or minus (2 * (RPG/4)^.75). For a team that scores 4.00 RPG, that is
simply 2, 4 and 6 for the Y values; the function allows the difference
about the mean to grow, slowly, as runs increase:
RPG Y1 Y2 Y3 2.0 2.0 3.19 0.81 3.0 3.0 4.61 1.39 4.0 4.0 6.00 2.00 5.0 5.0 7.36 2.64 6.0 6.0 8.71 3.28 7.0 7.0 10.04 3.96 8.0 8.0 11.36 4.64 10.0 10.0 13.98 6.02 15.0 15.0 20.39 9.61
Let me demonstrate. How often does a team that scores 4.50 RPG score
exactly four runs? The simple Poisson model (evaluated at Y = 4.5, and X =
4) would say 31 times per 162 games. The serial Poisson model is the
average of P(4,4.5), P(4,2.32), and P(4,6.68), which individually yield 31,
19 and 17, for an average of 22.
>From 1980 through 1998, there were 40 teams that scored between 4.45 and
4.55 RPG, which I'll use as my "4.5 RPG" sample. One of these
teams, the 1995 Giants, scored exactly four runs 33 times in 162 games
(they actually scored four runs 29 times in a strike-shortened 144-game
season; I've rounded all figures off to 162-game rates), to set the
high-water mark for the group. The 1997 Pirates were at the low end of the
scale, doing it just 14 times. The median value for the group was 22 (for
that matter, the mean and mode for the group were also 22.)
The only place where the equation really fails is in the case of a shutout;
the predictions are too low. My--admittedly clumsy--solution is to treat
everything as if it were one run higher than reality. Instead of evaluating
the equation for X=0 and Y=2.0, 4.0 and 6.0, evaluate it at X=1 for Y=3.0,
5.0 and 7.0. I.e., at P(1,Y1+1), P(1,.Y2+1), and P(1,Y3+1). It may be a
kludge, but it works much better than the basic model when X=0.
To demonstrate, here's how the model compares to the genuine distributions
for teams averaging 3.5, 4.0, 4.5 and 5.0 runs per game. In each case, I
have normalized the frequencies to 162-game seasons, and included teams
within 0.05 runs of the mean. The agreement with the median is excellent in
The data for team distribution of runs can be found
B) using the model
to use the model described above. This
spreadsheet allows me to fix the number of runs both teams together will
score (the value in cell A1), and the ratio of one team's runs to another
(the value in cell A5). It uses a random number generator to generate
scores of 1620 games at a time, counts how many times team A outscored team
B, and comes up with the exponent needed to satisfy the Pythagorean theorem
for the 1620-game sample.
What I did was to set the run ratio for the two teams at a given value, say
2.00 (team A averages twice as many runs as team B), and then step through
the total runs values, from an combined average of 2.0 runs per game to a
combined average of 20.0 runs per game (the latter case would represent an
average score of 13.3-6.7). The macro in cell T9 will re-set the random
number generator a dozen times, and will save the RPG value and the needed
exponent for regression analysis later.
I used fairly high ratios of runs scored to runs allowed in order to make
sure that one team both scored and won more than the other team. To do
otherwise would result in outlier values that would distort the analysis. I
don't think this is a major concern; there were no major changes in the
distribution between using a ratio of 1.1 and 2.0.
What I found was this:
There is, again, a definite relationship between the frequency of runs and
the "best" Pythagorean exponent. Once again, the shape of the
line is logarithmic, with a line function equal to roughly 1.4*log (RPG) +
.65. That is similar to the results from using the medians, but a little on
the high side. If I limit the regression to those ranges within the
historical range of scoring, say from 6 to 13 RPG, the regression drops to
about 1.55 log(RPG) + .45, very similar to the median test data.
This latter function, when used to predict the Pythagorean exponent, yields
an average error of 4.005. That is notably better than any fixed exponent
solution, and better than Palmer's method, but still short of what I found
using the median value. The model's results, though, are close enough to
corroborate the findings, and assure us that this was not a fluke of the
data: there really is a dependence between the necessary Pythagorean
exponent and the RPG environment, and that the dependence is a logarithmic
I conclude from this that the exponent should be set at approximately
Exponent = 1.50 * log (RPG. both teams) + 0.45
In the data set I used for testing the Pythagorean values, this results in
a 3.9911 root-mean-square error for wins. You can fiddle with the formula
to get very slightly better values (1.40 log RPG + .55 gets you 3.9905, as
good as I could find), but I prefer to stick with the higher multiplier
indicated by the median and model studies.
In practical terms, the implications are fairly small. For most
off-the-cuff calculations of runs and runs allowed into wins, the 1.5% gain
in accuracy isn't worth the trouble of finding a new exponent for every
team; just use 1.85 or thereabouts, and get on with your life.
It really makes a difference, though, to the small group of people who try
to assess the value of a player's performance as precisely as possible.
The most noticeable impact is going to be on the value of good pitchers in
extremely pitching-friendly environments. A pitcher-friendly environment
brings down the exponent; a good pitcher, by his own efforts, decreases the
run environment and the Pythagorean exponent even further.
Perhaps the most extreme case is Bob Gibson's 1968 season. That was the
year he had a 1.12 ERA and went "only" 22-9. Leaguewide scoring
reached its modern nadir that year. To summarize it quickly:
Gibson allowed 49 runs in 304 2/3 innings in a park I rated as a strong
pitcher's park. The average National League pitcher that year would have
allowed 116.3 runs in 304 2/3 innings; in Gibson's park, which scored as a
.934, the average would be 108.6. The Cardinals however, were an
above-average offensive team; their team scoring rate suggests 121.4 runs
per 304.2 innings.
Using that as an estimate for the offensive side of the ledger, and looking
at his 31 decisions, we find that
- With a Pythagorean exponent of 2, you'd expect 26.7 wins. An
uncharitable reviewer might speculate that Gibson "choked" away
nearly five wins.
- With a 1.85 exponent, you'd get 26.1 wins. Slightly better, but still
four wins off.
- Palmer's method would say that Gibson, playing in a 5.03 RPG environment
(that is, 121.4, plus 49, divided by 304 2/3 innings, times 9), would have
a Run-per-Win factor of 7.48. Since I'm estimating him at 72.4 runs better
than his opposition, that works out to 25.2 estimated wins. Better still.
- With a floating exponent, you'd find that the 5.03 RPG environment
suggests that the proper exponent for Gibson was just 1.50. That enables
you to cut another half-win off the estimate, to 24.7. In other words, 40%
of the error in the original estimate is due to the model's failure to
account for the run environment.
That's plenty to digest for now. I'll re-examine more of the consequences
later this summer.