The most important thing Bill James did for baseball, in my opinion, was to

question the assumptions that were prevalent in the game and determined how

it was played. His statistical methods were tools, and only tools,

developed to help test the validity of those assumptions. Some

analysts–myself included–have occasionally forgotten that point, and

treated James’ results as if they were something grander than a mere tool.

In this study, I am going to put one of James’ premier tools–a tool that I

have used uncritically for many years–to the test. The results have forced

me to revise some of my own methods. Let’s all take a good, hard look at

James’ “Pythagorean theorem”.

**Review**

The Pythagorean theorem, as James called it, was a formula designed to

relate how many runs a team scored and allowed to its won-lost record. The

most common way to express it is

RS^2 Winning Pct = WPct = ------------ RS^2 + RA^2

where RS = runs scored, RA = runs allowed, and ^ means “raised to the

power of”, in this case, 2. It was the “raised to the power of

2” parts that reminded James of geometry’s Pythagorean theorem (a^2 =

b^2 + c^2), hence the bestowment of an unwieldy name.

Aside from the confusion caused by the name, the formula works reasonably

well. Take the 1998 Yankees: they scored 965 runs and allowed 656, for an

estimated winnning percentage of (965^2)/(965^2 + 656^2) = 931,225 /

136,151 = .684. For a 162-game season that means 110.8 wins. They actually

won 114, of course, so the formula was off by a little more than three

wins. That’s a typical error for the routine.

Just to confuse things, though, James wrote in later works that in order to

make the formula work best, you shouldn’t raise runs to the power of 2. It

should be a little less than that, he said, and named 1.82 as the ideal

exponent.

Now, I’m not sure where he got that value, but in the 1990s, anybody can

get a database that shows runs scored and allowed for every team in

history, courtesy of Sean Lahman’s database,

and use Excel (or your favorite non-Microsoft spreadsheet) to get a regression line.

I put together a data set

of all teams that had played at

least 120 games in their season. Here’s how the root mean square error of

the wins varies with selected values of the exponent:

Exponent Error 2.00 4.126 1.90 4.044 1.89 4.041 1.88 4.040 1.87 4.039 1.86 4.040 1.85 4.041 1.84 4.044 1.83 4.048 1.82 4.052

If your dataset differs–you limit it to 20th-century teams, you set a

threshold of 60 games, or of zero games–you’ll get slightly different

answers. It is pretty obvious, though, that there isn’t much difference

among any of the 1.8x values, and that they are all somewhat, though not

overwhelmingly , better than using 2.00.

All of these, though, reduce the exponent to one number. Pete Palmer took a

different approach than James. Palmer’s system works off of the difference

between runs scored and allowed, rather than the ratio. In his system,

Runs Per Win = RPW = 10 * sqrt (RPG/9)

where RPG is the average number of runs per game, by both teams, in the

games that team played. Then expected wins becomes

ExpW = (R - RA)/RPW + (W+L)/2

If you apply Palmer’s method to the Pythagorean dataset above, you get an

average error of 4.038 wins – slightly better than the best Pythagorean value.

**Testing the Theorem–Method One**

The reason Palmer’s method tests better is that he makes use of data that

the Pythagorean method ignores: the run environment. James’ method assumes

that scoring twice as many runs as your opponents always results in the

same won/lost record, regardless of whether you win by an average of 4-2 or

20-10. This turns out to be a poor assumption.

This can be confirmed by a close examination of the real data. Using the

same Pythagorean dataset as before, I can calculate a “needed

exponent” for each team, the value that makes the Pythagorean formula

work perfectly for that team. The “needed exponent” can be

expressed as

Log (W/L) NeedExp = ----------- Log (R/RA)

It won’t work when runs scored = runs allowed, but that’s not a problem.

Set it to infinity, or some appropriately large number, and continue.

If I stratify the data according to RPG (again, that is R + RA per game),

and look at the median values for RPG and the needed exponent, the

following pattern occurs:

Median Median Layer NumTms RPG NeedExp >12 88 13.02 2.056 10-12 294 10.58 2.018 9-10 460 9.39 1.957 8-9 697 8.51 1.857 7-8 384 7.61 1.784 <7 109 6.77 1.625

The needed exponent of the medians is strongly related to the run-scoring

environment; the data above suggests a logarithmic relationship, with a

regression line of NeedExp = .44 + 1.51 log RPG (that's a base-ten log, not

natural, for the morbidly curious).

If you go back and use the formula above, to calculate a

"Pythagorean" exponent for each individual team, the average

error is reduced to 3.991, half again as much improvement as there was by

changing from a 2.00 exponent to the optimal 1.87.

Testing the Theorem--Method II

*A ) Modeling runs scored distributions*

Just to make sure this wasn't a data fluke, I tried to replicate the

results through a modeled environment. This gives me the advantage of being

able to control the run distribution and make far more tests than I can

with the genuine data.

To build this model, I first needed to know the likelihood of a team

scoring X runs in a single game, given that they averaged Y runs per game.

How often does a team that averages 4.0 runs per game score exactly four

runs? Exactly 11 runs? Get shut out? This turned out to be a difficult

step: no one I knew had a good, simple formula to do it.

I have seen speculation suggesting that the Poisson distribution should be

a good model for run scoring per game. The Poisson distribution is often

used to model the distribution of a number of events (in our case, runs)

over a specified time period (a game) or spatial domain, given the typical

value for that period.

Traffic engineers use the Poisson distribution to model the number of cars

that will pass a spot on a highway at its busiest hour, knowing the hourly

average. Wildlife officials use it to model the number of fish that are in

a lake from knowing how many get caught. If X is the number of events, and

Y is the mean number of events, then the Poisson distribution is defined as:

P(X,Y) = (Y^X) * exp (-Y) / X!

Unfortunately, the data says it's a lousy fit for runs per game:

The standard Poisson distribution is too narrow around the mean; real teams

score zero--which the Poisson distribution does not handle well--and 10

runs considerably more often than M. Poisson's formula suggests, and score

within a run of their average correspondingly less often.

Nonetheless, the Poisson distribution is the way to go if you need a little

more complexity.

Think about the way a real team enters play. It becomes clear that a team

that averages four runs per game over a season doesn't take that 4.00

average out there each and every day. They play in better or worse

pitcher's parks, they play in different weather conditions and they don't

always face an average pitcher. I can emulate these effects by using a

series of Poisson functions, rather than just one, to try and replicate the

effects of having different averages on different days.

On the other hand, I don't want to make it too complicated. I wound up

using three Poisson equations to model the team's run distribution. Each

counts 1/3 towards the final total, and all are evaluated for the same

value of X.

The Y value, however, changes. I found the best matches to the median

distribution by using the following values for Y: the actual RPG, and RPG

plus or minus (2 * (RPG/4)^.75). For a team that scores 4.00 RPG, that is

simply 2, 4 and 6 for the Y values; the function allows the difference

about the mean to grow, slowly, as runs increase:

RPG Y1 Y2 Y3 2.0 2.0 3.19 0.81 3.0 3.0 4.61 1.39 4.0 4.0 6.00 2.00 5.0 5.0 7.36 2.64 6.0 6.0 8.71 3.28 7.0 7.0 10.04 3.96 8.0 8.0 11.36 4.64 10.0 10.0 13.98 6.02 15.0 15.0 20.39 9.61

Let me demonstrate. How often does a team that scores 4.50 RPG score

exactly four runs? The simple Poisson model (evaluated at Y = 4.5, and X =

4) would say 31 times per 162 games. The serial Poisson model is the

average of P(4,4.5), P(4,2.32), and P(4,6.68), which individually yield 31,

19 and 17, for an average of 22.

>From 1980 through 1998, there were 40 teams that scored between 4.45 and

4.55 RPG, which I'll use as my "4.5 RPG" sample. One of these

teams, the 1995 Giants, scored exactly four runs 33 times in 162 games

(they actually scored four runs 29 times in a strike-shortened 144-game

season; I've rounded all figures off to 162-game rates), to set the

high-water mark for the group. The 1997 Pirates were at the low end of the

scale, doing it just 14 times. The median value for the group was 22 (for

that matter, the mean and mode for the group were also 22.)

The only place where the equation really fails is in the case of a shutout;

the predictions are too low. My--admittedly clumsy--solution is to treat

everything as if it were one run higher than reality. Instead of evaluating

the equation for X=0 and Y=2.0, 4.0 and 6.0, evaluate it at X=1 for Y=3.0,

5.0 and 7.0. I.e., at P(1,Y1+1), P(1,.Y2+1), and P(1,Y3+1). It may be a

kludge, but it works much better than the basic model when X=0.

To demonstrate, here's how the model compares to the genuine distributions

for teams averaging 3.5, 4.0, 4.5 and 5.0 runs per game. In each case, I

have normalized the frequencies to 162-game seasons, and included teams

within 0.05 runs of the mean. The agreement with the median is excellent in

all cases.

The data for team distribution of runs can be found

here.

*B) using the model*

I developed

this spreadsheet

to use the model described above. This

spreadsheet allows me to fix the number of runs both teams together will

score (the value in cell A1), and the ratio of one team's runs to another

(the value in cell A5). It uses a random number generator to generate

scores of 1620 games at a time, counts how many times team A outscored team

B, and comes up with the exponent needed to satisfy the Pythagorean theorem

for the 1620-game sample.

What I did was to set the run ratio for the two teams at a given value, say

2.00 (team A averages twice as many runs as team B), and then step through

the total runs values, from an combined average of 2.0 runs per game to a

combined average of 20.0 runs per game (the latter case would represent an

average score of 13.3-6.7). The macro in cell T9 will re-set the random

number generator a dozen times, and will save the RPG value and the needed

exponent for regression analysis later.

I used fairly high ratios of runs scored to runs allowed in order to make

sure that one team both scored and won more than the other team. To do

otherwise would result in outlier values that would distort the analysis. I

don't think this is a major concern; there were no major changes in the

distribution between using a ratio of 1.1 and 2.0.

What I found was this:

There is, again, a definite relationship between the frequency of runs and

the "best" Pythagorean exponent. Once again, the shape of the

line is logarithmic, with a line function equal to roughly 1.4*log (RPG) +

.65. That is similar to the results from using the medians, but a little on

the high side. If I limit the regression to those ranges within the

historical range of scoring, say from 6 to 13 RPG, the regression drops to

about 1.55 log(RPG) + .45, very similar to the median test data.

This latter function, when used to predict the Pythagorean exponent, yields

an average error of 4.005. That is notably better than any fixed exponent

solution, and better than Palmer's method, but still short of what I found

using the median value. The model's results, though, are close enough to

corroborate the findings, and assure us that this was not a fluke of the

data: there really is a dependence between the necessary Pythagorean

exponent and the RPG environment, and that the dependence is a logarithmic

one.

I conclude from this that the exponent should be set at approximately

Exponent = 1.50 * log (RPG. both teams) + 0.45

In the data set I used for testing the Pythagorean values, this results in

a 3.9911 root-mean-square error for wins. You can fiddle with the formula

to get very slightly better values (1.40 log RPG + .55 gets you 3.9905, as

good as I could find), but I prefer to stick with the higher multiplier

indicated by the median and model studies.

**Implications**

In practical terms, the implications are fairly small. For most

off-the-cuff calculations of runs and runs allowed into wins, the 1.5% gain

in accuracy isn't worth the trouble of finding a new exponent for every

team; just use 1.85 or thereabouts, and get on with your life.

It really makes a difference, though, to the small group of people who try

to assess the value of a player's performance as precisely as possible.

The most noticeable impact is going to be on the value of good pitchers in

extremely pitching-friendly environments. A pitcher-friendly environment

brings down the exponent; a good pitcher, by his own efforts, decreases the

run environment and the Pythagorean exponent even further.

Perhaps the most extreme case is Bob Gibson's 1968 season. That was the

year he had a 1.12 ERA and went "only" 22-9. Leaguewide scoring

reached its modern nadir that year. To summarize it quickly:

Gibson allowed 49 runs in 304 2/3 innings in a park I rated as a strong

pitcher's park. The average National League pitcher that year would have

allowed 116.3 runs in 304 2/3 innings; in Gibson's park, which scored as a

.934, the average would be 108.6. The Cardinals however, were an

above-average offensive team; their team scoring rate suggests 121.4 runs

per 304.2 innings.

Using that as an estimate for the offensive side of the ledger, and looking

at his 31 decisions, we find that

- With a Pythagorean exponent of 2, you'd expect 26.7 wins. An

uncharitable reviewer might speculate that Gibson "choked" away

nearly five wins. - With a 1.85 exponent, you'd get 26.1 wins. Slightly better, but still

four wins off. - Palmer's method would say that Gibson, playing in a 5.03 RPG environment

(that is, 121.4, plus 49, divided by 304 2/3 innings, times 9), would have

a Run-per-Win factor of 7.48. Since I'm estimating him at 72.4 runs better

than his opposition, that works out to 25.2 estimated wins. Better still. - With a floating exponent, you'd find that the 5.03 RPG environment

suggests that the proper exponent for Gibson was just 1.50. That enables

you to cut another half-win off the estimate, to 24.7. In other words, 40%

of the error in the original estimate is due to the model's failure to

account for the run environment.

That's plenty to digest for now. I'll re-examine more of the consequences

later this summer.