June 30, 1999
Revisiting the Pythagorean Theorem
Putting Bill James' Pythagorean Theorem to the test
In this study, I am going to put one of James' premier tools--a tool that I have used uncritically for many years--to the test. The results have forced me to revise some of my own methods. Let's all take a good, hard look at James' "Pythagorean theorem".
The Pythagorean theorem, as James called it, was a formula designed to relate how many runs a team scored and allowed to its won-lost record. The most common way to express it is
RS^2 Winning Pct = WPct = ------------ RS^2 + RA^2
where RS = runs scored, RA = runs allowed, and ^ means "raised to the power of", in this case, 2. It was the "raised to the power of 2" parts that reminded James of geometry's Pythagorean theorem (a^2 = b^2 + c^2), hence the bestowment of an unwieldy name.
Aside from the confusion caused by the name, the formula works reasonably well. Take the 1998 Yankees: they scored 965 runs and allowed 656, for an estimated winnning percentage of (965^2)/(965^2 + 656^2) = 931,225 / 136,151 = .684. For a 162-game season that means 110.8 wins. They actually won 114, of course, so the formula was off by a little more than three wins. That's a typical error for the routine.
Just to confuse things, though, James wrote in later works that in order to make the formula work best, you shouldn't raise runs to the power of 2. It should be a little less than that, he said, and named 1.82 as the ideal exponent.
Now, I'm not sure where he got that value, but in the 1990s, anybody can get a database that shows runs scored and allowed for every team in history, courtesy of Sean Lahman's database, and use Excel (or your favorite non-Microsoft spreadsheet) to get a regression line. I put together a data set of all teams that had played at least 120 games in their season. Here's how the root mean square error of the wins varies with selected values of the exponent:
Exponent Error 2.00 4.126 1.90 4.044 1.89 4.041 1.88 4.040 1.87 4.039 1.86 4.040 1.85 4.041 1.84 4.044 1.83 4.048 1.82 4.052
If your dataset differs--you limit it to 20th-century teams, you set a threshold of 60 games, or of zero games--you'll get slightly different answers. It is pretty obvious, though, that there isn't much difference among any of the 1.8x values, and that they are all somewhat, though not overwhelmingly , better than using 2.00.
All of these, though, reduce the exponent to one number. Pete Palmer took a different approach than James. Palmer's system works off of the difference between runs scored and allowed, rather than the ratio. In his system,
Runs Per Win = RPW = 10 * sqrt (RPG/9)
where RPG is the average number of runs per game, by both teams, in the games that team played. Then expected wins becomes
ExpW = (R - RA)/RPW + (W+L)/2
If you apply Palmer's method to the Pythagorean dataset above, you get an average error of 4.038 wins - slightly better than the best Pythagorean value.
Testing the Theorem--Method One
The reason Palmer's method tests better is that he makes use of data that the Pythagorean method ignores: the run environment. James' method assumes that scoring twice as many runs as your opponents always results in the same won/lost record, regardless of whether you win by an average of 4-2 or 20-10. This turns out to be a poor assumption.
This can be confirmed by a close examination of the real data. Using the same Pythagorean dataset as before, I can calculate a "needed exponent" for each team, the value that makes the Pythagorean formula work perfectly for that team. The "needed exponent" can be expressed as
Log (W/L) NeedExp = ----------- Log (R/RA)
It won't work when runs scored = runs allowed, but that's not a problem. Set it to infinity, or some appropriately large number, and continue.
If I stratify the data according to RPG (again, that is R + RA per game), and look at the median values for RPG and the needed exponent, the following pattern occurs:
Median Median Layer NumTms RPG NeedExp >12 88 13.02 2.056 10-12 294 10.58 2.018 9-10 460 9.39 1.957 8-9 697 8.51 1.857 7-8 384 7.61 1.784 <7 109 6.77 1.625
The needed exponent of the medians is strongly related to the run-scoring environment; the data above suggests a logarithmic relationship, with a regression line of NeedExp = .44 + 1.51 log RPG (that's a base-ten log, not natural, for the morbidly curious).
If you go back and use the formula above, to calculate a "Pythagorean" exponent for each individual team, the average error is reduced to 3.991, half again as much improvement as there was by changing from a 2.00 exponent to the optimal 1.87.
Testing the Theorem--Method II
A ) Modeling runs scored distributions
Just to make sure this wasn't a data fluke, I tried to replicate the results through a modeled environment. This gives me the advantage of being able to control the run distribution and make far more tests than I can with the genuine data.
To build this model, I first needed to know the likelihood of a team scoring X runs in a single game, given that they averaged Y runs per game. How often does a team that averages 4.0 runs per game score exactly four runs? Exactly 11 runs? Get shut out? This turned out to be a difficult step: no one I knew had a good, simple formula to do it.
I have seen speculation suggesting that the Poisson distribution should be a good model for run scoring per game. The Poisson distribution is often used to model the distribution of a number of events (in our case, runs) over a specified time period (a game) or spatial domain, given the typical value for that period.
Traffic engineers use the Poisson distribution to model the number of cars that will pass a spot on a highway at its busiest hour, knowing the hourly average. Wildlife officials use it to model the number of fish that are in a lake from knowing how many get caught. If X is the number of events, and Y is the mean number of events, then the Poisson distribution is defined as:
P(X,Y) = (Y^X) * exp (-Y) / X!
Unfortunately, the data says it's a lousy fit for runs per game:
The standard Poisson distribution is too narrow around the mean; real teams score zero--which the Poisson distribution does not handle well--and 10 runs considerably more often than M. Poisson's formula suggests, and score within a run of their average correspondingly less often.
Nonetheless, the Poisson distribution is the way to go if you need a little more complexity.
Think about the way a real team enters play. It becomes clear that a team that averages four runs per game over a season doesn't take that 4.00 average out there each and every day. They play in better or worse pitcher's parks, they play in different weather conditions and they don't always face an average pitcher. I can emulate these effects by using a series of Poisson functions, rather than just one, to try and replicate the effects of having different averages on different days.
On the other hand, I don't want to make it too complicated. I wound up using three Poisson equations to model the team's run distribution. Each counts 1/3 towards the final total, and all are evaluated for the same value of X.
The Y value, however, changes. I found the best matches to the median distribution by using the following values for Y: the actual RPG, and RPG plus or minus (2 * (RPG/4)^.75). For a team that scores 4.00 RPG, that is simply 2, 4 and 6 for the Y values; the function allows the difference about the mean to grow, slowly, as runs increase:
RPG Y1 Y2 Y3 2.0 2.0 3.19 0.81 3.0 3.0 4.61 1.39 4.0 4.0 6.00 2.00 5.0 5.0 7.36 2.64 6.0 6.0 8.71 3.28 7.0 7.0 10.04 3.96 8.0 8.0 11.36 4.64 10.0 10.0 13.98 6.02 15.0 15.0 20.39 9.61
Let me demonstrate. How often does a team that scores 4.50 RPG score exactly four runs? The simple Poisson model (evaluated at Y = 4.5, and X = 4) would say 31 times per 162 games. The serial Poisson model is the average of P(4,4.5), P(4,2.32), and P(4,6.68), which individually yield 31, 19 and 17, for an average of 22.
>From 1980 through 1998, there were 40 teams that scored between 4.45 and 4.55 RPG, which I'll use as my "4.5 RPG" sample. One of these teams, the 1995 Giants, scored exactly four runs 33 times in 162 games (they actually scored four runs 29 times in a strike-shortened 144-game season; I've rounded all figures off to 162-game rates), to set the high-water mark for the group. The 1997 Pirates were at the low end of the scale, doing it just 14 times. The median value for the group was 22 (for that matter, the mean and mode for the group were also 22.)
The only place where the equation really fails is in the case of a shutout; the predictions are too low. My--admittedly clumsy--solution is to treat everything as if it were one run higher than reality. Instead of evaluating the equation for X=0 and Y=2.0, 4.0 and 6.0, evaluate it at X=1 for Y=3.0, 5.0 and 7.0. I.e., at P(1,Y1+1), P(1,.Y2+1), and P(1,Y3+1). It may be a kludge, but it works much better than the basic model when X=0.
To demonstrate, here's how the model compares to the genuine distributions for teams averaging 3.5, 4.0, 4.5 and 5.0 runs per game. In each case, I have normalized the frequencies to 162-game seasons, and included teams within 0.05 runs of the mean. The agreement with the median is excellent in all cases.
The data for team distribution of runs can be found here.
B) using the model
I developed this spreadsheet to use the model described above. This spreadsheet allows me to fix the number of runs both teams together will score (the value in cell A1), and the ratio of one team's runs to another (the value in cell A5). It uses a random number generator to generate scores of 1620 games at a time, counts how many times team A outscored team B, and comes up with the exponent needed to satisfy the Pythagorean theorem for the 1620-game sample.
What I did was to set the run ratio for the two teams at a given value, say 2.00 (team A averages twice as many runs as team B), and then step through the total runs values, from an combined average of 2.0 runs per game to a combined average of 20.0 runs per game (the latter case would represent an average score of 13.3-6.7). The macro in cell T9 will re-set the random number generator a dozen times, and will save the RPG value and the needed exponent for regression analysis later.
I used fairly high ratios of runs scored to runs allowed in order to make sure that one team both scored and won more than the other team. To do otherwise would result in outlier values that would distort the analysis. I don't think this is a major concern; there were no major changes in the distribution between using a ratio of 1.1 and 2.0.
What I found was this:
There is, again, a definite relationship between the frequency of runs and the "best" Pythagorean exponent. Once again, the shape of the line is logarithmic, with a line function equal to roughly 1.4*log (RPG) + .65. That is similar to the results from using the medians, but a little on the high side. If I limit the regression to those ranges within the historical range of scoring, say from 6 to 13 RPG, the regression drops to about 1.55 log(RPG) + .45, very similar to the median test data.
This latter function, when used to predict the Pythagorean exponent, yields an average error of 4.005. That is notably better than any fixed exponent solution, and better than Palmer's method, but still short of what I found using the median value. The model's results, though, are close enough to corroborate the findings, and assure us that this was not a fluke of the data: there really is a dependence between the necessary Pythagorean exponent and the RPG environment, and that the dependence is a logarithmic one.
I conclude from this that the exponent should be set at approximately
Exponent = 1.50 * log (RPG. both teams) + 0.45
In the data set I used for testing the Pythagorean values, this results in a 3.9911 root-mean-square error for wins. You can fiddle with the formula to get very slightly better values (1.40 log RPG + .55 gets you 3.9905, as good as I could find), but I prefer to stick with the higher multiplier indicated by the median and model studies.
In practical terms, the implications are fairly small. For most off-the-cuff calculations of runs and runs allowed into wins, the 1.5% gain in accuracy isn't worth the trouble of finding a new exponent for every team; just use 1.85 or thereabouts, and get on with your life.
It really makes a difference, though, to the small group of people who try to assess the value of a player's performance as precisely as possible.
The most noticeable impact is going to be on the value of good pitchers in extremely pitching-friendly environments. A pitcher-friendly environment brings down the exponent; a good pitcher, by his own efforts, decreases the run environment and the Pythagorean exponent even further.
Perhaps the most extreme case is Bob Gibson's 1968 season. That was the year he had a 1.12 ERA and went "only" 22-9. Leaguewide scoring reached its modern nadir that year. To summarize it quickly:
Gibson allowed 49 runs in 304 2/3 innings in a park I rated as a strong pitcher's park. The average National League pitcher that year would have allowed 116.3 runs in 304 2/3 innings; in Gibson's park, which scored as a .934, the average would be 108.6. The Cardinals however, were an above-average offensive team; their team scoring rate suggests 121.4 runs per 304.2 innings.
Using that as an estimate for the offensive side of the ledger, and looking at his 31 decisions, we find that
That's plenty to digest for now. I'll re-examine more of the consequences later this summer.