When I rewrote the Playoff Probabilities routine, it is fair to say that I didn’t know what I was getting myself into. I thought I was running a fun little toy we had produced before, one which hardly anyone would notice or take very seriously.

It turns out that I was delving into a feature that was widely read, very popular and taken very seriously. My article on the system’s methodology generated more response than anything else I’ve written in ages.

I received many suggested improvements for the routine, some of which have been incorporated into the current system, some of which should be in there by next season (I can’t see any reason NOT to run it from Opening Day next year, for everyone who asked that it start earlier in the season), and some of which I considered, looked into and rejected. Some of the changes I’ve made were trivial to the calculation–a better algorithm for figuring out who won and tied for division championships and wild cards, for instance, is good to have, but doesn’t affect the estimates themselves. Among all the suggestions, though, no one noticed what turned out to be the biggest flaw in my original setup.

The most important piece of this Monte Carlo simulation is the winning percentage assigned to each team; everything follows from those. How I set those percentages has evolved in four distinct stages.

Originally, I was using the third-order winning percentage from the adjusted standings report as my estimator. The “W3” is essentially the Pythagorean won-lost percentage, modified by using the estimated runs scored and allowed instead of actual, and further modified by the accounting for the average ability of the team’s opponents. Several people challenged me on the use of that as an estimator, arguing that the actual record is better; implicit in this is the idea that teams have a reason for exceeding their Pythagorean record, that it is not simply luck.

Unfortunately, I can’t easily make a full study of W3 as an estimator; the data is not readily accessible to me. I do have the data available for the regular Pythagorean record (“W1”), and I think–but don’t know for certain–that this serves as a reasonable proxy, so consider yourself warned. I tested whether actual record or Pythagorean record is a better predictor of future record.

I used the Retrosheet game logs to get the records for every team that has played 150 games or more in a season, not quite 1900 teams going into 2004. At intervals of 10 games, I pulled out the team’s record to that point in the season, along with how many runs they had scored and allowed. That allowed me to set up some simple regression tests between current actual record and current Pythagorean record as the predictors, and rest of season record (not final record!) as the predictand.

Here are the results.

Actual Pyth G Pyth-Act R2 A B R2 A B 10 1076-797 .121 .176 .412 .155 .231 .385 20 1060-823 .195 .293 .353 .227 .355 .322 30 1027-862 .247 .385 .307 .278 .452 .273 40 1047-845 .298 .470 .265 .333 .540 .230 50 1028-863 .338 .550 .225 .369 .613 .193 60 1028-866 .373 .607 .196 .402 .673 .163 70 1005-888 .380 .637 .181 .405 .703 .148 80 1007-886 .374 .673 .164 .403 .743 .128 90 1020-872 .388 .710 .144 .410 .772 .113 100 980-913 .394 .759 .121 .408 .813 .094 110 988-906 .383 .800 .101 .397 .858 .071 120 971-924 .357 .822 .089 .366 .877 .061 130 965-930 .310 .853 .074 .320 .912 .044 140 931-963 .233 .844 .078 .237 .897 .052

The second column indicates how many times did the Pythagorean record do better than the actual record at predicting the future record? After 10 games played, the Pythagorean was the better estimator in 1076 cases, and the actual record was better 797 times; a clear win for Pythagoras (who would have loved baseball, by the way). The Pythagorean record turns out to almost always be a better predictor than the actual record, but its advantage steadily declines with every game played, until actual record becomes a better predictor after 140 games. (Different numbers of total games reflect the times where actual and Pythagorean records were identical, almost always a .500 team with R=RA.)

Before conceding the point, take a look at the regression equations for the actual and Pythagorean records. The Pythagorean record always has a better r-squared value (that’s the R2 column) than the actual record, even at the 140-game mark. As with the straight binary test, the advantage declines with increasing games played. I think it would be reasonable to conclude that anyone interested in handicapping the playoffs should be using actual record rather than Pythagorean record, and perhaps the former should also be the choice for the last two weeks or so of the season.

There is another item to take home from the regression listings, and that is in the A and B components (in a regression, y=Ax+B). If record to date was the likely record for the remainder of the season, as I assumed in my initial report, then A would be equal to 1.0 and B would be equal to zero. I totally neglected regression to the mean, and nobody noticed (or at least, no one told me they’d noticed); the most likely rest-of-season record for a team playing .600 ball after 100 games is not .600, but something like .576. As you can see, as games go up, the A component gets closer to 1 and the B component closer to 0; but the A component for the Pythagorean is always higher than that of the actual. The Pythagorean record is a more conservative estimator than actual record; some of the regression to the mean needed for the actual record is built in to the Pythagorean.

I don’t think the difference between the two makes it worthwhile to switch between actual and Pythagorean records during the season; Pythagorean record is clearly superior for most of the season, and is not clearly inferior at any point, so I am going to retain the Pythagorean values as the primary estimator. However, the big change here from the original model is that I’ll use the regression equations to get a regressed-to-mean W3, not W3 itself, as the primary estimate.

A second major change was the realization, which dawned slowly and only after several people tried repeatedly to convince me of it, that even these regressed values were estimates, not a hard fact about the team’s future performance. Of course I knew that, but my initial take was that the Monte Carlo simulation itself would supply sufficient variation around the estimate. After more correspondence with people who actually use Monte Carlo simulations as a regular part of their professional career, and more reading on my part, I no longer believe that. There is a real need to recognize, up front, that while I am calling Boston a .550 team, they may in fact be a .650 team, or a .450 team; they may even be a .999 team that has gotten incredibly unlucky, although the odds against that are staggering. The simulation will work better if instead of using the same estimate for Boston’s winning percentage in every run, I let the estimate vary.

How much it should vary is answered, in part, by returning to those regression equations. The standard errors for the estimates were never zero; they varied between .075 and .130, depending on how many games had been played. As a very crude (but simple and easily programmed) solution, I added a random number between -.100 and +.100 (.100 being roughly the standard error) to the team’s winning percentage on every iteration, and it made a big difference. Everything pushed a little farther away from the endpoints, zero and one, and a little closer to .5; the certainties were not nearly so certain anymore. But this was, like I said, a crude solution, and after a little more mathematical effort I’ve replaced it with a system that replicates a Gaussian (normal) distribution around the primary estimate, with a standard deviation of .100; getting the SD to vary with games played is next on the to-do list for this routine, but it isn’t there quite yet.

So let me summarize: suppose Boston’s W3 after 120 games was .560. In the first version of the playoff odds, I would have used .560 as Boston’s estimated winning percentage for the rest of the season. In the current version, I take the .560 and correct for regression to the mean, and get .549 as their new base estimate. Before every one of the million iterations of the season, I sample a normal distribution around .549 to get their estimated record for that iteration, capping them between .250 and .750. I have, essentially, done a Monte Carlo simulation to replicate the whole range of outcomes from the regression equations to get inputs for the Monte Carlo simulation of the rest of the season.

My crude estimator was on the right track, but still underestimated the impact of spreading the initial values, since I’ve replaced a system that has an entire range of +/- .100 with one that has a standard deviation of .100, meaning that only 67%, not all, of the points will be between +/- .100. I’ll admit that that sounds high, but that is what the data tells us. Let’s go back to July 1, and look at how the different versions of the model would have assessed team’s chances of making the playoffs:

Playoff Probabilities, July 1, 2004Original Regress Crude Gauss Yankees 98.4 97.5 89.2 78.1 Red Sox 50.1 47.4 45.4 43.6 Twins 5.2 11.0 18.9 22.7 White Sox 91.6 84.0 68.3 58.6 Indians 3.0 4.9 13.0 17.8 Tigers 5.4 8.1 16.0 20.7 A's 52.1 52.5 49.4 46.4 Angels 18.1 21.4 28.3 30.1 Rangers 75.1 71.2 62.1 56.3 Braves 9.3 11.0 16.8 19.2 Marlins 28.9 31.6 30.5 31.6 Phillies 52.5 48.9 42.5 38.8 Mets 12.3 13.1 18.5 22.2 Cardinals 91.7 87.4 72.4 61.3 Cubs 57.5 50.6 44.2 39.7 Astros 22.5 22.9 25.6 27.1 Dodgers 17.7 20.1 25.7 29.6 Giants 54.4 55.7 50.4 46.0 Padres 39.6 38.4 40.4 38.7

“Original” is the original version; “regress” adds the regression-to-the-mean correction; “crude” is using the simple +/- .100 addition to winning percentage, in addition to the regression to mean; and “gauss” uses a normal distribution instead of the simple, flat distribution.

You can make many points from this, but I’ll stick to two: the AL Central and NL West. In the original version of the model, the White Sox were near locks for the playoffs at 92%, while the Twins were extreme underdogs at 20:1. Every change we made to the model cut into the certainty of those pronouncements, until in the current version the White Sox are only slightly better than an even-money favorite, while the Twins face much less daunting 9:2 odds; instead of being 18 times less likely than the Sox to make the playoffs, they are now only 2.5 times less likely. In the NL West, the spread between the top three teams drops from 37 points in the original version to less than half that, 16 points, in the current version. As in the AL West, the competition was much tighter than the overly certain original version would have had you believe.

There are lessons here, ones that we should always remember, that apply to more than just baseball studies. The results of a model are only as good as the assumptions the model was based upon. Our assumptions are a lot less certain than we usually care to admit. It’s awfully hard to keep your own biases out of the system.

I’m glad to find a wider spread in the results, since the success of this or any game depends a lot on not knowing how it will all turn out. In the end, reality always wins; the models only serve to keep us from being too surprised by the results.