July 11, 2005

# Predictatron Pontification

## A Closer Look At The Picks

The first annual Predictatron contest generated a fair amount of interest, getting almost 1,100 entries. Although there are plenty of standings predictions contests out there, few offer a purse as sizeable as \$500, or a gift as pleasing to the eye as an autographed picture of everyone's favorite "Bud." As the All Star Break approaches, we thought it might be interesting to take a closer look at the ballots from the inaugural competition.

Before delving deeper, some of you might find it helpful to read up on these statistical terms (thanks to Wikipedia):

Given all the data, there are many different ways to break things down. First, we can look at the overall distribution of win guesses. Putting all the win guesses together, we can see the relative frequency of every win guess, generating a sort of probability density function. Here, the win guess PDF graph is in blue, with the normal PMF overlaid in pink for reference:

Although there isn't a ton of information to be gleaned from this graph, we can see that it's actually relatively close to normal, an interesting phenomenon. We can also see several peaks in the data, showing higher frequency of guesses around 70, 76, 82, and 88 wins. Kudos to the first one to figure out why the peaks come six wins apart each time. Perhaps these peaks represent different groups of teams? Looking deeper, we can see six or seven groups of teams with similar mean guesses (indicated by the coloring). In this chart, we have the average, standard deviation of win guesses, and then percent of the time the team was picked to make the playoffs, win their division, win the Wild Card, and was chosen as a mortal lock (for more information about the mortal lock, see the Predictatron rules):

We can see that these groups usually have averages about four wins apart:

```
Color         Mean
Pink         94.839
Orange       87.850
Yellow       83.153
Green        79.350
Blue         75.249
Purple       69.299
Grey         65.353

```
Now that we have a good idea of how the data shapes up in general, we'll compare all the teams using box and whisker plots of each team's guesses from all 1095 ballots. These plots are helpful because they give a good visual representation of both the center (median, in this case) of each team's guesses, and the spread. Each plot has five parts: the top whisker, the top of the box, the middle of the box, the bottom of the box, and the bottom whisker. The top whisker represents the largest guess for that team, the top of the box represents the value of the 75th-percentile guess for that team, the center of the box represents the median guess, the bottom of the box represents the value of the 25th-percentile guess, and the bottom whisker is the value of the lowest guess for that team. Since 50% of the overall team guesses fall between the top of the box and the bottom of the box, the overall size of the box is a good indicator of the spread of the guesses for each team--smaller boxes mean that the guesses were closer together, or the standard deviation is relatively lower.

To help get used to reading a box and whisker plot, we'll go through a few interesting observations from the plot here.

The first thing you should notice is that the teams are arranged from left to right in order of greatest median guess to the smallest median guess. That is, you can tell just by glancing at the box and whisker plots that the highest median guess belongs to Boston, and that the lowest belongs to Kansas City. Since the top whisker for each team represents that team's highest win guess, you can tell that the New York Yankees had the highest overall win guess at 109 wins.

• Philadelphia and Milwaukee have bottom whiskers below nearby teams. That is, the lowest guess for each team is much lower than the lowest guesses for teams with similar median guesses. It seems a few ballots were especially pessimistic about the Phillies' and Brewers' chances this year.

• Teams like Boston and Minnesota have relatively small boxes, so 50% of their data falls in a relatively short range of guesses. This indicates that most of the ballots seemed to agree on the win predictions for these teams (these two teams have the lowest standard deviations of guesses).

• In addition to having a small box, Minnesota's plot has long whiskers--this indicates that most of the data is very close to the median, but there are a few guesses much further out, as low as 62.

• The overall range (max guess - min guess) of Minnesota's guesses was 45, second highest to Tampa Bay, with a 48-win spread. This was thanks mostly to an absurdly optimistic maximum guess of 97 wins, and a more realistic (but perhaps also extreme) minimum guess of 49 wins.

• Last thing about Minnesota: If you compare the teams with the two lowest standard deviations of win guesses, Boston and Minnesota, you can see just how strange it is that they'd have a range of guesses so large (Cleveland, the third lowest standard deviation of guesses, is included to give better context):
```
Team      MIN     25th      MED     75th      MAX       MEAN       RANGE       StdDev
BOS       81       95       97       98       105       96.51       24          3.14
MIN       62       88       90       91       107       89.58       45          3.22
CLE       65       83       85       87       94        84.52       29          3.43

```

• We should keep in mind while looking at all this data that these are all preseason predictions. Don't be surprised, then, to find that teams like the White Sox, Diamondbacks and Nationals are on the right side of the box and whisker plot, meaning that their collective win predictions were low, even though they've been successful so far this year.

The overall correlation between each team's mean and median guesses is very high; this plot of each shows a simple linear regression that yields an r-squared of 0.998. This indicates that the data for the win guesses is relatively symmetrical, and not skewed:

Now that we have a good idea of how the teams compare and the shape of the data overall, we can look a little deeper, going division by division. Although it seems like processes like this invariably start with the AL, and go through the divisions from East to West, here we'll go through in order of least to most interesting. First up, National League Central:

```

Team       AvgWins       StdWins      Playoff%      DIV%          WC%         ML%
SLN        93.447         4.035        92.69%      86.03%        6.67%       27.85%
CHN        87.283         3.853        41.92%      12.79%       29.13%        3.20%
HOU        79.844         4.702         3.11%       0.91%        2.19%        0.18%
CIN        75.479         4.602         0.46%       0.09%        0.37%        0.55%
MIL        74.594         4.584         0.27%       0.09%        0.18%        1.46%
PIT        69.809         4.233         0.18%       0.09%        0.09%        9.95%

```
Not that there are any uninteresting divisions, but after the Cardinals ran away with the division last year, and did their best to plug the few holes that opened up over the offseason, they were prime for a repeat. Over 86% of ballots agreed, with the remnants of the division picks going primarily to the Cubs. It would seem that most of the people that didn't pick the Cubs for the division picked them for the Wild Card. The rest of the division didn't look like it would contend, and ballots agreed. The only other division that had one clear favorite amongst Predictatron ballots was the American League Central:
```
Team       AvgWins       StdWins      Playoff%        DIV%          WC%          ML%
MIN        89.576         3.216        87.95%        87.67%        0.27%        10.68%
CLE        84.515         3.430        11.23%        10.05%        1.19%         0.46%
CHA        78.873         4.358         1.92%         1.83%        0.09%         0.37%
DET        75.968         4.216         0.64%         0.46%        0.18%         0.09%
KCA        65.353         4.791         0.00%         0.00%        0.00%        14.43%

```
Nearly everyone--almost 88%--picked the Twins to take the American League Central, more than those who picked the Cardinals in the NL Central. Of course, what makes the AL Central interesting is that the team running away with the division is the White Sox, not the Twins. We could hope then, that the White Sox would get a significant share of the other 12% of ballot predictions, but they didn't--over 10% of predictatroners picked the Indians to win the AL Central. A measly 2% of ballots anticipated the White Sox's run at the division; although the team might be likely to regress, these few hold bragging rights for now. The other two divisions in the American League amount to two horse races. First up, the West:
```
Team       AvgWins       StdWins       Playoff%       DIV%          WC%          ML%
ANA        89.598         3.908         55.71%       52.42%        3.29%        3.84%
OAK        88.466         4.035         49.22%       45.57%        3.65%        4.02%
TEX        79.517         4.450          1.74%        1.55%        0.18%        0.46%
SEA        76.810         4.443          0.46%        0.46%        0.00%        0.46%

```
Most of the predictatron guesses have the Angels and Athletics fighting for the division. Instead, the Angels and Rangers both got off to good starts. In the early going, a rash of injuries and terrible early season hitting left Oakland behind the pack. Now, the Rangers are starting to slide back, the A's are recovering, and the Mariners are left wondering how they should change their pitcher development programs while the Angels run off with the division. Hopefully the second half of the season will see the AL West turn into the close fight we all hoped for, as the A's are primed for one of their signature second half runs. That leaves only the East in the American League.
```
Team       AvgWins       StdWins       Playoff%        DIV%          WC%           ML%
BOS        96.510         3.137         98.90%        71.05%        27.85%        36.71%
NYA        94.561         4.170         91.51%        28.77%        62.74%        18.54%
BAL        79.167         3.840          0.73%         0.18%         0.55%         0.27%
TOR        73.396         4.454          0.00%         0.00%         0.00%         1.46%
TBA        68.540         4.421          0.00%         0.00%         0.00%         8.22%

```
Of course, this might have been the easiest division to pick going into the season, as the Red Sox and Yankees have been the only viable candidates the past few years--ballots agreed, as Baltimore, Toronto, and Tampa Bay got less than 1% of playoff entrances from the entire system. After Baltimore's early season surge, they are likely to outpace the 79 win average guess, but they probably won't win the division. Boston's recent surge lifted the Sox past the Orioles, and possibly out of reach from the entire division. Over 71% of ballots envisioned Boston taking the AL East.

One thing that astute readers or statisticians in the crowd might notice is the remarkably low standard deviation for the win guesses for the Red Sox--over 1 full win lower than the Yankees, and lowest overall. Similar to the Twins, the relatively low perceived turnover in the Red Sox's roster might have lent itself to remarkable agreement amongst the predictatroners.

The other interesting thing about this division is that over 90% of the wild card share is predicted between the Yankees and Red Sox. Predictatron ballots basically agree that either one could win the division, and the other one is likely to take the AL Wild Card. Now, let's move back to the National League, starting in the West:

```
Team       AvgWins       StdWins       Playoff%        DIV%          WC%          ML%
LAN        88.142         4.121         63.84%        61.00%        2.83%        6.39%
SDN        86.688         3.460         37.63%        28.58%        9.04%        1.37%
SFN        82.357         4.628         11.78%         9.77%        2.01%        1.10%
ARI        71.349         5.083          0.55%         0.55%        0.00%        4.11%
COL        68.817         4.629          0.09%         0.09%        0.00%       12.88%

```
This is one of the more interesting predicted divisions, since there are three teams that received notable support to take the division. Most people picked the Dodgers, probably in part based upon a couple articles from early in the season about how many people were underestimating the craftsmanship of Paul DePodesta. Of course, the course of Barry Bonds' rehab put a huge variable on the Giants' season, and ballots undoubtedly took different weight of the situation--some might have even been filled out before the news of his prolonged absence came out.

There isn't much claim to the NL Wild Card here, with only the Padres getting more than a couple percent of the wild card entries in ballots. Another interesting point of note is that this division features three of the seven highest standard deviations in win guesses--that is, the was relatively little agreement about how the Giants, Rockies, and Diamondbacks would finish. Most people thought the Diamondbacks and Rockies didn't have much of a chance--fewer than 1% of the playoff berths went to these teams, but there was disagreement as to just how badly they would end up. Last, but not least, the NL East:

```
Team       AvgWins       StdWins       Playoff%        DIV%          WC%          ML%
PHI        87.237         4.867         52.05%        42.19%        9.86%        5.75%
ATL        87.579         4.191         50.59%        35.53%       15.07%        5.39%
FLO        86.081         4.149         34.70%        17.44%       17.26%        2.19%
NYN        82.588         4.285         10.14%         4.84%        5.30%        0.82%
WAS        67.979         5.442          0.00%         0.00%        0.00%       16.80%

```
Drafting up a predictions competition, this is what you hope for--three teams with sizable chunks of the division picks, two teams with sizable portions of the wild card entries, and two teams that were picked to make the playoffs more often than not. Although the difference is negligible, this is the only division where the team with the highest average win guesses, the Braves, did not also have the highest percentage of ballots where they were picked to win the division.

Of course, the NL East has lived up to expectations, being one of the most competitive divisions thus far. Nobody in the contest picked the Nationals to go to the playoffs, and even though they're unlikely to take the division, if your friend says he saw this coming, you can call his bluff.

The last thing we can look at are the predictions for the wild cards in each league:

```
American League
Team       AvgWins       WC%
NYA        94.561       62.74%
BOS        96.510       27.85%
OAK        88.466        3.65%
ANA        89.598        3.29%
CLE        84.515        1.19%

```
As we noted before, almost all of the wild card predictions went to the loser of the East, either the Yankees or Red Sox. Of course the Red Sox won the wild card last year, but this year it might be the second place team in the Central--both the Twins and Indians are doing well, so the fight for the AL wild card might be rather exciting this year.
```
National League
Team       AvgWins       WC%
CHN        87.283       29.13%
FLO        86.081       17.26%
ATL        87.579       15.07%
PHI        87.237        9.86%
SDN        86.688        9.04%
SLN        93.447        6.67%
NYN        82.588        5.30%
LAN        88.142        2.83%
HOU        79.844        2.19%
SFN        82.357        2.01%

```
Most of the wild card picks went to the Cubs, but after that, the loser of the East got the wild card pick, with most of the predictions going to the Marlins or Braves (remember that the Phillies got 42% of the division picks). We can look deeper and deeper into the data--in some preliminary digging, I've found modest correlation between the higher average win guesses for teams and lower standard deviation of win guesses--the regression yielded r-squared of 0.447. Additionally, there's a strong correlation between average win guesses for each team and the chance they make the playoffs; hopefully, the reason for that is pretty obvious.