The first annual Predictatron contest generated a fair amount of interest, getting almost 1,100 entries. Although there are plenty of standings predictions contests out there, few offer a purse as sizeable as $500, or a gift as pleasing to the eye as an autographed picture of everyone’s favorite “Bud.” As the All Star Break approaches, we thought it might be interesting to take a closer look at the ballots from the inaugural competition.
Before delving deeper, some of you might find it helpful to read up on these statistical terms (thanks to Wikipedia):
- probability density function (PDF)
- probability mass function (PMF)
- normal distribution
- standard deviation
- box and whisker plots
Given all the data, there are many different ways to break things down. First, we can look at the overall distribution of win guesses. Putting all the win guesses together, we can see the relative frequency of every win guess, generating a sort of probability density function. Here, the win guess PDF graph is in blue, with the normal PMF overlaid in pink for reference:
Although there isn’t a ton of information to be gleaned from this graph, we can see that it’s actually relatively close to normal, an interesting phenomenon. We can also see several peaks in the data, showing higher frequency of guesses around 70, 76, 82, and 88 wins. Kudos to the first one to figure out why the peaks come six wins apart each time. Perhaps these peaks represent different groups of teams?
Looking deeper, we can see six or seven groups of teams with similar mean guesses (indicated by the coloring). In this chart, we have the average, standard deviation of win guesses, and then percent of the time the team was picked to make the playoffs, win their division, win the Wild Card, and was chosen as a mortal lock (for more information about the mortal lock, see the Predictatron rules):
We can see that these groups usually have averages about four wins apart:
Color Mean Pink 94.839 Orange 87.850 Yellow 83.153 Green 79.350 Blue 75.249 Purple 69.299 Grey 65.353
Now that we have a good idea of how the data shapes up in general, we’ll compare all the teams using box and whisker plots of each team’s guesses from all 1095 ballots. These plots are helpful because they give a good visual representation of both the center (median, in this case) of each team’s guesses, and the spread. Each plot has five parts: the top whisker, the top of the box, the middle of the box, the bottom of the box, and the bottom whisker. The top whisker represents the largest guess for that team, the top of the box represents the value of the 75th-percentile guess for that team, the center of the box represents the median guess, the bottom of the box represents the value of the 25th-percentile guess, and the bottom whisker is the value of the lowest guess for that team. Since 50% of the overall team guesses fall between the top of the box and the bottom of the box, the overall size of the box is a good indicator of the spread of the guesses for each team–smaller boxes mean that the guesses were closer together, or the standard deviation is relatively lower.
To help get used to reading a box and whisker plot, we’ll go through a few interesting observations from the plot here.
The first thing you should notice is that the teams are arranged from left to right in order of greatest median guess to the smallest median guess. That is, you can tell just by glancing at the box and whisker plots that the highest median guess belongs to Boston, and that the lowest belongs to Kansas City. Since the top whisker for each team represents that team’s highest win guess, you can tell that the New York Yankees had the highest overall win guess at 109 wins.
Some other observations about this plot and the data in general:
- Philadelphia and Milwaukee have bottom whiskers below nearby teams. That is, the lowest guess for each team is much lower than the lowest guesses for teams with similar median guesses. It seems a few ballots were especially pessimistic about the Phillies’ and Brewers’ chances this year.
- Teams like Boston and Minnesota have relatively small boxes, so 50% of their data falls in a relatively short range of guesses. This indicates that most of the ballots seemed to agree on the win predictions for these teams (these two teams have the lowest standard deviations of guesses).
- In addition to having a small box, Minnesota’s plot has long whiskers–this indicates that most of the data is very close to the median, but there are a few guesses much further out, as low as 62.
- The overall range (max guess – min guess) of Minnesota’s guesses was 45, second highest to Tampa Bay, with a 48-win spread. This was thanks mostly to an absurdly optimistic maximum guess of 97 wins, and a more realistic (but perhaps also extreme) minimum guess of 49 wins.
Last thing about Minnesota: If you compare the teams with the two lowest standard deviations of win guesses, Boston and Minnesota, you can see just how strange it is that they’d have a range of guesses so large (Cleveland, the third lowest standard deviation of guesses, is included to give better context):
Team MIN 25th MED 75th MAX MEAN RANGE StdDev BOS 81 95 97 98 105 96.51 24 3.14 MIN 62 88 90 91 107 89.58 45 3.22 CLE 65 83 85 87 94 84.52 29 3.43
- We should keep in mind while looking at all this data that these are all preseason predictions. Don’t be surprised, then, to find that teams like the White Sox, Diamondbacks and Nationals are on the right side of the box and whisker plot, meaning that their collective win predictions were low, even though they’ve been successful so far this year.
The overall correlation between each team’s mean and median guesses is very high; this plot of each shows a simple linear regression that yields an r-squared of 0.998. This indicates that the data for the win guesses is relatively symmetrical, and not skewed:
Now that we have a good idea of how the teams compare and the shape of the data overall, we can look a little deeper, going division by division. Although it seems like processes like this invariably start with the AL, and go through the divisions from East to West, here we’ll go through in order of least to most interesting. First up, National League Central:
Team AvgWins StdWins Playoff% DIV% WC% ML% SLN 93.447 4.035 92.69% 86.03% 6.67% 27.85% CHN 87.283 3.853 41.92% 12.79% 29.13% 3.20% HOU 79.844 4.702 3.11% 0.91% 2.19% 0.18% CIN 75.479 4.602 0.46% 0.09% 0.37% 0.55% MIL 74.594 4.584 0.27% 0.09% 0.18% 1.46% PIT 69.809 4.233 0.18% 0.09% 0.09% 9.95%
Not that there are any uninteresting divisions, but after the Cardinals ran away with the division last year, and did their best to plug the few holes that opened up over the offseason, they were prime for a repeat. Over 86% of ballots agreed, with the remnants of the division picks going primarily to the Cubs. It would seem that most of the people that didn’t pick the Cubs for the division picked them for the Wild Card. The rest of the division didn’t look like it would contend, and ballots agreed. The only other division that had one clear favorite amongst Predictatron ballots was the American League Central:
Team AvgWins StdWins Playoff% DIV% WC% ML% MIN 89.576 3.216 87.95% 87.67% 0.27% 10.68% CLE 84.515 3.430 11.23% 10.05% 1.19% 0.46% CHA 78.873 4.358 1.92% 1.83% 0.09% 0.37% DET 75.968 4.216 0.64% 0.46% 0.18% 0.09% KCA 65.353 4.791 0.00% 0.00% 0.00% 14.43%
Nearly everyone–almost 88%–picked the Twins to take the American League Central, more than those who picked the Cardinals in the NL Central. Of course, what makes the AL Central interesting is that the team running away with the division is the White Sox, not the Twins. We could hope then, that the White Sox would get a significant share of the other 12% of ballot predictions, but they didn’t–over 10% of predictatroners picked the Indians to win the AL Central. A measly 2% of ballots anticipated the White Sox’s run at the division; although the team might be likely to regress, these few hold bragging rights for now. The other two divisions in the American League amount to two horse races. First up, the West:
Team AvgWins StdWins Playoff% DIV% WC% ML% ANA 89.598 3.908 55.71% 52.42% 3.29% 3.84% OAK 88.466 4.035 49.22% 45.57% 3.65% 4.02% TEX 79.517 4.450 1.74% 1.55% 0.18% 0.46% SEA 76.810 4.443 0.46% 0.46% 0.00% 0.46%
Most of the predictatron guesses have the Angels and Athletics fighting for the division. Instead, the Angels and Rangers both got off to good starts. In the early going, a rash of injuries and terrible early season hitting left Oakland behind the pack. Now, the Rangers are starting to slide back, the A’s are recovering, and the Mariners are left wondering how they should change their pitcher development programs while the Angels run off with the division. Hopefully the second half of the season will see the AL West turn into the close fight we all hoped for, as the A’s are primed for one of their signature second half runs. That leaves only the East in the American League.
Team AvgWins StdWins Playoff% DIV% WC% ML% BOS 96.510 3.137 98.90% 71.05% 27.85% 36.71% NYA 94.561 4.170 91.51% 28.77% 62.74% 18.54% BAL 79.167 3.840 0.73% 0.18% 0.55% 0.27% TOR 73.396 4.454 0.00% 0.00% 0.00% 1.46% TBA 68.540 4.421 0.00% 0.00% 0.00% 8.22%
Of course, this might have been the easiest division to pick going into the season, as the Red Sox and Yankees have been the only viable candidates the past few years–ballots agreed, as Baltimore, Toronto, and Tampa Bay got less than 1% of playoff entrances from the entire system. After Baltimore’s early season surge, they are likely to outpace the 79 win average guess, but they probably won’t win the division. Boston’s recent
surge lifted the Sox past the Orioles, and possibly out of reach from the entire division. Over 71% of ballots envisioned Boston taking the AL East.
One thing that astute readers or statisticians in the crowd might notice is the remarkably low standard deviation for the win guesses for the Red Sox–over 1 full win lower than the Yankees, and lowest overall. Similar to the Twins, the relatively low perceived turnover in the Red Sox’s roster might have lent itself to remarkable agreement amongst the predictatroners.
The other interesting thing about this division is that over 90% of the wild card share is predicted between the Yankees and Red Sox. Predictatron ballots basically agree that either one could win the division, and the other one is likely to take the AL Wild Card. Now, let’s move back to the National League, starting in the West:
Team AvgWins StdWins Playoff% DIV% WC% ML% LAN 88.142 4.121 63.84% 61.00% 2.83% 6.39% SDN 86.688 3.460 37.63% 28.58% 9.04% 1.37% SFN 82.357 4.628 11.78% 9.77% 2.01% 1.10% ARI 71.349 5.083 0.55% 0.55% 0.00% 4.11% COL 68.817 4.629 0.09% 0.09% 0.00% 12.88%
This is one of the more interesting predicted divisions, since there are three teams that received notable support to take the division. Most people picked the Dodgers, probably in part based upon a couple articles from early in the season about how many people were underestimating the craftsmanship of Paul DePodesta. Of course, the course of Barry Bonds‘ rehab put a huge variable on the Giants’ season, and ballots undoubtedly took different weight of the situation–some might have even been filled out before the news of his prolonged absence came out.
There isn’t much claim to the NL Wild Card here, with only the Padres getting more than a couple percent of the wild card entries in ballots. Another interesting point of note is that this division features three of the seven highest standard deviations in win guesses–that is, the was relatively little agreement about how the Giants, Rockies, and Diamondbacks would finish. Most people thought the Diamondbacks and Rockies didn’t have much of a chance–fewer than 1% of the playoff berths went to these teams, but there was disagreement as to just how badly they would end up. Last, but not least, the NL East:
Team AvgWins StdWins Playoff% DIV% WC% ML% PHI 87.237 4.867 52.05% 42.19% 9.86% 5.75% ATL 87.579 4.191 50.59% 35.53% 15.07% 5.39% FLO 86.081 4.149 34.70% 17.44% 17.26% 2.19% NYN 82.588 4.285 10.14% 4.84% 5.30% 0.82% WAS 67.979 5.442 0.00% 0.00% 0.00% 16.80%
Drafting up a predictions competition, this is what you hope for–three teams with sizable chunks of the division picks, two teams with sizable portions of the wild card entries, and two teams that were picked to make the playoffs more often than not. Although the difference is negligible, this is the only division where the team with the highest average win guesses, the Braves, did not also have the highest percentage of ballots where they were picked to win the division.
Of course, the NL East has lived up to expectations, being one of the
most competitive divisions thus far. Nobody in the contest picked
the Nationals to go to the playoffs, and even though they’re unlikely
to take the division, if your friend says he saw this coming, you can
call his bluff.
The last thing we can look at are the predictions for the wild cards in each league:
American League Team AvgWins WC% NYA 94.561 62.74% BOS 96.510 27.85% OAK 88.466 3.65% ANA 89.598 3.29% CLE 84.515 1.19%
As we noted before, almost all of the wild card predictions went to the loser of the East, either the Yankees or Red Sox. Of course the Red Sox won the wild card last year, but this year it might be the second place team in the Central–both the Twins and Indians are doing well, so the fight for the AL wild card might be rather exciting this year.
National League Team AvgWins WC% CHN 87.283 29.13% FLO 86.081 17.26% ATL 87.579 15.07% PHI 87.237 9.86% SDN 86.688 9.04% SLN 93.447 6.67% NYN 82.588 5.30% LAN 88.142 2.83% HOU 79.844 2.19% SFN 82.357 2.01%
Most of the wild card picks went to the Cubs, but after that, the loser of the East got the wild card pick, with most of the predictions going to the Marlins or Braves (remember that the Phillies got 42% of the division picks). We can look deeper and deeper into the data–in some preliminary digging, I’ve found modest correlation between the higher average win guesses for teams and lower standard deviation of win guesses–the regression yielded r-squared of 0.447. Additionally, there’s a strong correlation between average win guesses for each team and the chance they make the playoffs; hopefully, the reason for that is pretty obvious.