June 28, 2006
Lies, Damned Lies
We are Elo?
I’m one of those people that likes to feign knowledge about pretty much any topic that might come up in a conversation. A lot of the time, this requires nothing more than an Ozzie Guillen-like shamelessness for talking myself into and out of trouble. But a little bit of research goes a long way. In this spirit, I stumbled across this website, which applies an Elo Rating system to football… err, soccer teams. This is some pretty cool stuff. Not only is every national team in the world rated, but you can find its rating for any date in the past century. If you wanted to know, say, whether Qatar or Bahrain would have been favored in a match played in August, 1971, you could find that out (answer: Qatar, but only by a hair).
Elo Ratings (apparently pronounced E-L-O, as in the band that has inspired so many hapless karaoke renditions over the past thirty years) will be familiar to those of you who have played chess, backgammon, or Scrabble competitively. Their goal is very simple: to provide an objective and reliable measure by which the strength of two opponents can be compared. A backgammon player with a 1900 Elo Rating, for example, would be expected to beat a backgammon player with a 1700 rating around 76% of the time.
Naturally, if an Elo Rating system can be applied to soccer, then it can be applied to baseball as well. Just as naturally, you might ask why do we need another way to rate baseball teams, when we already have cool features like the Adjusted Standings Report? For one thing, Elo Ratings are extremely elegant. They are able to incorporate a lot of information--things like strength of schedule and run differential--and boil it down to one simple number. More substantively, however, Elo Ratings strive to answer a different question than something like Pythagorean records. Pythagorean records and their brethren give us an assessment of how well a team has played over a given time frame (usually the span of one particular baseball season). Elo Ratings, on the other hand, are designed to give an assessment of how strong a team is today, right now, at this very moment–or at any other moment in time throughout history.
This is a distinction that comes more or less naturally to soccer fans. National soccer clubs don’t really have seasons; they play in various tournaments like the World Cup and the Olympics, but otherwise it’s a big hairy mess of "friendlies" (one-off matches), qualifiers, and various minor regional tournaments. It’s assumed that a team’s form changes organically over time, as various players age, retire, get injured, or are brought up from the junior ranks.
Baseball fans, meanwhile, are rather hung up on the concept of the season. But the season is not sacred. Roster composition changes over the course of the year as players are injured, acquired in trade, promoted from the minor leagues, and so forth. A young team like the Marlins can reasonably be expected to be stronger in the latter part of the year than the earlier, while the opposite is true for a veteran club. And, perhaps, there are some karmatic factors at work. The Orioles played .457 ball last year--but would you really stake them at those odds in a game played last September? Can we acknowledge, perhaps, that the 2002 A’s really were playing better baseball during their 17-game winning streak in August and September than they were in April? Team strength is more dynamic than it’s usually made out to be.
You can find the formula for the soccer Elo Ratings system here. Although the math behind the Elo Ratings system is not complicated in itself, it quickly became apparent that some of the parameters included in the soccer version of the ratings would need to be tweaked for baseball. In particular, the following questions need to be resolved:
Velocity of the rankings
This is far and away the most important methodological question. National soccer teams play relatively few matches; the U.S. national team, for example, has played in 66 official contests since the conclusion of the 2002 World Cup. The New York Yankees, meanwhile, have played something like 660 games during that time frame. If we apply the same velocity that is used in the soccer version of the ratings to major league baseball clubs, we wind up with something like this:
By convention, a 1500 Elo Rating is taken to be exactly average. A 1600 rating is very strong for a baseball team–that corresponds to a team that would win 64% of its games against average competition (or about a 104-58 record). So this chart would posit, for example, that last year’s Indians were a 100-loss team in May of last season, but one of the strongest teams in baseball history by August. That conclusion is a bit absurd. On the other hand, an extremely low velocity would do little to distinguish baseball teams from one another at all:
We can optimize the velocity parameter by fitting it to historical results. Elo implies that a 100-point ratings difference should correspond to the stronger team winning a game on neutral turf about 64% of the time. If we found out that the higher-rated club was winning this sort of game 78% of the time, or 52% of the time, that would suggest that our ratings weren’t calibrated very well. In other words, we specify the velocity parameter to maximize the predictive power of the model. This turns out to correspond to a weighting of about four. By comparison, the soccer version of the ratings uses a weighting of 20 for an international friendly, and 40 for a World Cup match. Therefore, we wind up with something more sensible like this:
Margin of victory
The soccer version of the Elo Ratings weight the result of a contest not only by its ‘importance’ (e.g. the World Cup is given more weight than a friendly), but also by the goal differential: a three-goal victory triggers a bigger shift in the rankings than a 2-1 nailbiter. The most straightforward way to apply this to baseball is to take the margin of victory to an exponent, in order to come up with the appropriate weight. In particular, the optimal exponent, in terms of maximizing the predictive power of the model, turns out to be about 1/3 (one-third). That is, a one-run victory is assigned a weight of exactly 1.00 (1^(1/3)), whereas a five-run victory is assigned a weight of 1.71 (5^(1/3)). This strikes a nice balance between regular W-L records, which don’t account for margin of victory at all, and Pythagorean records, which can be overly responsive to a blowout result.
Home field advantage
Home field advantage is much stronger in international soccer than it is in major league baseball. Therefore, instead of spotting the home team a 100-point ratings boost, as in the soccer ratings, we instead assign the home team a 25-point advantage. This 25-point advantage corresponds to the home team winning about 53.5% of the time, which is very close to the actual home field advantage observed in baseball over the past couple of decades.
Treatment of new seasons
If we wanted, we could treat all of baseball history as one long, continuous season. If last year’s Tigers ended last season with a 1480 rating, that’s what their rating would be on Opening Day, 2006. This has certain advantages; if the Yankees get off to an 0-2 start, it would be naïve to claim that they are a below-average club, when they have so much recent history to the contrary. However, while I have cautioned that we should not be overly sentimental toward the concept of the season, there is little doubt that baseball teams do undergo more change during the span of a six-month off-season then they do during a typical day in the regular season.
We can resolve this problem just as we have resolved the other ones, by making the choice that maximizes the model’s predictive power. As it happens, we do best by neither resetting things entirely when a new season begins, nor by keeping things entirely continuous, but by splitting the difference. Therefore, if the Tigers finish 2005 with a 1480 rating (20 points below average), we start them off at 1490 (ten points below average) for 2006. Or, if the Indians end 2005 with a 1550 rating (50 points above average), we’ll have them begin 2006 with a 1525 rating (25 points above average).
By design, the Elo Ratings are self-correcting. If your rating is low, then it’s easier to gain points, and harder to lose them, and vice versa if your rating is high. Thus, most of the ‘bias’ introduced by the season-starting rating has been squeezed out of the system by the All-Star break or so, and almost all of it by the time the next season has been completed. However, teams do carry some ‘momentum’ with them from season to season. One interesting application of this is that, if we tried to use the Elo Ratings to rank the best baseball teams of all time, a dynasty that played well year after year would receive a somewhat higher rating than a one-year wonder like the 2001 Mariners.
Treatment of the postseason
One pet peeve of mine is that statheads tend to take regular season results as gospel, while the post-season is assumed to be some sort of big dice game. But the postseason itself provides information about team quality. We can knock last year’s White Sox for being "only" a 91-71 Pythagorean club during the regular season, but the White Sox also played quality clubs for three weeks in October, and absolutely destroyed them.
Fortunately, the Elo Ratings are well equipped to handle postseason results. The postseason participants trade Elo points amongst themselves, but they cannot take points away from other clubs that are busy playing golf. It does not necessarily follow from this that we should rate postseason games more heavily, but I have made a somewhat subjective decision to assign postseason games a 50% ‘bonus’ weighting. (This makes little difference one way or the other from a predictive standpoint). Here, for example, is how last year’s postseason shook things up:
Team End Regular Season Postseason Year-end Rating Cardinals 1569 +9 1578 Yankees 1567 -6 1561 Red Sox 1562 -14 1548 Angels 1559 -10 1549 Astros 1549 -10 1539 White Sox 1540 +41 1581 Braves 1531 -2 1529 Padres 1487 -9 1478
The White Sox’ 41-point gain is enormous, equivalent to a team picking up ten games in its Pythagorean record. It’s also much larger than the amount that other recent champions have earned. The ’04 Red Sox gained 31 points during their playoff run, the 2000 Yankees 29 points, the ‘02 Angels 24 points, the ’01 Diamondbacks 23 points, and the ‘03 Marlins just 15 points.
Current and recent ratings
Listed immediately below are current Elo Ratings through Monday evening’s games. I have also provided a corresponding ‘Eloport’ rating, which works backward to translate a team’s Elo rating into a W-L record.
Team Elo Eloport White Sox 1573 98-64 Red Sox 1552 93-69 Mets 1549 92-70 Tigers 1548 92-70 Yankees 1547 92-70 A’s 1531 88-74 Cardinals 1531 88-74 Twins 1521 86-76 Blue Jays 1517 85-77 Rangers 1513 84-78 Indians 1504 82-80 Dodgers 1502 81-81 Angels 1501 81-81 Marlins 1500 81-81 Phillies 1499 81-81 Mariners 1498 81-81 Reds 1497 80-82 Padres 1495 80-82 Astros 1492 79-83 Rockies 1489 79-83 Giants 1488 78-84 Brewers 1483 77-85 Nationals 1480 76-86 Braves 1477 76-86 D’backs 1475 75-87 Orioles 1470 74-88 Devil Rays 1460 72-90 Cubs 1453 70-92 Pirates 1437 66-96 Royals 1419 63-99
Keep in mind that the Elo Ratings are designed to measure a team’s current strength, and not necessarily its strength over the course of the entire season. Thus, the Marlins have already worked their way back to a 1500 rating, while the Reds and Cardinals have been punished for their recent poor play (the Cardinals had a 1552 rating just a week ago). It’s good to see the White Sox so far ahead of the pack, after all the grief that systems like PECOTA have given them. You may also notice that the list is heavily slanted toward American League clubs; only three of the 16 National League teams have a rating above the 1500 baseline. This is an organically generated result based on interleague play (including recent postseason results).
The Elo Ratings can also be used to do some storytelling. For example, you can probably surmise the identity of this club:
Or this one:
Other progressions are a bit less dramatic:
The highest Elo Rating during the decade of the 2000’s is 1624 (109-53 Eloport), belonging to the Oakland A’s after they beat the Yankees in the second game of the LDS on October 11, 2001. The lowest is 1335 (45-117 Eloport), "achieved" by the Tigers on September 23, 2003. The highest year-ending ranking in the decade for a World Series champion is 1609 (106-56 Eloport), belonging to the 2004 Red Sox.
Next week, we’ll turn back the clock and look at the best teams of the past half-century. In the meantime, I’ve got a football match to watch.