Lies, Damned Lies: We are Elo?

I’m one of those people that likes to feign knowledge about pretty much any topic that might come up in a conversation. A lot of the time, this requires nothing more than an Ozzie Guillen-like shamelessness for talking myself into and out of trouble. But a little bit of research goes a long way. In this spirit, I stumbled across this website, which applies an Elo Rating system to football… err, soccer teams. This is some pretty cool stuff. Not only is every national team in the world rated, but you can find its rating for any date in the past century. If you wanted to know, say, whether Qatar or Bahrain would have been favored in a match played in August, 1971, you could find that out (answer: Qatar, but only by a hair).

Elo Ratings (apparently pronounced E-L-O, as in the band that has inspired so many hapless karaoke renditions over the past thirty years) will be familiar to those of you who have played chess, backgammon, or Scrabble competitively. Their goal is very simple: to provide an objective and reliable measure by which the strength of two opponents can be compared. A backgammon player with a 1900 Elo Rating, for example, would be expected to beat a backgammon player with a 1700 rating around 76% of the time.

Naturally, if an Elo Rating system can be applied to soccer, then it can be applied to baseball as well. Just as naturally, you might ask why do we need another way to rate baseball teams, when we already have cool features like the Adjusted Standings Report? For one thing, Elo Ratings are extremely elegant. They are able to incorporate a lot of information–things like strength of schedule and run differential–and boil it down to one simple number. More substantively, however, Elo Ratings strive to answer a different question than something like Pythagorean records. Pythagorean records and their brethren give us an assessment of how well a team has played over a given time frame (usually the span of one particular baseball season). Elo Ratings, on the other hand, are designed to give an assessment of how strong a team is today, right now, at this very moment–or at any other moment in time throughout history.

This is a distinction that comes more or less naturally to soccer fans. National soccer clubs don’t really have seasons; they play in various tournaments like the World Cup and the Olympics, but otherwise it’s a big hairy mess of "friendlies" (one-off matches), qualifiers, and various minor regional tournaments. It’s assumed that a team’s form changes organically over time, as various players age, retire, get injured, or are brought up from the junior ranks.

Baseball fans, meanwhile, are rather hung up on the concept of the season. But the season is not sacred. Roster composition changes over the course of the year as players are injured, acquired in trade, promoted from the minor leagues, and so forth. A young team like the Marlins can reasonably be expected to be stronger in the latter part of the year than the earlier, while the opposite is true for a veteran club. And, perhaps, there are some karmatic factors at work. The Orioles played .457 ball last year–but would you really stake them at those odds in a game played last September? Can we acknowledge, perhaps, that the 2002 A’s really were playing better baseball during their 17-game winning streak in August and September than they were in April? Team strength is more dynamic than it’s usually made out to be.

You can find the formula for the soccer Elo Ratings system here. Although the math behind the Elo Ratings system is not complicated in itself, it quickly became apparent that some of the parameters included in the soccer version of the ratings would need to be tweaked for baseball. In particular, the following questions need to be resolved:

What should the ‘velocity’ of the rankings be? That is, how fast should they change in response to one particular result?
How much should we account for different margins of victory?
How do we account for home field advantage?
What do we do once a particular season has been completed? Start over from scratch, or allow things to be continuous?
Should any particular weight be given to post-season games?

Velocity of the rankings

This is far and away the most important methodological question. National soccer teams play relatively few matches; the U.S. national team, for example, has played in 66 official contests since the conclusion of the 2002 World Cup. The New York Yankees, meanwhile, have played something like 660 games during that time frame. If we apply the same velocity that is used in the soccer version of the ratings to major league baseball clubs, we wind up with something like this:

By convention, a 1500 Elo Rating is taken to be exactly average. A 1600 rating is very strong for a baseball team–that corresponds to a team that would win 64% of its games against average competition (or about a 104-58 record). So this chart would posit, for example, that last year’s Indians were a 100-loss team in May of last season, but one of the strongest teams in baseball history by August. That conclusion is a bit absurd. On the other hand, an extremely low velocity would do little to distinguish baseball teams from one another at all:

We can optimize the velocity parameter by fitting it to historical results. Elo implies that a 100-point ratings difference should correspond to the stronger team winning a game on neutral turf about 64% of the time. If we found out that the higher-rated club was winning this sort of game 78% of the time, or 52% of the time, that would suggest that our ratings weren’t calibrated very well. In other words, we specify the velocity parameter to maximize the predictive power of the model. This turns out to correspond to a weighting of about four. By comparison, the soccer version of the ratings uses a weighting of 20 for an international friendly, and 40 for a World Cup match. Therefore, we wind up with something more sensible like this:

Margin of victory

The soccer version of the Elo Ratings weight the result of a contest not only by its ‘importance’ (e.g. the World Cup is given more weight than a friendly), but also by the goal differential: a three-goal victory triggers a bigger shift in the rankings than a 2-1 nailbiter. The most straightforward way to apply this to baseball is to take the margin of victory to an exponent, in order to come up with the appropriate weight. In particular, the optimal exponent, in terms of maximizing the predictive power of the model, turns out to be about 1/3 (one-third). That is, a one-run victory is assigned a weight of exactly 1.00 (1^(1/3)), whereas a five-run victory is assigned a weight of 1.71 (5^(1/3)). This strikes a nice balance between regular W-L records, which don’t account for margin of victory at all, and Pythagorean records, which can be overly responsive to a blowout result.

Home field advantage

Home field advantage is much stronger in international soccer than it is in major league baseball. Therefore, instead of spotting the home team a 100-point ratings boost, as in the soccer ratings, we instead assign the home team a 25-point advantage. This 25-point advantage corresponds to the home team winning about 53.5% of the time, which is very close to the actual home field advantage observed in baseball over the past couple of decades.

Treatment of new seasons

If we wanted, we could treat all of baseball history as one long, continuous season. If last year’s Tigers ended last season with a 1480 rating, that’s what their rating would be on Opening Day, 2006. This has certain advantages; if the Yankees get off to an 0-2 start, it would be naïve to claim that they are a below-average club, when they have so much recent history to the contrary. However, while I have cautioned that we should not be overly sentimental toward the concept of the season, there is little doubt that baseball teams do undergo more change during the span of a six-month off-season then they do during a typical day in the regular season.

We can resolve this problem just as we have resolved the other ones, by making the choice that maximizes the model’s predictive power. As it happens, we do best by neither resetting things entirely when a new season begins, nor by keeping things entirely continuous, but by splitting the difference. Therefore, if the Tigers finish 2005 with a 1480 rating (20 points below average), we start them off at 1490 (ten points below average) for 2006. Or, if the Indians end 2005 with a 1550 rating (50 points above average), we’ll have them begin 2006 with a 1525 rating (25 points above average).

By design, the Elo Ratings are self-correcting. If your rating is low, then it’s easier to gain points, and harder to lose them, and vice versa if your rating is high. Thus, most of the ‘bias’ introduced by the season-starting rating has been squeezed out of the system by the All-Star break or so, and almost all of it by the time the next season has been completed. However, teams do carry some ‘momentum’ with them from season to season. One interesting application of this is that, if we tried to use the Elo Ratings to rank the best baseball teams of all time, a dynasty that played well year after year would receive a somewhat higher rating than a one-year wonder like the 2001 Mariners.

Treatment of the postseason

One pet peeve of mine is that statheads tend to take regular season results as gospel, while the post-season is assumed to be some sort of big dice game. But the postseason itself provides information about team quality. We can knock last year’s White Sox for being "only" a 91-71 Pythagorean club during the regular season, but the White Sox also played quality clubs for three weeks in October, and absolutely destroyed them.

Fortunately, the Elo Ratings are well equipped to handle postseason results. The postseason participants trade Elo points amongst themselves, but they cannot take points away from other clubs that are busy playing golf. It does not necessarily follow from this that we should rate postseason games more heavily, but I have made a somewhat subjective decision to assign postseason games a 50% ‘bonus’ weighting. (This makes little difference one way or the other from a predictive standpoint). Here, for example, is how last year’s postseason shook things up:

Team        End Regular Season  Postseason  Year-end Rating
Cardinals    1569                +9         1578
Yankees      1567                -6         1561
Red Sox      1562               -14         1548
Angels       1559               -10         1549
Astros       1549               -10         1539
White Sox    1540               +41         1581
Braves       1531                -2         1529
Padres       1487                -9         1478

The White Sox’ 41-point gain is enormous, equivalent to a team picking up ten games in its Pythagorean record. It’s also much larger than the amount that other recent champions have earned. The ’04 Red Sox gained 31 points during their playoff run, the 2000 Yankees 29 points, the ‘02 Angels 24 points, the ’01 Diamondbacks 23 points, and the ‘03 Marlins just 15 points.

Current and recent ratings

Listed immediately below are current Elo Ratings through Monday evening’s games. I have also provided a corresponding ‘Eloport’ rating, which works backward to translate a team’s Elo rating into a W-L record.

Team         Elo      Eloport
White Sox    1573     98-64
Red Sox      1552     93-69
Mets         1549     92-70
Tigers       1548     92-70
Yankees      1547     92-70
A’s          1531     88-74
Cardinals    1531     88-74
Twins        1521     86-76
Blue Jays    1517     85-77
Rangers      1513     84-78
Indians      1504     82-80
Dodgers      1502     81-81
Angels       1501     81-81
Marlins      1500     81-81
Phillies     1499     81-81
Mariners     1498     81-81
Reds         1497     80-82
Padres       1495     80-82
Astros       1492     79-83
Rockies      1489     79-83
Giants       1488     78-84
Brewers      1483     77-85
Nationals    1480     76-86
Braves       1477     76-86
D’backs      1475     75-87
Orioles      1470     74-88
Devil Rays   1460     72-90
Cubs         1453     70-92
Pirates      1437     66-96
Royals       1419     63-99

Keep in mind that the Elo Ratings are designed to measure a team’s current strength, and not necessarily its strength over the course of the entire season. Thus, the Marlins have already worked their way back to a 1500 rating, while the Reds and Cardinals have been punished for their recent poor play (the Cardinals had a 1552 rating just a week ago). It’s good to see the White Sox so far ahead of the pack, after all the grief that systems like PECOTA have given them. You may also notice that the list is heavily slanted toward American League clubs; only three of the 16 National League teams have a rating above the 1500 baseline. This is an organically generated result based on interleague play (including recent postseason results).

The Elo Ratings can also be used to do some storytelling. For example, you can probably surmise the identity of this club:

Or this one:

Other progressions are a bit less dramatic:

The highest Elo Rating during the decade of the 2000’s is 1624 (109-53 Eloport), belonging to the Oakland A’s after they beat the Yankees in the second game of the LDS on October 11, 2001. The lowest is 1335 (45-117 Eloport), "achieved" by the Tigers on September 23, 2003. The highest year-ending ranking in the decade for a World Series champion is 1609 (106-56 Eloport), belonging to the 2004 Red Sox.

Next week, we’ll turn back the clock and look at the best teams of the past half-century. In the meantime, I’ve got a football match to watch.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

You need to be logged in to comment. Login or Subscribe

stubert

9/13

There\'s not much of a spread in the numbers. Normalizing them illustrates the homogenization effect of this rating system, in my opinion. The worst teams just don\'t seem far enough away from the best teams.

Reply to stubert

nosybrian

4/20

The best teams generally only win about 60% of the time against the league. In that the leagues are fairly even from top to bottom.

Consider the language of baseball vs. football. In football, if a bad team beats the best team even in a single game, it's an "upset." In baseball, if a bad team beats the best team, it's a "long season."

Reply to nosybrian

Thank you for reading

Latest Articles

The Call-Up: Chase Dollander $

Fantasy Starting Pitching Planner ’25: Week Two $

Five & Dive, Episode 448: Once a Rockie, Always a Rockie

Box Score Banter: All, or Nothing B

The Road Ahead for Kumar Rocker $

Nate Silver

Latest Articles

The Call-Up: Chase Dollander $

Fantasy Starting Pitching Planner ’25: Week Two $

Five & Dive, Episode 448: Once a Rockie, Always a Rockie