Joe Sheehan asked me, prior to his appearance to discuss Barry Bonds‘ future on ESPN’s Outside the Lines last night, if I had any way to estimate the chance that Albert Pujols and Alex Rodriguez will break Hank Aaron‘s home run record. In other words, a version of Bill James’ Favorite Toy, no doubt to be inspired in some large part by PECOTA.

It’s worth mentioning something before we proceed further. Though the Favorite Toy is one of James’ more popular and accessible inventions, it has not to my knowledge been validated empirically. That is, while it produces some answers that look about right and can spark some lively barroom discussions, we have no way of knowing whether it is accurate. My guess, actually, is that the Favorite Toy tends to overestimate the chance that a certain record will be surpassed, mostly because it doesn’t account for the way in which problematic events in a player’s career path tend to snowball. In other words, the Favorite Toy might estimate that say Ivan Rodriguez has a break-even chance of reaching 3,000 hits, based on an assumption that he will play about seven more seasons and average 140 hits per year (which awould give him 3,031). The problem is that, if Rodriguez only gets say 90 hits in 2007, that likely indicates that something has gone seriously wrong with him (probably an injury), and would radically reduce his projection for future seasons. But if Rodriguez had a good year in 2007 and had say 170 hits, it would probably not substantially increase our estimate of his productivity in the years beyond that, as he’d still be on the wrong side of the aging curve.

A potentially more accurate way to go about estimating a player’s chances of breaking a certain record is to examine comparable players, which is exactly what PECOTA does. Of course it will require some care to do this properly, but intuitively it seems reasonable that, if we can identify a certain number of similar players, and a certain number of those players ended their careers favorably, then the player in question has about that likelihood of ending his career favorably.

PECOTA uses as many as 100 comparable players in order to form its estimates. For purposes of this exercise, we will restrict things to the top 20 comparables, as listed on Rodriguez’s and Pujols’ PECOTA cards. Here, for example, are A-Rod’s best 20 comparables:

1.  Dale Murphy
2.  Mike Schmidt
3.  Tony Perez
4.  Sal Bando
5.  Johnny Bench
6.  Frank Robinson
7.  Dave Winfield
8.  Ken Boyer
9.  Bobby Bonds
10. Gil Hodges
11. Pedro Guerrero
12. Eddie Murray
13. Doug DeCinces
14. Chet Lemon
15. Vern Stephens
16. Eddie Mathews
17. Dick Allen
18. Reggie Smith
19. Richie Sexson
20. Rocky Colavito

Intuitively you will recognize right away that comparables like Mike Schmidt and Dave Winfield–players who were productive well into their 30s–are favorable, while others like Pedro Guerrero and Dale Murphy are unfavorable. You may also have some questions about some of the comparables further down the list, and it’s worth noting that the PECOTA comparables are motivated mostly by performance during a three-season period, and not over a player’s entire career. If the goal of PECOTA were to produce career forecasts, rather than single-season forecasts, I might have done things a bit differently, and that is a limiting factor here.

Let’s use Winfield as our example and think about how we might use his career to make some inferences about how Rodriguez is going to perform in the remainder of his. Winfield hit a raw total of 311 homers from age 29 onward. If we combine that number with the 381 that Rodriguez hit through age 28, we come up with 692, a fair bit short of Aaron. But this math underestimates the production implied for Rodriguez for a couple of reasons:

  1. Rodriguez is playing in an era in which home runs are quite a bit more common.

  2. Rodriguez was a more prolific home-run hitter than Winfield up through age 28, even after accounting for this difference in eras.

The way to get around these problems is not to compare Winfield’s production to Rodriguez’s directly, but rather to compare Winfield’s production to Winfield’s. So for example we might estimate that, based on his career path to date and the run-scoring context that he was performing in, Winfield would hit 32 home runs at age 30. In fact he hit 37, or about 16% more than our prediction for him. The implication is that, if Winfield’s career path tells us something about Rodriguez’s, then Rodriguez would also hit about 16% more home runs than our estimate for Rodriguez’s productivity at age 30.

So if we predicted, based on his previous performance, that Rodriguez would hit 40 home runs at age 30, his implied productivity from Winfield’s career is 16% more than 40, or about 46 homers. Notice that this resolves some of the problem from having somewhat weaker players like Doug DeCinces listed as comparables for Rodriguez. PECOTA is not really saying that Rodriguez is going to perform like Doug DeCinces going forward. Rather, it’s saying that, if DeCinces tended to outperform a reasonable baseline prediction of DeCinces’ production, then he’s just similar enough to Rodriguez that we can infer that Rodriguez is also going to outperform his baseline. This is a subtle distinction, but an important one, and one of the backbones of PECOTA.

Back to Winfield. The following is the number of home runs that Winfield actually hit in each season from age 29 onward, as well as the number of implied Rodriguez home runs based on the method explained above.

Year    Age             Actual HR   Implied A-Rod HR
1981    29              13          37
1982    30              37          55
1983    31              32          52
1984    32              19          30
1985    33              26          38
1986    34              24          33
1987    35              27          32
1988    36              25          37
1989    37              0           0
1990    38              21          31
1991    39              28          39
1992    40              26          39
1993    41              21          31
1994    42              10          18
1995    43              2           4
Subtotal                311         476
A-Rod Career thru '04               382
Total                               858

From Winfield’s career path, we infer that Rodriguez will hit 476 more home runs, giving him 858 total, smashing Aaron’s record. Of course, Winfield’s late career went very favorably, even compared to other great players. By the way, if the numbers for 1981 look funny, it’s for good reason. We’re giving Rodriguez credit for the 13 home runs he’s hit on the season to date, which is somewhat higher than the rate that we’d otherwise estimated for him. Also, PECOTA adjusts everything upward to a 162-game schedule, so Winfield (and Rodriguez) are not punished for the strike year in 1981. Of course, this may be a dubious assumption, at least until a new CBA is signed.

Here is what we get when we perform the same analysis for each of Rodriguez ‘s top 20 comparables, excluding Richie Sexson, who is obviously pretty far from finishing his career.

Player           Implied A-Rod Career HR
Mike Schmidt     905
Dave Winfield    862
Frank Robinson   831
Tony Perez       776
Eddie Murray     769
Doug DeCinces    671
Ken Boyer        631
Reggie Smith     630
Sal Bando        621
Gil Hodges       613
Johnny Bench     600
Dale Murphy      588
Eddie Mathews    587
Rocky Colavito   571
Dick Allen       570
Bobby Bonds      565
Chet Lemon       536
Pedro Guerrero   531
Vern Stephens    492

Five of the 19 comparables, or 26%, have an implied HR total that surpasses Aaron’s mark of 755. So we estimate that Rodriguez has about a one-in-four chance of eventually topping Aaron, or perhaps a bit higher if we want to give extra credit to the players who rank higher on A-Rod’s comparables list, as PECOTA does in producing its forecasts.

We can use a similar method in order to achieve a “weighted mean” estimate for Rodriguez’ HR production in each remaining season of his career (for these purposes, we can put Sexson back in the mix, accounting for his production at age 29 but excluding him from the average thereafter). This won’t tell us anything about probability, but it does give us a good over/under number to shoot for:

A-Rod Weighted Mean Projection
Year   Age    Implied HR
2005   29     40
2006   30     39
2007   31     42
2008   32     33
2009   33     29
2010   34     24
2011   35     17
2012   36     15
2013   37     11
2014   38     8
2015   39     5
2016   40     3
2017   41     1
Subtotal      266
Career Total  647

So a good over/under looks to be about 650 home runs for Rodriguez. For what it’s worth, I’d take the over, especially considering that players are lasting longer nowadays than they used to.

Now, running the same calculations for Pujols:

Player           Implied Pujols Career HR
Hank Aaron       907
Frank Robinson   783
Eddie Murray     761
Carlton Fisk     733
Mark McGwire     657
Jack Clark       557
Jose Canseco     544
John Olerud      515
Rico Carty       508
Rocky Colavito   499
Johnny Bench     466
John Mayberry    464
Will Clark       460
Jeff Burroughs   351
Bob Robertson    263
Ron Blomberg     218

Three of the 16 comparables, or about 19%, pass the 755 threshold, while one other comes very close (again, we’re excluding players like Manny Ramirez who have a lot of their careers left to play). Note that the variance on these estimates is a little higher, which is natural since Pujols is at an earlier stage of his career; it is easy to forget just how good Jeff Burroughs was in the early going.

Pujols Weighted Mean Projection
Year   Age    Implied HR
2005   25     38
2006   26     39
2007   27     33
2008   28     33
2009   29     30
2010   30     29
2011   31     29
2012   32     27
2013   33     24
2014   34     26
2015   35     22
2016   36     17
2017   37     17
2018   38     9
2019   39     8
2020   40     6
2021   41     3
2022   42     4
2023   43     2
Subtotal      395
Career Total  555

This is a career path at least something in the spirit of Aaron’s, with excellent consistency down the line, which has been a hallmark of Pujols’ career to date. But there are also enough comparables whose careers ended unfavorably that Pujols’ estimate begins to tail off quite a bit after about age 34, especially as Pujols does not run well and is not an especially good “athlete.” This point is underscored by looking at the chart below, which compares Aaron’s actual production to the weighted-mean projections for Rodriguez and Pujols.

It is a bit incomplete to say that Aaron’s consistency was the reason for his success. More specifically, it was Aaron’s consistency late in his career that allowed him to break Ruth’s record. Rodriguez has a substantial head start on Aaron at present, and we project that he’ll remain ahead of Aaron’s pace up through about age 37. However, a sampling of his comparables reveals that A-Rod has a good chance to hit the wall at right about that point, just as Mark McGwire did. Being a productive player at age 37 is in many ways much harder than being a superstar at age 27; that’s why Aaron was so remarkable.