Checking the Numbers: Simulating the Triple Crown

September 1, 2010

At the end of last week I wrote about the idea that a Triple Crown is not a far-fetched feat this season. Miguel Cabrera is very unlikely to supplant Josh Hamilton atop the American League batting leaderboard, but in the National League, sluggers Joey Votto and Albert Pujols find themselves ranked first or second in all three categories. To make matters more interesting, each is within striking distance of one another in the categories as well, meaning that over the next month we might bear witness to a race almost as noteworthy as that which centers on qualifying for post-season play. The main reason I argued that a Triple Crown could be achieved this year is that the number of specialists had declined; that is, there didn’t seem to be anyone running away with the batting title who didn’t hit home runs or knock runners in, and Ryan Howard was not going to mash 45-plus homers this season.

In the period of time between that article and this one, the scope has changed a bit. A few players have emerged as potential spoilers in each of the three categories, which makes the jobs of Votto and Pujols that much more difficult. In order to gauge where the two primary candidates now stand as well as to satisfy the requests of some commenters in the prior article, I ran any players that could conceivably win one of the three Triple Crown legs through a simulation. The simulation, described here, models the game in as accurate a fashion as I could derive, by taking the projected rates over the rest of the season for hitters, comparing them to the allowed rates of randomly selected pitcher archetypes, and using the odds ratio in order to determine the likelihood specific events occur. Essentially, the simulation is a spreadsheet version of the Strat-o-Matic game. Ironically, when I wrote the piece describing my simulation last year, the topic was Pujols and the Triple Crown.

Under the Hood

To give a bit of background on the simulation for those who don’t want to navigate away from this page, the first step is determining the rest-of-season projections, which I did by aggregating several available ones on the web. After that, the raw numbers are converted to per-plate appearance percentages. If a player projects to bat 133 more times and 83 of those will be outs, his out rate is 62.41 percent. The same is done for all other events. Next, there are rows for all 133 of the remaining PAs in the spreadsheet, with binary outcomes in each of the fields. In other words, if a home run occurs, every other field will show a zero while a one appears in that home run cell. Determining what event results is accomplished by utilizing the RAND function in Excel as well as the odds ratio.

See, for each PA, I set the simulation up to randomly assign a pitcher archetype—that is, not specific pitchers but specific types of pitchers and their associated rates—to “face” the batter in his PAs. This helps with the idea that maybe someone will face an inordinate amount of easy or tough pitchers down the stretch, while also affording them the opportunity to face a blend of everyone. Over 25,000 simulations, we would expect the archetypes to wash out, but there are definitely runs wherein Pujols faces 50 percent Kyle Kendrick clones to just 10 percent for Votto. Once the pitcher is assigned to the PA, the odds ratio calculates the percentage that a pitcher allows an event to the rate the hitter produces the same event. If our hypothetical hitter has a 62.41 percent shot of making an out, but he is facing a pitcher who records outs 67.20 percent of the time, the odds ratio says that the rate of an out occurring in that PA is actually 64.65 percent.

The RAND function then comes into play as a comparison to the calculated rates. RAND produces a random number between zero and one. If the rate of making an out is 64.65 percent, and the first PA row produces a RAND of 38.20 percent, then an out occurs on the play. If the RAND is 74.19 percent, which is above the rate of making outs, the simulation then looks for the next event. In this hypothetical, let’s say singles occur 15.79 percent of the time. Therefore, the binary field for singles in each PA row will record a “one” indicating that a single occurred in the PA, if the RAND number is above 62.41 percent but below 78.20 percent (62.41+15.79).

The incredibly nerdy but very cool process is then automated over every PA for every player for a predetermined number of runs; it can run through those same 133 PAs thousands of times. For RBI, I used a combination of the per-PA percentage as well as OBI percentage, with a bit of a dependency on home runs; after all, we can’t have someone with more homers than steaks. It might not be perfect, but it gets the job done; I’m always open to suggestions if you have any.

Worthy of Discussion

Before getting into the results, there are a few things worth discussing. First, the reason a simulation is the best tool for a calculation like this is that the Triple Crown is not comprised of mutually exclusive components. A player with a higher batting average can conceivably knock more runners in, just as we should expect a player with a high RBI tally to amass a good number of home runs. These aren’t concrete rules, but expectations, and there is enough evidence that the numbers are intertwined to suggest that straight multiplication of individual probabilities won’t work. That method only works when there is no interrelatedness.

Second, this simulation was not run for every single player in the league, but rather everyone that could conceivably win one of the legs. For instance, I consider Adam LaRoche to have a shot of winning the RBI title. Sure, it’s an outside shot, but it is still possible and so excluding him based on a belief that he is unlikely to lead the league could skew results. On the other hand, including someone with 62 RBI at this juncture would only add time to the simulation without altering any of the results; it isn’t important to me if a player like that leads in one leg of the crown, one time out of 25,000 simulations. The players I ran the simulation for are as follows: Albert Pujols (BA-HR-RBI), Joey Votto (BA-HR-RBI), Carlos Gonzalez (BA-HR-RBI), Martin Prado (BA), Omar Infante (BA), Placido Polanco (BA), Adam Dunn (HR-RBI), Dan Uggla (HR-RBI), Mark Reynolds (HR-RBI), Prince Fielder (HR-RBI), Ryan Howard (RBI), Adam LaRoche (RBI), Casey McGeehee (RBI), and David Wright (RBI).

Lastly, perhaps the most interesting discussion point of all is the idea that Infante, whose selection to the All-Star team produced more disbelief than Livan Hernandez’s sub-4.00 ERA this year, might screw everything up. Infante is hitting .341/.374/.454 this season in 360 plate appearances. To qualify for the batting, he will need at least 502 PAs by the end of the season. Thus, over the next 32 team games he needs to come up with 142, an average of 4.4 per game. Assuming he continues to play every day, this should be attainable, meaning that his numbers would need to plummet for him to not get in the way of the NL Central candidates.

After inputting everyone’s current numbers as well as a combination of various rest-of-season projections, I set the simulation into motion, running through the remainder of the season 10,000 times for each of the aforementioned players.

Simulation Results

First things first: Out of the 10,000 simulations, a Triple Crown was achieved 1,203 times, or a little over 12 percent. The achiever was overwhelmingly Pujols, who led in all three legs 1,095 of those 1,203 times. Perhaps this has a lot to do with his track record and the fact that most in-season projection systems serve as a very gradually moving needle. Pujols would have entered the year with a better projection and so his expected performance from here on out is thought to be higher than Votto's. Then again, we don’t know much about Votto, as he hasn’t been playing for all that long, so it is possible that the systems utilize sound methodology while underrating the Reds' first baseman. How did these guys only achieve the Triple Crown in 12 percent of the runs?

Well, the most common reason is that pesky Infante, who won the batting title in over 6,000 of the runs. The number would likely be higher were it not for his falling short of qualifying in a good number of the sims. Interestingly, Votto actually led the league in hitting more often than Pujols, even though the latter achieved the Triple Crown more frequently. Pujols also won the home run title in 8,538 of the 10,000 runs, with Votto and Adam Dunn chiming in at around 500-600 leads apiece. Suffice to say, the systems are really confident that Pujols will be the king of clout this season. In the RBI department, Pujols wins over 7,000 times, with Votto at just about 2,200 league leads.

If I adjust Votto’s rest-of-season projection to assume that he will perform somewhat similarly to his current rate, the numbers shift, but not by much. A Triple Crown is achieved in 15 percent of the runs, with Pujols still dwarfing Votto. Regardless of anyone else, the big key to this entire thing is Infante, who seems to have trouble falling completely off in the simulations, suggesting that if he can tally up the requisite number of plate appearances, it is very unlikely the feat will occur for the first time since 1967. Either way, it seems safe to say that if a Triple Crown is to happen, it will be Pujols over Votto, but that the ball is no longer completely in their court due to that meddling Infante!

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Eric Seidman

Latest Articles

You need to be logged in to comment. Login or Subscribe

acarlisle

9/01

If Infante doesn't qualify for the batting title, I believe you can fill in the PAs he's short with hitless ABs and see if he still wins. Like Tony Gwynn in 1996.

Does your simulation still have him spoiling if he's a couple of PA short, but would win anyway?

Reply to acarlisle

EJSeidman

9/01

No, it is set up so that if his total falls short of 502, his results are not stored. As far as I know you need the 502 PA to qualify. If the hitless-AB thing is an aspect of the rule that I am unaware of, that is interesting to program in.

Reply to EJSeidman

dreaming

9/01

It is a rule. If you fall short the requisite amount of gutlessness ABS are added

Reply to dreaming

dreaming

9/01

Oops. Hittless at bats. Stupid auto-correct!

Reply to dreaming

EJSeidman

9/01

Okay thanks. Good to know. Isn't this game great? I'll see if I can program it in.

Reply to EJSeidman

EJSeidman

9/01

After re-running with the Infante Clause in place, the numbers really don't change that drastically. Infante wins the title about 7,200 times instead of 6,100, which is certainly significant, and the Trip Crown percentage drops from 12.03 or ~15 (with the more optimistic Votto projection) to around 7-8 percent, but the same proportions are intact. It occurs a significant non-zero amount of time with Pujols achieving the feat upwards of 90 percent of the time relative to it being achieved.

Reply to EJSeidman

bravejason

9/01

"...after all, we canâ€™t have someone with more homers than steaks." Can I get an explanation as to just what this means?

Reply to bravejason

BigEasy

9/01

Steak = RBI

Reply to BigEasy

bravejason

9/01

Since when did folks start calling RBIs steaks? I've never heard of that.

Reply to bravejason

EJSeidman

9/01

For a while, certainly longer than I've been alive (24 yrs). It comes from RBI sounding like ribeye, which is a steak.

Reply to EJSeidman

rowenbell

9/01

Interesting how different passionate baseball fans can end up with different gaps in their knowledge base. I too had never heard of RBIs being called steaks (but Eric's explanation makes perfect sense); on the flip side, Eric apparently hadn't heard of the "phantom AB" rule for determining batting champions, which I would have mentally filed in the "everybody knows that" file.

Reply to rowenbell

TheRealNeal

9/01

"Steaks" has got to be a regional thing. I've lived in a few and never heard it.

Reply to TheRealNeal

TheRealNeal

9/01

Any batting titles in there for Starlin Castro? If my memory of the WGN broadcast is accurate, he'd be the first 20 year old to win an NL batting title and maybe the youngest ever in either league, something even more rare than the Triple Crown.

If a chart isn't too big, any chance you could add it to the article?

Reply to TheRealNeal

Pronk4848

9/01

Great name to consider; I didnt even consider him at first either! That's probably due to the Cubs on field performance this year. He only needs 121 AB's to qualify; that an average of 4.03 AB's the rest of the season. However, at .315, it would take Infante to pull a near o-for for Starlin to have the numbers to be at the top. At that point, it would be strictly a two man show: Votto and Pujols.

Reply to Pronk4848

Pronk4848

9/01

Great article, Eric. I am just curious as to why you did not incorporate BABIP into the simulation. Is it because it has a lot of statistical noise within the year being observed? With Infante having a fantastic year in BABIP (.379 in 2010, .314 for career), you would think that it would have a very slight tendency to regress toward the mean this final month of the season; so slight that it could possibly cost him the batting title if he qualifies?

Reply to Pronk4848

andland

9/01

I think the rest of season projections do just that. They look at what the player was projected at before the year started and update it based on how they have done so far this year. Infante's projection is probably regressed to the mean a bit, but the season is coming to a close so a low projection doesn't have a huge effect.

Reply to andland

EJSeidman

9/01

Mike, yeah the BABIP would be addressed in the ROS projections.

Reply to EJSeidman

Richie

9/01

Good stuff, thank you!

Reply to Richie

stevekohlhagen

9/01

this is an excellent article, eric. thanks for the work---the methodology is a great read as well. swk

Reply to stevekohlhagen

newsense

9/01

As an alternative method for estimating RBI, how about using linear weights values for each event in the simulation? It's not perfect but it could serve as a check on the method you're using.

Reply to newsense

andland

9/01

For modeling RBI, you can find the percentage of times a batter will knock in 0,1,2,3,4 runners in each outcome of a plate appearance.
For an out, it may be 0 - 98%, 1 - 2%, 2 - 0%, 3 - 0%, 4 - 0%.
For a HR, it may be 0 - 0%, 1 - 60%, 2 - 20%, 3 - 15%, 4 - 5%.

I think this would make the simulation more realistic.

To get more sophisticated, you could use batting order and quality of teammates to adjust the %'s. So it the RBI %'s would be different for each player. You could even separate strike out, ground out, and fly outs to get different probabilities of scoring a run on an out.

Reply to andland

Meurso

9/03

I don't suppose we could convince Eric to update this every few days via the comments section?

If he stays hot, I'm interested to see if/when Carlos Gonzalez emerges as a threat to win the Crown himself, rather than just being a potential spoiler.

Reply to Meurso

EJSeidman

9/03

Meurso -- I was thinking the same exact thing. I'm not sure how I would go about doing it -- maybe like John does the playoff odds?

Reply to EJSeidman

Meurso

9/03

That would be very cool, Eric. But really whatever you have time for would be great, even if it's just a quick summary of any interesting developments/changes.

Checking the Numbers: Simulating the Triple Crown

Thank you for reading

Latest Articles

MLU: Bratt Frustrates Opposing Hitters $

Box Score Banter: Knuckling (Way, Way) Up B

The Most Dominated Teams of All-Time: 18-19 $

Golden Age: April 19-27 B

Yoshinobu Yamamoto Was Too Good To Be Great Right Away $

Eric Seidman

Latest Articles

MLU: Bratt Frustrates Opposing Hitters $

Box Score Banter: Knuckling (Way, Way) Up B

The Most Dominated Teams of All-Time: 18-19 $