September 1, 2010
Checking the Numbers
Simulating the Triple Crown
At the end of last week I wrote about the idea that a Triple Crown is not a far-fetched feat this season. Miguel Cabrera is very unlikely to supplant Josh Hamilton atop the American League batting leaderboard, but in the National League, sluggers Joey Votto and Albert Pujols find themselves ranked first or second in all three categories. To make matters more interesting, each is within striking distance of one another in the categories as well, meaning that over the next month we might bear witness to a race almost as noteworthy as that which centers on qualifying for post-season play. The main reason I argued that a Triple Crown could be achieved this year is that the number of specialists had declined; that is, there didn’t seem to be anyone running away with the batting title who didn’t hit home runs or knock runners in, and Ryan Howard was not going to mash 45-plus homers this season.
In the period of time between that article and this one, the scope has changed a bit. A few players have emerged as potential spoilers in each of the three categories, which makes the jobs of Votto and Pujols that much more difficult. In order to gauge where the two primary candidates now stand as well as to satisfy the requests of some commenters in the prior article, I ran any players that could conceivably win one of the three Triple Crown legs through a simulation. The simulation, described here, models the game in as accurate a fashion as I could derive, by taking the projected rates over the rest of the season for hitters, comparing them to the allowed rates of randomly selected pitcher archetypes, and using the odds ratio in order to determine the likelihood specific events occur. Essentially, the simulation is a spreadsheet version of the Strat-o-Matic game. Ironically, when I wrote the piece describing my simulation last year, the topic was Pujols and the Triple Crown.
Under the Hood
To give a bit of background on the simulation for those who don’t want to navigate away from this page, the first step is determining the rest-of-season projections, which I did by aggregating several available ones on the web. After that, the raw numbers are converted to per-plate appearance percentages. If a player projects to bat 133 more times and 83 of those will be outs, his out rate is 62.41 percent. The same is done for all other events. Next, there are rows for all 133 of the remaining PAs in the spreadsheet, with binary outcomes in each of the fields. In other words, if a home run occurs, every other field will show a zero while a one appears in that home run cell. Determining what event results is accomplished by utilizing the RAND function in Excel as well as the odds ratio.
See, for each PA, I set the simulation up to randomly assign a pitcher archetype—that is, not specific pitchers but specific types of pitchers and their associated rates—to “face” the batter in his PAs. This helps with the idea that maybe someone will face an inordinate amount of easy or tough pitchers down the stretch, while also affording them the opportunity to face a blend of everyone. Over 25,000 simulations, we would expect the archetypes to wash out, but there are definitely runs wherein Pujols faces 50 percent Kyle Kendrick clones to just 10 percent for Votto. Once the pitcher is assigned to the PA, the odds ratio calculates the percentage that a pitcher allows an event to the rate the hitter produces the same event. If our hypothetical hitter has a 62.41 percent shot of making an out, but he is facing a pitcher who records outs 67.20 percent of the time, the odds ratio says that the rate of an out occurring in that PA is actually 64.65 percent.
The RAND function then comes into play as a comparison to the calculated rates. RAND produces a random number between zero and one. If the rate of making an out is 64.65 percent, and the first PA row produces a RAND of 38.20 percent, then an out occurs on the play. If the RAND is 74.19 percent, which is above the rate of making outs, the simulation then looks for the next event. In this hypothetical, let’s say singles occur 15.79 percent of the time. Therefore, the binary field for singles in each PA row will record a “one” indicating that a single occurred in the PA, if the RAND number is above 62.41 percent but below 78.20 percent (62.41+15.79).
The incredibly nerdy but very cool process is then automated over every PA for every player for a predetermined number of runs; it can run through those same 133 PAs thousands of times. For RBI, I used a combination of the per-PA percentage as well as OBI percentage, with a bit of a dependency on home runs; after all, we can’t have someone with more homers than steaks. It might not be perfect, but it gets the job done; I’m always open to suggestions if you have any.
Worthy of Discussion
Before getting into the results, there are a few things worth discussing. First, the reason a simulation is the best tool for a calculation like this is that the Triple Crown is not comprised of mutually exclusive components. A player with a higher batting average can conceivably knock more runners in, just as we should expect a player with a high RBI tally to amass a good number of home runs. These aren’t concrete rules, but expectations, and there is enough evidence that the numbers are intertwined to suggest that straight multiplication of individual probabilities won’t work. That method only works when there is no interrelatedness.
Second, this simulation was not run for every single player in the league, but rather everyone that could conceivably win one of the legs. For instance, I consider Adam LaRoche to have a shot of winning the RBI title. Sure, it’s an outside shot, but it is still possible and so excluding him based on a belief that he is unlikely to lead the league could skew results. On the other hand, including someone with 62 RBI at this juncture would only add time to the simulation without altering any of the results; it isn’t important to me if a player like that leads in one leg of the crown, one time out of 25,000 simulations. The players I ran the simulation for are as follows: Albert Pujols (BA-HR-RBI), Joey Votto (BA-HR-RBI), Carlos Gonzalez (BA-HR-RBI), Martin Prado (BA), Omar Infante (BA), Placido Polanco (BA), Adam Dunn (HR-RBI), Dan Uggla (HR-RBI), Mark Reynolds (HR-RBI), Prince Fielder (HR-RBI), Ryan Howard (RBI), Adam LaRoche (RBI), Casey McGeehee (RBI), and David Wright (RBI).
Lastly, perhaps the most interesting discussion point of all is the idea that Infante, whose selection to the All-Star team produced more disbelief than Livan Hernandez’s sub-4.00 ERA this year, might screw everything up. Infante is hitting .341/.374/.454 this season in 360 plate appearances. To qualify for the batting, he will need at least 502 PAs by the end of the season. Thus, over the next 32 team games he needs to come up with 142, an average of 4.4 per game. Assuming he continues to play every day, this should be attainable, meaning that his numbers would need to plummet for him to not get in the way of the NL Central candidates.
After inputting everyone’s current numbers as well as a combination of various rest-of-season projections, I set the simulation into motion, running through the remainder of the season 10,000 times for each of the aforementioned players.
First things first: Out of the 10,000 simulations, a Triple Crown was achieved 1,203 times, or a little over 12 percent. The achiever was overwhelmingly Pujols, who led in all three legs 1,095 of those 1,203 times. Perhaps this has a lot to do with his track record and the fact that most in-season projection systems serve as a very gradually moving needle. Pujols would have entered the year with a better projection and so his expected performance from here on out is thought to be higher than Votto's. Then again, we don’t know much about Votto, as he hasn’t been playing for all that long, so it is possible that the systems utilize sound methodology while underrating the Reds' first baseman. How did these guys only achieve the Triple Crown in 12 percent of the runs?
Well, the most common reason is that pesky Infante, who won the batting title in over 6,000 of the runs. The number would likely be higher were it not for his falling short of qualifying in a good number of the sims. Interestingly, Votto actually led the league in hitting more often than Pujols, even though the latter achieved the Triple Crown more frequently. Pujols also won the home run title in 8,538 of the 10,000 runs, with Votto and Adam Dunn chiming in at around 500-600 leads apiece. Suffice to say, the systems are really confident that Pujols will be the king of clout this season. In the RBI department, Pujols wins over 7,000 times, with Votto at just about 2,200 league leads.
If I adjust Votto’s rest-of-season projection to assume that he will perform somewhat similarly to his current rate, the numbers shift, but not by much. A Triple Crown is achieved in 15 percent of the runs, with Pujols still dwarfing Votto. Regardless of anyone else, the big key to this entire thing is Infante, who seems to have trouble falling completely off in the simulations, suggesting that if he can tally up the requisite number of plate appearances, it is very unlikely the feat will occur for the first time since 1967. Either way, it seems safe to say that if a Triple Crown is to happen, it will be Pujols over Votto, but that the ball is no longer completely in their court due to that meddling Infante!