September 7, 2009
Checking the Numbers
Pujols and the Simulation Gauntlet
While scoping out the season of the one and only Albert Pujols a couple of weeks ago, I attempted to quantify his chances of attaining the Triple Crown. At the time, Pujols led his league in dingers, stood deadlocked in the RBI race with Prince Fielder, and trailed Hanley Ramirez in batting average by a rather large margin. The methodology implemented in that piece was back-of-the-envelope at best, as the dependency of the inherent variables should have precluded the multiplication of separate probabilities. Since home runs automatically correlate to runs batted in as well as batting average, and because a higher batting average would, in theory, lead to more steaks, the three legs of the race are not independent of one another and therefore cannot be multiplied together to determine the Triple Crown likelihood. Though a more accurate process is unlikely to yield drastically different results than the 0.74 percent I found initially, the perfectionist in me felt it necessary to re-run the numbers through a more complex and accurate simulation in order to determine Pujols' chances.
Simulations are essentially methods designed to mimic real-life systems. Not all simulations are intensely quantitative, but this particular study qualifies as logical or mathematical at root, while dynamic and stochastic in nature. The latter two terms refer to the fact that the simulation operates under a set of assumptions-think of the true talent level of players as well as their rest-of-season projections-and works with random input, as will be described a bit later on. I have put off investigating how to model simulations and learn programming languages like Visual Basic and C# for far too long, and so instead of continuing to watch How I Met Your Mother re-runs it seemed like the right time to sit down and quench my thirst for knowledge. Many thanks go out to reader Andrew Landgraf, who provided an assist in getting my simulation feet drenched. (Or were they simulated feet?)
To start, the ultimate goal needs to be outlined, which in this case deals with the probability of a certain player winning the Triple Crown. Broken down further, the goal involves determining who will lead the league in batting average, home runs, and RBI at the conclusion of the season, and whether or not the same player tops all three leaderboards. With a small sample of games and plate appearances remaining, this isn't as simple as combining stats to date with rest of season projections and ranking the players in all three categories, as bizarre and unexpected results may surface; while Hanley Ramirez is unlikely to see his batting average fall below .340, he may face stiff competition or run into a dry spell and finish the year at .328. Instead, we want to run through a simulation of the rest of the season, add together the simulated raw totals in each event, and use those numbers in conjunction with the statistics posted up through that point, while repeating the process many times. The more iterations of a season run, the more likely we can get a sense of who should win what as well as the probability that certain players win two or three of the categories.
After outlining the goal, data to date as well as the rest-of-season projections were spidered into a spreadsheet, with the latter being converted into probabilities. If Ryan Braun is projected to make 133 more trips to the dish, 83 of which will end as an out, there is a 62.4 percent chance that he will be retired in any plate appearance from here on out. The simulation spreadsheet would consist of the rest-of-season event probabilities on top, a line consisting of accrued data to date, and a blank line beneath both of those, reserved for what will actually happen from here on out as dictated by the simulation. Those fields will be automatically calculated by the event sums after an iteration of the remaining plate appearances is run.
The sample table below diagrams what the top of the simulation spreadsheet looks like, prior to running through any iterations:
The next step involves actually populating that blank yellow row, which calls for the bulk of the simulation itself to be created. Referring back to Braun, who had 133 projected plate appearances, a row is created for each of the 133 PAs, with the same headings as the diagram above. Next to the sacrifice flies, however, a column titled RAND must be added. In Excel, the function =RAND() automatically generates a random number in between 0 and 1. With each stroke of the F9 key, or each addition and subtraction to any other portion of the spreadsheet, the number regenerates. The RAND function is then added to each row of the plate appearances portion of the simulation. Now, in its most basic form, the simulation may start to make some more sense:
If Braun has a 0.6241 chance of making an out in a given PA, and the first row has a RAND value of 0.3820, since the latter is less than the former, he would have made an out in that at-bat, were we to stop here and not incorporate anything else. Of course, we are going to incorporate another factor or two, but this comparison of the RAND to the event probabilities of the player is the real backbone of this type of simulation. If he has a 0.6241 shot of making an out and a 0.1579 shot of hitting a single, and the RAND value comes out as 0.7419, the PA results in a single since the RAND value exceeded the probability of making an out but remained lower than the probabilistic sum of making an out and singling.
The only real issue at the start involved modeling runs batted in, but after a ton of tinkering I set them to be a random integer based on the OBI% of the hitter in question, the average OBPs of preceding hitters, and home run totals. If we were only projecting five PAs for Braun, based on the table above with the RAND values, our initial table would be updated:
Setting the RAND filters in the PA columns can be done by utilizing the IF and AND functions, and once the formulas are dragged all the way down to the 133rd PA, voila, an entire recreation of the rest of the season for Ryan Braun has been simulated. The yellow row, which adds together all of the events in each of the simulated PA columns, works with the important constraint that the final plate appearances total cannot be larger than the number of projected plate appearances remaining. For Braun, this would call for the event summing to cease activity after PA number 133. When running simulations for multiple players, the number of PA rows should be set to the maximum total amongst the sampled players, but by adding in this specific constraint we can avoid summing together results in 150 PAs, inflating or deflating his data, if he only projects to come up 133 more times.
From here, we have two options: continue hitting the F9 key to run through his season several more times, taking note of the results, or write a Visual Basic script to automate the process. Either is fine, although the latter becomes much more efficient in just about every fashion when running through thousands upon thousands of iterations. A modifiable version of this Monte Carlo code can be found at XLSsports.com, and there are several others around the web capable of being fitted for the specific needs of the user, but the gist involves running through the season a predetermined amount of times and automatically reporting specified fields from each seasonal run into a separate worksheet within the file.
In this case, we may choose to set the number of simulations to 25,000, reporting season-ending batting average, OBP, SLG, HR, and RBI into a tab called "results," prosaically enough. Create a button on the sheet and assign to it this Monte Carlo macro, and with the click of a button you can now run through Braun's season 25,000 times and, upon finishing, click on the results tab to run an analysis counting the number of seasons in which he hits over 40 home runs, knocks in more than 130 and/or hits above .325.
To determine the chances of Pujols winning the Triple Crown, however, the simulation needs to be run for either every player in the league, or every player within realistic reach of one of the legs. For instance, Jimmy Rollins and his .245 batting average stands no chance of winning the batting title, making his simulation rather extraneous. With that in mind, I created a drop-down menu in the initial simulation page with VLOOKUP functions in the event-probability row at the top, incorporating the 15-20 or so players who could conceivably win the batting-average, home-run, or RBI titles. When someone other than Braun is clicked on, the new numbers automatically calculate for that player in each of the columns, and a short Visual Basic code beneath the Monte Carlo script calls for the running of a predetermined amount of simulations, automatically re-running everything for subsequent players in the menu.
Again, if we so chose, nothing more technically needs to be done to determine his chances. In fact, through one simulation of 10,000 seasons for all of these players, I found 54 instances in which Pujols won the Triple Crown, a likelihood of 0.54 percent that doesn't stray too dramatically from the 0.74 reported in the less-accurate probability method. I remained unsatisfied, however, and decided to introduce quality of opposing pitchers into the mix. This particular process began similarly to hitters, but instead of generating the exact data for specific pitchers I chose to design archetypes; as in, the name in the tab may say Tim Lincecum, but the data really just equates to what can be expected from a dynamite pitcher, with the inverse true of Adam Eaton.
Preceding the names and probabilities is a number and they are ordered by ascending quality. Lincecum would be number one, Carpenter number two, and Eaton and, say, Livan Hernandez at the bottom of the list. We can bicker all day about whether or not a certain pitcher's name belongs next to the number you may see in the table a paragraph or two below, but that would defy its purpose. The names are merely that, with their respective event probabilities being more a function of the type of pitcher they are.
After repeating the event probabilities for around 30-35 pitchers, a RANDBETWEEN column was added to the main spreadsheet. RANDBETWEEN works similarly to RAND except the user specifies the parameters and it returns integers only. In this case, =RANDBETWEEN(1,35), with a series of adjacent columns designated to look up the names and event probabilities of pitchers corresponding to those numbers. If this random-number generator adds a "four" to a column, it automatically adds in Cole Hamels to the Name column as well as Cole's event probabilities. The diagram below shows a sample of what the final simulation sheet might look like, with pitcher data hidden from view:
These pitcher columns are dragged down to correspond with each plate-appearance row, essentially selecting a random pitcher archetype to face a hitter in each of his trips to the dish. The question then becomes how to incorporate actual pitcher data into the equation so that their inclusion is more than just aesthetically pleasing and audibly neat upon explanation. If a hitter makes an out at a specific rate and a pitcher generates outs at a specific rate, how can we reconcile the two sets of skills?
Luckily, the Odds Ratio exists and functions to solve such conundrums. The Odds Ratio works similarly to Bill James's log5 formula, except it is straightforward in its calculation on any dataset, regardless if the mean hovers around 0.500. In essence, though, think of it as a modified log5, as it would work to determine the likelihood of, say, Braun and his 0.6241 shot of making an out actually being retired against a pitcher who gets outs against 0.6720 of his batters faced. The ratio goes through several steps as outlined below, however the league averages in each category must first be known:
Odds(Hitter) = 0.6241 / (1-0.6241) = 1.66
If Braun makes outs at the aforementioned rate and faces a pitcher who retires 67.20 percent of the batters he faces, in a dataset with a mean of 65 percent, he can be expected to make an out 64.65 percent of the time, a greater percentage than his own rate of making outs. Repeat this process for all relevant events to generate the new rates in each category, and adjust the event columns in the PA rows to compare the RAND values to these adjusted numbers and a simulation surfaces in which a random pitcher surfaces for each plate appearance of a hitter, whose own skills are then subsequently adjusted based on the skill level of the opposing hurler.
In other words, this process trains Excel to play Strat-o-Matic. In a sample with tons of games and plate appearances remaining, modeling in the opposing pitcher quality is not as important, since things will tend to even out. Over a short time span such as this, in which hitters are only going to bat 110-140 more times, getting an unlucky "roll of the dice" could be the difference between winning and losing a category. There are even more aspects of the batter/pitcher matchup to be added in, including platoon splits, but these additions, while fun to model, will only increase accuracy minimally, which is Seidmanese for "I'm exhausted now, and this sim works without it, but I will add more in at a date to be named later."
With all of the gory details out of the way, I could swear that we started this article out with a goal… oh, that's right, Pujols and winning the Triple Crown. Running through five different simulations of 10,000 seasons for every player within realistic reach of one of the categories, Pujols achieved the feat anywhere from 11 to 62 times. His specific numbers in the crown-winning seasons were particularly interesting in that very, very few involved Pujols soaring above a .345 batting average. Instead, about 85 percent of the seasons in which he won all three categories involved Hanley Ramirez plummeting more than Pujols himself picked up the pace, really hammering home how unlikely such a feat is to occur given that the player in question actually does need to worry about the performance of others.
Relative to exactitudes, Pujols averaged a batting average of approximately .321 in all of these simulations, .333 when he won the batting title, and .335 when winning all three elements of the Triple Crown in the same season. Hanley averaged a .343 batting average across all seasons, .344 when winning the batting title, and .331 when someone else won the batting title, which happens in a very small percentage of the iterations. Suffice to say, although Pujols is having a fantastic season, one that will likely end as the most balanced of any player in the senior circuit, he is not particularly likely to win the batting average and RBI titles, and may very well lose out in the home-run race as well. In fact, anywhere from 55-63 percent of the simulated seasons I ran concluded with Pujols failing to lead any of the three categories, suggesting that it is actually very likely that he loses all three. Granted, these numbers were run before he terrorized the Pirates this past weekend, but it would be tough to imagine the overall results would change much. However, I'll be running this just about every day, Playoff Odds-style, and if anything interesting surfaces you'll be the first to know.
That Albert Pujols is not going to win the Triple Crown and may very well miss out on leading all three components does not detract from his accomplishments, serving rather as a reminder of the difficulty level inherent when vying for this particular honor. Pujols will probably have to settle for taking home a third MVP award this season, but the stars would have to align perfectly, providing some extremely entertaining baseball in the process, for Phat Albert to achieve the incredibly rare feat of winning the Triple Crown.