BP Comment Quick Links

Happy Thanksgiving! Regularly Scheduled Articles Will Resume Monday, December 1


March 14, 2010 Circling The BasesCleaning Up the (RunScoring) Environment with EPA
Ugh! What is that vile stench? That’s right, it’s Sidney Ponson starts and the Mariners' offense polluting our pristine Run Expectancy and Win Expectancy Matrices. Before we write our congressman to apply for stimulus money for the cleanup, let's ask: How did it get this way? The main tools that a sabermetrician uses to analyze baseball strategy are the Run Expectancy Matrix and the Win Expectancy Matrix. For the reader not familiar with these powerful tools, check out my article from the Basics series from the BP Idol competition last year. Simply put, these matrices are the basis for improved performance measuring (WXRL, etc.) and strategy analysis (stolen bases and sacrifice hits). Before blindly using these tools, the sabermetrician needs to consider their limitations. To calculate the Run Expectancy Matrix, all playbyplay situations with the same baserunner/out state are aggregated and then the average number of runs additionally scored in that inning from that point on is empirically determined. The base calculations of the Run Expectancy Matrix assume that the run environment is exactly the same regardless of pitcher, offense or ballpark and that the number of runs scored by each side for a game will likely be between 4.5 and 5.0, typically the season average. The problem is that we are grouping together situations where Chris Carpenter is pitching against the Astros (a low runscoring expectation), along with Fausto Carmona pitching against the Yankees (a high runscoring expectation). In the former situation, we hypothesize that it is much more important to manufacture whatever runs one can versus the latter, where giving away outs will be detrimental to winning. So how do we clean up our Run Expectancy Matrix? Let’s bring in the EPA! No, not the Environmental Protection Agency, but the Environment Prediction Algorithm. What we’re going to do is estimate the true run environment then use it to create Run Expectancy Matrices for the different environments. As an exercise, we will see how our understanding of various strategies change based on the environment. The EPA Approach The EPA considers four main factors in estimating the likely runs scored per game:
In theory, we could create a model that considers whether it is a day/night game, and such crosseffects like when CC Sabathia faces a lineup of four lefthanders and five righthanders, but as Dr. Leo Marvin writes: Baby Steps. We don’t want our algorithm to become all tied up like Bob Wiley. The algorithm I used was a sequential heuristic to create a linear model that estimates the number of runs/game based on the factors above. As an example, here are some of the set of factors for the 2009 season:
* minimum 25 starts So let’s take a hypothetical example to see how this works. The Yankees are at home against the Royals with Luke Hochevar on the mound for Kansas City. Taking all the factors into account: the Yankees are the home team (4.73 runs) going against a righty (+1.10) who is Hochevar (+1.39), our expectation is that the Yankees would average 7.22 runs per game in this situation, before we take the ballpark into account. Typically, this methodology predicts a run environment somewhere between two and eight runs per game. For the 2009 season, the following table shows the distribution of the number of games and a sample game from that bucket. Keep in mind that the total number is twice the number of games played as there will be one record for the visiting team’s scoring and the home team’s scoring.
Analyzing Some Strategies
Here are the different Run Expectancy Matrices:
First, let’s analyze the stolen base when there is just a runner on first base. Typically the rule of thumb is that the expected success rate should be around 7072 percent. The table below shows how the breakeven success rate changes based on the runscoring environment. The numbers in parentheses are modified breakeven success rates when we consider that, over the last five years, 5.1 percent of the time a stolen base of second is attempted there is an error resulting in the runner advancing to third.
As we would expect, in a lowscoring environment, the importance of manufacturing runs is of such importance that the breakeven rate drops significantly as compared to what we would have considered before. Obviously on the flip side, in the high runscoring environment the bar for success is raised. The other strategy that is usually hotly debated is the sacrifice bunt. There was a specific instance that got me thinking about this concept of quantifying strategies in low runscoring environments: the June 25 game between the Mets and Cardinals that I watched for the BP Idol competition. Twice in that game (with Johan Santana pitching for Mets, a 2.68 run/game based on the EPA prediction), Cardinals manager Tony La Russa twice called for the sacrifice bunt with a runner on first and no outs. Below is the table of three likely bunting situations with no outs: runner on first, runner on second, and runners on first and second. The values are the change in expected runs by employing the strategy: a negative number meaning the strategy hurts the offense and a positive number meaning the strategy helps the offense. The first numbers suggest what the change is given a successful sacrifice. The numbers in parentheses reflect adjustments based on the following historical trends of the last five years when a sacrifice is attempted: 70 percent successful attempt (batter is out, runners advance one base), 23 percent unsuccessful attempt with either the lead runner thrown out or a strikeout, 3.5 percent throwing error by the defense that allows runners to advance more than one base, 2 percent double play, 1.5 percent fielder’s choice that results in all runners being safe.
So, in the situation with La Russa above, given that it was likely going be a lowscoring run environment game, it caused my initial analysis to be that it was a bad decision, likely costing his team 0.2 runs. In reality, this decision was essentially breakeven. Given that our parameters for errors and unsuccessful attempts are averages over all attempts, if a better than average bunter and/or a poor fielding catcher were in the game, these may flip to a positive benefit. Once again, the last column indicates that in highscoring environments, attempting to manufacturing runs early is a poor strategy. Woe is the manager who sacrifices with Bruce Chen on the mound. Next Steps As some of you may realize, there could be a lot more to this EPA process than simply analyzing strategies. If we have an accurate way to predict the likelyrun scoring environment for both the visitors and the home team, it’s just a quick stroll in the mathematical woods, to determine a win percentage for the home team. Hmmmm…. if only there was a place that one could go to wager on games like this armed with this information. So, next time, we are going to have some fun by seeing how EPA stacked up against the historical closing lines in Las Vegas throughout an entire season.
Tim Kniker is an author of Baseball Prospectus. 9 comments have been left for this article.

Marge Simpson: "Eepa." What does that mean?
Comic Book Guy: I believe it was the sound Green Lantern made when Sinestro dropped him in a vat of acid. "Eeeeeepaaaa!"