Extra Innings: More Baseball Between the Numbers, edited by Steven Goldman, is the sequel to Baseball Prospectus’s 2006 landmark Baseball Between the Numbers, a book that gave many their first taste of state of the art sabermetric thinking in the years after Bill James and Moneyball. BP now returns with a sequel that delves into new areas of the game, such as how to evaluate managers and general managers, the true effects of performance-enhancing drugs, how prospects are recruited and developed in Latin America, and more. The book is now available for ordering and immediate shipping from Amazon and Barnes & Noble ahead of its official release date of April 3, 2012. Today, we present the second of two excerpts from the book.

"Alcohol is like love,” he said. “The first kiss is magic, the second is intimate, the third is routine. After that you take the girl’s clothes off.” —Raymond Chandler, The Long Goodbye

At first, it truly was magical. Our first kiss was Dennis Eckersley back in 1987. Eckersley, viewed as a washed-up starter in the twilight of his career, ended up with the one manager creative enough to bring his career back to life, Tony La Russa. La Russa, no doubt with an assist from pitching coach Dave Duncan, tailored a role just for Eckersley, one that we now recognize as that of the modern closer: the pitcher would appear only at the end of the game, only to protect a lead, and almost never for more than an inning at a time. Eckersley credited the move with revitalizing his career: 

”It was a hell of an idea, and I was the lucky recipient,” says The Eck. “I was 32. Starting was getting to be difficult. I couldn’t go six or seven innings, wade through all those left-handers anymore. But just pitching one inning, my fastball came back. I was throwing like I was 25 again. One inning suited me very well. I never would have lasted if I had to pitch two or three innings all the time. Plus, I would have had my head knocked off.”

The results were bewitching. Eckersley ended up with a Cy Young and MVP award in the same season, and every team in baseball decided they had to have a pitcher just like him. Like a virus, the fever spread, the limited role designed for Eckersley evolving to include other pitchers.

Now it’s routine. La Russa, not content to simply have a designated ninth-inning guy, added pitchers devoted solely to retiring left-handed hitters late in the game. Most managers have followed La Russa in form if not in creativity; managers enter a game with a set plan for how they want to use their bullpen, and some are unwilling to deviate from it. Yankees manager Joe Girardi is notable for his inflexibility. After an eighth-inning meltdown by Rafael Soriano, Girardi told the press why he was allowed to pitch long enough to squander a four-run lead: “Soriano’s our eighth-inning guy,” Girardi said. “And by no means is four runs a game in the bag, as we just saw.”  Baseball Prospectus’s Steven Goldman responded:

[Girardi’s] is a nice thought, except that a manager can’t go through life worrying about protecting four-run leads; in 2010 and 2011, when the home team carried a four-run lead into the top of the eighth, it won roughly 98 percent of the time. Girardi also argued that he had to use Soriano there because he would have been second-guessed if he hadn’t. “If a guy gets on or a couple guys get on, and I have to get Soriano up, then I’m asked the question, ‘Why didn’t you just have him to start the inning?’” This seems to suggest that only your eighth-inning guy can pitch the eighth inning, all 162 of them, because the consequences of using a non-eighth-inning guy in the eighth-inning spot are too frightening to contemplate. Someone might yell at you. Fans. Owners. Mom.

Similarly, Girardi had to use his eighth-inning guy because had he not, he might have had to use his closer: “If we get through the eighth without giving up a run, then I don’t have to get up [Mariano Rivera,] my 41-year-old closer who, I think, is quite important to us in the course of the year.” Again, by this reasoning, no lead is so safe that you don’t have to take all possible precautions to ensure that your closer does not ever have to pitch.

Yet, even had the Yankees given up a run in that eighth inning, the game wouldn’t truly have been in jeopardy, it just would have been in jeopardy according to the saves rule, which is a different matter. The manager of the Yankees does not dictate when to use Mariano Rivera, but the arbitrarily defined “save situation” does. He is powerless before it. Even had he deemed it wiser to skip Rivera that day so that he might be available for some future clash with the Red Sox, he would have had to use him.

The closer, once an invention based upon creativity, is now either an excuse or a mandate to avoid creativity in how a manager applies his relief pitchers in the search for team wins. How did we end up where we are now? And is it truly benefiting the game of baseball?

Relief pitchers were so rare in the early days of baseball that it didn’t occur to those keeping the records to track pitchers’ performance both as starters and relievers. We can, however, estimate the split of a pitcher’s playing time in each role, which gives us a useful starting point for analysis. While the historical record is unclear for individual pitchers, it is relatively simple to estimate the number of innings pitched by all starters for a season, as well as the number of relief appearances per game. Using these league-wide estimates, as well as a pitcher’s games and games started, lets us estimate how much of a pitcher’s work came from starting versus relief pitching.

For the modern era we have play-by-play data and can calculate these things precisely, but I have chosen to continue presenting the estimates so as to make it easier to identify trends without worrying if a shift is caused by changes in what happened versus changes in how the numbers are tabulated. See Figure 3-2.1 for the percentage of innings thrown by starting pitchers.

It is not until about 1908 that we see a decline, settling at 90 percent for a few years and then drifting almost inexorably downward, so that in the modern game less than 70 percent of all innings are thrown by starting pitchers.

What is fascinating about the downward slope is that it is unimpeded by almost anything that would affect pitcher usage. The second deadball era doesn’t seem to arrest the decline at all; the 1960s actually saw a slight drop in starter innings, while shaving nearly a full run per game in comparison to the 1950s. As a whole, the correlation between the percentage of innings thrown by starting pitchers and the runs scored per game is only 0.4. If we eliminate the evolving game of the late 1800s and early 1900s and start with 1920 (the first year starters threw less than 90 percent of all innings) we get a correlation of only 0.16.

The arrow of correlation is counterintuitive in that it suggests the more runs scored per game, the more innings thrown by starting pitchers. At the very least this causes us to reconsider the commonly held belief that pitchers don’t throw as deep into games because they have to face tougher lineups than pitchers of old used to. It has often been said that replacing slap-hitting shortstops with Cal Ripken types means fewer spots in the lineup to pitch around, but even replacing the pitcher with the designated hitter—in essence a second first baseman— doesn’t seem to affect the magnitude of the downward slope.

What accounts for the change in pitcher usage? We can neatly divide the outcomes of a plate appearance into two groups—balls in play, which require action by the defense, and the so-called three true outcomes of walks, strikeouts, and home runs. Figure 3-2.2 gives us a look at the rate of balls in play as a percentage of plate appearances over time.

The two graphs are strikingly similar. The BIP rate and starter IP rate have a correlation of .88. What this suggests is that pitching has gotten harder over the years because more and more of the burden has shifted to the pitcher alone, with less and less reliance on the defense. This has created an increased need for relief pitchers. It took some time, however, for this to lead to the rise of dedicated relief specialists. Figure 3-2.3 shows the percentage of relief innings thrown by pitchers who never started a game over time.

We actually see that baseball started with “dedicated” relievers, but that can be misleading—there weren’t many relief innings to go around, and so teams pressed position players into pitching on the rare occasions where a starter couldn’t finish a game. Using pitchers as relief pitchers seems to start in the 1890s, and by 1910 or so teams relied on pitchers nearly exclusively for relief pitching appearances. Teams still hadn’t moved to using pitchers whose primary job was to pitch in relief, however; the majority of relief appearances went to starting pitchers, or at the very least, swingmen who could be counted upon to work both roles. This began to change around 1936, when teams began a gradual transition toward pitchers who specialize in relief.

Once you have dedicated relief pitchers, you’re going to notice that some of them are better than others. And you’re going to try and use your better pitchers in tight games at the expense of your lesser pitchers. This is where we see the first manifestations of what we’d now call a “closer,” but which at the time were often called “firemen,” relief pitchers who are supposed to come in with the game on the line and finish it off. By the end of the 1960s, we see most (if not all) teams having a relief ace. If we define a team’s closer as the pitcher with the most saves for his team that year, we can look for historical trends in closer usage. We’ll look at two measures: how many IP a closer pitches per appearance, and how many appearances a closer makes per team game (see Figure 3-2.4).

From 1920 through to 1960, the percentage of games where a closer makes an appearance rises dramatically from 8 percent to 33 percent. After, we see a much subtler rise up to an average of 38 percent for the past decade. The frequency with which teams used their relief ace has been relatively stable since 1960 or so. But right around 1988 we see a dramatic change in how many innings a team’s relief ace pitches each appearance. Up to that point you have a pretty stable equilibrium around 1 and 1.2 innings per outing. After a five-year decline, though, you hit a new equilibrium at just over an inning pitched per game, one that’s even more stable than the old equilibrium.

We don’t just see this change among relief aces. Looking at the percentage of innings thrown by relievers with at least one, one and a half (half-innings are the result of averaging, whereas in actuality pitchers only throw in multiples of one-third), and two innings per outing over time shows us how drastically total bullpen usage has changed (see Figure 3-2.5).

We see the same late 1980s, early ’90s inflection point for the dramatic change in closer utilization. Before that point, nearly every relief pitcher threw at least an inning per outing; as of 2010 only half of all relief innings were thrown by pitchers who averaged an inning or more per outing. Pitchers who average at least an inning and a half of work per outing have gone from representing between 40 and 60 percent of innings pitched to representing less than 10 percent. True long relievers— pitchers who threw two or more innings per outing—experienced an extinction-level event akin to that which met the dinosaurs.

We can quibble a bit about the exact moment that comet struck— maybe it was 1988, maybe 1989—but it came soon after Eckersley’s first season in Oakland. In terms of impact on the game, the creation of the modern closer by La Russa seems as influential as Babe Ruth’s home run prowess ending the primacy of the bunt and stolen base.

Having identified where the change began, it falls to us to assess if the change itself has been a positive development. We can’t answer that question directly, unfortunately; baseball is a zero sum game, and if all teams change strategies, then in the end the average team doesn’t benefit at all from the shift. Still, a change in strategy of this magnitude should have one noticeable impact: by putting a team’s best pitchers in late to finish close games, we should expect all teams to be better at holding leads in such situations. After all, there is no strategy out there that has allowed managers to get their best hitters to face the other team’s closer a disproportionate amount of the time.

To find this evidence, let’s focus on situations resembling the archetypal save, with one team leading by one to three runs at the start of the ninth inning or later. (These won’t all be save situations; sometimes a pitcher other than the closer will be called upon to start the inning, but most of them will be.)

In the 1950s, a team in such a situation would win its game 90 percent of the time; in the 2000s, a team would win such a game 91 percent of the time. Assuming 44 such chances a season (the average for the past decade), that means modern teams will win an additional game every two to three seasons due to changes in relief pitcher usage. There is a slight countervailing impact from increased run scoring, but with a correlation of just –0.28 between runs per game and these win rates, such an effect shouldn’t be expected to significantly alter these conclusions. In short, baseball has contorted its roster and raised a small class of pitchers up to be multi-millionaires for a very small benefit.

In an additional bit of irony, the rise of pitchers designed to pitch in these sorts of situations has coincided with a decline in these sorts of chances. The primary driver seems to be the rise in offense, not the change in pitcher usage. There is a .81 correlation between the rate of potential ninth-inning saves and the seasonal average for runs per game. From 1950 through 2011, 29.1 percent of games resulted in a potential ninth-inning save chance, while from 1988 through 2011, only 27.9 percent did. That decline in possible save chances, at the least, provides a countervailing effect to the ability of ace relievers to come in and close a game.

The paucity of ninth-inning save chances points out another flaw in saving your best reliever for that inning. If you only have 44 ninth inning save chances a season, but your best reliever can pitch 60 or 70 innings in one-inning stints, you end up having more than a few wasted innings from your closer. For the moment, let’s define a close game as one where the fielding team leads by two or less, is tied, or trails by one run. From 1988 through 2011, at the point when the closer first enters the game, he finds himself in a close game only 59 percent of the time. Twenty-one percent of all games pitched by a team’s closer happen when the run differential is four runs or greater. This is because managers have to “find work” for their team’s supposedly most valuable reliever, and thus must resort to putting him into a game that’s essentially already decided just so he can get his innings in. (In fairness, earlier managers were little better, with 60 percent of appearances in close games and 17 percent in blowouts of four or more runs.)

Fans in the stands would be surprised to hear this, of course; if the closer wasn’t achieving something special, would he need a special entrance song? Would he send chills up our spine when he delivered his first pitch? The cold, raw numbers feel inadequate to explain how it feels to watch a dominant closer. You can hear the familiar refrain already: “Get your head out of your spreadsheets and watch a ballgame sometime.” Yet, as it turns out, spreadsheets are in fact capable of recognizing the heightened excitement that occurs when a closer enters the game. In order to capture this feeling, sabermetricians have often turned to what Dave Studeman has called “the story stat,” win probability added (WPA). I’ll let him explain:

Here’s the basic idea. An average team, at any point in a game, has a certain likelihood of winning the game. For instance, if you’re leading by two runs in the ninth inning, your chances of winning the game are much greater than if you’re leading by three runs in the first inning. With each change in the score, inning, number of outs, base situation or even pitch, there is a change in the average team’s probability of winning the game.

. . . Bottom of the ninth, score tied, runner on first, no one out. The home team has a 71% chance of winning according to the Win Expectancy Finder (in this situation, the home team won 1,878 of 2,631 games between 1979 and 1990). Let’s say the batter bunts the runner to second. Good idea, right? Well, after a successful bunt, with a runner on second and one out, the Win Probability actually decreases slightly to 70% (home team won 1300 of 1,848 games), according to the WE Finder. The bunter hasn’t really helped or hurt his team; his bunt was a neutral event.

. . . To really have fun with this system, you can take it one step further and track something [called] “Win Probability Added” (WPA). Once again, the concept is simple. Let’s say our batter in the bottom of the ninth hits a single to put runners on first and third with no outs. This increases the Win Probability from 71% to 87%, for a gain of 16%. So, in a WPA system you credit the batter +.16 and debit the pitcher/fielder –.16. If you add up every positive and negative event from the beginning to the end of a game, you wind up with a total for the winning team of .5, and a total for the losing team of –.5. And the player with the most points will have contributed the most to his team’s win.

Related to win expectancy is the concept of “leverage,” which is simply a measure of the possible change in win expectancy given the context. For our purposes, we will fix the leverage index of each event at one, so that a situation with a leverage index of two would have twice the average change in win expectancy compared to the average plate appearance.

Examining all events from 1950 through 2011, we find the average plate appearance in the ninth inning and later has a leverage index of 1.33, compared to .96 for the first eight innings. In a model based upon win expectancy and leverage index, those late-game situations are worth 37 percent more than events earlier in the game.

Contrast this to a more traditional model of how events contribute to team wins and losses—the Pythagorean theorem, which has been revised countless times but takes the basic form of

Runs Scored^2 / (Runs Scored^2+Runs Allowed^2)

where RS is runs scored and RA is runs allowed, and the result is an estimated win percentage. The Pythagorean model doesn’t care about the order of events. It doesn’t matter if a run is allowed in the first inning or the ninth; the formula treats them exactly the same.

How can we tell if the leverage model of pitcher evaluation is better than our Pythagorean model? What we can do is come up with a prediction based upon the ideas behind the leverage model, and test them at the team level. One thing we find, if we do a little digging, is that relief pitchers tend to pitch in slightly higher leverage spots than starting pitchers. The greatest concentration of leverage occurs in the ninth inning or later, with the average ninth-inning leverage from 1988 to 2011 at 1.33. Extra innings have even more leverage. (We’ll look at the reasons for this in a little bit.) In the language of leverage, what this means is that each batter faced by a pitcher in the ninth inning is more important in deciding the outcome of a ballgame than each batter faced by a starting pitcher.

If true, this suggests that we could beat the Pythagorean theorem at estimating team wins by putting a greater emphasis on a team’s pitching performance in the ninth inning. To see if this is true, we can break Pythagorean wins down into two components: a team’s expected win percentage given only the performance of its pitchers through the first eight innings, compared to its record after. We can use these two variables to predict both a team’s Pythagorean and actual win percentages. We can then compare them to see how close the two models are, and if the Pythagorean method is underweighting a team’s pitching performance in the ninth inning.

What we see instead is incredible consistency between the two models; the difference between the weight for relief pitching in the Pythagorean model and the observed wins model is only .03. In other words, there is little practical difference in the amount of emphasis on relief pitcher performance when predicting actual wins versus Pythagorean wins—the Pythagorean model is a much more realistic model of the impacts of pitching performance than the leveraged model. A .03 change means that for a team with a ninth-inning-and-later performance of half the league run average, you would expect it to win roughly one more game than predicted by the Pythagorean model per season. (Teams pitching that well occur less than one percent of the time.) In a more realistic scenario, a team that has an RA in the ninth and later that’s 75 percent of league average (teams pitching that well or better occur about 16 percent of the time) wins one more game than predicted by the Pythagorean model every two seasons.

What the win expectancy model is truly capturing is not how much a play contributes to team wins, but how well an event predicts the outcome of the game itself. There is, of course, going to be some substantial overlap between the two, as things that lead to wins also tend to be good predictors of wins. What complicates things is that at the end of the game, the music stops and everyone has to find a chair—the winning team is at one and the losing team at zero. This is what’s known as an “assuming state”; once you enter it, it’s impossible to leave. Late-game events are more predictive in terms of win expectancy due to their proximity to the end of the game.

To this end, WPA is truly the story stat. It captures very well how exciting a game is close-and-late. A blown save is tremendously upsetting emotionally, because it takes what was very nearly a sure win and turns it into a sure loss. WPA captures this change very well. But what it does not capture nearly as well is the fact that, indeed, the closer enters the game when it is already very nearly a sure win.

Consider the toughest save spot a closer would see to start the ninth inning—the pitcher comes in with three outs left in the game and a one-run lead. In order for his team to win, all he has to do is pitch one scoreless inning. The reality is most innings in MLB are scoreless; from 1988 through 2011, 72 percent of all innings had zero runs scored. Because we’re already dealing with a high probability of success, it’s difficult to improve on this rate; the average pitcher coming into a ninth-inning save chance allowed no runs only 75 percent of the time.

Emotionally, the final inning is an assuming state as well; the pitcher on the mound when a team wins or loses the game tends to bask in the reflected glory of the triumph or wallow in the agony of the defeat. However, in reality all that matters is the final score. If the starter pitches a scoreless fifth, that’s just as meaningful to deciding the outcome of a one-run game as it is if the closer pitches a scoreless ninth. Win expectancy may tell a better story than the Pythagorean analysis, but it tells us less about the relative contributions of closers versus starting pitchers to team wins and losses.

If the change in reliever usage hasn’t altered how effective teams are pitching late in games, it has changed how managers handle their tactical choices, and by doing so has affected the way we watch the game. The shift to relievers pitching fewer innings per appearance did nothing to arrest the decline of innings pitched by starting pitchers. The result of this change has been more pitching changes per game; in the 1980s there were 3.4 relief appearances per game, while in the 2000s there were 5.6 relief appearances per game. This has meant less space on the roster for position players (see Figure 3-2.6).

After a sharp jump up to the levels of the late 1890s, we see a gradual rise until the late 1980s, where again we see a dramatic increase. Teams are increasingly using more pitchers to fill their roster spots.

A manager’s chief strategic weapon is no longer the position player, but the relief pitcher. While specialization came naturally to position players, it had to be created for relievers. A manager can probably tell which player is his pinch-hitter and which is his pinch-runner just by looking at the him, but can’t tell which pitcher is which without some sort of guidance. Thus, managers have created increasingly narrow pitching roles to help them make those decisions: one pitcher is your closer, one your setup man, one your seventh-inning guy, one guy goes after tough lefties.

This increasing parade of relievers may not make it any easier to hold leads late in games, but they do in fact make the “late” in games more accurate. Looking at all seasons from 1950 through 2011, each reliever used per game adds an additional 10 minutes to the length of the game. This holds even after you control for increased run scoring (which is not a significant predictor of game length once you control for the number of relievers used). And changes in reliever usage account for over 70 percent of the variance in game length over that time period.

What this means is that, from 1950 through the present, we’ve added more than half an hour to the length of a ballgame. If this addition meant more play, it might be worth it. But for the most part, it’s an addition of seeing managers coming out of the dugout with an arm in the air, warm-up pitches from the mound, and catchers jawing with their starters to give the fresh arm in the pen a little more time to loosen up. Seeking ephemeral advantages, managers have instead colluded to add 30 minutes of tedium to our national pastime.

If history teaches us anything, it’s that nothing lasts forever. Someday, some enterprising manager will decide to eschew the staid traditions of the closer for something new—after all, this is how the notion of the closer got its start, and so it’s how it will meet its inevitable end. Just don’t expect it to happen anytime soon.