For the past several years, the perception that closers perform poorly in non-save situations has increased. These relief aces fail to look particularly sharp unless they’re under pressure and have the game’s fate in their hands. Our own experiences have helped fuel this idea; we’ve all been witness to an untouchable pitcher entering a game with a 3-0 deficit and allowing a few more runs to score while pitching an ineffective inning. Unfortunately, with the memory of these negative events in mind, a categorical bias emerges where every example only provides further evidence of closer ineptitude when the game is not on the line. Is this strictly a categorical bias, or are the results and discrepancies in data between save situations and non-save situations real and significant?

Last year I conducted a study on closers, pooling together all seasons with at least 15 saves from 1980-2007. The query offered 696 pitcher-years and 220 unique game-savers, but the analysis of their stats in and out of save situations was a solid first step at best. The results, which were deemed viable via a paired samples t-test that compares the means of two different variables, showed that closers did post somewhat improved rates in their save opportunities. The discrepancies in the rate stats measured (ERA, K/9, BB/9), though they were significant, differed only by plus or minus 0.25 units per nine innings.

The study failed to incorporate a few very important factors which, when controlled for, may produce vastly different results. For starters, the most obvious is the rust factor: a large chunk of non-save appearances constitute examples of the hurler merely getting his work in. If the closer has not entered a game in four days, it makes perfect sense that he may not be at his best. His control may be solid but not pinpoint, or it might take several in-game pitches before he reaches his target velocity. The next important factor involves who makes the bulk of the save and non-save appearances. A very talented team is likely to win a good number of games, providing their closer with plenty of chances to record saves, and they may only have a handful of non-save appearances. Conversely, those on poor teams are more prone to appearing in non-save situations, because the opportunities to save games are not as abundant. In a small enough sample, these opportunity factors could drastically skew the data; a closer on a bad team may be dynamite when it counts, but merely average, for random reasons, when he’s doing nothing more than getting in his work.

Additionally, the strength of opponents must somehow be factored into the equation. A large enough sample of games played is required before the average winning team will appear to be better than the average losing team. For instance, I currently participate in a superstar Strat-o-Matic league, and one of my favorite tools is the lineup evaluator, which runs simulations pitting certain lineup configurations against another team. In a set of 25 simulations, my squad can do pretty well against what is considered to be the best team in the league, but over 15,000 simulations, the true talent levels of each team becomes much clearer. Combine these mitigating factors with the results discussed previously and it becomes evident that the results merit further control and adjustment before they can provide any true insight.

What about Pitch F/X data? Do closers have different pitch data in and out of save situations? This aspect of performance would be largely immune to the factors detailed above. After all, a closer is unlikely to make a conscious effort to throw with less velocity or decreased movement just because the opponent has been on a seven-game losing streak. To begin, I queried one of my databases for all pitchers with at least 10 saves last season. (Which reminds me: with only one season of data in our sample, this analysis is more of a preview of a movie being released five years down the road than it is the feature itself.) The query produced 37 pitchers, most of whom were full-time closers for the entire season, and some others who had lost or gained the role halfway through the season. Four more were added that just missed the benchmark because they were closers at some point during the season, providing a grand total of 41 pitchers.

Coding a database for save opportunities is a very labor-intensive task, a fact that Sean Forman of Baseball Reference will undoubtedly vouch for, so instead I did the coding manually and entered the results into a game-logs table. Then the data was joined to the Pitch F/X information, and averages in and out of save situations were calculated. Fastballs were of primary concern; closers throw them practically three-quarters of the time. Here are the overall results for the group as a whole:

Overall   Velo   PFX   PFZ
Save     92.99  5.41  8.95
Non      92.48  5.53  9.04

And here are the results partitioned by average fastball velocity:

Type          Velo   PFX   PFZ
Fast-Save    95.13  5.23  10.04
Fast-Non     94.93  5.32  10.22
Medium-Save  92.46  5.66   9.17
Medium-Non   92.06  5.79   8.99
Slow-Save    89.12  5.27   6.61
Slow-Non     88.46  5.43   6.98

With the entire group attending the sample fair, there was roughly a half-mile per hour increase on the heater in save situations, albeit with slightly less movement. Breaking the group down diminishes the significance of an already questionable sample, but it does provide a window into what may become apparent with more pitcher-years added into the fold. Running paired samples t-tests shows that the only overt difference involved overall horizontal movement, which itself is likely a type one error, where the results are statistically relevant but underwhelming in practical application. This segues nicely into a discussion about statistical significance versus clinical significance. I am by no means a clinical psychologist, but I have hung around with one for over a year now, and I feel the occasional rub every so often. The difference between the two forms can be summed up by the following question: a half-mile per hour difference on the fastball in save situations may be statistically significant, but does that overall result mean anything to the pitcher or to overall pitcher evaluations?

Despite the sample not yet being large enough to produce useful results, I more than expect to see similar findings when rerunning the study in future years. The question arises of whether or not these rather minimal discrepancies even matter. A half-mile per hour? I’ve yet to read any studies proving that pitchers generally average slower velocities in their ineffective games, and the movement components feature no more than a one-tenth of an inch difference-not very likely to make or break a game or a season given the standard deviations we’ve previously looked at and discussed.

What this all boils down to, is that in pondering whether closers perform as well out of save situations as they do in them, we’re asking the wrong question. What should be of interest is whether or not they pitch differently, which these cursory results would seem to suggest. Then again, this would make intuitive sense from a clinical standpoint; I’m likely to exhibit slightly different tendencies when protecting a one-run lead than when entering a game with an eight-run lead. In the latter game, a closer may be less likely to induce swings out of the zone, merely attempting to pitch to contact, and if a run scores in the process, who cares? This lax approach would obviously not work in save situations.

A pitcher should pitch differently in non-save situations, especially those with a hefty lead, because his approach involves aspects of pitching with a higher probability of surrendering runs. In save situations or with a one-run lead, the pitcher is much more careful to not give up the tying run, but in trying to be too fine he may actually give up a few. These different approaches may occasionally produce worse rates in non-save situations, but a straight-up comparison of performance-based stats is inaccurate because the closers are implementing two different strategies. Such a comparison would be akin to comparing Tiger Woods’ performances in and out of major tournaments; he’s more likely to buckle down in the US Open than he is in the Perrotto/Jaffe Invitational. This may lead to worse rates, but in order to really prove that a closer was worse in non-save situations, he would need to give it his all in those appearances, and from a human evaluative standpoint, this is very unlikely to occur. Do they pitch worse in non-save situations? It’s not worth answering, because it’s the wrong question. Do they pitch differently? Yes… and they should.