Mitchel Lichtman (aka MGL), sabermetrician and co-author of *The Book*, recently had a post on his blog that I liked a lot for two reasons.

First, he presented some research, came to a conclusion, published, then did some more work, refined his techniques, came to a *different* conclusion, and published that as well. I’m over-simplifying here, but the point is that in a world in which medical researchers bury results they don’t like and the Oxford English Dictionary Word of the Year (*post-truth*, if you don’t want to click the link) reflects how “objective facts are less influential in shaping public opinion than appeals to emotion and personal belief,” it’s reassuring that our little corner still values the scientific method and full disclosure.

Second, his findings are interesting! His article’s title lays out his research: “What does it mean when a pitcher has a few really bad starts that mess with his ERA?”

We’re all familiar with this kind of thinking. Last July 3, Jon Lester started for Cubs at Citi Field against the Mets. He allowed a home run to Curtis Granderson in the first inning, but things really fell apart for him in the second frame: home run, strikeout, double, home run, walk, double, single, single, wild pitch, single, single. He was pulled, having allowed eight runs, all earned, in one-and-a-third innings.

Lester, of course, had a fine season in 2016: 2.44 ERA (second in the National League), 3.45 FIP (seventh), 3.10 DRA (fifth), 5.3 PWARP (fifth). Relevant to MGL’s title, if you remove *just that one start *from Lester’s season, his seasonal ERA drops all the way to 2.10, which would have enabled him to edge out teammate Kyle Hendricks’ 2.13 to lead the league.

Of course, you can’t just ignore one bad game, just as you can’t ignore Mike Trout’s August 7 game in Seattle, when he wore a Golden Sombrero at the hands of James Paxton. A season is a combination of good and bad games, aggregated together.

MGL indentified every pitcher from 1977 to 2016 with at least 100 innings who had at least four starts in which they pitched five or fewer innings and gave up six or more runs. He then compared all such pitchers who gave up an average of five or more runs per nine innings (RA9) with all pitchers who pitched at least 100 innings with an RA9 of at least 5.00 who didn’t have four or more such starts.

In other words, he was looking at a lot of starters who had disappointing seasons (i.e., RA9 of 5.00 or more), some of whom had four or more terrible starts, some of whom didn’t. Last year, pitchers with 100 innings and an RA9 of at least 5.00 ranged from Josh Tomlin (5.02), Taijuan Walker (5.02), and Michael Pineda (5.03) to Shelby Miller (6.42), Adam Morgan (6.45), and Tyler Duffey (6.97).

This gave him two buckets—pitchers with an RA9 of 5.00 who had at least four starts in which they were truly terrible, and pitchers with an RA9 of 5.00 who didn’t have big blowups. For each bucket, he compared the pitchers’ performance the following year with their projected performance, to see whether pitchers whose season was sabotaged by a few bad starts would outperform their more consistently unimpressive peers.

Since the projection system he used was based on season stats, it wouldn’t know that Jimmy Nelson (5.42 RA9 in 2016) had eight games in which he allowed six or more runs in five or fewer innings while Robbie Ray (also 5.42 RA9) didn’t. Or, in other words, that Nelson’s RA9 and ERA were 3.41 and 3.10, respectively, in 24 starts but 13.89 and 11.01 in the other eight.

The somewhat surprising result (to me at least) was that while the pitchers who were victimized by a handful of awful games performed more or less in line with expectations, those who were just consistently sub par did slightly *better* than expected. So, in the example above, Nelson’s blowup-fueled 5.42 RA9 was a *more accurate* indicator of his true talent than Ray’s more consistent 5.42 RA9. As MGL initially concluded:

The next time you read that, “So-and-so pitcher has bad numbers but this was only because of a few really bad outings,” remember that there is *no evidence* that an ERA or RA which includes a “few bad outings” should be treated any differently than a similar ERA or RA without that qualification, at least as far as projections are concerned.

As they say in infomercials, though, wait, there’s more.

MGL subsequently considered whether his definition of a blowup was too liberal. He re-ran his experiment, changing his definition of a bad start from six or more runs in five or fewer innings to eight or more runs in five or fewer innings. That’s a *really* bad start. Last year, there were 481 starts in which the starter allowed six or more runs in five or fewer innings, just under 10 percent of all starts. There were only 99 games in which the starter allowed eight or more runs in five or fewer innings, only two percent of starts. That’s really bad.

The starters in 2016 who met the revised criteria of two or more starts of five or fewer innings and eight or more runs allowed, minimum 100 innings, are Edinson Volquez with four such starts; James Shields and Josh Tomlin with three each; and Jorge De La Rosa, Jon Gray, Zack Greinke, Jason Hammel, Jeff Locke, Jimmy Nelson, Aaron Nola, Drew Smyly, and Steven Wright with two.

Skipping to the conclusion, MGL found “for starters whose runs allowed are inflated due to two or three really bad starts, if we simply use overall season RA or ERA for our projections *we will understate their subsequent season’s RA or ERA by maybe 0.2 or 0.3 runs per nine innings*.” [Italics mine]

Put another way, when looking at pitchers whose seasonal averages get trashed by two or three truly terrible starts, we probably should consider those starts to be partial outliers, and the pitcher’s true talent to be somewhat better.

Now, to this point, I’ve largely given you a book review of MGL’s post. Let me make it a little more interesting by noting its relevance to the 2017 season. Here is an alphabetical list of pitchers whose 2016 figures were hurt by two or three starts of five or fewer innings and eight or more runs allowed. I added their PECOTA projections for ERA and WHIP in 2017:

Per MGL’s research, there’s a reasonable likelihood that these pitchers will modestly beat their projections in 2017 in aggregate. The key words in that sentence are *modestly* and *in aggregate*. It would be foolhardy to wait until the middle rounds of your fantasy draft and then snatch up De La Rosa, Locke, and Shields. (Important caveat: Ignore the last sentence if you’re an owner in any of my leagues. Go for them!) PECOTA is likely *a little* too pessimistic about these pitchers *as a group*, because their 2016 numbers were hurt so severely by two or three disaster starts.

Jon Gray’s seasonal ERA rose by more than two full runs on May 19, when he allowed nine runs in 3.1 innings in St. Louis. MGL’s research suggests that while that one game put a hurt on his 2016 numbers, it may not be as relevant in the upcoming season as the 2.61 ERA he compiled over his next 13 starts.

#### Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
The converse would be true for better-than-average pitchers.

To the extent that variability persists year after year, I wonder WARP would be made more accurate by considering it.

Here is what I said at the end of my piece:

"Our certainty of this conclusion, especially with regard to the size of the effect â€“ if it exists at all â€“ is pretty weak, given the magnitude of the differences we found and the sample sizes we had to work with. However, as I said before, it would be a mistake to ignore any inference â€“ even a weak one â€“ that is not contradicted by some Bayesian prior (or common sense).

Basically I found a .28 run difference between the projected and actual RA9 NEXT SEASON for starters with exactly 2 or 3 terrible outings in one season. For all other similar starters, I found only a .03 difference in the same direction (actual was BETTER than projected).

.25 or .28 runs is a little less than 2 SD for the number of IP (3,500) in the experimental group (the ones with the terrible outings).

Given that I don't really have any Bayesian priors (which is rare in analyses like this), a 2 SD difference is, well, you decide. And remember, when we find a 1 or 2 (or 3 or whatever) SD effect, it's not a binary thing, like we accept or reject that difference or the hypothesis that there IS a true difference. We're trying to find out what the magnitude of the true difference or size of the effect IS, if there is any at all (and what does NO effect even mean - if the true effect is .01 runs per 9, is that NO effect? What about .02? .05?).

If we find a 2 SD effect or difference (from a control group or the null hypothesis), well that means that an effect of that size or more is unlikely to occur by chance. But, what if the true effect were small but it did exist? Well, now that 2 SD difference is more likely to have occurred by chance. If the true effect were half that small, then we have a 1 SD random effect which occurs quite frequently (16% at one tail).

So be careful with and when drawing conclusions from empirical tests like these. That's all I'm saying.

It might make sense that when we have a group of pitchers who have had a few disastrous starts, that SOME of these pitchers had something really wrong with him for those starts to SOME extent, as compared to a group of pitchers who have not had any disastrous starts. If that were the case, we would expect the first group to outperform their projection assuming that they are not somehow constitutionally inclined to continue to have a few outings where something is really wrong with them.

I'll try to follow in your footsteps with relievers!