I think that we've really misunderstood pitcher BABIP over the years.

One of the main tenets of what's become known as DIPS Theory is that there are three "true" outcomes of a plate appearance from a pitcher's perspective, and that what happens when the ball is in play is mostly luck. It's one of those assumptions that's been around so long that it's baked into a lot of what sabermetricians hold dear. We have component ERAs that assume that a pitcher should have a league-average BABIP. We confidently state that a pitcher will regress to the league mean as if it were a matter of course. We predict doom for pitchers who have a .260 BABIP and salvation delayed for pitchers who have an "unlucky" .350 mark. "Danger Will Robinson! (Insert name of pitcher) has been running on luck and will collapse any moment now!" makes for an easy article. I know, I've written plenty of them. In fact, as recently as last week, I predicted that the Orioles would relapse into mediocrity because four of their main relievers from last year had BABIPs in the .260 range and thus, their success was a vast mirage. Because once the ball leaves the bat, it's all random chance, right?

At this point, there's a pretty good consensus that the real answer to the question is "Yeah, but… hang on a minute, there's more to it." There are a bunch of logical factors that can influence BABIP.

  • For one, line drives that don't leave the park tend to fall for hits about 70 percent of the time (70.9 percent in 2012), while ground balls find a hole around a quarter of the time (23.8 percent in 2012), and fly balls drop in a place that is not over the wall 13 percent of the time (13.1 percent in 2012). The rate at which pitchers give these various types of hit balls up is fairly stable across time. BABIP could simply be a function of what sorts of batted balls a pitcher likes to yield. More than that, my former BP (and Statistically Speaking) colleague Matt Swartz also found that pitchers who yielded a lot of ground balls tended to have lower BABIPs on ground balls than would be otherwise expected. Not all grounders are created equal!
  • My former BP (and Statistically Speaking) colleague Mike Fast found that BABIP can depend a great deal on whether batters tend to hit pulled or opposite field air balls off a pitcher. Opposite field hits are more likely to be line drives. Liners are more likely to fall for hits. Pulled balls are more likely to be fly balls and to fly over the fence (which takes them out of the BABIP discussion).
  • Mike Fast also found that pitchers have some amount of control over how hard a ball is hit, and that harder-hit balls tend to go for hits more often.
  • Also, another researcher going by the improbable pseudonym "Pizza Cutter" found that ball-strike count made a difference. Balls put in play in pitcher's counts were less likely to go for hits than those in hitter's counts. Getting to an advantageous count was an outcome that appeared to have some stability for pitchers.
  • The kitchen utensil guy also found that while BABIP might take around 3,800 balls in play to show enough statistical reliability to be considered stable, it does eventually stabilize.

Still, most of the common ERA estimators (and a fair number of writers) continue to assume that BABIP is something that is out of the pitcher's control, and that, over time, it will return to league average (or at least a small window around that average).

Maybe we've been wrong all along. What if BABIP isn't a random event? What if we've just massively misunderstood the concept?

Warning! Gory Mathematical Details Ahead!
(This one is very dense and very math-heavy, but I promise it's worth fighting through.)

Proof no. 1: If BABIP is random, then why can I find a nice easy predictor of what's coming on the next ball in play?

Let's start with a fairly obvious question. In addition to groundball/flyball/pull/opposite field tendencies, wouldn't BABIP vary by how well the pitcher in question was throwing on that day? It's well known that pitchers vary in how much stuff they have from start to start. Any given pitcher might also have some minor injury that he pitches through (don't we all?) that still affects him over two or three starts. So, I decided to look at whether recent BABIP performance might predict the outcome of a single plate appearance.

To do this, I pulled a new trick out of the bag. For the years 1993-2012, I isolated all balls in play and coded whether they fell for a hit or not. I found the league average for that year. The way that BABIP is currently conceptualized, this should be the only number that we need. I converted the league BABIP into natural log of the odds ratio. In addition to the league number, I calculated what had happened to the 10 previous balls in play for this pitcher within this season. I did this as a moving average, so each ball in play had as a predictor the average fate of the 10 balls immediately before it. Again, I converted the BABIP to a logged odds ratio.

At first, I ran a logistic regression using only the previous 10 BIP as a predictor, controlling for the league BABIP for that year. And I got…nothing. There was no significant association between recent performance and what happened on the next ball in play. It looked like each ball in play, once it left the bat, was equally as likely to fall in as any other. Or at least like recent performance wasn't going to help me.

But then I changed to the previous 20 BIP as my sampling frame, and a funny thing happened. Significance. Pulling in more data from the pitcher's recent past made the predictor better. I went to 30 BIP and got significance again, and somewhat stronger significance at that. I went to 40, and then 50, and it kept getting better. There's a way to tell whether a predictor in a binary logistic regression is better or worse than another. It's a model fit statistic called -2 log likelihood. All you need is a consistent set of cases. Run a series of predictors on the same set of cases and the one that gives you the greatest amount of change in the -2 log likelihood is your best bet. You can also compare -2log contributions of different variables in the model. I isolated cases where I could calculate a running mean from 10 BIP to 250 BIP (in 10-BIP increments) within a season (thus, the pitcher needed to have at least 251 BIP for that season and only plate appearances from the 251st ball in play onward were used). To allow for streaks where a pitcher had an 0-for-10 groove going (you can't take a logarithm of zero), I excluded those cases from all analyses.

Looking at the comparisons of how the moving averages fared against the league-average BABIP was a revelation. At 10 BIP, the league BABIP had a 4-to-1 edge in predictive power, consistent with what we've been taught about BABIP all these years. But as the sampling frame crept up, the pitcher's recent results on balls in play started to become a relatively stronger predictor. By the 100-BIP sampling frame, a pitcher's recent performance was the stronger of the two predictors. Around 150 BIP, it was about a 60/40 split in favor of the pitcher's recent results, and it stayed around that ratio up to 250 BIP.

It's hard to argue that a pitcher's recent performance is unrelated to some sort of underlying skill that he has, and the sampling frame needed to show that is much shorter than we would have imagined. (We'll talk about that "skill" in more detail in a minute.) If BABIP is simply a matter of luck and pitchers are tethered to the league average, why is this skill-related predictor doing a better job than league average of predicting the results of the next ball in play?

Proof no. 2: It's not defense…

One obvious critique of the above is that I may simply be picking up on the effects of the defense behind a pitcher. A groundball pitcher with four vacuum-cleaner infielders behind him will look amazing when it comes to BABIP. We need a way to separate what the pitcher is doing from how much his defense picks him up. Another less-obvious critique is that a pitcher's BABIP might depend more on the quality of the batter whom he faces.

Fortunately, from 1993-1999, Retrosheet data contain an indicator of what sort of ball the batter hit (ground ball? line drive? Fly ball?) and where on the field the ball was hit based on a grid system. Now, these data have to be treated with some caution. Stringers classifying batted balls have biases. A line drive vs. a fly ball is something of a judgment call. And so is location. There is likely a tendency to place a ball that gets through the infield as being hit to the '56' zone (between the 3B and SS), but the same ball that another shortstop manages to get to as being in the '6' zone (right at the SS). Some of these data points are also 20 years old, and we have no data on how hard the ball was hit. It's not perfect, but it will do for now.

For each ground ball (excluding bunts), I looked at what zone the ball was recorded as entering. For each zone, I calculated the league-wide expected BABIP for a ball hit to that area. By doing this, I was able to get both the pitcher's and batter's overall expected BABIP on grounders, based solely on the location of where the balls were hit. If the pitcher was steering grounders to areas where his fielders should have gotten them, and the fielders were simply subpar or he was facing batters who were good at "hitting it where they ain't," this method should account for that.

I also calculated the BABIP for the pitcher's team on ground balls over the course of the season in question, excluding those that happened with the current pitcher on the mound. This will give us a rough estimate of the team's defensive quality overall. Finally, I calculated the league BABIP on grounders. I converted all of the above to logged odds ratios again. I created a logistic regression for all ground balls in the data set coded for whether they went for hits or not. I entered each of the four indicators above as predictors, including only plate appearances where both the batter and pitcher had 100 grounders or more during the year.

After that was done, I went back and did the same for line drives, and then for flyballs/pop ups. (For line drives, I dropped the inclusion criteria to 50 or more.)

I found the -2 log likelihood contributions for each of the predictors, similar to how I apportioned blame/credit in this article. Below is a table showing how well each of the predictors performed relative to each other for each type of batted ball.

Batted Ball Type




League Mean

Ground Ball





Fly Ball/Pop Up





Line Drive





We see that the batter's tendency to hit the ball where they generally ain't holds the greatest amount of sway over whether the ball will go for a hit. This squares with what we know about batter BABIP being a much more stable stat than pitcher BABIP. But the pitcher's tendencies to direct ground balls and fly balls to where the defense can generally get to them checks in as more important than the defense's general ability to turn batted balls into outs (the spread is closer for fly balls). And the league mean is present, but not a very strong predictor.

Far from being tethered to the league average, pitcher BABIP has a perfectly rational set of factors that influence it, and a good chunk of it belongs to the pitcher. Sure, the pitcher doesn't have full control over about 70 percent of the equation, but his contribution is generally twice as strong as that of the league average being used as a predictor.

Proof no. 3: An outcome and a skill are not the same thing.

Let's start this one with the language that surrounds the idea of DIPS and BABIP (Note: Always study the language that someone uses. Always. Language always betrays hidden assumptions.) In Voros McCracken's original BABIP study, there were four types of outcomes of a plate appearance: a strikeout, a walk, a home run, or a ball in play. Everything was kept in its own separate box, as if these were completely separate things, but within the box the assumption was that they were completely unified skill sets.

The three true/one false outcomes model of a plate appearance assumes that we should classify events based on whether they are discrete outcomes on the scoreboard, rather than whether they reflect some underlying skill of the pitcher. Because we equated outcomes with skills, we saw that while strikeouts, walks, and home runs (somewhat less so) were repeatable from year to year, BABIP wasn't. The consensus on BABIP was "no skill involved." Maybe it should have been "poorly designed construct." Maybe the problem with BABIP isn't that it's all luck, but that getting outs on balls in play encompasses different skills in different situations, some skills which are more influenced by factors outside the pitcher's control—whether luck or defense or the batter— than others. Maybe getting outs on grounders is a different skill than getting outs on fly balls that don't leave the park.

Statistically, it's hard to create a meaningful single number that represents the sum of a wide range of only mildly related (both in terms of covariance and conceptually) components. Those who are familiar with the statistical technique of factor analysis will be familiar with this idea. For those who aren't, a quick example: Suppose that I wanted to create an index of how sad and depressed someone is. I might ask questions like how often the person feels hopeless about the future or how often the person has uncontrollable crying spells or how often the person feels that even things that used to be fun just aren't anymore. As the answer to one of these questions goes up, the answer to the others will probably also go up as well. (For the initiated, they will have high factor loadings.)

Now, let's say that I tried to add in a question about how often the person had intrusive and obsessive thoughts. Obsessive thoughts are certainly a problem and may happen along with depression, but one can have depression and no obsessive thoughts or have obsessive thoughts but no depression. If I tried to shove this extra question into my measure, it will make the measure less stable.

Maybe we've been trying to put too many unrelated skills under the umbrella of BABIP. And for some reason, we've been surprised when it doesn't work. I'd argue that instead of a component ERA, maybe the first step is a component BABIP (like an xBABIP, which BP's Derek Carty has shown to be a good indicator of future performance)

Enough of this theoretical musing. The gory math awaits!

For the year pairs 2003-2004 to 2011-2012, I found all pitchers who had at least 250 balls in play in each year. Among these pairs, the year-to-year BABIP correlation was .193, which is the sort of lowly correlation that got this whole DIPS thing started. (Note: yes, I know I'm violating assumptions about the independence of data points. For just 2011-2012, it's .205. Happy?)

I ran a regression predicting the following year's BABIP using outcomes from the previous year that everyone assumes are "true": strikeouts per PA (year-to-year correlation of .77), walks per PA (.66), HR per PA (.30), GB% (.81), and FB% (.79), as well as BABIP.

The following equation produces a prediction that correlates with the next year's BABIP at a multiple-R of .305. That's not huge, but it's a) better than .193 and b) the same number as the year-to-year correlation for home run rate.

The equation: .291 + .143 * BABIP * GB_rate – .057 * K_per_PA – .630 * BB_per_PA + 1.765 * BABIP * BB_per_PA.

When we try a very simple component-level prediction for next year's BABIP, our predictive power goes up. Suddenly, this doesn't all look so cut-and-dried. The point is that when you take a more component-based view of BABIP, the skills—plural—and the interactions between those skills tend to come out. Maybe there is no difference between major-league pitchers in their ability to prevent hits on balls in play. But there certainly are differences in the abilities that go into preventing hits.

Well then… why does BABIP always seem to regress to .300?
My hope in writing this article is that we can finally put to bed the idea that every time a batter hits the ball, but not over the fence, the pitcher is some luckless (or lucky) dolt in the matter. There most certainly is skill in preventing hits on balls in play. We've just been conceptualizing the problem (and thus, measuring it) in the wrong way.

But the cynic will point out that despite all this, while BABIP may not be a unitary skill, it is an outcome that makes a large amount of difference in what happens on the scoreboard. And it does not correlate well from year to year. And yes, most guys who have .260 BABIP one season follow it up with a .300 season the next year and show a resulting decrease in their headline stats.

I still hold to the idea that BABIP is (multi-)skill-based, and have no trouble reconciling these two facts in my head. I offer the following three thoughts:

1) There will always be random variation in any measurement from year to year, and the smaller the sample size, the more likely that random variation creeps in. There probably are seasons where a pitcher had a good BABIP that really was just good luck, and we'll expect him to revert back to form in the following year. But if we took a more component-based look at BABIP, we'd probably be able to tell which inputs are more or less given to randomness. If a pitcher got lucky on an indicator that we know really is luck-based, we might predict regression. But if it was on a skill that we know to be stable, we might predict that the magic will continue. Being able to discern who got lucky vs. who might sustain that performance would be a massively interesting talent, now wouldn't it? I think a component-based view of BABIP gets us closer to that.

2) If BABIP really does consist of several skills acting in concert, a "lucky" season is likely to be the result of a pitcher who has put it together on several different skills over the course of a year. The problem might be that a loss of one of those skills might be enough to tilt him back toward the mean, and while maintaining good form on one skill is hard enough, what if it's four or five different skills? That's four or five things on which the pitcher might mess up, and the result is that he would become simply ordinary again.

3) I think there's one other measurement error that we tend to make in sabermetrics. We assume that a player is his yearly average throughout the course of the season. This makes about as much sense as noting that the average high temperature in the city of Chicago is around 50 degrees, and packing for crisp, autumn weather—in January. Sure, the overall average is 50, but as seasons change, the climate changes too, and you have to adjust your expectations. We wouldn't make that mistake in packing for a trip, yet we do it all the time in sabermetrics.

In proof no. 1, we saw that a moving-average approach to predicting BABIP was quite effective in predicting what happened next, and at that, we needed to look back at only 100 BIP before it overtook the league average as a good predictor. This leaves open the possibility that whatever the skill or skills are that are involved in BABIP, it or they may fluctuate over time. These fluctuations may not represent random variation around a mean, as is often assumed. They might be real changes in true-talent level.

There's probably a natural floor (and ceiling) to how good a pitcher can be in preventing hits on balls in play. Major-league hitters will eventually square on up on even the toughest pitcher. But maybe the untapped concept that differentiates the regresser from the maintainer is the ability to hold on to a good true-talent level over a long period of time. Maybe that's a talent unto itself. Maybe studying those variations from month to month and seeing who is steady across time vs. who fluctuates wildly from week to week will shed some light on the subject.

More than anything, I hope that what we've learned is that saying "He got lucky!" isn't enough anymore. I worry that for too long, we didn't question the DIPS hypothesis strongly enough. I believe that the preponderance of evidence points to there being real differences between pitchers in their abilities to prevent hits on balls in play and that the assumption that the league-average BABIP is the best baseline going forward is false. Balls in play are not completely within the pitcher's control, but the pitcher's contribution is not trivial. We should build our assessments of pitcher quality with that knowledge in mind going forward.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Long time BP reader, very infrequent commenter, but I felt compelled in this instance. This is exactly the sort of stuff that keeps me coming back; best baseball article I've read in some while.
*clips for arbitration hearing file*
It's articles like these that make me feel that i'm not paying enough for my BP subscription.
Looks like a really nice job, Russell, but I'll need several days to digest it to be sure. ;)
Call me if you need a tutor.
What's your charge?

The only part I didn't really follow was Proof #2. I think you created an xBABIP based on where the ball was hit and compared that to actual BABIP, or something like that. But did you try to account for a skill in which a pitcher induces *where* the ball was hit. I guess I just don't understand how that analysis was pulled together.

Also, what's a logged odds ratio? I understand logs and I understand odds ratios, but what are you doing when you put them together? Perhaps there's an article somewhere where you explain it?

BTW, fantastic job. That said, whenever I read one of these studies that debunk DIPS theory, I'm still always struck that pitcher BABIP is dang hard to predict. Which, to me, is all that DIPS theory is.
In #2, the idea is that I created indicators of xBABIP based on where balls off the pitcher were hit (and how many outs a league average team should have recorded based on that), so that the pitcher wasn't penalized for having a bad defense (or credited for a good one). Then, I took the defense's BABIP for when that pitcher wasn't around.

(Natural) log of the odds ratio is just a statistical trick that I used because I used a lot of logit regression. It has to do with raw percentages not being normally distributed, and using LOR corrects for that. Also, when logit does its actual modeling, it spits out a function that gives you the LOR of the probability that you want to model.

In #2, the idea was to see how well these predictors performed relative to each other from the point of view of variance explained (as much as logit lets you do that.) Was it the pitcher's general talent in steering the ball toward a fielder? Was it the sparkling defense? Was it the batter steering the ball himself?
Fascinating. As a non-mathematician, I have the following comments/questions. Is anything in baseball truly random in the same sense as rolling dice is random? Human will abounds through all of this. And just as pitchers aren't the same through the season, they also aren't the same year-to-year, plus hitters and pitchers are always talking about "making adjustments" so it can even change within a game. Your basic point that "it's more complicated than we originally thought" seems sound and certainly worth pursuing. I know everyone is looking for the simple, elegant declarative statement but ...
Great article. Being a former pro pitcher I know that balls put in play on 2-0, 3-1 are generally hit harder than balls put in play on 0-2, 1-2 counts. There are probably other factors that can skew these numbers. I think NL starters would have a lower BABIP because they pitch to other pitchers in close to 85% of their games. NL late inning relievers rarely ever face other pitchers unless game is out of hand or starter is throwing a gem. Pinch hitters and double switches cancel that out for late-inning relievers in the NL. I would be interested in seeing research done on field conditions such as synthetic turf, teams that keep infield grass high as opposed to those teams that keep the infield grass at a very low level. Ground balls tend to get through more on turf and balls will get swallowed up by the grass at Wrigley Field. I think that balls tend to fall in more in big ballpark outfields like Coors and Safeco because outfielders have to play deeper in those parks. Infield defense i'm sure plays a part as well. I'm sure the Rays infield of Longoria, Escobar, 2b, and Loney get to more balls than an average ML infield defense, especially on grass. They also play 90+ games on turn with home games and the Rogers Centre. I wouldn't be surprised if this research has already been done.

Enjoying my premium subscription because of content like this. Keep it up.
It is interesting that we might be able to bake factors such as turf into xBABIP. My question then becomes would these seemingly small corrections in expectations just become awash in the randomness that we know is at least partially associated in DIPS.
Did you really just ask a question that had the answer inside the question "If BABIP is random, then why can I find a nice easy predictor of what's coming on the next ball in play?"? Answer: because BABIP is random.
By definition, you can't predict randomness. If BABIP is random, why am I able to find a predictor?
Excellent article. This kind of article is why I subscribed.
Thanks for the article. I've wondered about BABIP, trying to think about the similarities in simple terms for my simple brain. I find it hard to believe that a ball hit fair off Mariano Rivera in his prime is just as likely to be a basehit as a ball hit today off Heath Bell, assuming both stay in the park. Wouldn't a ball hit off Rivera be more likely to be a jam shot or squibbler while a ball hit off Bell more likely to be a rocket?
BABIP is a blunt instrument passing itself off as a scalpel. This piece of work, Russell, is more incisive than BABIP, the stat, ever was.

I'd like to add to the other compliments; the kind of article that makes this site worth paying for, instead of replacing with more free content.
Great historical modeling. This is the type of original research I love BP for.

I'd love to see you run a simulation with existing season-data to see how well your component-BABIP-predictor does against the "dumb" TTO/regress to league avg predictor. Would also be really cool to regress to career BABIP in addition.

Cool research, but I would love to see a simulated test of the prediction.

Then I can use it to my advantage in fantasy ;). After all, that's what really matters.
The thing that I didn't quite understand.

You say:
"In addition to groundball/flyball/pull/opposite field tendencies, wouldn't BABIP vary by how well the pitcher in question was throwing on that day?"

Then you say:
"At first, I ran a logistic regression using only the previous 10 BIP as a predictor, controlling for the league BABIP for that year. And I got...nothing. There was no significant association between recent performance and what happened on the next ball in play. It looked like each ball in play, once it left the bat, was equally as likely to fall in as any other. Or at least like recent performance wasn't going to help me."

10 BIP is probably about 2-3 innings of work. I'm not sure the average number of BIP in a game but I'd think it'd be around 30-40. So, basically, 10 BIP is not predictive of BABIP, though 20 has some significance, though not as strong as 30, not as strong as 40, etc...

To return to the original quote, is this saying BABIP does not do a good job at predicting performance on that one specific day?

Is it also saying that BABIP does a better job at predicting performance over multiple starts (and might be pretty much useless for relievers over the course of two months).

Then, thirdly, if BABIP is far from the league norm, which matters more.. the length of time the BABIP is measured at or the variance from the league norm (and the amount of "snap back"/regression to the mean)?

Good point. My original thought was that the recency of 10 BIP would have a strong amount of predictive power, but it seems that it's just too small/noisy a sample size to really get a read on what the pitcher is up to. But it turns out that when you pull back a little bit, the sample size is big enough to provide at least some clarity. Predicting the next ball in play may very well be a function of how the pitcher is feeling that day, but the previous 10 BIP just doesn't give us enough of a read on that to tell how he's doing.

On the third point, if BABIP is far from the league norm over the last 100 BIP (say it's .240), then from a variance explained point of view, the recent personal history of the pitcher is more important than the league average. However, understand that the recent personal history of the pitcher is not a static number.
Regarding 100 BIP, isn't that the equivalent of about a season and a half of innings from a starting pitcher? Which is generally used as a gauge to see what's a trend?

Maybe it might be interesting to take pitchers who have extremely low or extremely high BABIP over 300 innings last year and see how much that predicts their performance for this year?
Figure a starter faces 25 or so batters per start, and strikes out/walks 7 or 8, then 100 BIP would be roughly 6 starts.
Love it when I fail my own math.
This is wonderful. Thanks, Russell.

I'd love to see a follow-up about which pitchers seem to possess which skill and how that component positively/negatively affects their BABIP.

Matt Cain and Zack Greinke come to mind.
Proof no. 1 seems to imply that a pitcher's last 50 to 85 innings approximately (figuring roughly H - HR = K + DP + CS + outfield assists) is a better indicator of his current ability than his last 250 innings.
Russell, I don't understand this:

On the third point, if BABIP is far from the league norm over the last 100 BIP (say it's .240), then from a variance explained point of view, the recent personal history of the pitcher is more important than the league average.

In my uneducated (from a statistics perspective) brain, if the recent history is MORE IMPORTANT than the league average, that implies that you would regress a pitcher's last 100 BIP BABIP less than 50% toward the league mean in order to predict the BABIP of his next few BIP. Clearly, that is not the case. Give me a pitcher who is .240 through his last 100 BIP and I will show you a pitcher who is .2997 (or whatever) through his next 10 or 20 or 100 BIP, where league average is .300 (for pitchers with similar profiles, like GB rate), after adjusting for the opposing team, his defense, and the park. So I don't understand what you mean by "more important" or even what the those relative percentages in the chart mean.
We're not talking here about how far to regress here. That's a different set of analyses. We have two variables fighting it out to see who is better at predicting the outcome of the next ball in play, which is the true measure of how good a predictor is. These analyses tell you that recent history does a better job modeling the outcome of the next BIP than does league average. That right there suggests that the standard DIPS assumption that everyone is league average deep down should be treated with suspicion.

So at that moment, he is better described as a .240 BABIP rather than a .300 BABIP.
Negative no explanation dingers be dammed.

I'm trying to imagine the implications of this. So, you're saying it is way too much of a leap to say a pitcher's tendency to have a more consistent BABiP or one more consistently above or below average happens in swings of 150-250 balls put in play does not imply that his overall effectiveness would be more likely to swing over the coarse of the same interval? Would that be worth looking into next?

Secondly, I'm trying to speculate how this happens. You mentioned the change of seasons as a metaphor, but taking it more directly I thought the notion of cold weather pitchers vs. warm weather pitchers is overstated if existent at all. . . the same for early season pitchers vs. late season pitchers (or is that another issue to be studied?) Are pitchers streaky - do they have grooves then get sloppy after so many appearances, then take the same number of outings to get back on track? Again, I'm getting ahead of the scope of your study, but I'm looking for ways to apply it.

One technical question, do you know if a pitcher's BABiP overall improves at the same rate as it degrades?
I think you're right on looking into how many of these cases where you have an "extreme" value (say, .240) there are, and how well the model would perform in these cases, but that's testable.

As to your second point, I'd love to know how this works too! If the fundamental message of what I'm trying to say here holds, then it opens up a lot of different avenues of investigation!

On your third question, I don't know that one yet.
But then aren't you saying then that the best prediction at that point would be a BABIP of .270 or less? That's about the same thing as saying you should regress less than 50% towards the league average? That also seems rather dubious - you have examples of pitchers with such an extreme BABIP as .240 and they more often than not were less than .270 BABIP going forward?
In the interests of investigating whether this theory works at the extremes, I ran some new analyses.

Using the same basic framework as I did in the original, I took the league average and the past 100 BIP and let them fight it out in the same logistic regression.

I only took cases where the last 100 BIP yielded a prediction of .280 or lower, then .275 or lower, then .270 or lower, etc.) There does come a point where league BABIP is a better predictor, and it seems to happen somewhere between .270 and .265. However, it should be noted that the past 100 BIP still holds some significant sway, even as you descend even further.

Perhaps .240 is too lucky to believe, but .270 is not.
"There does come a point where league BABIP is a better predictor, and it seems to happen somewhere between .270 and .265."

I'm surprised, but .265 does seem more likely than .240. This is a very interesting finding.

1) Do you get better/worse results when you separate relievers from starters?
2) Is there any better correlation within an appearance?
3)If 50 is where past babip becomes more predictive than league average, what's the next crossover point where league average is again better?
Pitchers vary over years/careers in all aspects of their skills. If this variation occurs in shorter time periods than the noise in BABIP normalizes, then it's likely that the data looks random instead of the pitcher. I fear I'm restating a conclusion of the article, but if we looked at abnormally consistent (k/9, bb/9, velo, etc.) pitchers, would we see abnormally consistent BABIP?
One obvious clue that .300 BABIP is not one of the physical constants of the universe is the fact that not a few pitchers with long careers have BABIPs rather different. Just off the top of my head are Guillermo Mota (14 seasons) and Ramon Ramirez (7 seasons) who, by coincidence, each have a career .276. As Willy Ley once remarked, theories tend to have delicate skins while facts tend to have sharp corners.
My personal favorite example is Troy Percival who had BABIPs usually in the .270 range year after year.
Pizza: hate to burst your memories but, has his career BABIP at .232 (and Fangraphs at .230), which is even better than you thought!

In addition, he was not "stable" inside any range. His performance followed a random distribution around that .232 mean.
From your article: "The kitchen utensil guy also found that while BABIP might take around 3,800 balls in play to show enough statistical reliability to be considered stable, it does eventually stabilize."

That's what I find important about BABIP: it can be random from one season to the next. It isn't necessarily totally random in a large enough sample size but baseball players tend to be judged with arbitrary end-points so it's important to consider what role BABIP is playing in a player's success over the course of 10, 50, or 162 games.
Yeah, but you can't trust the kitchen utensil guy.
#2 and 3 at the end are very important and a huge shortcoming in online baseball statistical analysis. It may even explain a lot of the discord in the mainstream vs. sabr nonsense.

Players are not the same day-to-day, much less year-to-year. Here's a very basic example from my own experience.

I tinkered with my mechanics way too often, daily really. I would find certain "feels" that would produce immediate, good results and for a week I would mash at the plate or throw a filthy hammer curve. But concentrating on that feel would cause me to over correct and produce new, less optimal mechanics.

Obviously, I'm not and wasn't a professional. 99.9% of major leaguers are probably much better at maintaining consistent mechanics. But those little mechanical variations throughout a season mean that a player's "true talent level" is constantly in flux around some moving average. Even the most mechanically consistent player will still fight this battle due to minor injuries.

That said, simple constructs like DIPS are helpful for forming a null hypothesis when analyzing a player or set or players. Given the current state of the arts, it's impossible to evaluate players perfectly. But that shouldn't prevent us from trying, there are a lot of people asking us to try!

Thanks for this article.
Russell, I am pretty sure that this does not exist, but any way to find out if a high BABIP produces a higher BABIP? What I am trying to say is that could the effect of a couple "lucky" hits boost the confidence of a hitter leading to more frequent higher BABIP outcomes such as line drives.

I suspect that this is likely confirmation bias, or a placebo effect. The question just got me thinking. Thank you for your work!
That makes sense on first glance. I found that very small (10 BIP) samples don't do well as predictors, maybe because it's just too small a sampling frame to get a good read on what's going on. Maybe if we conceptualized it in a different way we'd find an effect. It's an open question at this point.
Great article. It struck a lot of great chords on many different levels. Statistically (I do use statistics in my day job) I was comfortable with DIPS and the randomness of BABIP (primarily because of the low to non-existence of correlation in year to year BABIPs for pitchers). I was troubled by the practical aspect of the concept that we accept that some pitchers are ground ball pitchers v. fly ball pitchers (I guess line drive pitchers don't get mentioned) and we know that line drives are hits more often than fly balls and ground balls. Thanks for reconciling the world again.
I also like the "count" analysis. We've always heard about batters "waiting for their pitch" (I suppose so they can "square it up"and hit it harder) v. "protecting the plate" when behind in the count. Your analysis is intuitively congruent.
I wonder, do pitchers with a higher "nasty factor" (i.e., more movement) tend to have lower BABIPs? Does a pitcher's fastball velocity impact BABIP? How about "break" on other pitches? So much potential analysis, so little time.
Thanks again.