The following is an edited transcript of an in-house discussion among the Baseball Prospectus team about BABIP.

Ken Funck:
I’ve seen a few different Expected BABIP calculators based on batted-ball data. Is there a particular one we should use?

Matt Swartz:
I have an expected BABIP calculated that projects future BABIP that I’m definitely going to be working on in some future articles. The usage of batted ball rates and speed only tells a fraction of the story-line-drive rates for hitters are persistent, but not very much so. Knowing a hitter’s BABIP skill is more about power and infield pop-up rate than speed, unless you’re an extreme ground-ball hitter. If that’s the case, it’s more of a factor than the speed itself. If we want to have one that’s supposed to actually represent some sort of luck-neutral BABIP rather than projecting it forward, I would definitely be able to do that.

Colin Wyers:
I really don’t know if I think that BABIP is something to consider for hitters. I think, for instance, PECOTA looks at H/(AB-SO). Clay can correct me if I’m wrong. I like this for several reasons:

  1. You can call this stat BACON if you like, for Batting Average on CONtact. And, well, because I like it.

  2. For hitters, there’s less of a reason to split purely defense-independent versus defense-dependent skills.

The other interesting thing to note is that for pitchers, in the course of a season, they usually pitch in front of roughly the same defensive unit all the time, so there’s a lot more opportunity for one egregiously good or bad defense to sway the numbers. Hitters (at least ones that play enough to matter) will never see the same defense as often as a pitcher will.

That’s just me, though. I understand the appeal of the symmetry of using the same measure for each.

There’s definitely a lot of value in looking at hitter BABIP. The real reason is that, unlike HR/AB, BABIP for hitters has low persistence. It has more than pitcher BABIP has, but there are very specific luck factors that play into it.

For those interested, the math of this can be seen from using the formula for standard deviation of a binomial variable: p(1-p)/n, where p is probability and n is observations (at-bats or balls in play)-the variance in a hitter’s true BABIP skill level from his yearly BABIP result is going to be much higher than the difference between his realized HR/AB result and his true HR/AB skill level because BABIP is closer to .5 than HR/AB (making p*(1-p) higher). Not only that, but variance among hitter HR/AB skill levels is larger than variance among BABIP skill levels, so the end result is that a far larger portion of BABIP variance from hitter to hitter is comprised of the luck factor than or HR/AB.

After knowing what the math tells us, then, the thing to do is really separate out BABIP skill from BABIP luck by looking at hitters on the aggregate and trying to determine his true BABIP skill. This is no small issue, considering three-quarters of all PA end in a ball in play. We know that the hitter did something bad on a K and good on a HR, but there are plenty of bloop doubles and line drive outs in the gap that make looking at BABIP tough without looking at what skills the hitter has.

As far as the baseball aspects of what is going on, and how to separate skill from luck using BABIP, there are a few things to look for. BABIP on line drives is a particular area where luck can really sway a hitter. Those with power do tend to have a better BABIP on line drives, but the persistence is very low beyond that. Predicting future BABIP on line drives is better done by simply looking at power rather than looking at an historical line drive BABIP at all. BABIP on ground balls is made up of infield hits and outfield hits-the former has more persistence, but there is a clear tendency for some hitters to hit their ground balls to the outfield more often. Guys like Derek Jeter, Ichiro, and Joe Mauer are all consistently underestimated by most projected systems. As I understand it, there isn’t a direct acknowledgement of BABIP on ground balls in those systems. They just hit the ball hard and to all fields on the ground. The general reason to look at batter BABIP is that its persistence is so much lower, and actually exploring its components by batted ball types gets you a lot of information as to which mean to regress it to.

The benefit of looking at something like BACON is that you are basically answering the question of how often the hitter hit them where they ain’t, but it doesn’t necessarily separate out what’s persistent from what’s not. The HR/(AB-SO) is going to be a lot more reliable than the (SI+DO+TR)/(AB-SO) component of it. There’s definitely value in hitter BABIP in separating out luck, but it’s even valuable in determining what’s a skill and what’s not.

If we want to handle it binomially (and I agree that’s a very good way to do it) then I don’t think you want HR/AB. If you regress HR/AB and SO/AB (or what have you) separately, then you get really wacky results sometimes. Let me see if I can remember how to do the independent binomial components for a standard batting line of the top of my head:

$XB = (2B+3B)/(HHR)
$T = 3B/(2B+3B)

That should preserve the binomial property for every component and keep everything “independent” of each other for the purposes of regressing. (And yes, $H is functionally equal to BABIP, depending on how you want to do it. If you want to add in things like HBP, SH, SF, or ROE, you can do it, but let’s leave that off to the side for now.)

What you want to do at that point is use p(1-p)/n to figure the random variation and then figure the observed variation, then take the difference to get the “true” variation. (Actually it’s a little more involved than that, since the longer formula to figure the true variation includes true variation as one of its terms, so you have to iterate it a bit.)

I’m uncomfortable with any analysis involving batted-ball data, especially line drives. I don’t know if you saw the study I did on park effects for line drives with press box heights, but it’s a troubling issue to me.

Clay Davenport:
That is more or less the approach I use, other than inverting the order of a couple of things (I break them out as SO, BB, H, HR, TP, DB).

I strongly second Colin’s discomfort [with line-drive data]. The wakeup call for me was when I ran regressions of various park effects against each other. I found the PF for line drives and the PF for fly balls correlated at -.86. Any stat that relies on fly balls but not line drives (or vice versa) is going to run into trouble with teams at the extremes. I know the Cardinals were one-I can’t remember off the top of my head which direction it was-but it made more than a 30 percent difference to the question of “How many fly balls did Joel Pineiro allow?”

I definitely think park-adjusting these things can probably remove some of the bias, but I still think using noisy stats is not useless. No one has park factors that perfectly estimate the effect of something so we’re left making some (pardon the pun) ballpark adjustments regardless of which data we use.

I think the correlation between line-drive park factor and fly-ball park facot being so negative basically means that if you want to answer the question, “How much of the difference in LD and FB in different park factors is due to scorer bias?” then the answer is “almost all of it.” Colin’s THT article about press box heights answers the question “Where is that scorer bias coming from?” with “The angle he has.” But I think the answer to the question of “How badly does this bias hurt us from approximating people’s skill levels?” is basically “not very much.”

Also, if we’re really, really going to post all this discussion online, I’d like to give a shout-out to my lovely wife.

It doesn’t surprise me that Saint Louis is one; I had some talks with people after that article came out and realized that I had measured the height of the wrong press box (I used the TV one, since I got a lot of my data from scouting reports TBS published for their own use in broadcasts, while the Gameday stringer sits in a different press box altogether). They have one of the highest press boxes in the game. Off the top of my head, Pittsburgh is probably another park that’s an outlier, and Houston is probably one the other way.

As for the components, yeah, I don’t know that any one order is better than any other, and frankly it seems like more work than it would ever be worth to check.

I really liked your article on the press box heights. That’s really good stuff and important to remember. Ideally, if someone could do some park-adjustments to this type of stuff, that might at least correct some of this. But keep in mind that subjective scoring can only take you off track so far.

When I’m trying to predict ground-ball BABIP, I use (ROE+IFH) as a term rather than just IFH itself, but it’s valid to use AVG/OBP/SLG even if the scorer biases for ROE vs. IFH are probably questionable and possibly biased. Clay said the park factor on LD vs. FB once got to be at most 20 percent or something like that for maybe St. Louis or somewhere, but it was less than that in other places. I think there’s still a lot of value in looking at it, even if it’s imperfectly estimated, just like looking at HR rates are valuable even if they are better looked at after adjusting for park.

I think the true variation thing is pretty easy to compute. I used the sum of variances = variance of sums formula, and basically determined that batter BABIP skill level among major leaguers has a standard deviation of about .020, while realized BABIP has a standard deviation of about .030. I forget what it was for HR, but what you’re suggesting is basically what I did.

Christina Kahrl:
Going back to something I was working out with Eric a couple of weeks ago, I’d stress a couple of what should be obvious cautions as far as BABIP:

  1. Different types of players can and do wind up with different baselines. Even using something as crude as single-season stolen-base data to describe “speed guys,” you wind up with a group of players whose BABIPs are noticeably higher than the MLB average. Are they “lucky”? Of course not, they just bring a different set of skills to the plate (literally).

  2. Players with 1000 career PAs or more wind up with higher BABIPs than the MLB average; that’s a normal enough survivor effect, but it’s something to keep in mind when people start talking about “luck” when there’s simply a higher average for people who are successful enough in the majors to stick around.

    Or, to pick on a favorite craptastic briefly big-league ballplayer, Sammy Khalifa wasn’t lucky or unlucky relative to a worthwhile major-league average in 1985 when he produced a .278 BABIP; he was Sammy Khalifa, and, shortly thereafter, he was Sammy Khalifa someplace else besides the major leagues.

  3. As ever, I’d argue strongly against the inclusion of pitcher hitting data in any meaningful baseline describing what’s average for a major-league hitter. All you’re doing there is creating a slightly lower average that makes it easier for a few more people to appear average, at which point you get a little further from sorting out who’s lucky or unlucky or good or bad, or just plain.

Russell Carleton:
A few relevant numbers, from data I’ve yet to publish. (If people want it, feel free to steal it… it was for a project that I never quite got off the ground.) I’m a fan of using split-half reliability to determine the persistence of a stat. They are all from the batter’s perspective.

Split-half reliability of:
FB/BIP at 400 BIP: .789
GB/BIP at 400 BIP: .884
LD/BIP at 400 BIP: .565
PU/BIP at 400 BIP: .769

I’m using Retrosheet codes there and assuming that there’s no observer bias. (Please don’t hurt me, Colin.) Colin has shown previously that this is a problem and will need to be corrected.

FB and GB rates are very persistent. LD’s less so. That .50s range is maddening because it means an R-squared around 30 percent.

Singles per FB at 150 FB: .450
2B/3B per FB at 150 FB: .565
HR per FB at 150 FB: .790

Singles per GB at 150 GB: .244
2/3B per GB at 150 GB: .128

Singles per LD at 100 LD: .035
2B/3B per LD at 100 LD: .562

What happens on GB isn’t very consistent. I have also found evidence that infield and outfield grounders should be considered separately though. Line-drive singles seem to be more a function of whether a fielder got in the way.

Interesting. A few questions:

  1. What years does this cover? I wager you’d get different results looking at, say, ’05 to ’09 than ’89-’99 (’05 and on is Gameday data, ’89 to ’99 is Project Scoresheet).

  2. 2) How are you splitting the halves? Midseason? Even/odds?

Of course, again, if you regress based upon this, you should get some odd results-a player’s GB rate puts a constraint on their LD/FB/PU rate, for instance.

There are other biases as well-the same batted ball caught by an infielder versus an outfielder may be scored differently (if an infielder catches it, it would probably be scored PU, where if an outfielder catches it, it could be a FB).

You also have potential caught/not-caught bias, where if a ball is caught, maybe it’s scored a FB, where if the fielder can’t get to it, it may be scored as a liner.

For these numbers, I used Retrosheet from 2003-2008. I combined ’03-’04, ’05-’06, and ’07-’08, and split it in half based on evens/odds.

You could regress that batted-ball types to league mean and then pro-rate the answers out to 100 percent. Since batted-ball type numbers don’t regress overly much, it probably wouldn’t look too weird. My initial framework was to generate a projected batted ball profile. Once I knew that Smith was projected to have 100 GB, I would figure out his skill at getting singles/XBH on those and regress to mean as necessary, then prorate them back to 100 percent.