BP Comment Quick Links
Vote in the Internet Baseball Awards for a chance at a free copy of Dollar Sign on the Muscle


March 23, 2010 Ahead in the CountPredicting BABIP, Part 1
If you don’t put your bat on the ball, you’re not going to get a hit, and if you don’t hit the ball over the wall, someone might catch it. This series begins with what happens the rest of the time as I develop a model to predict a hitter’s Batting Average on Balls in Play (BABIP). In Part 2, I will explain some of the current BABIP superstars then some of the players where my system differs from PECOTA will be the topic of Part 3. Hitters vary wildly with respect to their abilities to put the bat on the ball. The best are guys like Placido Polanco and Dustin Pedroia, hitters who strike out in about seven percent of their plate appearances, while the worst are guys like Mark Reynolds and Jack Cust, who strike out about 3035 percent of the time. Hitters vary wildly with respect to their abilities to hit the ball over the wall, too: Prince Fielder and Ryan Howard hit home runs in about eight percent of their atbats while Jason Kendall and Luis Castillo are well under one percent. It’s very clear who is good at making contact, and who is good at hitting home runs, but it is harder to know who is good at getting hits on balls in play. That’s because the difference between the best and worst hitters at BABIP is much smaller, which explains why the yeartoyear correlation of a batter’s HR/AB is about .74, and the yeartoyear correlation of a batter’s K/PA is about .84, while the yeartoyear correlation between a hitter’s BABIP is only about .37. Guys like Derek Jeter and Ichiro Suzuki can reliably get hits on 35 percent of their balls in play, while guys like Edwin Encarnacion and Rod Barajas can only muster hits on 26 percent of their balls in play. Between hitters, there is a much smaller gap in BABIP skills, and it can be tricky to decipher the difference between skills and noise. However, it is very important to figure out if you want to assess a hitter’s skills, because 69 percent of all PA result in a ball being hit into the field of play. BABIP came into popular usage when Voros McCracken initially observed how little pitchers have control over it. Recently, I explained that pitchers seem to control about 12 percent of their BABIP skill, but I have also found that hitters control about three times that much in their atbats (36 percent). Furthermore, unlike pitchers, whose BABIP skills are correlated with and explained by each pitcher's strikeout, walk, and home run skills, any hitter's BABIP skills are often uncorrelated with his Three True Outcomes skills. The actual standard deviation of hitter BABIP among players with at least 300 PA in a given season is about .031, but removing the fraction that could be attributed to luck using the same methods as the article I just referenced, and true hitter skill in BABIP should have a standard deviation of about .019. My model predicts about .018, indicating that I have isolated the most important aspects of BABIP. As I explained when I wrote my BP Idol article "You Can Beat PECOTA without a Computer Model," PECOTA uses nearly a century of data to predict hitters’ statistics with accuracy, although not every statistic has been recorded throughout through all of history. Thus, PECOTA cannot use information like rates for ground balls, line drives, popups, and fly balls, nor can it examine BABIP on each of those battedball types. My model can see some things that PECOTA can’t about BABIP, and therefore has been successful at projecting BABIP more than PECOTA and other systems. My projected BABIP numbers, published at the now defunct StatSpeak.net blog last spring, correlated with actual 2009 BABIP at a higher rate than PECOTA and CHONE, and had a lower RMSE. They also were closer to actual BABIP than PECOTA was 57 percent of the time, and closer to actual BABIP than CHONE was 60 percent of the time. That might not seem like much, but those fractions are significant at the 95percent and 99percent level, respectively. In other words, there is less than a fivepercent chance that there would have been that large of a difference between my BABIP model and PECOTA if they were equally as good, and less than a onepercent chance that the model would have beaten CHONE as badly as it did just by chance. There is a large difference in BABIP on different batted balls. Line drives have a BABIP of about .730, while ground balls have a BABIP of about .240. Outfield fly balls have a BABIP of about .170, while infield popups have a BABIP of only .020. Different hitters have different swings that generate these batted balls at very different rates. The yeartoyear correlation for groundball rate is .78, for outfield flyball rate is .72, for linedrive rate is .37, and for popup rate is .68. You can get pretty far already just by knowing the rate at which a hitter has hit these battedball types in the past. Hitters also show some difference in their BABIP on each battedball types, with yeartoyear correlations of .30 for ground balls, .22 for outfield fly balls, just .12 for line drives, and .17 for popups. Therefore, if we know that a hitter had a high BABIP because he had a high BABIP on line drives, then we should expect him to regress back to the mean, while if a hitter had a high BABIP because he had a high BABIP on ground balls, then he will be more likely to maintain that. There are other statistics that show some ability to help guide a prediction of BABIP even after knowing battedball rates and BABIP on them. While hitters seem to have more control over their BABIP on ground balls, this is primarily because faster players can reach on ground balls in the infield at a much higher rate than slower players. In the past, I have found that using groundball errors and infield hits together is better than only looking at infield hits alone, so I have incorporated a variable called "Infield Reach Percentage" which is the percentage of ground balls that stay in the infield that a hitter reaches first base safely on (excluding fielder’s choices). Infield Reach Percent has a .55 yeartoyear correlation, while "Outfield GroundBall Rate," my statistic representing the percentage of ground balls that reach the outfield, has only a .25 yeartoyear correlation; it certainly represents a skill, but not one where there is much difference between majorleague hitters. The battedball type with the least persistent BABIP was line drives, with only .12 yeartoyear correlation. Most of hitting line drives away from fielders' gloves is a matter of luck, but hitters who hit the ball harder are better at getting them to fall in. In fact, homerun rate (or, actually the natural log of homerun rate) correlates more highly with next year’s linedrive BABIP than this year’s linedrive BABIP. In other words, you can most likely expect Andre Ethier (.618 LDBABIP, 31 homers in 2009) to have a better BABIP on line drives than Erick Aybar (.861 LDBABIP, five homers in 2009) this year. Another statistic that has a high correlation with a lot of relevant aspects of BABIP is contact rate (as shown on FanGraphs), and defined as the percentage of pitches that a hitter swings at that he makes contact with, either generating a foul or fair ball (although I use the natural log of contact rate). Being able to make contact when a hitter swings is a good proxy for his ability to square up the ball, and hitters are more likely to improve their BABIP if they make more contact in general. Triples per atbat were also used in some regressions as an additional proxy for speed. Following my methods in previous articles, I developed simple OLS regressions to check BABIP on each batted ball type (except popups) before developing an overall BABIP model. I did make an important change for the sake of accuracy to create weighted averages of a statistic over multiple seasons. Only including seasons with at least 300 PA, I created a weighted average of the three previous seasons by weighting three years ago by three, two years ago by four, and one year ago by five, and averaging those. For instance, the weighted average of groundball BABIP over the last three years was equal to: (3*(Groundball Hits in 2007) + 4*(Groundball Hits in 2008) + 5*(Groundball Hits in 2009)) Similarly, when I only had two previous seasons with at least 300 PA, the years were weighted by just four and five. Using only data from 200309, and only hitters with at least 300 PA for four straight years, I used regression analysis to predict groundball BABIP (GBBABIP) in the fourth year using data from the previous three years. The regression had an R^2 of .20; here it is: Variable Coeff. PStat GBBABIP 0.412 .000 INF Reach% 0.179 .016 LN(Contact%) 0.082 .008 LN(HR/AB) 0.010 .015 TR/AB 1.248 .004 Constant 0.162 .000 Translated, this table says that: Expected GBBABIP = .412*(GBBABIP) + .179*(INF REACH%) + .082*(LN(Contact)) + .010*(LN(HR/AB)) + 1.25*(3B/AB) + .162 Keep in mind that all of these statistics are weighted averages of the previous three years as described above. Predicting outfield flyball BABIP in the fourth year as a function of the weighted average of OFBABIP, popup rate, and linedrive rate in the previous three years would look as follows:
Variable Coeff. PStat
OFBABIP .307 .000
PU% .265 .002
LD% .240 .014
Constant .186 .000
This equation had an R^2 of .09. Projecting linedrive BABIP in the fourth year was not even helped by looking at previous years’ linedrive BABIP, but instead by looking at previous years’ linedrive rate and the natural log of home runs per at bat. This only had an R^2 of .04. Variable Coeff. PStat LN(HR/AB) .017 .000 LD% .205 .087 Constant .741 .000 Using these results, I confirmed that using the same statistics would help me develop my BABIP model. Again, using all players with 300 PA in four straight seasons, I developed the following regression equation for BABIP the fourth season as a function of weighted averages from the three previous seasons. This regression had an R^2 of .31: Variable Coeff. PStat Line Drives/Balls in Play .277 .000 Ground Balls/Balls in Play .091 .010 Popups/Balls in Play .378 .000 Groundball BABIP .177 .003 IF Reaches/IF Reach Ground Balls .109 .040 Outfield Flyball BABIP .181 .000 Ln(Home Runs/AtBats) .011 .000 Ln(Contact Made/Pitches Swung At) .054 .028 Constant .200 .000 Using only players with only three straight years of 300 PA, I developed the following regression equation for predicting BABIP in the third year as a function of weighted averages from the two previous seasons; this had an R^2 of .26: Variable Coeff. PStat Line Drives/Balls in Play .226 .000 Ground Balls/Balls in Play .040 .127 Popups/Balls in Play .387 .000 Groundball BABIP .138 .001 INF Reaches/INF Reach Ground Balls .104 .005 Outfield Flyball BABIP .124 .000 Ln(Home Runs/AtBats) .007 .002 Ln(Contact Made/Pitches Swung At) .023 .180 Constant .232 .000 Using the same statistics to project BABIP one year using only the previous year’s data, I got a lot of insignificant variables and needed to change up the variables a little bit. With only one year of data, there is a lot of noise, and the best strategy is to use some other statistics to add information that would not have been valuable with more years to work with. I adjusted the popup variable to popups per fly ball overall, which provided a more accurate assessment of how frequently the hitter makes a bad swing. I eliminated overall GBBABIP, which had too much noise in it, and used only infield reach rate. I also used triples to add in some information about speed not contained in one year of infield reach rate. Finally, I used outfield flyball hits per all balls in play rather than just per outfield fly balls in play since that provided a slightly better prediction. This regression had an R^2 of .21. Variable Coeff. PStat Line Drives/Balls in Play .236 .000 Ground Balls/Balls in Play .077 .000 Popups/(Popups + Fly Balls) .116 .000 IF Reaches/IF Reach Ground Balls .132 .000 Ln(Home Runs/AtBats) .003 .019 OF Flyball Hits/All Balls in Play .218 .000 Triples/At0Bats .605 .000 Constant .223 .000 It’s always better to use extra data, so to develop my model of BABIP, which I call Expected BABIP (or EBABIP), I used the first of these three regressions for all hitters with 300 PA or more the previous three years, while replacing it with the second regression if the hitter had 300 PA the previous two years but not three years ago, and the last of the three equations only for hitters who had 300 PA just one year in a row. Actually incorporating this model into a projection system would be a tricky endeavor, and with all the improvements going into PECOTA in 2010, this was not included in this year’s projections. However, it can be used by general managers and fantasy managers alike to better assess who is likely to outperform their projections. As other projection systems use similar amounts of data, this model serves as another approach to evaluate a few aspects of hitting skill. Part 2 of this series will highlight the current BABIP Superstars, as I’ve called them before, showing the 10 highest BABIP projections for 2010, and the reasons why they are so high, and also the fivelowest BABIP projections for 2010 and the reasons why they are so low. Part 3 will show the hitters where EBABIP and PECOTABABIP differ the most and why you should view each in different circumstances.
Matt Swartz is an author of Baseball Prospectus. 26 comments have been left for this article.
 
Excellent article. Thanks, Matt. I've been looking forward to EBABIP since Idol last year.
"Actually incorporating this model into a projection system would be a tricky endeavor..." but worth the effort. This could be the "next big step" in baseball analytics and projections. $$$.
Kudos.