In mid-August I was reading an article on a friend’s blog, in which he was reflecting on his preseason MLB award predictions and lamenting his pick of Chris B. Young for NL Rookie of the Year. Since I had hyped Young to him prior to the season, I felt it necessary to reply and defend my prognostication. Young was hitting just .235 at the time, but had 25 or so home runs, and had already built a reputation as a premier baserunner. In defense of my pick, I hastily replied, “He’s had bad luck. If you gave him even a league-average batting average on balls in play, he’d be hitting .270 and have All-Star overall numbers.”

Something about my statement, however, did not seem intuitively correct. I knew that multiplying a player’s balls in play total by a league-average BABIP and adding in the home runs was, at best, a brute force measure of what the player “should” be hitting, only slightly less variable than a player’s batting average itself. Still, I thought that with Young’s combination of power and speed, he should ideally be hitting with at least average luck on balls in play.

Thanks to Marc Normandin‘s player profiles, most BP readers will be familiar with the concept that a player’s batting average is highly variable from year to year, compared with other important rate stats like isolated patience and isolated power.

chart 1

The correlation reflected above supports the idea that that hitting for average is subject to great variance due to luck, whereas walk rate and hitting for power are less prone to variance unexplained by the player’s skillset.

Readers will also be familiar with the fact that certain types of batted balls are more likely than others to fall for hits. To determine a players expected batting average, then, we could simply multiply each batted ball type by its individual probability to fall for a hit, and then factor in the player’s strikeout rate.

Unfortunately, this procedure is not as practical as it might seem. There are other factors besides batted ball type that determine the rate at which balls in play are converted into hits. In efforts to investigate whether my defense of Young’s “bad luck” was justified, I created a model that would account for batted ball types as well as three other demonstrable skills that affect a player’s average: frequency of contact, strength of contact, and how he can use his legs to create (or quash) potential hits on ground balls.

To measure the footspeed component, I used Bill James’s original “speed score” formula, the average of five different calculations that measure speed-stolen base percentage, stolen base attempts, triples, runs per time on base, and GIDPs. For a measure of strength of contact, I elected to use the player’s year-to-date rate of extra-base hits per at-bat. I did not discriminate between doubles, triples, and home runs, since I decided to treat these as similar events and roughly equal in description of power ability. Finally, I included the player’s strikeouts per at-bat to measure the frequency of the ball being put into play.

Taking every player-season since 2004 with at least 300 plate appearances (1097 total), I used multiple regression to measure the relative effect of each factor on batting average. The resulting model predicts a player’s “component” batting average (cAVG) based on the aforementioned rates and speed score. The model has a correlation coefficient of .75, with an R-squared value of .55.

Below are the players with the top 10 biggest discrepancies between their cAVG and their 2007 batting averages, according to my model:

Name               AVG   cAVG  Diff.
Frank Catalanotto .260   .314  .054
Lyle Overbay      .240   .293  .053
Richie Sexson     .205   .251  .046
Bobby Crosby      .226   .269  .043
Marcus Giles      .229   .270  .041
Julio Lugo        .237   .274  .037
Adam Kennedy      .219   .256  .037
Paul Lo Duca      .273   .310  .037
Kevin Mench       .267   .302  .035
Jason Kendall     .242   .276  .034

Name             AVG   cAVG   Diff.
Matt Kemp       .342   .271  -.071
Willy Taveras   .320   .253  -.067
Matt Diaz       .338   .275  -.063
Ichiro Suzuki   .351   .292  -.059
Edgar Renteria  .332   .283  -.049
Norris Hopper   .329   .286  -.043
Magglio Ordonez .363   .320  -.043
Mike Lowell     .324   .285  -.039
Moises Alou     .341   .302  -.039
Cliff Floyd     .284   .245  -.039

I should note that this list contains both Ichiro Suzuki and Willy Taveras, two of the fastest players in the game. It makes sense that their extreme speed would inflate their batting average more than this model predicts, since the cAVG calculation aggregates years’ worth of data from hundreds of players. However, the equation provides an intriguing look at how these players would perform if their speed were slightly closer to the mean level.

A similar list of over- and underperformers calculated using the normal BABIP method ( ([AB-HR-K][League BABIP] +HR) / AB) shares only nine of the 20 names, demonstrating a sizeable difference between my model and that method. What, then, is the biggest source of difference between the predictions of the two equations?

First, as a basis of comparison, here are the largest discrepancies between the cAVG model and the straight BABIP calculation:

Name               AVG   cAVG  BIP AVG  abs(BIP-cAVG)
Chone Figgins     .330   .305   .254       .051
Curtis Granderson .302   .310   .261       .049
Carl Crawford     .315   .307   .260       .047
Juan Uribe        .234   .220   .266       .046
Michael Young     .315   .309   .264       .045
David Ross        .203   .208   .253       .045
Ryan Church       .272   .297   .258       .039
Jason Giambi      .236   .227   .264       .037
Barry Bonds       .276   .277   .314       .037
Bobby Abreu       .283   .302   .265       .037
Yunel Escobar     .326   .310   .274       .036
Derek Jeter       .322   .305   .270       .035
Hunter Pence      .322   .302   .267       .035
Chase Utley       .332   .317   .283       .034
Nook Logan        .265   .258   .224       .034
Matt Holliday     .340   .317   .284       .033
Jack Cust         .256   .257   .224       .033
Travis Buck       .288   .284   .251       .033
Jeremy Hermida    .296   .292   .260       .032
Dmitri Young      .320   .308   .276       .032

Reviewing the data, is apparent that the cAVG model rewards players who demonstrate above-average skills (line-drive rate and extra-base power) that correlate most strongly with batting average, the combination of which often serve to counteract a high strikeout rate. Though the notional average player who strikes out 137 times in 584 at bats could be expected to hit only around .260, Curtis Granderson’s notable speed, power, and line-drive rate indicate that his .300 average this year was completely in line with his actual hitting profile, and not simply a fluke of luck. Considering that he is entering the prime of his career, at least part of this gain likely reflects an improvement in skill.

Even more interesting is the case of Jack Cust. The Oakland DH struck out in 40 percent of his at-bats in 2007, but his cAVG suggests that his true average has not been inflated by luck, mostly due to his prodigious line-drive and extra-base hit rates. The reverse is true for a player like Cliff Floyd, who has an acceptable strikeout rate and power, but lacks a batted ball profile that befits his actual batting average.

For one final comparison of the methods, I looked at how well the BABIP method predicted actual batting average, and came up with a .22 R-squared value, compared with the aforementioned .55 for cAVG.

Where, you ask, is Chris Young? He sits at 66th out of 281 players for absolute difference between the models, with a .237 AVG, a .248 cAVG, and .269 BIP average. Though his high speed score and extra-base-hit power are big pluses, his poor line-drive percentage and high popup rate suggest that his average was depressed only slightly by chance, not nearly the injustice of luck I originally made it out to be.

The idea that power and speed help a player to hit for average is not a revolutionary concept by any means. However, by parsing the components of hitting for average and quantifying their respective levels of influence, we can more easily identify the ideal outcome of the underlying events. It is inevitable that luck will factor into everything that takes place on a baseball field, and batting average is no exception. This being said, the more luck we are able to eliminate from an evaluation of a player’s performance, the closer we are to accurately understanding and valuating that player’s true skillset. This model brings us a small step closer to that ideal evaluation.

Jason Paré is a contributor to Baseball Prospectus. You can contact Jason by clicking here..

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe