As regular readers may have gathered by now, I spend a lot of time thinking about the validity of the data that’s collected about baseball. The bee in my bonnet these days is really batted-ball data.

We can refer to one of two things when we talk about batted-ball data—trajectory data and location data. Trajectory data describes how the ball travels—typically subdivided into grounders, fly balls, line drives, and popups (also called infield fly balls). Location data typically describes where the ball went—distance and vector, basically.

For now, let’s focus on trajectory data. It’s easier to get our hands on, and easier to condense into a single quantity for study. Simply put—how accurate is the trajectory data we have? And how might the data be biasing our conclusions—about pitchers, hitters, and fielders?

The Story So Far

This isn’t the first time I’ve studied potential problems with batted-ball scoring. A few months ago, I looked at data from Retrosheet, which is produced by the Gameday stringers working for MLB Advanced Media. Those scorers sit in the press box and chart the games on a computer. What I found was a modest correlation between the height of the press box and the line-drive rate reported.

It’s an interesting finding, and it raises some question about the validity of metrics based on batted-ball data. But it’s also a very tentative finding. So, in the past few months, I’ve been wracking my brain trying to come up with another way to study the issue.

Well, it struck me. Here at Baseball Prospectus, we publish batted-ball rates for teams, both for batting and pitching. Fangraphs also reports similar figures on offense and defense.

Methodologically, these reports are practically identical—the only difference is that Fangraphs includes infield flies in fly balls (along with breaking them out as a separate category), while BP does not. But the results are strikingly different. Why?

Because here, we use the Gameday/MLBAM data. At Fangraphs, they use data provided by Baseball Information Solutions. BIS, rather than placing a scorer in the press box, uses video feeds to chart batted balls.

So we have two sources for ostensibly the same data, collected by two distinct data providers using two distinct methods. What can we learn from comparing the two data sets?

Running the Averages

The first thing I did was to subtract infield flies from fly balls in the BIS data, so that we were comparing apples and apples here. The first thing to note is that the two data sources seem to regularly disagree even on the average rates:

2003 43.3% 46.1% 22.6% 27.4% 22.5% 18.5% 11.6% 8.0%
2004 44.2% 45.6% 25.3% 27.8% 18.9% 18.5% 11.6% 8.2%
2005 44.2% 45.6% 23.8% 27.9% 20.9% 18.3% 11.1% 8.2%
2006 43.7% 45.2% 26.5% 28.3% 19.6% 18.7% 10.2% 7.9%
2007 43.5% 45.1% 28.3% 28.0% 18.6% 19.1% 9.6% 7.9%
2008 43.9% 45.1% 26.0% 28.1% 20.2% 19.0% 9.9% 7.7%
2009 43.3% 44.9% 28.1% 28.6% 18.9% 18.8% 9.7% 7.7%
Total 43.7% 45.4% 25.8% 28.0% 19.9% 18.7% 10.5% 7.9%

In this case, "b" denotes data from BIS, and "r" denotes data from Retrosheet.

What this tells us is that someone using BIS data and someone using Retrosheet data do not mean precisely the same thing when they refer to a "fly ball" or a "line drive." It’s a subtle difference, to be sure, but one worth noting.

Looking for Park Effects

The next step was to try and see if a team’s home park had an effect on the error rates. This is akin to figuring out park factors without the benefit of home/road splits—something like teaching an elephant to play the piano. Let’s accept from the outset that our elephant isn’t going to be able to play Rachmaninoff. But let’s see if we can at least get Chopsticks, shall we?

So let’s compare the data on a team-by-team basis, both on offense and defense. I took the rate for each team (using BIS and MLBAM data) and subtracted the average for that season to provide "normalized" rates. In other words, if I were looking at 2009 BIS data, instead of saying that a team had a line-drive rate of 23.8 percent, I would say their rate of line drives above average was 5 percent.

Then I subtracted the normalized rate for MLBAM data from the normalized rate for BIS data to produce what we can call "errors" between the two sets—or "residuals," if you prefer a less loaded term. So if you have a team with a normalized rate of 5 percent using BIS data and 2 percent using MLBAM data, the "error” was 3 percent.

In order to see if the error was due to a consistent park effect, I compared the data for each park to the same data for the next season. The year-to-year correlations were:

LD: 0.503
FB: 0.383
GB: 0.584
IFFB: 0.215

If you prefer a more visual representation, we can look at scatter plots. The x-axis represents the difference between normalized rates in year one; the y-axis represents the difference between normalized rates in year two.

For ground balls:

For fly balls:

For line drives:

For infield flies:

What this strongly suggests is that there are persistent biases in how batted-ball trajectories are scored between different parks. (If this were so, we should expect this study to actually underreport the park-to-park bias, since a team’s home and road stats are lumped together.)

What This Means

Assuming there is bias represented in this data, there are three potential explanations:

  1. That the MLBAM data is correct and the BIS data is biased
  2. That the BIS data is correct and the MLBAM data is biased
  3. That both data sets are biased to some extent, and do not share the same set of biases

Which is true? The truth is, we don’t really know. And I’m not even sure, given the data available, that we can know. (It is possible that this could be better resolved with a more granular look at the data, but I can’t say for sure right now.)

But we can take a look at how the difference in batted-ball metrics affects current metrics. Let’s consider tRA for a minute. tRA is a component run estimator for pitchers (somewhat akin to SIERA, but with a focus on describing what happened, rather than predicting future performance) which takes into account batted-ball data. It’s an interesting case study since it’s displayed on two sites—StatCorner, which uses the MLBAM data, and Fangraphs, which uses the BIS data.

[Fangraphs, it should be noted, scales tRA to ERA instead of RA by multiplying by .92; all figures presented below from Fangraphs divide by .92 to place them on the same scale as the StatCorner figures.]

I chose to look at Wandy Rodriguez  as a for instance because Houston's Minute Maid Park has been one of the more extreme parks over the period looked at, with an average difference of .014 in normalized line-drive rates. In other words, BIS has reported a higher line-drive rate on average than MLBAM for Houston players. Here’s how the two sites, using the same metric, view Rodriguez:

  Fangraphs StatCorner
2007 4.53 4.26
2008 4.48 3.97
2009 3.91 3.41

Over a period of several years, the Fangraphs data is consistently higher than the StatCorner data.

Let’s look at another team with an extreme difference—the Mariners, with an average difference of -0.007. In other words, a park where BIS tends to see fewer line drives than MLBAM. Consider Felix Hernandez:

Fangraphs StatCorner
2007 3.83 4.19
2008 4.23 4.45
2009 3.32 3.27

Not as dramatic a difference as Wandy, but now you see that Fangraphs consistently shows King Felix as a better pitcher than StatCorner.

What I want to point out is that this is not a difference of opinion between the two on how to evaluate pitchers—Graham MacAree of StatsCorner collaborated with Dave Appelman of Fangraphs to implement tRA there. As Appelman points out:

There are a couple things which are different between the StatCorner version of tRA and the version implemented on FanGraphs. The main difference is we’re using Baseball Info Solutions batted ball stats instead of Gameday batted-ball stats. The other difference, though probably not as major is we’re using different park factors.

The thing is, the differences between BIS and MLBAM data are influenced by park biases that are persistent over time—no matter how much data you have, they will not wash out of the sample.

And what about fielding? Again, let’s consider a rather famous Mariner—Ichiro Suzuki. Tom Tango compared Ichiro’s rating in UZR between values computed using STATS data (which is similar to the MLBAM data in that it is collected in the press box) and values computed using BIS data. What he found was that using BIS data, Ichiro grades out as a very good defensive outfielder, while looking at STATS data he looks like a much less stellar outfielder.

The difference isn’t method—UZR was used in both cases. The difference is the underlying data. And the difference between the data sets is persistent and biased based upon a player’s home park.

And we simply don’t know who is right and who is wrong.

So what do we do about this? Stop taking the data at face value. Try to figure out what the errors in the data are and how they affect specific players over a period of years. And then adjust for those errors as best we can.

What We Can Do

The long-term answer is to get better data. Let’s face it—no matter how much we massage the data, there simply is not a way to objectively define the difference between a fly ball and a line drive. It is inherently a subjective and somewhat arbitrary distinction.

There’s a lot of work being done right now in precision batted-ball tracking, both with cameras and radars. Someday maybe that will percolate down—but it won’t tell us anything about players from before the introduction of those technologies. Failing that, a simple stopwatch could provide more accurate, quantifiable data than what we’re getting right now. And it is possible, to some extent, to review video of past games and get those measurements for players and seasons already passed.

 In the meantime, consider this my sabermetric crisis of faith. It’s not that I don’t believe in the objective study of baseball. I’m just not convinced at this point that something dealing with batted-ball data is, at least wholly, an objective study. And where does this leave us with existing metrics that utilize batted-ball data? Again, I’m not sure. I can tell you I’m a lot less comfortable accepting their conclusions—even over a large number of seasons—than I was in the past.  

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
WOW. Great article, important, too, given the number of analyses on this site dealing with batted ball data (fielding, line drive rates for hitters).

Is the solution doing what the TV industry does with ratings: choosing one set of biases to play the game under? Or is there a way to screen out the biases?
Won't the data get better (perfect?) once HitFX is available? It seems like the solution is already in the works. Still, this is very eye opening!
I actually have a sample of Hit F/X data, provided by Sportvision in advance of their Summit last season.

What I can tell you is that, given that sample of data, that you can't fully derive the flight of the ball from the parameters they give.

Hit F/X (at least in the form I was provided) contains three of the four variables needed:

* Speed
* Horizontal angle
* Vertical angle

The fourth parameter is spin of the ball, which cannot be derived from the raw ball tracking data Sportvision has (essentially left over from the collection of Pitch F/X data).

What you need is full field tracking of the ball through to the landing point. There are currently two major efforts to do this - Sportvision is working on their Field F/X system, and Trackman has their radar-based system which can track the flight of the batted ball. (We saw this functionality in use during Fox playoff telecasts.)

The question then becomes when those technologies will be widely adopted, and if/when that data becomes available to the public.
This article should be an early favorite for one of next year's Sabermetric Writing Awards. Bravo.
Can't you figure out which data set is better by examining which is more useful in prediction?
Colin - great article - but as an aside, was that a deliberate reference to The Seven Year Itch?
(as in Rachmaninoff/Chopsticks)....
GPS in the core of the ball is the solution. Great article.
Actually, what seems most puzzling to me is the discrepancy in GB rate. Are really 1/70 GBs accidentally classified as LDs?

But, yeah, I agree with the above about HitFX. Once that's available, it can be directly compared to both systems to see which one is more accurate for historical data.

Is there any time frame for when it will be all publicly available?
This should give pause to all that use batted ball data to pass judgment on players. One case in particular that comes to mind is Bobby Abreu, whose UZR figures absolutely sunk his chances at a multiyear deal a two years ago. Clubs cited his terrible fielding numbers in contract talks, and he eventually had to settle for $5M with the Angels. Then he went out, posted his usual offensive production and ended up in right field anyway. I'm not saying he's a good fielder, but the bias created by UZR certainly hurt him in the wallet.
Is there any evidence that UZR was the culprit in Abreu's difficulties finding a job? I know his defense was discussed a lot as a roadblock, but many decidedly un-saber-friendly observers of the 2008 Yankees took note of Abreu's issues in right field, including the Yankees announcers.
Great work. The best baseball stat analysis I've read in years.
Yes, wow! Terrific column, Colin.

This will definitely change the way I look at the data.

Just to comment publicly on a colleague's work, this was a great piece, and drives to the foundation of what is a basic, Kuhnian dilemma in the data, especially for those who try to use it to make flat declarations about player value: if basic research indicates that the value of "facts" themselves have a transitive nature, then this is no small problem. Speaking as a skeptic of so much of the merely suggestive nature of many recent metrics, folks should take this column to heart.
This has got to be a common problem in Science right? Consider the recent bruhahah over global warming and the method's used to collect that data in the US.

I wouldn't throw the data out or decide we can't prove anything. The nature of science is to constantly refine and discover better answers.

Consider the differences between Newtonian and Relativistic Physics. Newtonian is still good enough for most cases - not that we don't strive to get better. I'm not a science geek but my reading of media articles on the the giant collider is that they are trying to resolve conflicts between the theory of relativity and string theory.

I applaud the raising of the problem and the need to find a rough fix and then a permanent fix. But really none of us should be surprised - this happens all the time.

Perhaps the issue is that too many readily interpret baseball statistics as fact when they need to be treated more as working theories.
The basic problem has been identified, and that leads to ideas about the basic solution. First, there's a need for a definition, a rigorous one, about what constitutes a line drive. Without that, the chaos will continue.

That definition should be based on physics: trajectory, velocity, wind. Once that's done the next step would be to get real time data on just that. There may be a strong enough relationship between the time a ball takes to get to a certain part of the outfield and those 3 elements to use that as a surrogate.

The visual element is too subjective, that's obvious. The answer is, as Colin points out, better data. Given the instrumentation that's finding its way into the ballpark, that may be coming.

Excellent piece. Thank you Colin.
I love this kind of data detective work. Bravo.