I spend an awful lot of time talking about baseball data—what data we have, how we can tell if data is good or bad, what data we need to answer certain questions.

Here at BP we use a lot of baseball data, most of it either seasonal accounts (now from the Palmer database) or play-by-play data compiled by the fine, fine folks at Retrosheet. Up until now, we’ve had only scattered usage of one of the most exciting sources of data to come about in recent years—the PITCHf/x data collected by Sportvision for MLB Advanced Media’s Gameday product.

But PITCHf/x is here to stay, and we’ve got a lot of things that we want to start doing with it, like Mike Fast’s investigation of catcher framing. We’re not there yet, but we’re taking a first step toward incorporating more PITCHf/x data into our offerings.

Our first PITCHf/x-based report for our sortables is pitcher and batter plate discipline. These, in concept, are the same metrics published on FanGraphs and quoted throughout the Internet—Zone%, O-Swing%, and so forth. The definitions of the most commonly used figures:

  • O-Swing%: The percentage of pitches a batter swings at outside the strike zone.
  • Z-Swing%: The percentage of pitches a batter swings at inside the strike zone.
  • Swing%: The overall percentage of pitches a batter swings at.
  • Zone%: The overall percentage of pitches a batter sees inside the strike zone.

As I said, these are the same metrics presented at FanGraphs—what’s not the same is the results. And the reason for this is the data we’re using. FanGraphs uses stringer-collected data from Baseball Info Solutions, recorded by “video scouts” off the same broadcasts the rest of us get on cable and As alluded to above, we’re using PITCHf/x data. But despite the data source being different, they still measure the same thing, right? More to the point, when they disagree, how to determine which is better?

I’ve discussed some of the problems with charting balls and strikes from commercial video before:

There are a lot of things conspiring against you being able to judge balls and strikes off of video. You can sum it up broadly like this—your brain is a magnificent thing, and it takes the two-dimensional images you’re seeing on your television and reconstructs it so that you think you’re seeing it in three dimensions. It’s a marvelous process, and if you stop to think about it, it’s pretty amazing.

What it is not, however, is perfect.

In order to present the view that you see, the camera is positioned in the outfield at an offset, and then zoomed in to magnify the picture. This is, in essence, an act of deception—you are made to feel like you’re watching a little ways from behind the pitcher’s mound, when in reality you’re watching from the outfield bleachers.

And what the offset does is it distorts the view of the strike zone you have—it’s the phenomenon of parallax. You can observe this yourself, if you just go out to your car and check the gas gauge from the passenger’s seat and then from the driver’s seat:

Description: Illustration of parallax using a car's gas gauge.

You also have problems with depth perception—essentially your brain is “guessing” the depth based upon visual cues in the image. This is difficult enough under the best of circumstances—there are really, really good reasons human beings have two eyes instead of one. Cyclops would be a terrible baseball player. You can get some idea of how this works just by covering one eye and trying to judge distance, then doing it with both eyes open.

The other issue with using commercial video feeds is the frame rates. NTSC video, used in all North American broadcasting, has video at roughly thirty frames per second. (Because of possible interference between the chroma and audio carrier signals, NTSC video has been 29.97 frames per second since the introduction of color.) So each frame works out to a little over three-hundredths of a second.

That sounds like an awfully brief period of time—but it’s not as brief as the flight of a baseball pitched by a major leaguer. The time from release (as defined as 50 feet from the back edge of home plate) to the back edge of home plate for an average pitch is going to be, on average, a shade over four-tenths of a second. Home plate itself is just a little more than 1.4 feet in length. So the time it takes for a pitched ball to cross home plate is less than a third of a frame.

And typically a camera won’t be recording for the entirety of a frame. Modern CCD-based cameras don’t use physical shutters, but all of them will only be recording video for a fraction of the frame time. Let’s consider a particular model of camera from Sony that has been used by Fox Sports for their MLB playoff coverage. The Sony HDC-1500, for instance, has a minimum shutter speed of 1/60—the longest exposure it will take is going to be roughly half a frame in length. At its fastest shutter speed, the camera may only be recording for a shade over one-hundredth of a frame.

What this means, in practical terms, is that there is no guarantee that the video camera will record an image during the moment when a pitched ball passes over the plate. The reason you feel as though it does is an optical illusion caused by the brain’s inability to distinguish between real and apparent movement. (In the past, this has been mistakenly referred to as “persistence of vision,” a real but unrelated optical phenomenon. The preferred nomenclature among the academic community appears to be “apparent motion.”)

To reemphasize—your perception of the ball as it crosses the plate is the product of a number of optical illusions (chiefly apparent motion and parallax error). Your eyes give an unreliable testimony as to the location of the pitched ball in relation to other objects.

The obvious question is, how do we know these effects are meaningfully impacting the data BIS is collecting? In the past I’ve studied the BIS-based plate discipline statistics and found some disturbing irregularities. To sum up:

  • The size of the strike zone is inconsistent from season to season,
  • There are large and unexplained anomalies in the data set (particularly the ’07 Los Angeles teams), and
  • There are measurable park effects even after the obvious outliers have been excluded from the analysis.

It is true that the quality of the data appears to have improved with the introduction of PITCHf/x, most likely due to the introduction of PITCHf/x itself. Quoting from BIS founder John Dewan:

One of the questions that has come up is: How can the video scouts who track pitch location data at Baseball Info Solutions (BIS) be as good as Sportvision's very cool PITCHf/x technology that tracks pitch location using hi-tech camera angles. In short, how can a human being be as good as the technology?

The answer is that, at BIS, it's not simply human vs. technology. The equation at BIS is that technology PLUS human review is much better than technology alone. Let me explain. PITCHf/x technology is a huge step forward in baseball analytics and the pitch location data it provides is excellent. But not perfect. At BIS, they take it a step further. Thanks to the fact that PITCHf/x data is publicly available, when BIS video scouts review video to determine pitch location, they also have information about how PITCHf/x plotted the location. The video scout reviews both the actual video of the pitch and the PITCHf/x location to determine where the pitch is located. In essence, pitch location charting at BIS enhances the charting done by PITCHf/x to come up with what BIS believes to be the best data possible, a kind of Enhanced PITCHf/x.

We’ve discussed the problems with human observation already; how does PITCHf/x avoid those problems? Sportvision, the company that collects PITCHf/x, is allowed to install their own cameras directly in the ballpark. They have the ability to choose lenses and measure the optical effects directly. Because they have more than one camera tracking every pitch, they are able to (in essence) take advantage of distinguishing parallax. And by fitting a trajectory to the entire flight of the pitched ball rather than focusing on one point of the entire sequence, they are able to avoid the problem of not having an image of the ball at the exact moment it crosses the plate. Under controlled circumstances, Sportvision’s engineers have been able to establish the accuracy of the PITCHf/x systems to within an inch, or a third of a baseball. Given these massive advantages, a combined approach incorporating both stringer data and precision PITCHf/x data is most likely to degrade, not improve, the quality of the data.

BIS’s response to these concerns is not particularly reassuring:

As a way to test this, BIS conducted an impartial study. They selected the 100 pitches from their database of the 2010 season that represented the biggest discrepancies in pitch location between BIS data and raw PITCHf/x data. They then meticulously reviewed video once again on all these pitches. The video scouts reviewed the pitch location and selected the data source, either BIS or PITCHf/x, that they believed best represented the true location.

These impartial video reviewers chose BIS plotted pitch location data 55 percent more often than the raw PITCHf/x location as the correct location. The details: 59 choices for BIS pitch location (Enhanced PITCHf/x), 38 choices for the raw PITCHf/x location, 2 pitches that Pitch FX has since corrected, and one pitch where neither location was close.

Let us take BIS at their word that the reviewers involved in this study were, in fact, impartial. The video feeds the reviewers were watching, however, were not impartial—they were the same video feeds the original video scouts reviewed. In other words, if there was a bias caused by the video source (parallax error originating from the placement of the center field camera, for instance) the reviewers would be more likely to agree with the video scouts than the PITCHf/x data, even though both of them would be less able to tell the location of the pitched ball than the precision tracking data.

This isn’t to suggest that the PITCHf/x data is perfect, however—the accuracy under controlled conditions is likely to be higher than accuracy in the field, where the weight of the people sitting in the stadium is great enough in aggregate to actually move the stadium itself and thus the placement of the cameras.

So, based on work by Mike Fast, we’ve incorporated a series of calibration adjustments to “correct” the plate location data to give a better picture of where a pitch really was when it crossed the plate. And we’re maintaining a consistent definition of the strike zone, which means that a batter or pitcher’s numbers can be directly compared between seasons without fear that an apparent change is really the product of how the numbers are being crunched.

This is only our first foray into PITCHf/x—rest assured, we’re not done yet. Even when it comes to the subject of plate discipline, we’re always considering new approaches and will work at incorporating the best analysis possible. So consider this an appetizer course.

 (And we have a few non-PITCHf/x related announcements still up our sleeves in the weeks to come. So we’ve expanded our Big September promotion into the first half of October as well.)