I spend an awful lot of time talking about baseball data—what data we have, how we can tell if data is good or bad, what data we need to answer certain questions.
Here at BP we use a lot of baseball data, most of it either seasonal accounts (now from the Palmer database) or play-by-play data compiled by the fine, fine folks at Retrosheet. Up until now, we’ve had only scattered usage of one of the most exciting sources of data to come about in recent years—the PITCHf/x data collected by Sportvision for MLB Advanced Media’s Gameday product.
But PITCHf/x is here to stay, and we’ve got a lot of things that we want to start doing with it, like Mike Fast’s investigation of catcher framing. We’re not there yet, but we’re taking a first step toward incorporating more PITCHf/x data into our offerings.
Our first PITCHf/x-based report for our sortables is pitcher and batter plate discipline. These, in concept, are the same metrics published on FanGraphs and quoted throughout the Internet—Zone%, O-Swing%, and so forth. The definitions of the most commonly used figures:
- O-Swing%: The percentage of pitches a batter swings at outside the strike zone.
- Z-Swing%: The percentage of pitches a batter swings at inside the strike zone.
- Swing%: The overall percentage of pitches a batter swings at.
- Zone%: The overall percentage of pitches a batter sees inside the strike zone.
As I said, these are the same metrics presented at FanGraphs—what’s not the same is the results. And the reason for this is the data we’re using. FanGraphs uses stringer-collected data from Baseball Info Solutions, recorded by “video scouts” off the same broadcasts the rest of us get on cable and MLB.tv. As alluded to above, we’re using PITCHf/x data. But despite the data source being different, they still measure the same thing, right? More to the point, when they disagree, how to determine which is better?
I’ve discussed some of the problems with charting balls and strikes from commercial video before:
There are a lot of things conspiring against you being able to judge balls and strikes off of video. You can sum it up broadly like this—your brain is a magnificent thing, and it takes the two-dimensional images you’re seeing on your television and reconstructs it so that you think you’re seeing it in three dimensions. It’s a marvelous process, and if you stop to think about it, it’s pretty amazing.
What it is not, however, is perfect.
In order to present the view that you see, the camera is positioned in the outfield at an offset, and then zoomed in to magnify the picture. This is, in essence, an act of deception—you are made to feel like you’re watching a little ways from behind the pitcher’s mound, when in reality you’re watching from the outfield bleachers.
And what the offset does is it distorts the view of the strike zone you have—it’s the phenomenon of parallax. You can observe this yourself, if you just go out to your car and check the gas gauge from the passenger’s seat and then from the driver’s seat:
You also have problems with depth perception—essentially your brain is “guessing” the depth based upon visual cues in the image. This is difficult enough under the best of circumstances—there are really, really good reasons human beings have two eyes instead of one. Cyclops would be a terrible baseball player. You can get some idea of how this works just by covering one eye and trying to judge distance, then doing it with both eyes open.
The other issue with using commercial video feeds is the frame rates. NTSC video, used in all North American broadcasting, has video at roughly thirty frames per second. (Because of possible interference between the chroma and audio carrier signals, NTSC video has been 29.97 frames per second since the introduction of color.) So each frame works out to a little over three-hundredths of a second.
That sounds like an awfully brief period of time—but it’s not as brief as the flight of a baseball pitched by a major leaguer. The time from release (as defined as 50 feet from the back edge of home plate) to the back edge of home plate for an average pitch is going to be, on average, a shade over four-tenths of a second. Home plate itself is just a little more than 1.4 feet in length. So the time it takes for a pitched ball to cross home plate is less than a third of a frame.
And typically a camera won’t be recording for the entirety of a frame. Modern CCD-based cameras don’t use physical shutters, but all of them will only be recording video for a fraction of the frame time. Let’s consider a particular model of camera from Sony that has been used by Fox Sports for their MLB playoff coverage. The Sony HDC-1500, for instance, has a minimum shutter speed of 1/60—the longest exposure it will take is going to be roughly half a frame in length. At its fastest shutter speed, the camera may only be recording for a shade over one-hundredth of a frame.
What this means, in practical terms, is that there is no guarantee that the video camera will record an image during the moment when a pitched ball passes over the plate. The reason you feel as though it does is an optical illusion caused by the brain’s inability to distinguish between real and apparent movement. (In the past, this has been mistakenly referred to as “persistence of vision,” a real but unrelated optical phenomenon. The preferred nomenclature among the academic community appears to be “apparent motion.”)
To reemphasize—your perception of the ball as it crosses the plate is the product of a number of optical illusions (chiefly apparent motion and parallax error). Your eyes give an unreliable testimony as to the location of the pitched ball in relation to other objects.
The obvious question is, how do we know these effects are meaningfully impacting the data
- The size of the strike zone is inconsistent from season to season,
- There are large and unexplained anomalies in the data set (particularly the ’07 Los Angeles teams), and
- There are measurable park effects even after the obvious outliers have been excluded from the analysis.
It is true that the quality of the data appears to have improved with the introduction of PITCHf/x, most likely due to the introduction of PITCHf/x itself. Quoting from BIS founder John Dewan:
One of the questions that has come up is: How can the video scouts who track pitch location data at Baseball Info Solutions (BIS) be as good as Sportvision's very cool PITCHf/x technology that tracks pitch location using hi-tech camera angles. In short, how can a human being be as good as the technology?
The answer is that, at BIS, it's not simply human vs. technology. The equation at BIS is that technology PLUS human review is much better than technology alone. Let me explain. PITCHf/x technology is a huge step forward in baseball analytics and the pitch location data it provides is excellent. But not perfect. At BIS, they take it a step further. Thanks to the fact that PITCHf/x data is publicly available, when BIS video scouts review video to determine pitch location, they also have information about how PITCHf/x plotted the location. The video scout reviews both the actual video of the pitch and the PITCHf/x location to determine where the pitch is located. In essence, pitch location charting at BIS enhances the charting done by PITCHf/x to come up with what BIS believes to be the best data possible, a kind of Enhanced PITCHf/x.
We’ve discussed the problems with human observation already; how does PITCHf/x avoid those problems? Sportvision, the company that collects PITCHf/x, is allowed to install their own cameras directly in the ballpark. They have the ability to choose lenses and measure the optical effects directly. Because they have more than one camera tracking every pitch, they are able to (in essence) take advantage of distinguishing parallax. And by fitting a trajectory to the entire flight of the pitched ball rather than focusing on one point of the entire sequence, they are able to avoid the problem of not having an image of the ball at the exact moment it crosses the plate. Under controlled circumstances, Sportvision’s engineers have been able to establish the accuracy of the PITCHf/x systems to within an inch, or a third of a baseball. Given these massive advantages, a combined approach incorporating both stringer data and precision PITCHf/x data is most likely to degrade, not improve, the quality of the data.
BIS’s response to these concerns is not particularly reassuring:
As a way to test this, BIS conducted an impartial study. They selected the 100 pitches from their database of the 2010 season that represented the biggest discrepancies in pitch location between BIS data and raw PITCHf/x data. They then meticulously reviewed video once again on all these pitches. The video scouts reviewed the pitch location and selected the data source, either BIS or PITCHf/x, that they believed best represented the true location.
These impartial video reviewers chose BIS plotted pitch location data 55 percent more often than the raw PITCHf/x location as the correct location. The details: 59 choices for BIS pitch location (Enhanced PITCHf/x), 38 choices for the raw PITCHf/x location, 2 pitches that Pitch FX has since corrected, and one pitch where neither location was close.
Let us take BIS at their word that the reviewers involved in this study were, in fact, impartial. The video feeds the reviewers were watching, however, were not impartial—they were the same video feeds the original video scouts reviewed. In other words, if there was a bias caused by the video source (parallax error originating from the placement of the center field camera, for instance) the reviewers would be more likely to agree with the video scouts than the PITCHf/x data, even though both of them would be less able to tell the location of the pitched ball than the precision tracking data.
This isn’t to suggest that the PITCHf/x data is perfect, however—the accuracy under controlled conditions is likely to be higher than accuracy in the field, where the weight of the people sitting in the stadium is great enough in aggregate to actually move the stadium itself and thus the placement of the cameras.
So, based on work by Mike Fast, we’ve incorporated a series of calibration adjustments to “correct” the plate location data to give a better picture of where a pitch really was when it crossed the plate. And we’re maintaining a consistent definition of the strike zone, which means that a batter or pitcher’s numbers can be directly compared between seasons without fear that an apparent change is really the product of how the numbers are being crunched.
This is only our first foray into PITCHf/x—rest assured, we’re not done yet. Even when it comes to the subject of plate discipline, we’re always considering new approaches and will work at incorporating the best analysis possible. So consider this an appetizer course.
(And we have a few non-PITCHf/x related announcements still up our sleeves in the weeks to come. So we’ve expanded our Big September promotion into the first half of October as well.)
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
I'm a Cardinals fan, and in the slate of new articles today, we have exactly nothing about that series yet again.
This isn't your fault Colin. But BP, during an exciting postseason, can we find time to give the dry statistical research articles a rest and actually pay attention to the baseball games being played? Even if the PITCHf/x stats announced here are the greatest thing ever, is the postseason the time to launch them? Might baseball fans be focused on something else?
Albert Pujols is a good hitter.
Roy Halladay is a good pitcher.
Really, is there more that hasn't already been said a few dozen times?
And I'm asking for analysis of the games, not the players. There is a difference.
This article was a pretty good read. Hijacking the comments thread to complain about something unrelated just irritated me. But you made your point, and the powers that be responded, so now we can all go out for a beer.
I'm not sure if commenting on something's placement as a lead story is hijacking, but if it is, I apologize. If Colin was in any way insulted, I particularly apologize.
A long time subscriber, I've tried posting about my frustration with the site in other comment areas of the site, and I have sent a couple emails. I thought I'd express my frustration with the site one last time before I shuffled on.
Peace.
I agree with much of the criticism here. This has purely been my error. I'm currently preoccupied with deadlines for our next book, Extra Innings: More Baseball Between the Numbers (we call it BBTN II around here) so I while I did think about assignments for division series previews, I didn't think through continuing those beats into the series themselves.
Moreover, and I think this is the real lesson of what has happened here, is that I am the first BP editor to have more than one BBWAA member on staff, giving us the possibility of having the kind of on-sight coverage Jay has been giving us. Normally, BP writers are self-directed and our coverage hasn't been so systematic, but Jay's dispatches have been so well done that it pointed up our lack of analytic coverage of the other series. That will change immediately.
Finally, both Jay and I will have home park credentials should the Yankees or Phillies make it to the next round (and if either makes it to the World Series as well) and so we will definitely continue to have detailed on-site coverage throughout the end of the postseason. As always, I appreciate your feedback and I hope that you continue to throw both bouquets and brickbats as our work merits them.
--Steve, Editor-in-Chief
Nothing to parse, and everything taken at face value. If only politicians were more like Steve has shown himself to be here.
Actually, I kind of agree that while the effort is appreciated the timing is odd to say the least.
Now, we mainly get the statistics. It's like buying a Reese's Peanut Butter Cup that contains ONLY peanut butter. I like peanut butter, but I would still miss the chocolate. Noting that I want some chocolate back in my peanut butter cup in no way disparages peanut butter or those who make it.
Given their large slate of writers, I would hope someone at BP could be assigned to watch a NL postseason series and write about it in an entertaining way that uses the insight that Baseball Prospectus provides.
If they don't have anyone who can do that, they could hire a special guest to cover postseason baseball. There has to be someone out there they can get.
As it is now, I have the disturbing impression that no one at BP is even watching the NL postseason series.
http://www.baseballprospectus.com/article.php?articleid=2412
I know Colin was not hired to write this sort of piece. But can't someone be?
To those upset at the timing of this post, you must not be on Twitter during these playoff games. Holy crap, it seems like 90% of the discussion is fans whining and bitching about the strike zones. Heck, even LaRussa and Girardi have channeled their inner Phil Jackson in attempts to get the umps to change their zones mid-game/mid-series.
I say that for two reasons. One, I looked at where a few umpires stand, and it had no obvious effect on their zone:
Home Plate Umpire Positioning
Two, if you look at the difference in zone between RHB and LHB, which I think is what bugs a lot of people, the zone for LHB is not actually wider, it's just shifted toward the outside (on both the inside and outside edges). This makes sense if it's due to the catcher target, which is shifted outside by 2-3 inches for LHB. But if it's because the umpire is in the slot and can't get a good view of the outside edge, why wouldn't he call the inside edge the same for RHB and LHB?
Which is a pain.
The four decimal places, and the format in general, are a little tough on the eyes though. I think it would be helpful to overhaul the presentation format, and not just for this stat report.
As an example, when Jeff Euston's compensation data was added (another great addition!), once the ticker at the top of the page disappeared, it is a bear to try to find the information now.
I hate to use another website as an example (but Colin mentioned FanGraphs in the article so I think its OK), but a large part of the reason FanGraphs data is popular is because of the ease of navigation. Making custom reports is great, but I'd wager $1 that having easily navigable standard reports would result in more use.
With that out of the way: The lack of playoff coverage continues to amaze/dismay me.
A few suggestions:
1) Difficulty accessing definitions. Definitions are not displayed on the stats page and abbreviations are not always clear. One can access the glossary through the headers, but it requires loading a separate page -- either in a separate window or by leaving the stats page. Could these be hover tool-tips instead? Or at minimum, could the glossary search bar be placed on the report page as well?
2) Sorting. Having multi-layered sorting is nice, but it's of secondary importance/value to simple, quick sorts. Being able to sort quickly by clicking on the field headers would be a welcomed addition. Perhaps you could add a neutral sort icon (e.g. "--") like the up and down arrows that would all be clickable and would rotate through asc/dec/neutral.
3) Filtering. Again, having multi-layered filtering is nice, but I'd love to be able to filter on more than just Team/League/Pos/PA. Often I'm trying to obtain a list of batters that mean some threshold of stat. Perhaps this would cause performance issues, but it would be very nice to get all batters with OBP > .330, for example, without having to export a full list in to Excel. Just adding one variable filter field would be a great addition.
4) Significant digits. A contact rate of 0.7906 is difficult to read. I assume you'd have to give up speed to display it as 79.1%, but it's a trade off I'd personally take. Is the hundredths place meaningful? I'm guessing not. Heck, is the tenths? If I want a raw data export, the detail is helpful. But the current display can make it more difficult to interpret the data.
Just a few thoughts. Keep up the great work, Colin.
A Zone of Their Own
(after the fourth paragraph)
I'm worried a little about the impact on metrics like O-Swing and Z-Swing that could arise from the fact that players adjust to the individual umpire's strike zone in effect during a particular game. Therefore they are often pitching to, and selecting pitches based on, a strike zone that is different from the denominators of the metrics, no matter how carefully the metrics are adjusted. I guess as long as we apply the metrics to seasons and not individual games, it will probably even out enough.
I agree we're a long way from being finished with our understanding of the strike zone, batter plate discipline, and how to measure them accurately, consistently, and in ways that have useful baseball meaning.