Happy Thanksgiving! Regularly Scheduled Articles Will Resume Monday, December 1
October 26, 2006
The Information Revolution
"Failure is a part of this game. You can't escape it. If you try, it's going to find you, more than your share sometimes. And I've had mine. Going through that makes all of this more rewarding in a lot of ways."
-Tigers pitcher Kenny Rogers, discussing his success in the 2006 postseason
"The number one benefit of information technology is that it empowers people to do what they want to do. It lets people be creative. It lets people be productive. It lets people learn things they didn't think they could learn before, and so in a sense it is all about potential."
-Microsoft CEO Steve Ballmer
Kenny Rogers has been simply exceptional this postseason, the controversy a Tigers friend of mine dubbed "Gamblergate" aside. His masterpiece in Game Two on Sunday night in Detroit ran his scoreless streak this postseason to 23 innings; he gave up just two hits in eight innings, striking out five and walking three. That streak represents the second longest scoreless streak in a single postseason, behind only Christy Mathewson and the 27 shutout innings he twirled in three complete games for the New York Giants in the 1905 World Series. Oh, and Rogers became the first starter age 40 or older to win a World Series game. Not bad for a guy who came into this postseason with a career 8.85 ERA and who in recent years was better known for throwing cameras and cameramen and punching water coolers than for his pitching.
But as great as all of this is for Tigers fans like my friend, this column is not really about Rogers. It's actually all about the information.
The March of Progress
Information has always been integral to the game of baseball. As Alan Schwarz has so well documented in his excellent book The Numbers Game, from the very beginning and Henry Chadwick himself, the game on the field has been not only described, but shaped by the numbers used to record its events. Just as certainly, those numbers have shaped the way we fans communicate about, interpret, and ultimately understand the game. As Bill James has often said, the statistics of baseball have acquired "the power of language," since in them we can visualize the skills and attributes of the players.
The information revolution powered by Moore's Law has not only increased the availability of that information but also allowed for its enhanced collection and dissemination. After all, it was MacMillan's The Baseball Encyclopedia, published in 1969, that had the distinction of being the first book entirely typeset on a computer. Over the next four decades James' Baseball Abstracts, the Elias Sports Bureau's Baseball Analysts, Total Baseball and more only served to stoke the fires of fan interest in the game and its numbers.
That ever-increasing processing power has been coupled with an ever-decreasing cost per MIP (million instructions per second), as well as exponential growth of the Internet. These factors have placed powerful computers in the hands of fans, enabling them to electronically access the growing amount of information. Not only can that information be accessed through web sites like ours and many others, individuals can also download and analyze the data for themselves, as evidenced by the popularity of Sean Lahman's Baseball Database providing seasonal data and Retrosheet's collection and publication of play-by-play data going back to 1957.
That brings us to the book published earlier this year by Joseph Adler titled Baseball Hacks: Tips and Tools for Analyzing and Winning with Statistics. The book is unique in that it doesn't simply provide an analysis of baseball's statistics; it is instead a cookbook that details the process of collecting and analyzing information from a variety of sources using (mostly) tools that are freely available. In short, the book consists of seven chapters that include 75 "hacks," or step-by-step instructions for either accessing or analyzing information.
The primary tools that Adler discusses include MySQL (a free database application), Perl (a free scripting tool used to manipulate data), Microsoft Access, Microsoft Excel, and R (a language and integrated environment for statistical computing and graphics). Using these tools, Adler delves into everything from loading Retrosheet data into MySQL to creating hexagonal bin plots to visualize the spray charts of various hitters. While the book is geared for readers who are comfortable with computers (so-called "power users") the writing is very accessible, and the steps easy to follow.
Visualizing Kenny Rogers
As some readers are probably aware, MLB.com is offering an enhanced version of their Gameday application during the postseason. As shown below, the enhanced version not only includes the familiar play-by-play data but now also pitch tracking that not only provides the velocity of each pitch at both the release point and as it crosses the plate, but pitch trajectory and pitcher release point as well. The application is driven by the installation of three cameras at each ballpark that triangulate on the pitch, and three computers outside in the truck that perform the calculations on the video shot at 30 frames per second. Ultimately, as shared by Cory Schwartz of MLB Advanced Media (MLBAM), they'll use the data to determine the pitch type in real time, and of course offer the entire package of information to the myriads of outlets where it is in use today.
Although not discussed by Schwartz, one could imagine how an enhanced version of the system could also be used to provide trajectory information on batted balls, which would be a boon to the burgeoning field of defensive analysis.
For now, as described in Baseball Hacks, the data for the Gameday application is also available online in XML documents, making it possible to analyze some of this information. For example, after downloading the data for Game Two, I wrote a simple program to chart pitches, and then ran it against Rogers' performance by looking at the pitch location, pitch outcome, and release velocity. Using the following key, I charted each pitch outcome using a different shape, and each range of pitch speeds in a different color.
In addition to those definitions, swinging strikes are decorated with an additional red line around the outside of the square.
Using the data, let's reconstruct Rogers' first inning on Sunday, the only time he was in any trouble all night.
As you can see Rogers got a first pitch called strike on a fastball traveling at 87.7 mph (and hence blue) and then Eckstein offered at the second pitch at 84.4 mph (which was a bit low), and grounded it to short.
The next hitter was Scott Spiezio.
Rogers started Spiezio off with an off-speed pitch at 79.0 mph that was just tempting enough to offer at; he swung through it. Rogers came right back with a second off-speed pitch at 80.7 mph, which Spiezio laid off of, before getting him to offer at a third straight pitch on the outer half at 81.2 mph that was just out of the strike zone, making the count 1-2. After another sinker or changeup low and away at 80.7 mph he came inside with a fastball at 88.7 mph to catch Spiezio looking for a called third strike, his first strikeout of five he'd have in the game.
Incidentally, in looking at the at-bats for the entire game, you can see how Rogers became more confident in his fastball as the game wore on and his velocity increased. The best fastball of the night was a 92.2 mph pitch thrown for a called strike on Jim Edmonds with one out in the seventh inning.
Albert Pujols then walked on four pitches that were low and on the outer half. Although the first and last pitches of the at-bat were fastballs, neither topped 85.0 mph, and it appeared that Rogers understandably did not find the prospect of giving Pujols anything good to hit appealing.
Pitch number two of this sequence also points out an issue with the data and the program I wrote to display it. The pitch was called a ball although it appears here in the strike zone. Of course umpires sometimes simply miss pitches but it should be pointed out that in the data there are values that indicate the top and bottom of the strike zone, and that changes for each at-bat for each player. However, the values provided don't seem to be in the same coordinate system used for the pitch tracking (which is roughly a 200 by 200 grid), so I've not adjusted the strike zone for each at-bat. Instead I'm using an approximation based on a "best guess."
Scott Rolen, looking much more comfortable at the plate since Game Seven of NLCS then strode to the plate with a runner on first and two outs.
Rogers started him off with an 84.3 mph called for a strike near the outside corner, and again it's down. Rolen then took a 79.9 mph ball low and away before grounding a 79.6 mph pitch to Brandon Inge at third, legging out a single to put runners on first and second.
As Rogers did the entire game, he continued to work the right-handed hitters low and away. Juan Encarnacion was quickly behind 0-2 on a called fastball strike (85.1 mph) and swinging at a breaking ball (79.6 mph). He fouled off a third pitch (81.0 mph) that was again low before grounding the final pitch of the inning (80.4mph) right back to Rogers to retire the side.
All told, that's 18 pitches, seven in the 83-89 mph range and 11 coming in below 83 mph, three swinging strikes, four called strikes, one foul ball, seven balls, and three balls put into play. As you can see, it begins to show the pitch patterns, where he consistently worked down and away from right-handed hitters. This can also be shown by creating a composite graph of all the pitches thrown against left-handed and right-handed hitters, as shown below. (This is similar to the information you can glean from ESPN's Inside Edge, where the strike zone is split into a grid with percentages for different pitch outcomes.)
Clearly, Rogers did most of his work against right-handed hitters in the lower outside quadrant of the strike zone. The graph also reveals that when he did come inside he did so with hard stuff, since the majority of the pitches on the inner half are navy blue signifying that they were thrown between 89-93 mph-which for Rogers represents a good fastball. Against left-handers-in this case, just Jim Edmonds, since he was the only lefty in the lineup-Rogers generally stayed away, only throwing his good fastball on the outer half. Edmonds struck out, walked, and flew out to center.
In order to show what a hard thrower looks like with this tool, I also ran the at-bats against Joel Zumaya in Game Three. He faced six batters, walking two and striking out one. It turns they were all right-handed, so the composite diagram is shown below.
Zumaya's color scheme differs markedly from Rogers'. Of the 24 pitches Zumaya threw, only six were recorded below 94 mph (the final two pitches to Rolen to end the 7th were not recorded with pitch velocities, and hence are black). On the jaw-dropping side of things, his fourth pitch to Preston Wilson was recorded at 100.0 mph, the first pitch to Pujols at 100.7 mph, and the ball Pujols hit that Zumaya threw wildly to third on came in at 99.5 mph.
I can hear the detractors already. Something along the lines of, "Yes, the march of progress allows for more people to access more information more quickly, but is that necessarily a good thing? Aren't we in danger of information overload and aren't we detracting from and ultimately disrespecting the game by trying to measure it so thoroughly?"
While I have some sympathy for the sentiment, to me the answer to the problem of overload is in how the information is analyzed, and then how it is presented. Information for information's sake is not the goal. Information analyzed in such a way as to inform the decision-making process on the field and in the front office is the end game. Indeed, more is not always better. More does provide the opportunity for better analysis, though, by broadening the kinds of questions that can be asked, and also the kinds of answers that we'll find.
Many thanks to Joe Sheehan (no, not our Joe Sheehan) for the idea for this column.