BP Comment Quick Links

Happy Thanksgiving! Regularly Scheduled Articles Will Resume Monday, December 1


October 11, 2007 Schrodinger's BatOn Atmosphere, Probability, and Prediction
"All of us could take a lesson from the weather. It pays no attention to criticism." This summer we explored a variety of topics related to the wealth of new information available through PITCHf/x. Those topics ranged from profiling pitchers, to evaluating umpires, to the underlying physics involved, and even dissecting plate discipline. Today we'll revisit the topic of atmospheric effects that we discussed in late May, only this time we're armed with almost ten times as much data as we had back then. I'll also have a few words on the differences between models versus reality and probabilities versus predictions. Slow Ride Back in May, when we had about 40,000 pitches from ten ballparks courtesy of the PITCHf/x system to analyze, we examined just how much a pitched ball slows down on its way to the plate. Now with over 325,000 pitches and data from 28 ballparks, we can revisit that discussion, adding the additional data points. You'll recall that in summarizing Robert K. Adair's The Physics of Baseball we noted that the force on a moving baseball is proportional to the crosssectional area of the ball, proportional to the square of the velocity of the ball, proportional to the density of the air, and proportional to the drag coefficient, a dimensionless number that varies with the velocity and "roughness" of the ball. Perhaps counterintuitively, Adair shows that the drag coefficient for a baseball actually drops from around 0.40 at 75 mph to less than 0.30 at 100 miles per hour. This is because of both the typical velocities that a ball travels at in a major league game and the nonuniform texture of the ball due to the stitches, which creates a turbulent flow. Even so, when taken as a whole the result is that the faster a ball is thrown, the more drag will be created. That is, faster pitches should lose a greater percentage of their velocity on the way to the plate. In the previous article, I showed a graph that plotted the percentage difference in starting and ending velocities for pitches thrown with little break ranging from 86 to 100 miles per hour; in other words, mostly fastballs and changeups as indicated by a low break length value, which is the maximum deviation from the straight line path. That plot showed a steady and somewhat parabolic increase in percentage difference as velocity increased, ranging from just over 10 percent at 86 miles per hour to just over 12 percent at 100. With the additional data and, just as importantly, including the additional 18 ballparks, we can now rerun the plot with a wider range stretching from 77 to 100 miles per hour:
The plot now includes both the percentage difference in starting and ending velocity in pink using the yaxis on the left, and the average actual velocity difference, which is shown in blue and uses the axis on the right. As in the previous graph, this one shows the same nonlinear increase in percentage difference and, to a lesser extent, in actual difference as the speed of the pitch increases. This again confirms the models discussed by Adair, and tells us that for each additional mile per hour a pitcher gets on his fastball, there are diminishing returns as it loses a greater percentage of its velocity. Further, that difference increases the faster a pitcher throws. So a fastball thrown at 90 miles per hour will cross the plate at 82.3 mph, while one thrown six miles per hour harder will cross at 86.7 mph, a gain of 4.4 mph. Even with the diminishing returns, pitchers still benefit from a greater initial velocity, since it also compresses the already small window of time a hitter has to react to the pitch. There is an interesting difference in this graph from the one shown in the previous article, however. You'll notice that the range from 86 to 100 miles per hour mentioned above is lower here, ranging from 8.2 to 11.6 percent. In the previous analysis, I didn't select a uniform distance from the plate at which to measure. Throughout the season MLBAM changed the distance at which it recorded the starting velocity; it was 55 feet initially, went down to 40 and 45 feet, before ultimately settling on 50 feet (this was after the publication of the first article in late May). For that reason, the plot here uses only pitches recorded at the new standard of 50 feet. In addition, with the inclusion of 18 additional parks it turns out that a couple of those parks, as will be discussed below, seem to have a smaller impact in terms of slowing pitches down. In any case, when weighted by starting velocity, the average major league pitch thrown with little break decelerates an average of 8.8% from the time it is 50 feet from the plate until it reaches the front of the plate. Moving on, we can break down the numbers above to show the differences between the deceleration of the ball in afternoon as opposed to evening games. The graph below defines afternoon games as those that start before 5pm (as shown on the red line) while evening games are the blue line.
As in the previous discussion of this topic, the deceleration is greater in the evening, as would be expected, since in the evening temperatures and humidity both decrease, which combine to increase the density of the air, causing more friction on the ball. The difference is statistically significant at the 99 percent confidence level, and so clearly the PITCHf/x system does pick up the differences in air density. On average, that difference equates to less than a quarter of a mile per hour as the ball reaches the plate; although real, the difference is likely not perceptible at the practical level, which fits with our experience (for example, neither fans nor players, that I'm aware of anyway, are able to differentiate between pitches thrown during the day and those thrown at night). Another way to look at the effect of air density is to slice the data by the temperature recorded at the start of the game. While not perfectas in the case of game three of the NLDS at Coors Field, where the game time temperature was 73 degrees but had dropped into the mid 50s by the second inningperforming an aggregate calculation by temperature should give us an idea of how temperature affects the deceleration of the ball. The following graph records the average percent difference in starting and ending velocity on pitches with little break, for games played at temperatures ranging from 31 to 92 degrees:
The graph shows quite a bit of variation because of the small sample sizes at some temperatures, but the overall trendas expressed through the dotted trend line (a linear regression with a correlation coefficient of r=0.61, p<0.01)clearly indicates a relationship between temperature and the amount a pitched ball slows down on its way to the plate. The effect is equivalent to about half a percent for every 10 degrees, or about a half mile per hour. Having shown that pitches do indeed decelerate as expected, and do so differently by time of day and temperature, the next step is to examine the differences in ballparks. However, since there are 28 parks, a graphical representation doesn't work particularly well; the following table lists the park, the number of pitches analyzed (using the same definition of starting location and break length used in the graphs above), the average game time temperature, and the average percentage difference in starting and ending velocities for pitches between 81 and 97 miles per hour*.
Ballpark Pitches Temp PctDiff  Safeco Field 6721 68 11.2% Rogers Centre 3411 70 11.0% Petco Park 6868 71 10.9% U.S. Cellular Field 7896 76 10.4% Busch Stadium 6182 84 9.8% Dodger Stadium 7046 76 9.6% Minute Maid Park 4796 73 9.5% Fenway Park 4170 74 9.4% Great American Ball Park 4917 84 9.2% PNC Park 933 73 9.2%  Chase Field 5003 81 9.0% McAfee Coliseum 6035 67 8.8% Shea Stadium 2361 74 8.8% Miller Park 4961 75 8.8% Hubert H. Humphrey Metrodome 5422 69 8.3% Angel Stadium of Anaheim 5908 80 8.3% Jacobs Field 2235 73 8.3% Tropicana Field 2177 72 8.1% Kauffman Stadium 4477 81 8.0% Yankee Stadium 2679 74 8.0%  Coors Field 4060 78 7.8% Turner Field 5844 84 7.4% Rangers Ballpark in Arlington 7019 88 7.4% Wrigley Field 6065 75 7.3% AT&T Park 5919 65 7.2% Citizens Bank Park 4080 79 7.1% Dolphin Stadium 1536 86 5.5% Comerica Park 5374 74 5.2% This indicates that Safeco Field slows the ball the most, at over 11 percent while Comerica does so at just five percent. There doesn't seem to be much correlation here with elevation or the tendency of the park to play as a hitter's or pitcher's park, although Citizens Bank Park, Rangers Ballpark at Arlington, and Coors Field are all near the bottom. That said, there is a small negative correlation with temperature (r=0.32, p<0.10). The fact that the two lowest values for Comerica and Dolphin Stadium are so much lower than the rest is troubling, and makes it seem as if there is something going on at those parks, perhaps in the way the system is calibrated, to yield such low values. As a result, I wouldn't necessarily take these numbers at face value. Finally on this topic, in a previous column we also took a stab at determining how the break of the ball differs at different parks using the pFX value recorded by the PITCHf/x system. This value, reported in inches, is a combination of the vertical and horizontal movement of the pitch relative to the straight line drawn between the starting and ending locations of the pitch. The value is defined as the hypotenuse of the right triangle formed by the other two values, with the effects of gravity removed from the vertical component. The result is that these components reflect the movement of the pitch due to the Magnus force generated on the spinning baseball. While perhaps not the most intuitive measuresince it leads to large positive vertical movement values for fastballs that don't drop as much as a pitch thrown without spinit should give us a pretty good way to assess whether atmospheric effects impact the flight of the baseball. In order to look at this question, I created two data sets: one for fastballs and one for breaking balls (most curveballs and some sliders). Then I filtered the data such that only pitchers who threw 25 or more fastballs or breaking balls at a particular park and at all other parks were included. Finally, I computed the ratio of the average pFX value for each pitcher at the specific park compared to his pitches at all other parks, and then derived a weighted average of those ratios across all pitchers. The result is a value relative to 1.00, which can be thought of as a park effect for pitches**. And, for good measure, I included a weighted average of the difference in pFX over all the comparisons in order to get a feel for the magnitude of the difference in movement. For breaking balls, this procedure yielded 204 individual pitcher ratios from a total of over 29,000 pitches; for fastballs it was 886 ratios and approximately 81,000 pitches. The results of all this can be summarized in the table below, first sorted by the fastball ratio:
Fastballs pFX Breaking Balls pFX Ballpark Pitches Ratio Diff Pitches Ratio Diff  Coors Field 6271 0.79 2.73 899 0.86 1.34 PNC Park 1360 0.90 1.14 0 n/a n/a Comerica Park 12727 0.92 1.01 1024 1.23 1.41 AT&T Park 10845 0.92 1.01 918 1.16 0.92 Minute Maid Park 8446 0.92 0.96 1396 1.22 1.91 Fenway Park 9119 0.93 0.90 1184 1.01 0.03 Turner Field 9092 0.93 0.86 321 1.09 0.59 Jacobs Field 6826 0.95 0.63 653 1.00 0.02 Yankee Stadium 4687 0.96 0.43 356 1.05 0.51 McAfee Coliseum 14761 0.97 0.40 2961 1.02 0.13  Wrigley Field 10711 0.97 0.35 1713 0.90 0.98 Angel Stadium of Anaheim 11975 0.98 0.25 2089 1.04 0.33 Rangers Ballpark 16197 0.98 0.22 956 0.87 1.21 Miller Park 9521 0.99 0.18 1151 0.98 0.30 Great American Ball Park 8801 0.99 0.15 771 1.09 0.64 Dolphin Stadium 2867 1.00 0.05 146 0.94 0.25 Busch Stadium 13830 1.00 0.03 1913 0.90 0.88 Citizens Bank Park 6463 1.01 0.08 749 1.22 1.49 Kauffman Stadium 10972 1.02 0.20 636 0.92 0.75 Hubert H. Humphrey 11627 1.02 0.20 1312 1.09 0.58  Rogers Centre 7606 1.05 0.50 1439 1.10 0.68 U.S. Cellular Field 14176 1.07 0.80 1207 0.83 1.45 Tropicana Field 6502 1.07 0.83 994 1.23 1.55 Dodger Stadium 10669 1.08 0.92 1446 0.86 1.35 Safeco Field 16878 1.08 0.98 919 0.89 0.97 Shea Stadium 4399 1.11 1.40 693 1.04 0.28 Chase Field 11574 1.13 1.47 1401 0.94 0.37 Petco Park 12298 1.13 1.57 1192 0.92 0.66 At the low end, Coors Field seems to have the biggest effect with fastballs getting only 79 percent as much movement (equating to 2.73 inches less), compared to fastballs thrown by the same pitchers at other parks. Keep in mind that this includes both vertical and horizontal components and, when broken down, indicates that fastballs at Coors drop moreroughly two inches, presumably because the backspin on the fastball doesn't counteract gravity as well in air that is less denseand also don't tail as much (about 2.3 inches less). On the other end of the spectrum, Petco Park seems to enhance fastball movement 13 percent by keeping the ball up (one and a half inches) and allowing it to tail roughly a half inch more. Overall, this list seems to accord pretty well with our expectations, with Coors being the outlier, and places with denser air like Petco and Dodger Stadium on the other end. One caution here is that in comparing the average values for pitchers who throw 25 pitches both at their home park and away parks, you necessarily run up against some bias in their away parks since they'll likely throw more pitches within their division. This can be illustrated by Rockies pitchers also pitching at Petco Park, and vice versa, thereby possibly magnifying (or perhaps canceling out in this case?) the results for those parks. And now we can sort the same list by ratio for breaking balls (PNC Park did not have enough data):
Fastballs pFX Breaking Balls pFX Ballpark Pitches Ratio Diff Pitches Ratio Diff  U.S. Cellular Field 14176 1.07 0.80 1207 0.83 1.45 Dodger Stadium 10669 1.08 0.92 1446 0.86 1.35 Coors Field 6271 0.79 2.73 899 0.86 1.34 Rangers Ballpark 16197 0.98 0.22 956 0.87 1.21 Safeco Field 16878 1.08 0.98 919 0.89 0.97 Busch Stadium 13830 1.00 0.03 1913 0.90 0.88 Wrigley Field 10711 0.97 0.35 1713 0.90 0.98 Petco Park 12298 1.13 1.57 1192 0.92 0.66 Kauffman Stadium 10972 1.02 0.20 636 0.92 0.75 Dolphin Stadium 2867 1.00 0.05 146 0.94 0.25  Chase Field 11574 1.13 1.47 1401 0.94 0.37 Miller Park 9521 0.99 0.18 1151 0.98 0.30 Jacobs Field 6826 0.95 0.63 653 1.00 0.02 Fenway Park 9119 0.93 0.90 1184 1.01 0.03 McAfee Coliseum 14761 0.97 0.40 2961 1.02 0.13 Shea Stadium 4399 1.11 1.40 693 1.04 0.28 Angel Stadium of Anaheim 11975 0.98 0.25 2089 1.04 0.33 Yankee Stadium 4687 0.96 0.43 356 1.05 0.51 Hubert H. Humphrey 11627 1.02 0.20 1312 1.09 0.58 Turner Field 9092 0.93 0.86 321 1.09 0.59  Great American Ball Park 8801 0.99 0.15 771 1.09 0.64 Rogers Centre 7606 1.05 0.50 1439 1.10 0.68 AT&T Park 10845 0.92 1.01 918 1.16 0.92 Citizens Bank Park 6463 1.01 0.08 749 1.22 1.49 Minute Maid Park 8446 0.92 0.96 1396 1.22 1.91 Tropicana Field 6502 1.07 0.83 994 1.23 1.55 Comerica Park 12727 0.92 1.01 1024 1.23 1.41 PNC Park 1360 0.90 1.14 0 n/a n/a Here the results are somewhat mixed, although Coors Field still ranks third with a ratio of 0.86, equating to movement of 1.34 inches less (a horizontal movement of 1.6 inches less with a vertical movement essentially equivalent). This time, however, Petco Park has a ratio of less than 1.00, indicating that, despite the assumption of denser air, breaking balls actually break slightly less there than on the road. Although it's difficult to say exactly why this would be the case (barring a systemic, data, or calculation problem) it's possible that since a breaking ball isn't thrown with complete overspinbut rather a combination of over spin and sidespinthe Magnus force generated by the increased friction causes the ball to break more horizontally, but keep it elevated, as we've seen with the fastball. In fact, for breaking balls at Petco the average horizontal movement is about a quarter of an inch more, but the vertical movement is just over an inch less. In any case, it seems that comparing pFX (perhaps because of the smaller sample sizes, variability, complexity of the movement or a combination of the three) as applied to breaking balls gives us less information than doing so with fastballs. Here you can also see that Comerica Park is rated as the park that most affects breaking balls, and when looking more closely, the difference is almost entirely (over two inches) in the vertical component. Given the possibility that there is a systemic problem with the data at Comerica based on the analysis of deceleration above, I would be hesitant to take that measurement at face value. Notes: * This value was calculated by first averaging the percentage difference for each mile per hour between 81 and 97 mph (since there was data for all 28 parks in that range) and then averaging those averages. This procedure ensures that a pitching staff that throws harder on average isn't biasing the results by including more pitches in the upper velocity ranges. ** For example, when looking at curveballs, I logically built the following table of pitchers who threw 25 or more pitches both at Coors Field and at other parks:
Coors Field Other Parks Name T Pitches pFX Pitches pFX Ratio  Matt Morris R 26 7.74 198 11.18 0.69 Matt Herges R 38 10.90 29 10.12 1.08 Jeremy Affeldt L 28 4.86 48 7.02 0.69 Taylor Buchholz R 56 8.29 61 8.21 1.01 Jeff Francis L 49 5.54 101 7.75 0.72 Ubaldo Jimenez R 103 9.92 61 8.05 1.23 Franklin Morales L 27 6.32 74 9.34 0.68 Total 327 572 Weighted Averages 0.86 As you can see, the ratio of pFX at Coors to pFX at other parks is calculated for each pitcher. This value is weighted by the total number of pitches thrown to produce the value of 0.86.  Probability Is Not Prediction Before closing this week, I wanted to pen a few words about the nature of probability, prediction, and modeling related to sabermetrics. While this may be the quintessential example of preaching to the choir, at least I'll feel better after venting a little. This topic first caught my attention because of the recent articles in the mainstream media reporting on the Diamondbacks' run differential and how, in 2007, it ended up being a very poor predictor of their season record. By scoring 712 runs and giving up 732, the Diamondbacks would have been expected to win approximately 79 games. However, at season's end they won 11 more games than that (actually 10.6) and a division title, a feat which places them ninth on the list of teams since 1901 in bettering their Pythagenpat in terms of games won (it places them 12th in terms of winning percentage over what would be expected) as shown in the table below:
Actual Pythagenpat Year Team W L WPct RS RA WPct W +W  1905 DET 79 74 .516 512 602 .430 66.2 12.8 2004 NYA 101 61 .623 897 808 .550 89.1 11.9 1984 NYN 90 72 .556 652 676 .484 78.4 11.6 1970 CIN 102 60 .630 775 681 .558 90.4 11.6 2005 ARI 77 85 .475 696 856 .404 65.5 11.5 1954 BRO 92 62 .597 778 740 .523 80.6 11.4 1972 NYN 83 73 .532 528 578 .461 71.9 11.1 1924 BRO 92 62 .597 717 675 .528 81.3 10.7 2007 ARI 90 72 .556 712 732 .490 79.4 10.6 1955 KC1 63 91 .409 638 911 .339 52.6 10.4 1961 CIN 93 61 .604 710 653 .538 82.8 10.2 1932 PIT 86 68 .558 701 711 .493 76.0 10.0 1997 SFN 90 72 .556 784 793 .495 80.1 9.9 2004 CIN 76 86 .469 750 907 .409 66.3 9.7 1977 BAL 97 64 .602 719 653 .543 87.5 9.5 1943 BSN 68 85 .444 465 612 .383 58.5 9.5 While great for fans of the Snakes, the unfortunate aspect of all of this is that you end up with stories that seem to throw the baby out with the bath water by producing quotes like this one from D'backs outfielder Eric Byrnes: I laugh. I just laugh. Because it doesn't really apply to what this team is. It doesn't apply to winning baseball. I mean, I don't blame the numbercrunchers, the computer geeks, for not being able to come up with a formula for how we got here. But there's a lot more that goes into sports than numbers. Well. Whatever else may be said, the reality is certainly not that "computer geeks" and "numbercrunchers" (not to worry, no offense taken) are frustrated by their supposed inability to produce a perfect formula, as implied in the quote. At its core, any quantitative formula or methodology is an attempt to model things that happen in the physical world. Being a modeland therefore necessarily limitedit cannot (as all researchers in any field understand) take into account every factor that may influence the outcome. That formulas like Pythagenpat have had such overwhelming success (95 percent of teams since 1901 come within eight games of their estimated wins) is a testament to the fact that its model of wins and losses through the underlying relationship of runs scored, runs allowed, and run environment accounts for a significant percentage of the final outcome. But rather than rending garments and putting on sack cloth and ashes, cases like the 2007 Diamondbacks and those in the table above shouldand actually doserve as an impetus for improving, extending, and testing the limits of the model. And that is just what good analysts and thoughtful reporters are doing as they discuss the roster construction and strategic decisions (for example, the deployment of the bullpen) made by Arizona and its staff. So Mr. Byrnes can rest easy that analysts are not overwhelmed or dismayed, but rather embrace his team's ability to put wins on the board. Interestingly, in the same article as the Eric Byrnes quote above, general manager Josh Byrnes had this to say on his team's success: In spring training we actually put forth a lot of information, just internally, that this roster composition can win. And internally, we believed in it the whole way. And as the season went along, it proved that as the games got more important, a lot of these players just kept getting better. While a bit cryptic, it could certainly be the case that Byrnes and his staff did some analysis whereby "beating Pythagoras" was not all that unexpected. If so, that's great news, since ideas have a way of spreading, and so those additional factors will become part of the positive feedback loop that drives these kinds of efforts. A subtler but related point in this vein is that some seem to think the models used to discuss events are necessarily predictions and therefore take a "told you so" approach when the end result seems improbable according to the model. But probabilities are not predictions, and so in addition to the fact that the models used to generate the probabilities are incomplete, even events that are unlikely do in fact happen. Only if you could replay the event hundreds or thousands of times could you say with confidence that the model is not useful. So now go relax and enjoy the postseason and all the variability and randomness that it entails. 0 comments have been left for this article.
