“With me, being a hard thrower … no matter what, they’re defending that heater, man. So the more confidence I have to throw that [changeup] in any count, I’m going to throw it. I’m just going to. I don’t care anymore. It’s going to help me and I realize that.”
A.J. Burnett on his pitch selection. PITCHf/x has confidence in his fastball as well.

We’re now a week and a half into the new season and, if you’re anything like me, you’re basking in the wall-to-wall baseball that only this time of year offers. From the flip-flop in the AL Central with the Royals and Brian Bannister looking good and the Tigers…well, not…to the surprising Orioles, the not-so-surprising Giants, and the hot starts of the Brewers and Cardinals, there are plenty of story lines to keep us occupied.

But while MLB.TV may be a more or less constant companion, our attention turns to other matters as well, and so this week we’ll close the book (for now anyway) on measuring historical infield defense with Simple Fielding Runs (SFR) and open the book on PITCHf/x for 2008.

SFR in the Infield, One More Time
Before moving onto PITCHf/x in 2008, let’s first revisit a topic from the previous couple of weeks related to infield defense and SFR.

I mentioned last week that in making changes to the algorithm I also took the time to include pitcher handedness in the context that SFR uses to create its baseline matrix. I speculated that it probably wouldn’t make much difference in the overall results but I wanted to be sure, and so this week I re-ran the numbers for 2003 through 2006 (the period for which we also have UZR data with which to compare). As before, I ran two simple correlations for SFR vs. UZR with the results shown in Tables 1 and 2. For comparison, you can refer back to a previous column where I also ran correlations for both the seasonal and aggregate numbers.

Table 1. Correlation Coefficients for SFR vs. UZR, Seasonal 2003-2006 for >=50 Games Played

Pos  Seasons    r
All     549   0.79
1B      143   0.65
2B      132   0.78
SS      141   0.81
3B      133   0.82

Table 2. Correlation Coefficients for SFR vs. UZR, Aggregated 2003-2006 for >=162 Games

Pos  Players   r
All     156   0.86
1B       38   0.77
2B       40   0.88
SS       41   0.86
3B       37   0.89

In looking at the previous correlations, you can see that the changes are hardly noticeable and, in fact, the correlations are just slightly lower at first base while remaining the same everywhere else.

My assumption remains that the essential information about hit distribution (which one would think pitcher handedness would most affect) was already captured in the combination of hit type and batter handedness, and so by adding pitcher handedness we didn’t really add any new information. Although that’s plausible, it could also be the case that there is an offsetting effect in play where the additional context would indeed have produced better results (in theory) but that by creating smaller buckets (essentially splitting the existing buckets used for comparison in the baseline in half) we at the same time introduce more variation and less reliable results that cancel out the effect of adding pitcher handedness.

On a second note, in looking at the data again for 1986-1987 and 2000-2002 I determined that it would be worth running the framework as is, since both contain the vast majority of fielder identifications, albeit being deficient with respect to hit types. Hit types are especially low for 2000-2002 but, of course, the changes made last week are designed to account for this and so should compensate to some degree.

The upshot of all of this is that you can now download a spreadsheet that contains SFR data for all major league infielders for the time periods that include 1957-1983, 1986-1998, and 2000-2007. Keep in mind that the results provided are based on a single algorithm that takes into account when there is missing hit type information and fills it in accordingly. This means that the results for 1988-1998 and 2003-2007 should be considered more accurate since they are based on essentially complete data (minus actual zone information of course) with 1957-1983, 1986-1987 and 2000-2002 in descending order of precision, and with 1984 and 1985 still out of the picture.

Finally, to finish this out, let’s take a look at the new overall leaders in Rate at each of the four infield positions when all seasons for which we have data are included. You’ll notice we’ve also upped the ante and are looking only at players who were assigned 2,000 or more balls in their virtual area of responsibility.

Table 3. Top and Bottom Shortstops by Rate, >= 2,000 Balls 1957-2007 (almost)

Name                Span         Balls     SFR    Rate
Adam Everett       2001-2007      2558    86.0    1.21
Bob Lillis         1958-1967      2053    55.8    1.20
Ernie Banks        1957-1961      2916    87.4    1.19
Rey Sanchez        1991-2005      3341    88.2    1.17
Mark Belanger      1965-1982      8468   198.3    1.15
Frank Taveras      1972-1982      4903  -106.7    0.89
Ruben Amaro        1958-1969      2909   -66.3    0.88
Ricky Gutierrez    1993-2004      3079   -82.2    0.88
Andujar Cedeno     1990-1996      2426   -74.6    0.87
Kurt Stillwell     1986-1996      2727   -75.8    0.87

Table 4. Top and Bottom Second Baseman by Rate, >= 2,000 Balls 1957-2007 (almost)

Name                Span         Balls     SFR    Rate
Dick Green         1963-1974      4281   102.0    1.18
Mark Ellis         2002-2007      2680    67.7    1.18
Mike Gallego       1986-1997      2117    50.5    1.16
Mark Lemke         1988-1998      3602    90.2    1.15
Jose Oquendo       1986-1995      2473    45.2    1.11
Bobby Richardson   1957-1966      4976  -109.9    0.87
Luis Rivas         2000-2007      2013   -46.9    0.87
Tony Taylor        1958-1976      5735  -152.6    0.86
Cookie Rojas       1962-1977      5594  -157.3    0.85
Jorge Orta         1972-1979      2567   -79.4    0.83

Table 5. Top and Bottom Third Baseman by Rate, >= 2,000 Balls 1957-2007 (almost)

Name                Span         Balls     SFR    Rate
Brooks Robinson    1957-1977      9686   293.0    1.26
Jim Davenport      1958-1970      2917    68.3    1.19
Eric Chavez        1998-2007      3488    73.9    1.16
Aurelio Rodriguez  1967-1983      6266   133.7    1.16
Scott Rolen        1996-2007      4423    95.9    1.16
Jim Presley        1986-1991      2111   -48.4    0.87
Howard Johnson     1982-1995      2009   -52.7    0.86
Bill Madlock       1973-1987      3520   -90.4    0.86
Harmon Killebrew   1957-1971      2256   -74.9    0.82
Dick Allen         1964-1972      2252  -118.4    0.74

Table 6. Top and Bottom First Baseman by Rate, >= 2,000 Balls 1957-2007 (almost)

Name                Span         Balls     SFR    Rate
Todd Helton        1997-2007      3202    55.1    1.18
John Olerud        1989-2005      4172    68.9    1.16
Pete O'Brien       1982-1993      2410    33.5    1.15
Wes Parker         1964-1972      2237    23.1    1.14
George Scott       1966-1979      4431    51.1    1.13
John Mayberry      1968-1982      3008   -23.8    0.93
Donn Clendenon     1962-1972      2584   -23.1    0.93
Mo Vaughn          1991-2003      2472   -27.7    0.91
Willie Montanez    1970-1982      2539   -29.7    0.90
Dick Stuart        1958-1969      2104   -59.4    0.79

PITCHf/x for 2008
With the new season upon us, we want to continue exploring the most recent data set made available to analysts–Sportvision’s PITCHf/x system. The advantage this time around is that we’ll have data for all 30 parks, beginning with the opening game in Washington’s new ballpark (sans the first two games at the Tokyo Dome) and hopefully taking us through the playoffs and World Series.

Besides being able to collect more data (last year the system collected 332,851 pitches representing about half the schedule while this year we should have more than 650,000) thereby making the samples larger and therefore more reliable, analysts everywhere are hopeful some of the kinks in the system have been worked out necessitating fewer adjustments. For example, while last year the point at which the pitch was being tracked was adjusted in mid-season, it appears that this season all pitches are being tracked starting at a point 50 feet from the plate, making adjustment for velocity unnecessary. In addition, it is now the case that Gameday operators (like yours truly) have the ability to override pitch data that comes in to the system. Unfortunately, it appears that in the XML data set these overrides simply appear as pitches without any data other than the operator’s x and y coordinates, making it impossible to know whether a pitch was simply missed by the system (i.e. the system wasn’t turned on or had some other problem) or whether it was truly overridden. As you might expect, early-season configuration and operator issues likely cause more of the former, and so hopefully the number of untracked pitches will be more a reflection of actual overrides as the season progresses. At this point (through games of April 6), 5% of the pitches have gone untracked.

What is probably most interesting to analysts, and what most fans have no doubt noticed, is that PITCHf/x is taking a stab at pitch type categorization, and displays it in the client, as shown in Figure 1.

Figure 1. Classifying Pitch Types

pitch types

Several analysts, most notably Josh Kalk and John Walsh, have developed procedures for pitch identification, while I’ve been working off individual pitcher profiles when the need has arisen, and occasionally creating larger buckets for fastballs, changeups and breaking balls. Although the details of the algorithm used by Sportvision is not public at this point, they are classifying pitches into many categories, including changeups, curveballs, fastballs, four-seam fastballs, cut fastballs, split-fingered fastballs, knuckleballs, pitchouts, intentional balls, sinkers, sliders and unknown. Interestingly, they are also providing a confidence level for each pitch classified that ranges from 0 to 1 and is apparently a percentage.

To get a feel for the frequency, characteristics, and confidence with which their algorithms classify pitches, the number of pitches for each pitch type by pitcher hand along with velocity, movement, and confidence is shown in Tables 7 and 8 for games through April 7:

Table 7. PITCHf/x Pitch Classification for Left Handed Pitchers in 2008

Pitch        Throws    Conf     Vel   Horiz    Vert   Count     Pct
Change       L        0.681    78.8     7.0     6.2     668     13%
Curve        L        0.811    74.6    -4.7    -4.6     623     12%
Cutter       L        0.687    84.3    -4.3     6.0     195      4%
Fastball     L        0.848    89.1     6.6     9.3    3113     60%
Four-seamer  L          n/a     n/a     n/a     n/a       0     n/a
Intent Ball  L        1.000    68.6     2.5     8.7       9      0%
Knuckleball  L          n/a     n/a     n/a     n/a       0     n/a
Pitch out    L        1.000    83.0     5.7     9.6       2      0%
Sinker       L          n/a     n/a     n/a     n/a       0     n/a
Slider       L        0.662    82.0    -0.9     2.9     517     10%
Splitter     L        0.513    83.5     4.8     5.9      90      2%
Unknown      L        0.000    54.8    -0.9     6.2       1      0%

Table 8. PITCHf/x Pitch Classification for Right Handed Pitchers in 2008

Pitch        Throws    Conf     Vel   Horiz    Vert   Count     Pct
Change       R        0.665    82.8    -7.7     6.2    1858     11%
Curve        R        0.713    76.2     5.2    -4.7    1459      9%
Cutter       R        0.531    88.4     0.3     8.6     507      3%
Fastball     R        0.795    91.0    -6.6     8.9    8909     52%
Four-seamer  R        0.583    91.9   -10.9    11.4     142      1%
Intent Ball  R        1.000    71.2    -5.8     8.1      85      0%
Knuckleball  R        0.844    67.9     2.7     1.3      85      0%
Pitch out    R        1.000    81.7    -6.7     9.3      21      0%
Sinker       R        0.548    90.0   -11.9     6.6     667      4%
Slider       R        0.710    83.6     2.2     3.6    2812     16%
Splitter     R        0.493    83.8    -9.3     3.2     512      3%
Unknown      R        0.000    52.7    -4.8    12.2       1      0%

Note that these tables exclude some pitches, since my master list of players is not complete, and so for those pitchers handedness is not recorded.

The algorithm thus far classifies almost all fastballs as simply “fastballs” and doesn’t really use the four-seamer or cut-fastball designations as often as they are certainly thrown. The same argument probably also applies to sinkers and perhaps splitters as well. However, the distribution of changeups, curves, and sliders appears to be closer to what one might expect.

Also, the algorithm is customized to some degree for each pitcher and at the very least incorporates velocity, since when plotting movement against pitch type (with respect to a non-spinning pitch as shown in Figure 2 for left-handed pitchers from the perspective of the batter) the groups for the various pitches overlap fairly significantly.

Figure 2. Pitch Type Groupings for Southpaws, 2008


And just for kicks, let’s take a look at the pitchers and pitch types that PITCHf/x is most and least confident about in the small sample we have from the first week.

Table 9. Most and Least Confident by Pitch Type, 20 or more pitches

A.J. Burnett        Fastball       63   0.965
Jonathan Albaladejo Fastball       31   0.962
Philip Hughes       Fastball       51   0.948
Hong-Chih Kuo       Fastball       42   0.942
Richard Hill        Curve          30   0.940
Kent Mercker        Fastball       20   0.940
Jeremy Affeldt      Curve          28   0.940
Erick Threets       Fastball       37   0.939
Manuel Parra        Fastball       53   0.938
Damaso Marte        Fastball       27   0.925
Micah Owings        Cutter         24   0.501
Jason Bergmann      Fastball       21   0.498
Justin Verlander    Sinker         40   0.486
Brad Thompson       Splitter       25   0.481
Justin Verlander    Fastball       35   0.460
Joe Saunders        Change         20   0.460
James Shields       Splitter       24   0.438
Aaron Cook          Sinker         29   0.420
Jason Bergmann      Sinker         22   0.330
Mike Mussina        Splitter       22   0.313

Finally, to close out this week’s musings, let’s compare the pitch profile I created for Felix Hernandez last year with the pitch classification in use by PITCHf/x this season. So far PITCHf/x has recorded two Hernandez starts–in total, 194 pitches on April 1 versus Texas and April 6 at Baltimore. First, let’s take a look at his pitch frequency using my classification scheme from last season.

Table 10. King Felix in 2008 Pitch Classification

Pitch       Count  Start  Horiz   Vert  Conf   Other System?
Unknown         1   91.3   -1.8   -1.9   n/a   Slider
Changeup       32   86.2   -7.3    2.9   n/a   26 classified as splitter
Curve          29   82.1    5.1   -5.4   n/a   7 sliders, 21 curves, 1 sinker
Four-Seamer    40   96.3   -7.6    7.6   n/a   All fastballs
Two-Seamer     69   93.8   -7.2    6.9   n/a   1 split-fingered, 1 sinker
Slider         23   88.5    1.2   -1.0   n/a   All sliders

Table 11. King Felix in 2008 PITCHf/x Pitch Classification

Pitch       Count  Start  Horiz   Vert   Conf  Other System?
Unknown         0    n/a    n/a    n/a    n/a
Changeup        4   86.4   -8.6    6.4  0.530  all changeups
Curve          21   80.5    6.0   -6.6  0.785  all curves
Four-Seamer     0    n/a    n/a    n/a    n/a
Fastball      109   94.7   -7.3    7.2  0.854  2 changes, 40 four-seamers, 67 two-seamers
Split-finger   27   86.1   -7.1    2.2  0.558  26 changes, 1 two-seamer
Sinker          2   86.7    0.7   -1.9  0.502  1 curve, 1 two-seamer
Slider         31   88.1    1.3   -1.2  0.650  7 curves, 23 sliders, 1 unknown

Overall, the PITCHf/x algorithm does a pretty good job as both systems significantly agree on curves, sliders, and fastballs. But you’ll notice that my algorithm identifies a significant number of his pitches as changeups while PITCHf/x sees them as split-fingered fastballs. Since Hernandez does not, to my knowledge, throw a splitter, the conclusion is that while the algorithm considers some information on other pitches thrown by the pitcher, the subset of pitches it chooses from is not restricted. As a result, there will be cases where the classification is incorrect, although of course Hernandez is probably one of the more difficult pitchers to look at since he throws at a higher average velocity than most pitchers.

Be that as it may, the addition of pitch type confidence is a welcome one and should allow for a greater breadth of analyses while providing the benefit of being standardized.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Dan Fox


You need to be logged in to comment. Login or Subscribe