January 24, 2008
Simple Fielding Runs Version 1.0
"Let him hit it, you've got fielders behind you."
When it comes to fielding analysis, there really is no such thing as simple. Be that as it may, in this space during the last month and a half we've been exploring a nascent fielding system I developed based on play-by-play data: "Simple Fielding Runs," or SFR, for those who aren't totally nauseated by yet another TLA (three-letter acronym).
Simply put, the main advantage of developing such a system is that it claims the middle ground between systems that are based on traditional fielding statistics (such as the Davenport Translations, or the subsequently-developed fielding component of Bill James' Win Shares), and those based on zone data tracked for every ball put in play, notably Mitchel Lichtman's Ultimate Zone Rating [UZR] or John Dewan's Plus/Minus system (as described in the initial 2006 edition of The Fielding Bible). As we'll see, each approach has its advantages, and we'll exploit one of those where SFR is concerned in the final section of today's column.
While the idea of creating a system based on play-by-play records turned out not to be very original, the implementation of SFR is unique, and it has been interesting to see how the two systems fare when compared to those based on more granular data. To that end, and in the formulaic threefold manner that defines so many of these columns, this week we'll briefly recount the refinements that have been made in SFR over the last month, then make a few comparisons between SFR, UZR, and the Plus/Minus system, and wrap up by applying SFR to the 2007 minor leagues to take a look at who is flashing the leather down on the farm.
Evolution Not Revolution
First, a baseline for each position is created that specifies how frequently and at what cost balls of different types (line drives, groundballs, popups) hit "near" a fielder were turned into outs. The second step is to make the same calculations for each fielder and compare their results to the matrix to determine how many balls above or below average the fielder either turned into outs or let past him (assuming all other things being equal). The final step is to convert the difference into a run value, which we did using values derived from linear weights. The rub in all of this is in determining just what a ball hit "near" a fielder actually means, since we don't have hit location data broken down into zones. In the first iteration very simple partitioning rules were used to come up with the "virtual area of responsibility" for each position; and by "simple" I mean static, as in "for shortstops, all ground balls fielded by the shortstop and left fielder are considered, as well as half the groundballs fielded by the center fielder."
We then compared SFR to UZR for 2005 and 2006 data and computed a correlation coefficient of 0.75, which was high enough to justify continued development of the system.
Our next step was taken a week later and primarily incorporated two refinements that most readers likely thought obvious: batter handedness and bunting. In taking the first into consideration, we're controlling for fielders facing a disproportionate number of batters from one side that may in turn skew the overall difficulty of turning batted balls into outs. In the latter case, we're affecting corner infielders who usually end up fielding bunts, since treating bunts like normal ground balls is clearly not sufficient. In addition, this second attempt eliminated from consideration for middle infielders all line drives that resulted in extra-base hits, since the odds of a line drive catchable by an infielder resulting in a double or triple are negligible. One final refinement in the second beta version of the system involved introducing more sophisticated partitioning rules. These rules were based on the proportion of batted balls that we find "in the wild" for balls we know were fielded by the positions participating in the split (i.e. between third and short, and second and first) with the split between short and second handled as a 50/50 division. The new partitioning rules also put an end to the double-counting of batted balls where, for example, the same grounder to the outfield was assigned to both the third baseman and shortstop. Instead, each ball was assigned using the proportion calculated in the partitioning rules.
After applying these changes we once again ran our comparison with UZR and found that our correlation coefficient rose slightly, from 0.75 to 0.78, in addition to bringing SFR in line with UZR in terms of range and standard deviation. It should be noted that while I'm effectively using UZR as a baseline for comparison, that doesn't mean that UZR is a perfect system. My use of it is based on the principle that all other things being equal, a system like UZR based on more detailed data should, in theory, give us results that are closer to reality.
While those changes were certainly beneficial, I quickly discovered (and have not yet discussed in this space) a few more tweaks that could possibly make a substantive difference.
First, I considered the additional context for all infielders depending on whether first base was occupied, and added this to the baseline used for comparison. It turns out that for first basemen, on groundballs hit by lefties, the percentage of runners who reach via hit or error goes from 16 percent to 24 percent when first is occupied, and from 20 percent to 29 percent for right-handers. For second basemen in the same scenarios it goes (respectively) from 27 percent to 29, and from 30 percent to 34. For shortstops, interestingly, the trend is the opposite, as it declines from 34 percent to 30, and 32 percent to 31. Finally, for third basemen it's pretty steady--an unchanged 27 percent on lefties, and from 24 percent to 26 against right-handers. Although a far smaller percentage of fielded balls, the differences are significant for bunts when first is occupied, since the vast majority of such attempts are sacrifice bunts instead of bunt hit attempts, which have a higher success rate.
Secondly, the partitioning rules for determining which balls are assigned to first and third basemen were once again altered. In the previous attempts, all balls in the shared areas of responsibility--except as noted above for middle infielders when a line drive resulted in an extra-base hit--were partitioned according to the split percentages. This was changed to exclude bunts from the calculation of the partitioning percentage and, more importantly, all extra-base hits on groundballs to left field are now assigned to the third baseman, and all extra-base hits on grounders to right are assigned to the first baseman. This is done based on the assumption that these are hits down the line that naturally would fall within the corner infielder's area of responsibility.
Based on some correlation results shown in the second article, we also no longer partition any popups, fly balls, or line drives to the outfield, since by doing so we didn't seem to be adding much information to what we already knew. Fielders still get credit, however, for balls of these other types that they field. And as noted above, we were using a 50/50 split on groundballs up the middle. Shortly after publication, this was changed to partition these balls based on the percentages of grounders actually fielded by the shortstop and second baseman. Finally, in just the last few days, a small error in attributing popups, line drives, and fly balls for shortstops was corrected.
The end result of all of these changes and refinements is that we're ready to--in the words of my field of software development--ship the code and stamp the current version for infielders as version 1.0 of SFR. What that means is that you can now download a spreadsheet containing the 2005 through 2007 data for all infielders. Enjoy.
Like Two Peas in a Pod?
In order to provide just a little more context for this first version, let's run a few more comparisons to UZR and the Plus/Minus system. As we did before, we can compute correlation coefficients for SFR and UZR and break it down by position. This time, though, we'll use all seasonal data for 2003 through 2006 for players who played in 50 or more games at their position. The results are shown in Table 1:
Table 1. Correlation Coefficients for SFR vs. UZR, Seasonal 2003-2006 for >=50 Games Played Pos Seasons r All 549 0.80 1B 143 0.68 2B 132 0.81 SS 141 0.81 3B 133 0.82
From this it appears that our latest set of changes bought us a slight rise in correlation coefficient, increasing the similarity slightly for first basemen, second basemen, and shortstops. It's still interesting to note that first basemen lag behind the other positions, although I still have no solid reason why this should be the case. It certainly could be that the style of play varies more for first basemen in terms of positioning and allowing or not allowing a second baseman to field certain types of balls that a more granular system can take into account. It should also be noted that in this version of SFR the context of whether first base is occupied is taken into account, although the state of second base is not. This may have a larger effect than one might think. This whole question is obviously one ripe for further research.
In addition to running correlations based on seasonal data, we can also see how our metric performs across a span of a few seasons for players with significant playing time. Sean Smith reported his findings for TotalZone+ (the analogous system referenced in the introduction) recently, so we'll take a similar tack here. In Table 2 you'll find the correlations by position for all players who played in 162 or more games (Sean filters by 500 or more chances) at their positions from 2003 through 2006:
Table 2. Correlation Coefficients for SFR vs. UZR, Aggregated 2003-2006 for >=162 Games Pos Players r All 156 0.87 1B 38 0.78 2B 40 0.88 SS 41 0.86 3B 37 0.89
To give you a feel for what this looks like graphically, Figure 1 shows the scatter plot colored by position (and yes, that blue dot in the upper right hand corner of the graph is new Twins shortstop Adam Everett, which SFR has at +81 runs in 481 games, and which UZR has at +104):
In addition to comparing SFR to UZR we can also compare it, at least in the aggregate, to the Plus/Minus system developed by Baseball Info Solutions. I used the caveat "at least" since full data for 2006 and 2007 is not available, although summary data as well as leaders and trailers have been published in the Hardball Times' annual and The Bill James Handbook 2008.
In THT's published work, we find the aggregated team totals for 2007 broken down by middle and corner infielders, which we can compare to the SFR totals for those positions, as shown in Table 3 ordered by total SFR:
Table 3. Comparison of SFR to Plus/Minus, 2007 Team Totals Middle Corner Total Team SFR +/- SFR +/- SFR +/- TOR 42 56 5 25 47 81 COL 35 46 9 -16 44 30 SDN 30 0 8 -6 38 -6 SFN 19 1 18 46 37 47 BOS 15 3 15 14 30 17 OAK 21 14 8 8 30 22 CHN 11 3 3 13 14 16 SLN -15 -17 28 58 13 41 BAL 17 3 -5 -5 12 -2 ATL 11 -5 1 4 12 -1 NYN 15 15 -3 11 11 26 KCA 6 25 5 24 11 49 PHI 17 25 -6 -2 11 23 MIN -6 14 11 16 5 30 ARI 2 19 3 -18 5 1 ANA 0 8 1 27 1 35 TEX 4 -9 -3 -14 0 -23 NYA -8 -20 3 17 -6 -3 DET -26 0 16 29 -10 29 WAS -11 -33 0 -8 -11 -41 LAN -5 -8 -8 -20 -13 -28 CLE -10 5 -5 -9 -16 -4 SEA -7 -11 -10 -14 -17 -25 HOU -3 -5 -15 17 -18 12 PIT -23 -4 2 18 -22 14 CIN -13 7 -13 -29 -26 -22 CHA -19 -38 -9 -15 -28 -53 TBA -39 -51 -4 -17 -43 -68 MIL -16 -5 -33 -41 -49 -46 FLO -43 -54 -20 -44 -64 -98
Keep in mind that while SFR is denominated in runs, Plus/Minus is simply counting the difference from expected in the number of balls fielded (although for corner infielders Plus/Minus does have a concept termed "Enhanced +/-" that considers balls hit down the line for corner infielders to give them added weight, although I can find no evidence that the numbers presented here are "Enhanced"). This means that the Plus/Minus numbers will have a larger magnitude. A quick translation would multiply the Plus/Minus number by something like 0.44 for middle infielders and 0.50 for corner infielders to convert them to runs.
The lists agree substantially, with 13 of the 16 teams that SFR pegs as above-average also rating as such in Plus/Minus. Overall, a regression between the totals results in a correlation coefficient of 0.79 with no discernable difference between middle infielders (at 0.78) and corner infielders (at 0.78). Once again, to give you a visual, the following graph depicts how the two systems see the teams overall in terms of their infield defense:
Figure 2. SFR vs. Plus/Minus for 2007 Teams
There are clearly differences between the systems--for example, the cluster of Detroit, Pittsburgh, and Houston, which SFR sees as below average, while Plus/Minus has them in the black--but the similarities and the fairly tight correlation between both UZR and Plus/Minus should give us some confidence that the system is indeed measuring fielding prowess to a substantial degree.
We can't perform the same kinds of correlations for Plus/Minus using player seasons as we can for UZR, since the data is not available, but we can sample the leaders and trailers for 2005 through 2007 as published in the Handbook. What follows is a series of tables that shows the top and bottom five in Plus/Minus, along with the SFR value computed for the player. I've also added a few notable players who may have done well in one metric but not the other (blank values for Plus/Minus indicate that the player did not appear in the leaders or trailers lists).
Table 4. Leaders and Trailers in Plus/Minus Compared to SFR, 2005-2007 First Base Second Base Player SFR +/- Player SFR +/- Albert Pujols 22 72 Chase Utley 29 64 Casey Kotchman 7 31 Orlando Hudson 19 53 Doug Mientkiewicz 8 31 Aaron Hill 36 48 Lyle Overbay 14 24 Mark Ellis 42 43 Kevin Youkilis 10 19 Mark Grudzielanek 27 36 -------------------------------- --------------------------------- Mike Jacobs -7 -25 Jorge Cantu -29 -29 Richie Sexson -4 -25 Jose Vidro -13 -31 Carlos Delgado -11 -26 Craig Biggio -3 -33 Adam LaRoche -7 -28 Jeff Kent -13 -36 Prince Fielder -21 -33 Rickie Weeks -45 -41 Notables Notables Player SFR +/- Player SFR +/- Mark Teixeira 16 15 Jamey Carroll 32 17 Nick Johnson 7 6 Placido Polanco 10 28 Justin Morneau 5 14 Brian Roberts 8 25 Olmedo Saenz -8 Jose Castillo -22 Jason Giambi -11 Brandon Phillips -24 Dan Uggla -30 Shortstop Third Base Player SFR +/- Player SFR +/- Adam Everett 64 92 Pedro Feliz 30 64 Jason Bartlett 22 45 Brandon Inge 33 61 Clint Barmes 17 43 Scott Rolen 27 50 Jimmy Rollins 22 42 Joe Crede 24 44 Jack Wilson 11 41 Adrian Beltre 7 42 -------------------------------- --------------------------------- Marco Scutaro -15 -33 Garrett Atkins 6 -21 Felipe Lopez -20 -34 Edwin Encarnacion -26 -25 Hanley Ramirez -23 -43 Hank Blalock -8 -28 Michael Young -26 -64 Mark Teahen -22 -30 Derek Jeter -37 -90 Miguel Cabrera -16 -37 Notables Notables Player SFR +/- Player SFR +/- Omar Vizquel 45 31 Nick Punto 19 28 Jose Reyes 30 Eric Chavez 30 27 John McDonald 20 33 Ryan Zimmerman 9 24 Rafael Furcal 18 36 Morgan Ensberg 16 16 Troy Tulowitzki 10 30 Ryan Braun -28 Angel M. Berroa -27 -33 Alex Rodriguez -19 Carlos Guillen -21
Defense in a Minor Key
Because of the confidence we've gained through the comparisons to UZR and Plus/Minus, we can now begin to take the next step and apply the SFR methodology to data sets where we do not have numbers generated from a more granular system. To wrap up today, let's review the leaders and trailers for all of the minor leagues (except the Mexican League) for 2007 by position.
First, let's tackle the first basemen:
Table 5. Minor League SFR Leaders and Trailers at First Base, 2007 Player League Team SFR Brandon Snyder SAL DEL 9 Daric Barton PCL SRC 9 Yurendell De Caster INT IND 8 Larry Broadway INT COH 8 Todd Self TXS COR 7 ------------------------------------------ Lars Anderson SAL CAP -6 Jeffrey Cunningham PIO CAS -8 Logan Morrison SAL GBO -9
Our top slots go to Orioles farmhand and converted catcher Brandon Snyder playing in A-ball and Daric Barton playing at Triple-A for the A's organization. Snyder just missed Kevin Goldstein's list of top Orioles prospects, while Barton (another converted catcher), after being named the top prospect before the 2007 season, didn't disappoint with the bat and remains at the top of the Oakland stack.
On the flip side, 20-year-old left-handed slugger Logan Morrison, toiling in A-ball for the Marlins, had a good year with the bat by smacking 24 home runs despite struggling against southpaws. However, he did not turn in such a good season with the glove. Another big slugger in the Rockies system, Jeffrey Cunningham, got his first taste of professional baseball in Casper and rated at -8 runs in just 59 games.
Now it's on the second basemen:
Table 6. Minor League SFR Leaders and Trailers at Second Base, 2007 Player League Team SFR Adam Davis SAL LCO 17 Jayson Nix PCL CSP 16 Jose Vallejo MDW CLI 15 Luis Valbuena SOU WTD 15 Miguel Abreu SAL DEL 14 ------------------------------------------ Chih-Hsien Chiang SAL CAP -14 Brooks Conrad PCL RRE -14 Chase Fontaine SAL ROM -15
Grabbing the top spot at +17 runs is Adam Davis playing in Low-A for Cleveland. Not exactly a prospect at 22 years old, he split time between second (104 games) and third base (25 games) where he also rated +2. At runner-up we find the Rockies' 2001 first-round pick, Jayson Nix, shining afield at Colorado Springs. After delivering his best season with the bat since 2003, he's in the running to win the second base job, where he'll be competing with Marcus Giles, Ian Stewart, Clint Barmes, and Omar Quintanilla, among others. He will have a leg up defensively.
Chase Fontaine was promoted from the Sally League to the Carolina league, and in both stops was tried at short, third, second, and the outfield. While I haven't run the outfield numbers, it's clear that the Braves are trying to find him a position where he can do the least damage. When you add it all up, in his two stops his infield "contribution" totaled -27 runs.
While never touted for his glove work, Brooks Conrad's 2007 season was a disaster on both sides of the ball. As profiled by Marc Normandin, his .218/.305/.420 performance at Round Rock for the Astros will be an impediment now that he's a minor league free agent. His -14 SFR at second base to go along with a -2 at third base in just 13 games won't help either.
On to the shortstops...
Table 7. Minor League SFR Leaders and Trailers at Shortstop, 2007 Player League Team SFR Hainley Statia CLF RCQ 21 Ramon Santiago INT TOL 21 Clint Barmes PCL CSP 19 Jonathan Herrera TXS TUL 16 Juan Sanchez DSL DTW 15 ------------------------------------------ Jeffrey Dominguez CLF HDM -16 Dylan Johnston NWN BOI -16 Neil Walton FSL VBD -17
At shortstop, our leader is Hainley Statia playing his first full season at High-A. According to Baseball America, Statia's "an instinctual middle infielder with plus defensive skills." SFR agrees, as he was +14 at two stops in 2006 and +16 at two stops in 2005. In second and third places, we find a couple of Triple-A veterans, Detroit's Ramon Santiago and Colorado's Clint Barmes, both of whom will be competing for utility roles on their respective clubs in 2008.
Prior to the 2006 season, Neil Walton was ranked by Baseball America as the top defensive middle infielder and the player with the best infield arm in the Rays' system. His 2006 SFR of +5 seems to support that, but in 2007 he committed 25 errors in 88 games and showed decreased range on his way to a -17 SFR. Dylan Johnston was drafted by the Cubs in 2005, has played nothing but shortstop in his three seasons, and reportedly has average range with a quick release. In 2007 he was promoted from Boise to Peoria, but before moving east he committed 28 errors in 56 games leading to an SFR of -16. He wasn't done, however, and after moving to the Midwest League he totaled -8 by committing 18 more miscues for a 2007 SFR total of -24.
Finally, at the hot corners of the minors, the best and worst...
Table 8. Minor League SFR Leaders and Trailers at Third Base, 2007 Player League Team SFR Ryan Rohlinger SAL AUG 35 Mario Lisson CRL WIL 16 John Contreras DSL DCU 15 Andrew Davis NWN SKV 15 Mike Hessman INT TOL 14 ------------------------------------------ Michael Grace SAL KAN -11 Matthew Sweeney MDW CED -17 Mat Gamel FSL BRE -24
As a 24-year old who hit .235 in Low-A, Ryan Rohlinger is not exactly high on most people's radar, but his SFR of +35 ranked the highest of any player. Although he did play second and short in college, one wonders whether there is a data problem (although in looking at the other infielders at Augusta, nothing jumps out) or if this is just something of a fluke in the system. We'll give Ryan the benefit of the doubt for now, and crown him with the title of best infield defender in the minors in 2007.
In the runner-up slot we find Royals third sacker Mario Lisson, who played at High-A Wilmington as a 23-year-old. Lisson is acknowledged as a good defender--as you might expect of a converted shortstop--and he's on the 40-man roster, since the Royals opted to protect him in the Rule 5 draft.
On the bottom we find Brewers farmhand Mat Gamel, who purportedly has plus arm strength but poor footwork that leads to poor throws. He made an astonishing 53 errors in 113 games in 2007, and since 2005 has committed 91 errors in 229 professional games at third base, good for -34 runs on his career. Can you spell DH?
Finally, we have 19-year-old Angels prospect Matthew Sweeney, who in his first full season in Low-A committed 28 errors in 85 games at third leading to his -17 SFR. Although he has decent arm strength, a move to first base seems to be in his future.