“The subject may appear an insignificant one, but we shall see that it possesses some interest, and the maxim ‘de minimis lex non curat’–the law is not concerned with trifles–does not apply to science.”
–Charles Darwin, from the preface of his book The Formation of Vegetable Mould, Through the Action of Worms, With Observations on Their Habits (1881)

Darwin penned the words above in what would be his final book, published less than a year before his death. In that book he argues that small effects–earthworms, in his case–summed over a great number of actors and a long period of time, can have large consequences. In other words–unlike for his lawyer friends–in many disciplines trifles matter, not only because resolving small but knotty puzzles is intellectually satisfying, but because the apparently inconsequential is often just that: apparently inconsequential.

If one can say anything about the obsessive baseball fan, it surely is that, like a good scientist, he is concerned with what to others may look like minutiae. And so it is in that vein that, based on reader feedback, this week we’ll dig back into the methodology behind the calculation of SFR–introduced last week–and address one or two points that, like Darwin’s earthworms, may appear insignificant but have a measurable effect on the final outcome. I’m also delaying my original promise: we’ll once again focus our attention on infielders, and outfielders will have to wait for another time.

The Trifles

Before we dig in, I’d like to thank everyone who offered opinions and asked questions on last week’s column. While I assumed that this topic might generate some interest, I am always appreciative of the insightful suggestions I find in my inbox (and regardless of the number of responses received). I should also point out that while last week the focus was on how efficiently and quickly a credible system could be built, this time around it’s all about refinements. So for those who were worried this was only a quick and dirty exercise, I hope you’ll be encouraged by my desire to improve upon it.

Based on those reader suggestions–and piggybacking on a couple improvements I hinted at last week–the adjustments can be discussed under two broad categories.

1. More Context. One of the obvious weaknesses of the first draft of the system was that the baseline matrix shown included balls in the “area of responsibility” defined for each position broken down only by hit type. Two of the key pieces of context missing from the table were the batter handedness and bunting. The former could be meaningful if a player fielded disproportionately more balls struck by left- or right-handed hitters, and we might guess that our expectation for how many runners reach base would differ. In the case of bunting, bunts are not fielded in the same way as normal ground balls, so especially for corner infielders in the National League, but also in general, we might expect that adding this context might change SFR totals.

In order to account for this, we can simply add two attributes to our matrix. This has the effect of morphing the six row table shown last week into a 20 row table:

Pos    Hit  B Bunt     Balls Runners      TB   Out% TB/Runner
Short   G   L    F      5186    1963    2071   .622    1.05
Short   G   L    T        11      11      11   .000    1.00
Short   G   R    F     13530    4164    4566   .692    1.10
Short   G   R    T         3       3       3   .000    1.00
Short   L   L    F      2284    1947    1951   .148    1.00
Short   L   R    F      3452    2937    2938   .149    1.00
Short   P   L    F      1295      53      56   .959    1.06
Short   P   L    T         2       2       2   .000    1.00
Short   P   R    F      1282      36      37   .972    1.03
Short   P   R    T         1       0       0  1.000    0.00
Third   G   L    F      1679     454     494   .729    1.09
Third   G   L    T       286     102     106   .643    1.04
Third   G   R    F     10231    2485    2845   .757    1.14
Third   G   R    T       301     111     113   .631    1.02
Third   L   L    F       991     790     998   .203    1.26
Third   L   R    F      2816    2290    3050   .187    1.33
Third   P   L    F      1212      21      26   .982    1.23
Third   P   L    T        17       1       1   .941    1.00
Third   P   R    F       645       6       7   .991    1.24
Third   P   R    T        12       3       3   .750    1.00

You can discern that when a ground ball is hit to the third baseman’s area of responsibility, it is turned into an out more frequently when struck by a right-handed rather than a left-handed hitter. However, the tables are turned when the ball is a bunt–you can see that bunts are more successful for the offense when laid down by either hand. The same general rule applies to shortstops, where grounders are converted into outs seven percent more often when hit by a right-handed hitter. For both positions this effect is present; presumably the runner coming from the right-handed batter’s box has an extra step to make on his way to first, and because a grounder pulled to the left side of the infield by a right-handed hitter will typically be one he “rolled over on,” making it less likely to be hit hard. It may also be the case that considering batter handedness subtly accounts for a variation in where balls are fielded when hit by left and right-handed hitters, although I have no solid evidence to back up that bit of intuition.

Looking closer, it would also appear that line drives by right-handers are more difficult for third basemen–but not shortstops–to turn into outs, and are more costly when the runner does reach. Although not shown in the table, we find the exact opposite results for grounders in the area of responsibility for second and first baseman, where right-handed hitters reach more frequently.

Based on a suggestion from a reader, I eliminated from consideration for middle infielders all line drives that resulted in doubles and triples, since it is likely that any line drive within reach of a shortstop or second baseman would, if not caught, result in a single. This is not necessarily the case for corner infielders, however, and so no adjustment has been made.

Finally, don’t be alarmed by the small samples you see for bunts to shortstop and pop-up bunts to third. Because these percentages are only used when a ball is credited to a fielder, the percentages are applied in the frequency you see in the table. In other words, while the percentages may not be ideal because of the small sample, their impact is also very minimal in terms of the overall values the system produces.

In any case, this additional context also allows us to break down a corner infielder’s SFR into grounder and bunt components, as we’ll see shortly.

2. A Question of Partitioning. When I discussed the concept of virtual areas of responsibility in last week’s column, I noted that grounders, line drives, and popups not fielded by the player in question had to be partitioned with fielders adjacent to that fielder. That is, batted balls fielded by the left fielder are partitioned between third base and shortstop, balls hit to the center fielder are partitioned between short and second, and balls to right field are split between second base and first base.

Because our focus was more on the quick and dirty effort last week, very little thought was put into how that partitioning should be done. Thanks to several readers, I reconsidered the problem this week and came up with a simple and tentative way of allocating batted balls in these shared areas of responsibility. Simply put, the system now partitions batted balls (again, those not fielded by the position we’re analyzing) in the proportion that we find “in the wild” for balls we know were fielded by the positions participating in the split. So for our three areas of partitioning, the splits for ground balls in 2007 look like this:

diamond chart 1

diamond chart 2

Here we see that for ground balls hit by right-handed hitters that make it into left field, we assign 45 percent of them to the third baseman, and 55 percent to the shortstop. Since we have no way of knowing on which side of second base the center fielder was when he fielded the ball, we’re content to make that one a 50/50 split (although one could argue that the distribution up the middle on such balls might follow the same distribution of balls in general, so for a right-handed hitter we would split it something like 80/20 and for a left-hander more like 70/30 to the pull side). For ground balls hit by right-handers fielded by the right-fielder, we then assign 79 percent of them to the second baseman and 21 percent to the first baseman. The chart for left-handed hitters is similar, although it’s clear that lefties spray the ball a bit more; a fact confirmed by looking at BIPChart.

We create similar distributions for line drives and popups, all of which are shown below:

Area          H   B    Split
Third/Short   G   L    28/72
Short/Second  G   L    50/50
Second/First  G   L    60/40
Third/Short   G   R    45/55
Short/Second  G   R    50/50
Second/First  G   R    79/21
Third/Short   L   L    37/63
Short/Second  L   L    50/50
Second/First  L   L    55/45
Third/Short   L   R    50/50
Short/Second  L   R    50/50
Second/First  L   R    73/27
Third/Short   P   L    48/52
Short/Second  P   L    50/50
Second/First  P   L    74/26
Third/Short   P   R    33/66
Short/Second  P   R    50/50
Second/First  P   R    53/47

Using these distributions we can now assign a more accurate portion of the balls in play to each position, which will adjust the numbers of balls assigned to each player, and thus the final SFR, accordingly. One final note on this adjustment: in the previous column I mentioned that all ground balls fielded by the left fielder were assigned to both the shortstop and the third baseman. After thinking about it a bit more, I decided to use the partitioning rules discussed here, which never double counts a ball but instead, like Solomon, divides it and assigns partial responsibility to each fielder. This has two effects. First, you’ll notice that the numbers of balls assigned to most fielders have gone down from what was shown last week. For example, Troy Tulowitzki went from 1,179 balls last week to 974 balls after making these adjustments. Second, the Out% shown in the first table above is higher when aggregated than for the matrix shown last week, since all of the balls that were removed were hits.

As an aside, most of the development effort this week was spent on creating the distributions and applying them while not inflating the code base. It remains, at present, at about 1,000 lines of code.

Taking a Test Drive

With those adjustments made, I then re-ran the numbers for 2005 through 2007 and re-ran the regressions against UZR for all infielders except catchers and pitchers. The resulting correlation coefficients for the 988 player seasons are shown in the following table:

Position    r
All        .781
1B         .649
2B         .780
3B         .776
SS         .818

Overall, we’ve gone from an r of .75 to .78, and you can see that the correlations are pretty good for all positions except first base. I’ll admit that I don’t understand exactly why that might be the case. In addition to a higher correlation, these adjustments also bring SFR in line with UZR in terms of standard deviation and range in this particular data set. Previously, the spread for SFR was over a run higher, while now it’s less than a half-run less, and although SFR recorded a +38 for Adam Everett in 2006 while UZR has him at +48, the other extreme high and low values match up rather nicely. Although it is difficult to discern, the scatter plot below does indeed look just a little tighter than in the previous column:


At the suggestion of readers, I also ran the numbers once excluding line drives, and once excluding both popups and line drives; the results were very interesting:

With Line Drives Removed
Position     r
All        .817
1B         .658
2B         .800
3B         .815
SS         .863

With Line Drives and Popups Removed
Position     r
All        .821
1B         .666
2B         .810
3B         .827
SS         .860

While these correlations are indeed higher, in both cases the standard deviation drops to a run less than UZR, while the overall range constricts slightly. What this probably indicates is that line drives fielded by outfielders have little value, and therefore we should be considering only those line drives fielded by the infielder. The current algorithm would have to be adjusted to do so, but this will certainly be on the list of adjustments to try on the next pass. Popups also seem to have more relevance for shortstops than the other positions (at least in terms of correlating to UZR).

Finally, here are the recalculated 2007 leaders and trailers using the adjustments discussed in this article; as in the last article, they are shown alongside the Davenport FRAA:

First Basemen 2007
Name               AdjG  Balls  FRAA   SFR
Albert Pujols     149.6   464    22     12
Todd Helton       148.2   419    10     11
Justin Morneau    142.0   379    14     11
Kevin Youkilis    123.2   329    17      8
Casey Kotchman    116.6   303     9      7
Lyle Overbay      108.8   298     5      6
Adam LaRoche      145.8   381    -1      5
Mark Teixeira     123.5   336    -7      5
Darin Erstad       19.5    62     1      3
Craig Wilson       15.7    45    -1      3
Jeff Conine        56.8   137     2     -5
Ryan Garko        118.2   303    -1     -7
Dmitri Young       99.1   240    -7     -7
Mike Jacobs       101.4   259   -11     -8
Prince Fielder    150.1   392   -15     -9

Albert Pujols still comes out on top, but the adjustments to the system severely limit the number of balls in his–and all first basemen’s–area of responsibility, and thus the range of SFR is now lower. As noted above, first base has the lowest correlation, so there is still likely a factor missing from the current algorithm. The above table includes bunts, but with bunts now factored into the equation we can show the leaders and trailers when fielding bunts as well:

First Basemen on Bunts 2007
Name               AdjG  Balls SFR
Adrian Gonzalez   160.6    26    2
Todd Helton       148.2    19    2
Mike Jacobs       101.4    13    1
Ryan Garko        118.2    15   -2
Ryan Klesko        89.7    25   -2
Adam LaRoche      145.8    21   -2

Certainly, Todd Helton has a good reputation in this regard, while Ryan Klesko does not.

Second Basemen 2007
Name               AdjG  Balls  FRAA   SFR
Aaron Hill        157.7   900     0     27
Mark Ellis        147.9   867    27     27
Kazuo Matsui       95.6   527    14     23
Dustin Pedroia    128.6   660     2     12
Brian Roberts     149.7   799     1     11
Robinson Cano     157.3   871    26     10
Geoff Blum         54.3   288    -2      9
Marcus Giles      104.1   602     6      9
Ian Kinsler       128.8   758     3      9
Alex Cora          33.5   162     1      7
Brandon Phillips  153.3   840    15    -12
Craig Biggio      103.6   500   -17    -13
Freddy Sanchez    142.4   688    -7    -17
Rickie Weeks      110.4   559   -13    -21
Dan Uggla         155.3   814    14    -31

Aaron Hill remains somewhat of a mystery between the two systems, while Mark Ellis is spot-on. Dan Uggla and Rickie Weeks remain on the bottom, although Brandon Phillips looks far worse in SFR than in FRAA.

Shortstops 2007
Name               AdjG  Balls  FRAA   SFR
Omar Vizquel      135.9   774    10     35
Troy Tulowitzki   152.3   974    24     21
Khalil Greene     153.3   832    -8     17
Jason Bartlett    134.7   764     8     14
Jimmy Rollins     160.1   875     8     11
Jose Reyes        159.8   817     4     10
John McDonald      89.4   504    12     10
Tony Pena         143.6   804    12      8
Adam Everett       59.3   339     4      7
Ryan Theriot       96.2   495    -7      6
Josh Wilson        51.3   265   -11    -11
Carlos Guillen    120.3   691   -12    -14
Derek Jeter       147.4   773    -6    -16
Hanley Ramirez    146.1   826    -8    -18
Brendan Harris     85.2   462   -13    -19

Khalil Greene gets knocked down a peg from last week (although the systems still disagree) while both Omar Vizquel and Troy Tulowitzki lose a few runs as well. Brendan Harris now takes over the bottom spot while Hanley Ramirez looks six runs better and Derek Jeter treads water.

Third Basemen 2007
Name               AdjG  Balls  FRAA   SFR
Pedro Feliz       136.1   521    14     25
Aramis Ramirez    122.2   486    17     13
Scott Rolen       105.6   421    16     13
Ryan Zimmerman    160.3   655    21     11
Mike Lowell       149.2   513    14     10
Nick Punto         93.4   333     1      9
Joe Crede          43.7   176     8      8
Mark DeRosa        32.2   119     2      7
Chipper Jones     120.3   414     1      7
Abraham Nunez      66.0   300     8      7
Wilson Betemit     45.7   156    -5     -6
Edwin Encarnacion 130.6   486   -11     -6
Alex Gordon       128.0   491     0     -7
Miguel Cabrera    147.2   543     5     -9
Ryan Braun        106.1   378   -25    -31

The top third basemen stay relatively constant, although Aramis Ramirez moves up a few spots since David Wright falls off the chart: he goes from +14 last week to +6 this week, in line with his FRAA of +5. Under these new rules Ryan Braun picks up 9 runs to only end up at -31, while Garret Atkins goes from -12 to -5 and is no longer among the trailers.

Third Basemen on Bunts 2007
Name               AdjG  Balls SFR
Pedro Feliz       136.1    20    4
Miguel Cabrera    147.2    26    3
Chipper Jones     120.3    19    3
Troy Glaus        103.8     7   -3
Jose Castillo      28.4     9   -3

In looking at bunts only, it’s interesting to see Miguel Cabrera place second when his overall number (-9) is so low. An ability to field bunts appears to be what catapulted Chipper Jones into the overall leaders as well.

Before closing, let’s make one more comparison. Since the correlation between SFR and FRAA is fairly low–as documented last week–I thought it would be appropriate to get a feel for how SFR looks when compared to UZR by showing the top and bottom fifteen SFR scores from 2005 and 2006, and their UZR equivalents.

Name                 Year   Pos    Balls    UZR     SFR
Adam Everett          2006   6      763      48      38
Rafael Furcal         2005   6      849      18      25
Jack Wilson           2005   6      864      15      24
Craig Counsell        2005   4      773      26      22
Jamey Carroll         2006   4      615      25      22
Mark Grudzielanek     2005   4      678      22      22
Brandon Inge          2006   5      701      20      21
Mark Ellis            2005   4      568      17      20
Chase Utley           2005   4      701      21      20
Jose Valentin         2006   4      480      17      19
Mike Lowell           2006   5      615      20      18
Eric Chavez           2005   5      546      10      18
Joe Crede             2006   5      602      18      17
Juan Uribe            2005   6      780      16      17
Bobby Crosby          2005   6      428       7      17
Edgar Renteria        2005   6      773     -14     -14
Carlos Delgado        2005   3      381     -12     -14
Rickie Weeks          2005   4      489     -21     -14
Jorge Cantu           2005   5      194     -20     -15
Jorge Cantu           2005   4      382     -10     -15
Russ Adams            2005   6      653     -22     -15
Angel Berroa          2006   6      742     -14     -16
Felipe Lopez          2005   6      740      -5     -16
Jose Lopez            2006   4      818       6     -16
Mark Teahen           2005   5      518     -22     -18
Alfonso Soriano       2005   4      839     -15     -20
Angel Berroa          2005   6      889     -16     -22
Jose Castillo         2006   4      703     -11     -22
Jorge Cantu           2006   4      526     -25     -27
Michael Young         2005   6      872     -30     -33

Here you can see the strong correlation, as both systems peg the same players as defenders who excel and those who… well, don’t. Jorge Cantu has the distinction of making the trailers list three times in two seasons thanks to being awful at both second and third in 2005.

Baby Steps

Like Darwin’s earthworms, the wheels of progress sometimes move slowly. But as Darwin showed, small steps can have a powerful cumulative effect in the long run. Thanks again for marking out a path for some of those steps.