BP Comment Quick Links


July 16, 2012 Baseball TherapyIt's a Small Sample Size After AllWho said sabermetrics hasn't gone mainstream? We've now reached the point where even mainstream analysts are yelling "small sample size!" at one another. There's always been some understanding that a player who goes 4for5 in a game is not really an .800 hitter, but now, people are being more explicit in talking about sample size. I consider that a victory. Hooray for sabermetrics! How big does a sample size need to be before it stops being... small? As I understand it, the most commonly cited study on the topic was written by a man codenamed "Pizza Cutter" almost five years ago at a blog that no longer exists. Mr. Cutter's idea was that he'd look at something called splithalf reliability. BP's Derek Carty did something similar a while ago, while picking plate appearances at random. Mr. Cutter took two equal samples of the same number of PA for a bunch of players and checked to see how well they correlated with one another. The idea was that over time, a statistic becomes more and more "stable," meaning that it becomes a better indicator of his true talent level over that time frame. After reading the original Pizza Cutter article, I am amazed that anyone pays attention to this study given its many methodological flaws. Among them:
Let's see if we can make this better. I would propose to duplicate Mr. Cutter's study with much better methodology. As always, if the numbers scare you, you can close your eyes for the next part, and go to "the results."
Warning! Gory Mathematical Details Ahead! The data were Retrosheet playbyplay logs from 20032011. Pitchers batting were eliminated, as were all intentional walks (I counted them as never happening). Only batters who had at least 2000 PA in that time frame were selected. There were 311 such batters. All batter PAs were lined up chronologically and numbered in order, and I took the first 2000 PAs for each batter. This means that I was able to get reliability coefficients on samples up to 1000 PA. For stats that had other denominators, such as batting average (ABs) or grounders (balls in play), I note the inclusion criteria in the chart below. This time, instead of splitting up the sample into evensandodds as Mr. Cutter did, I used a much better methodology, the KuderRichardson reliability formula. (For the initiated, I used KR21, a derivative of Cronbach's alpha. There were a couple cases where the outcome was not binary—SLG, ISO—where I used Cronbach.) The baseball statistics in which we are most interested are binary outcomes (strikeout rate is a yes/no question of whether the batter struck out over a series of PA), and KuderRichardson specifically assesses measure reliability in binary outcomes. The formula is available elsewhere online, but the basic idea is this. Imagine that you had a sample of six PAs for a bunch of hitters. Now imagine if, instead of splitting them 135 vs. 246 (i.e., evens and odds), you could split them into every single possible combination available and correlate those two halves. So, you could see what the correlation between 123 and 456 would be, or the correlation between 124 and 356. Then, let's say that you could take the average of all of those correlations. Mathematically, that's what KuderRichardson (and Cronbach) does. So, if I have a sample of 500 PAs for a list of batters, this method will tell me what happens when you split that into a pair of 250 PA samples in every possible way. The result will be a much better estimate of how reliable an indicator of a player's true talent level a statistic is over 250 PA. Of course, we know some stats reach higher levels of reliability at lower levels of PA, but it's interesting to note which ones are which and what that says about player evaluation as the season goes along. I looked for the place where reliability passed .70, which is about the only thing that Mr. Pizza Cutter got right. At .70, the rate of signal to noise crosses the halfway point (.707 * .707) = 50%. Of course, with any sort of bright line, there's always the objection that it's a black/white contrast where 50 shades of grey are called for. I don't know what else to say other than "Yeah, I know." The Results
Hopefully, Colin Wyers won't kill me for using Retrosheet batted ball classifications. *  In some cases, the magic .70 mark was not reached within the constraints of the data set, so I used the SpearmanBrown prophecy formula to estimate at what point .70 was most likely to occur.
What it means Perhaps we need to talk about the five factual outcomes for hitters? I realize that TTO is meant to describe a hitter like Adam Dunn or Jack Cust who has a style of play that emphasizes those three outcomes. However, between 20072010, when the two of them were duking it our for the title of TTO king, Cust began to see his rate of singles rise (while his HR fell), while Dunn hit comparatively fewer singles and kept his HRs (freakishly) consistent. Rates of doubles and triples were an odd duck. There's been a certain sabermetric (should I use the word fetish here?) over the past few years for guys who have high doubles numbers, but whom the market overlooks because they don't have sexy HR totals. Those doubles numbers may be illusions. The home run numbers are more likely to be real. Caveat emptor. Or amator. Ground balls and fly balls stabilize at roughly the same time (and quickly!). Skill in producing line drives is given to much more noise. Again, Colin Wyers has written over and over that it's hard to trust a classification of a line drive because it's a subjective judgment. But even trusting that Retrosheet is 100 percent correct that a player's line drive rate will likely vary a lot, his GB/FB ratio will be quick to stabilize. Some players are GB hitters, some are FB hitters, but line drives occasionally happen and it's hard to know why. Overall, these numbers aren't vastly different from the original article by Pizza Cutter, but the methodological improvements that I've made take away some of the concerns that could be raised about the originals. The techniques are a little more obscure, but after five years, it's time for an update. If I see some other older works that might benefit from some methodological sprucing up, especially from this Pizza Cutter guy, I might look into doing just that. (If there's a stat that you wish I had done, leave it in the comments, and I will do my best to get around to it. Let's stick to hitters for now.) Next time, we'll talk about how these numbers are often misused and what they can and can't be used to show.
Russell A. Carleton is an author of Baseball Prospectus. Follow @pizzacutter4
16 comments have been left for this article. (Click to hide comments) BP Comment Quick Links piraino (59490) Coming at this from another angle... If you assume every plate appearance is an independent random trial, you can actually compute confidence intervals around (e.g.) a batting average. Roughly, that line of reasoning leads you to the conclusion that 100 at bats gives you a confidence interval of about +.100. 400 at bats gives you a confidence interval of +.50, 1600 AB's: +.25. For Wade Boggs' entire career of about 9000ish at bats, the confidence interval is still +.10 around his .328 average. And I expect that these confidence intervals are, if anything, too narrow. A plate appearance is not an independent random trial because players run hotandcold, and their abilities improve and decline over time, so there is also unaccounted for time series variation. Very similar logic would apply to something like a walkrate or a homerun rate, which we normally think of as more reliable (and which stabilizes faster according to your methodology). The first conclusion I'd draw is that we are kidding ourselves when we report the 3rd digit of detail on anyone's batting average. We should just say Robinson Cano is hitting 32% (plus or minus 7%). It'd be nice to see somebody tackle the issue of how useful statistical analysis can be in baseball given that we're typically drawing inferences from such imprecise statistics. Jul 16, 2012 06:24 AM Ian in Chicago (57950) Very nice article. I think this article may help answer a related question: "How quickly should we forget the past?" Jul 16, 2012 08:16 AM joseconsuervo (59340) Is there some sort of publicly available retrosheet parser? I find it very cool that you can specify data set modifiers like completely ignoring intentional walks. Is this an inhouse creation or some modification of publicly available parsers? Either way, this was a great article. Jul 16, 2012 08:21 AM I use the statistical program SPSS for my work, which can be publicly bought (although it is expensive). The good news is that any spreadsheet style program can handle Retrosheet data. Jul 16, 2012 08:50 AM Ian in Chicago (57950) SPSS is fine of course but for the budget concious, this can all be done in R (which is free!). Here's a nice intro from Jim Albert who cowrote "Curve Ball: Baseball, Statistics, and the Role of Chance in the Game": Jul 16, 2012 09:49 AM joepeta (35285) Russell, Great update to a classic column. I had fun comparing this one to the "525,600 minute" version that is at FG. Picking up on Ian in Chicago, how do you weigh projections vs. 2012 performance? Now that most regular starters have 300 PA how do you merge a HR projection with actual? Let's look at Pujols: He's hitting HR's at a 3.9% per PA clip (I didn't remove IBB for sake of speed) compared to the final preseason PECOTA of 5.4%. How do you weight each going forward? Am I correct in assuming at the 170 PA mark (when r = .7) you'd take 50% of each? Thanks for the work. Jul 16, 2012 11:39 AM BurrRutledge (18981) So, hey, that pizza cutter guy was on to something. Hope he's still got gainful employment somewhere if that baseball statistics hobby/career path doesn't pan out. Jul 16, 2012 13:27 PM kuma32478 (30525) Pardon my ignorance, but what exactly do we mean when we say that "GB rates stabilize..."? Jul 16, 2012 14:27 PM I'll talk about that in next week's article. The short version is that you can be fairly certain that a player's GB rate over those 80 PA was reflective of his true talent _over that period of time_. But true talent unto itself can change and does so more rapidly than we'd like to think. Jul 16, 2012 19:16 PM Not a subscriber? Sign up today!

Great article.
How about some swing diagnostic metrics?
OSwing%
ZSwing%
Swing% (swings/pitches)
OContact%
ZContact%
Contact% (ball in play + foul / swings)
Zone%
FStrike%
SwStr%
I had dreams of doing these initially... I'll see if I can fire up my PFX data base later.