BP Comment Quick Links


February 10, 2010 Introducing SIERAPart 3Earlier this week we introduced the run estimator SIERA, providing a general summary of its purpose as well as the evolution of its development. Today, in Part 3, our focus will shift to the quantitative side of the metric, offering a detailed look at the data used to derive the formula as well as specifics pertaining to the regression analysis techniques used. The transparency should provide a better understanding of the integrity of such a process as well as a few insights into the SIERAladen approach towards pitcher valuations. The Data All data used throughout this process, be it the calculation of SIERA or the various other comparative estimators, came from Retrosheet, a monumental achievement in the world of data without which several advancements in the field would not exist. The first step involved extracting seasonal tallies from the main events table, with statistics being grouped by pitcher, team, and year. This way, a pitcher with stints on various clubs throughout the same season would carry a different entry for each; Cliff Lee as both a Phillies and Indians pitcher last season. Next, using the Lahman database, the pitching park factor was added to each row in the table. Parkadjusted ERA was then calculated, though only half of the park factor was applied to the individual pitchers given that only half of a team’s games are played at the home stadium. If a pitcher ended up with a PPF of 105, instead of taking 95 percent of his ERA, 97.5 percent was taken, equating to onehalf of the difference between the actual park he called home and one considered to be neutral. With the adjustment applied to raw ERA, the next issue to bypass involved batted ball reliability. While Retrosheet provides a fantastic wealth of information, batted ball data is realistically only usable from 2003present. The major reason for this involves how balls put in play were scored, as the processes implemented have not been consistent. Before 2003, batted balls were only recorded on outs, meaning that a ground ball single through the third base hole counted as a single while a ground out to the second baseman went down as a grounder. Both are ground balls, but this rather vast issue precludes the usage of batted balls prior to that season. Only data from 200309 moved onto the next round given this restriction. With that table in place, the QERA formula was unfoiled and the nine emerging terms were calculated for each row in the database table. The data was then ready for further processing and rigorous study. The Results SIERA was first estimated with 10 parameters: an intercept and the nine aforementioned terms that surface once QERA is unfoiled, which involved regressing parkadjusted ERA on all nine terms. The results can be seen below: VARIABLE COEF. TSTAT PSTAT Constant 6.368 16.97 0.000 SO/PA 18.341 7.10 0.000 BB/PA 9.471 2.00 0.046 (GBFBPU)/PA 1.807 1.60 0.110 (SO/PA)^2 10.254 1.98 0.048 (BB/PA)^2 6.833 0.33 0.742 ((GBFBPU)/PA)^2 7.063 3.93 0.000 ((GBFBPU)/PA)*(SO/PA) 9.661 2.38 0.017 ((GBFBPU)/PA)*(BB/PA) 3.208 0.44 0.661 (BB/PA)*(SO/PA) 2.828 0.18 0.857 Before getting into what the data originally said, a description of the columns is in order. The first column lists the variable in question while the coefficients were estimated by the regression. The tstatistic describes how many standard deviations from zero the coefficient strayed and the pstatistic tells us that, if the effect of the variable on parkadjusted ERA were actually zero, what the probability is a coefficient that far from zero would surface. It is commonly accepted that pstats less than .05 or .10 are probably different from zero. Unfortunately, reliable data for balls in play only exists from 200309, which means that we are unable to get many coefficients that make sense to be significant. Our intuition helped to build this model, with an understanding that as pitchers get back on the mound and throw some more games even more accurate results can be had. Note that the above table does not show the final formula for SIERA, but rather the original estimation using the entire formula for QERA regressed on parkadjusted ERA. Also note that the data used to build the table above originally came from 200308, not 2009; the latter year was excluded for the purpose of eventually testing a regression on an outside element. However, to contrast it with the table below, the table above includes 2009 data as well even though our original tests left out 2009 data for honest testing procedures. What immediately stands out is that the quadratic term for walks is not significant, the .74 pstat indicates that there is a 74percent chance that you would get a value further from zero than 6.833 if the true quadratic effect of walks on ERA was zero. The conclusion: the effect of walks on ERA is linear but perhaps with interactions with strikeouts or ground balls. It is also evident that the effect of strikeouts and walks is nonexistent as well. This seems plausible, seeing as there is no reason to assume walks increase ERA more for high strikeout pitchers as opposed to those with low whiff totals. Two quadratic terms are significant as is an interaction term. The interaction between walks and ground balls could have been dropped, but intuition chimed in and kept it afloat because the significance of the interaction of strikeouts and ground balls forces honesty and requires the presence of the former interaction. The reason this interaction is believed to be clinically significant is that pitchers who strike more batters out allow fewer singles and need fewer double plays. This is true for walks as well. Removing the other two insignificant terms sends the walk and ground ball interaction term closer to significant, but still far from it. It is our belief that including this interaction gives a more accurate prediction of a pitcher’s skill level and that the reason that the coefficient is insignificant is that the sample size is too small. Some of the other effects are even crisper when the regression is analyzed with the two insignificant terms removed: VARIABLE COEF. TSTAT PSTAT Constant 6.262 28.07 0.000 SO/PA 18.055 8.39 0.000 BB/PA 11.292 12.81 0.000 (GBFBPU)/PA 1.721 1.57 0.116 (SO/PA)^2 10.169 1.97 0.049 ((GBFBPU)/PA)^2 7.069 3.94 0.000 ((GBFBPU)/PA)*(SO/PA) 9.561 2.38 0.017 ((GBFBPU)/PA)*(BB/PA) 4.027 0.58 0.563 Four terms are worthy of further explanation as they are significant, or close enough to significant, like in the case of the linear term in (GBFBPU)/PA since its square proved to be significant. Each will be explained separately:
Thus, these four points have shown us that strikeouts have a diminishing return as you accrue more of them, ground balls have an increasing return the higher your tally, and ground balls are more beneficial to pitchers who allow more walks or balls in play, especially because fly balls are more detrimental to pitchers who allow more runners on base. How beneficial are these results? In Part 4 of our introductory series on SIERA, the estimator will be put to the test at both predicting same year ERA better than other estimators that use similar statistics and at predicting future year ERA than any other estimator out there.
Matt Swartz is an author of Baseball Prospectus. 35 comments have been left for this article. (Click to hide comments) BP Comment Quick Links Clonod (35609) I don't think the rollover glossary text for SIERA is long enough, guys. Feb 10, 2010 10:48 AM You're right... we should add in more paragraphs. Feb 10, 2010 12:58 PM Dan W. (42065) To nitpick on the above point, perhaps it's my smallish laptop, but the rollover for SIERA (and QERA, and I think EQA, too) are all too long for one page, and can't be scrolled through. Maybe your rollover popup should be wider? Or am I the only one who hasn't figured out a workaround for this? Feb 10, 2010 15:55 PM Nathan J. Miller (10465) Perhaps a stupid question on Park Factors... But if you have the retrosheet game data and you have the individual Park Factors, why wouldn't you ParkAdjust on the pergame level before aggregating ERA rather than applying it after the aggregation and just assuming that half the games pitched were at home? Or is this just because it's much easier to do that on the inputs than on the projections? Feb 10, 2010 10:53 AM Yeah, it was just easier to approximate that half the games were at home. I can't imagine that there would be enough noise there to affect the results. I guess also since we were running a regression with parkadjusted ERA as the dependent variable, it was less important to be precise with parkadjusted ERA because the coefficients would be unbiased even if the parkadjustment was noisy. Noisy independent variables would bias the coefficients towards zero, though, so I think that would have been a bigger issue. Feb 10, 2010 11:58 AM Dr. Dave (1652) Question: are you folding HBP into BB, or ignoring them? With some starters in the 15+ HBP per season category, it could make a clinical difference. Feb 10, 2010 11:06 AM We didn't include HBP. I just did a little rechecking and but I remembered correctly that it doesn't seem like it would have changed all that much. It might be a small improvement, though, and we might look into it in the future as we get more data, but it was probably too small of a factor to consider. This is a good point, though, and worth checking as we get more years of data, especially if HBP are very persistent which I suspect they are at least somewhat. Feb 10, 2010 12:05 PM Rowen Bell (5629) Interesting question as to whether HBP is persistent among pitchers. The StratOMatic world view is that HBP is entirely a hitter's skill. Of course, a lot has been learned about baseball since the SOM engine was developed many years ago. Dave Stieb was my boyhood idol, and he certainly had consistently high HBP rates. Feb 10, 2010 16:33 PM Dr. Dave (1652) Somewhere, Ron Hunt and Craig Biggio are systematically shredding a copy of StratoMatic between them... Feb 10, 2010 22:14 PM Juris (1283) Why does the little box "definition" of QERA say that the "formula was described most verbosely by Nate Silver...."? Was this little popup intended as a criticism? Feb 10, 2010 11:20 AM The Iron_Throne (4630) I hope you keep an eye on the BB*GB factor going forward. That pstat is pretty bad. I know it make intuitive sense to keep it, but if we're trying to verify our intuition with data, we have to let the data speak. Feb 10, 2010 11:28 AM We definitely will be keeping an eye on the BB*GB term. The problem is really that we suspect this term has an effect but that even a perfect term that accurately captures the effect probably would not be statistically significant because we only have 7 years of data. Feb 10, 2010 12:09 PM Dr. Dave (1652) I understand the intuition, but it's not like your pvalue was .15 or something; it's huge. The regression is yelling "even with this little data, I can tell this term is totally irrelevant". It's not impossible that this is just bad luck in the sample, but it's really unlikely. Feb 10, 2010 22:18 PM I see your point, but it really wouldn't change a single SIERA by .10 and it's a matter of preference. The reason I don't agree is that I think that the effect is real but close to 4.0. So the type II error of rejecting anything less than 15 is very, very high. It's a matter of intuition in this case. Especially given that the variance in GB*SO is high enough that the regression said it was positive. Feb 10, 2010 22:27 PM Juris (1283) This may be an "irrelevant variable" from a statistical standpoint. It's theoretically (or logically) aproprite. But the coefficient isn't statistically significant. Howevever, it's inclusion doesn't distort the estimated effects of the other regression coefficients. In that case, it's often reasonable to do exactly what you're doing. I recall a nice old discussion of this in an econometrics book by Rao and Miller. Feb 11, 2010 07:49 AM Dr. Dave (1652) It's not that hard to check for direct correlation among your predictor variables in the model. What does the variance/covariance matrix of the independent variables look like? Many stats packages will provide that as an optional output. It doesn't spot variables that are linear combinations of more than one other variable, but it spots direct correlation of 2 independent variables. Feb 11, 2010 17:54 PM Juris (1283) It may be worth a brief mention that QERA was applied by Nate without any park adjustments. That's perhaps one reason why he regarded it as a shorthand toy. Feb 10, 2010 11:31 AM nblascak (40534) As someone who has some knowledge about statistics, but not a lot, the added logic and explanations in the article are very welcome. Keep up the great and interesting work! Feb 10, 2010 13:44 PM Brian Cartwright (4519) I agree with nahtnJM  As you have the play by play, it is not very difficult to do park factors as a weighted mean, counting how many batters each pitcher faced in each park. Feb 10, 2010 16:40 PM Eric M. Van (31218) First of all, and this is a big weneedtostartoverandrunthenumbersagain mistake (although I think ultimately we're only talking about tweaking the coefficients): I am fairly certain that the (PetePalmer, Total Baseball designed) Park Factors in the Lahman database do not have to be cut in half, because they are already designed to be straight multipliers. Compare them to the straight Run Indexes published each year in the Bill James Annual, or plow your way through the technical explanation at Feb 10, 2010 17:06 PM Eric and I can check into the Lahman database thing park factor issue. I'm not sure about this yet. Feb 10, 2010 17:33 PM Eric M. Van (31218) 40 IP is actually lower than I think might be safe; that's about 170 BFP and the Y2Y correlations seem to start falling off more steeply below 200. But probably no big deal*. Feb 10, 2010 22:27 PM That actually does bump the GB*BB term to 10 and up to weak significance (p=.07), but why would you take out the linear GB term. It's effectively equivalent to limiting the minimum effect of GB% to exactly where GB=FB+PU. Feb 10, 2010 22:34 PM Eric M. Van (31218) I don't follow this logic at all. As far as I can tell, the minimum effect of GB happens when GB = FB + PU in either form of the metric. When that happens, the linear term, the squared term, and the interaction terms all become zero. When GB > FB + PU or GB < FB + PU the term becomes nonzero and you start to see GB loading on the metric. Your final equation does reflect the reality of the situation (GB rate is minimized at at unknown value) but your constant e is just an unknown portion of the overall constant a + b + e. Feb 12, 2010 13:08 PM Tommy Fastball (19193) You say missing a bat is a good indicator, but a popup isn't. Would not throwing a pitch that the batter swings under be a "super popup"? By your logic in the first article, a pitch that the batter swings under would be a very bad indicator. Feb 10, 2010 22:07 PM There probably is, but it's probably canceled out by the grounders and supergrounders, another benefit of the (GBFBPU) term where GB includes balls chopped into the ground in front of the plate as well balls that onehop between the SS and 3B. But if the popups happen more often than the choppers, it's an indicator that the pitcher is throwing the ball on a trajectory that generates upwards spin, and therefore is home run prone. The key is that popups/batted ball is correlated with fly ball rate, which is correlated with home run rate. Feb 10, 2010 22:21 PM TomLongwell (10234) Not sure if I have this all correct, but it looks like you are treating the data as a crosssection, correct? It's pretty obvious you have a panel data structure here, so wouldn't it make sense to at least include year and league fixed effects? Feb 10, 2010 22:40 PM Hmm...I would really like that approach only it doesn't do a good job of projecting nextyear ERA that well. Feb 10, 2010 23:07 PM TomLongwell (10234) Glad to help. I guess the panel approach strips out too much of the variation in the data, then. Feb 11, 2010 02:25 AM Not a subscriber? Sign up today!

Cool stuff. Just a thought I had Have you tried using (BBIBB)/PA as your walk rate variable (and maybe using another variable that is (IBB/PA)? Presumably all intentional walks are done to reduce the number of runs allowed, while in your regression all walks will increase ERA equally. Then again, I assume the scarcity of intentional walks will make this addition insignificant.
Thanks. We did play around with IBB a little bit, but some of the problem is that it is difficult to differentiate between IBB where the pitcher gives up after getting into a 20 or 31 count and direct IBB from the first pitch, and then to separate even further the difference between those IBB and just pitching around people.
There certainly was some indication that IBB led to fewer runs, particularly with respect to the ground ball term, but at this sample size we figured it was probably best not to do something that could be construed as data mining. We also felt that the gains from distinguishing between BB & IBB seemed negligible anyway. That is a good point, though. Thanks for highlighting it.
The BB and IBB discussion is one Matt and I had for a long, long time, but we ultimately felt that the difficulty in differentiating the types of IBBs muddied the waters and for now just felt more comfortable using the term in its current state. But it is definitely something we were conscious of throughout this process.