Happy Thanksgiving! Regularly Scheduled Articles Will Resume Monday, December 1
February 8, 2012
The Weighting is the Hardest Part
Madame Sosostris, famous clairvoyante,
—T.S. Eliot, The Waste Land
BP’s projection system, at its core, follows the same basic principles as it has before. We begin with our baseline projections, which start with a weighted average of past performance, with decreasing emphasis placed on seasons further removed from the season being projected. Then that performance is regressed to the mean. After that, we use the baseline forecast to find comparable players (while also taking into account things like position and body type) and use those to account for the effects of aging on performance.
Every season we put PECOTA under the knife, looking for things we can improve to make sure we’re coming up with the best forecasts possible. Sometimes what we come up with is a minor tweak. At other times, though, what we unearth is not only more significant, but an interesting baseball insight in its own right, even aside from its inclusion in PECOTA.
This season, we’ve made some rather radical changes to how we handle the weighted averages for the PECOTA baselines—we still deemphasize past seasons, but nowhere near as much as we used to. With such a dramatic and counterintuitive change, we thought it best to give our users an explanation of what was changed and why so that they could correctly use and interpret the PECOTA forecasts.
Last year, I was asked to appear on a Chicago sports talk station to discuss the town’s two teams, in particular how PECOTA saw them faring. I said many things, most of which don’t bear repeating (or for that matter remembering) this far past, but there was one thing I remember saying, and it probably does bear repeating—I expected Adam Dunn to be the best hitter on the White Sox in 2011. Suffice it to say, this statement does not represent my finest hour as a baseball analyst. Consequently, I’ve spent a bit of time thinking about Adam Dunn and whether there was anything in 2010 or earlier that hinted he might be capable of a season like 2011. In other words, is there anything that I know now about forecasting in general that would allow me to predict what happened using only what I could reasonably have known about Adam Dunn before the start of the season? The conclusion I’ve come to is that no, there really wasn’t. What happened to Dunn was, in essence, unforeseeable given what we knew heading into last season. That’s the bane of forecasting—no matter what you do, reality in all its many variations is always going to be able to surprise you.
Now it’s time to predict 2012’s stats, and PECOTA has learned from its mistake. No longer does it declare Dunn the best hitter on the White Sox. It has been humbled, dropping Dunn… all the way to second place, behind Paul Konerko. This is partly due to the fact that the White Sox are not a very good hitting team as currently constituted, having traded away Carlos Quentin during the offseason, but part of it is because PECOTA sees a far greater chance of the Adam Dunn that mashed baseballs for the better part of a decade showing up next year than the putrid Adam Dunn the White Sox saw in his first season on the South Side.
Naturally, some of you are going to look at PECOTA’s forecast for Dunn, think back to his abysmal season, and say, “I’ll take the under, thanks.” But PECOTA knows about his terrible performance just as we do; at its core, PECOTA takes past baseball statistics and applies a set of rules to them to come up with an estimate of what a player’s future statistics will be. If PECOTA is too optimistic about Adam Dunn, the culprit can be found in the rules governing the amount of emphasis to be placed on recent performance.
Of course, in tying myself so explicitly to Dunn, I run the risk that—to be blunt about it—he sucks again. I’m reminded of an article Ron Shandler wrote prior to the 2005 season, where he said:
Shandler probably should have left well enough alone; Pujols hit 41 home runs in 2005, and he’s never hit 50 or more home runs in a season. But it all comes down to the same set of questions: How much emphasis should we put on Dunn’s utter collapse, or on a young Pujols’ second-half power index? We don’t just have our eyeballs to rely on—we have decades of past baseball stats we can use to come up with an idea of how to weight baseball stats in relation to one another.
So, let’s build ourselves a forecasting model and see how various changes to the backweighting affect the forecasts, as well as try to determine the correct way to derive the backweights. For the sake of illustration, we’re going to use a much, much simpler model than PECOTA (it will remind many of you of the Marcels done by Tom Tango). To predict future TAv (from here on out, TAv_OBS), we will use three years of past TAv, where TAv_1 means one season prior to TAv_OBS, TAv_2 is two seasons prior, and TAv_3 is three seasons prior. The simplest model we can come up with is:
What we have here is a weighted average of a player’s TAv for the past three seasons. But let’s suppose that we want to downweight less recent seasons based on our intuition that more recent seasons are more reflective of a player’s current ability level. We would modify the formula as such:
So how do we come up with our yearly weights? What we can do (and what many other forecasters have done) is use an ordinary least squares regression to come up with weights for each prior season. The simplest way to do this is to use TAv_1 through TAv_3 to predict TAv_OBS in our regression. If we do so, we get:
According to this model, the most recent season is nearly 1.5 times as predictive as the second-most recent season and over 2.5 times as predictive as the third-most recent season. Recasting the coefficients so that the first season is equal to one, I get 1/.6/.4. (This is similar but not an exact match to the weights used in the Marcels, which work out to 1/.8/.6.)
[I’ve set the intercept to zero, because our weighted average formula lacks an intercept and this makes it a slightly more representative model, although the effect on the relative (rather than absolute) value of the weights is rather modest. If you include an intercept, it will essentially behave as the regression to the mean component of the forecast, which we’ll address separately in a moment.]
The trouble is that this kind of regression doesn’t truly model how the weights will be used in practice. From now on, we’ll call it our unweighted model. With a little bit of algebra, we can redistribute the formula like so:
If there were no need for downweighting of past data, this would provide the proper weighted average we need for our forecasting model. For the sake of brevity, we will refer to
as TAv_1_W (for weighted), and so on. If we plug those into our regression model, we get some radically different weights:
These values are on a very different scale, since due to the lack of an intercept the values have to sum to one for the first regression and to three for the second regression, but they’re also very different in a more meaningful sense; recasting the first year to 1 (which is practically already done for us), we get weights of 1/.92/.90.
In this second method, we get a result that seems contrary to our intuition—the most recent season is only slightly more predictive than older seasons. How can we assure ourselves that the less intuitive model is still more correct? We can look to the regressions themselves for one piece of evidence. The r-squared of the first regression is .27, compared to .38 for the second regression. It’s also more consistent with the way the weights will actually be used in practice.
What’s interesting is that by themselves, the PA weights have no meaningful predictive value—by definition, they have to sum to one for every player, and including them in the regression as separate variables doesn’t do anything to increase the predictive power of the regression. It’s not the distribution of past playing time that’s affecting the model, but rather what that distribution tells us about the TAv values themselves.
Ideally, we’d compare both methods with known good values for what the seasonal weights ought to be and determine the correct method by whichever provides the more accurate results. But we don’t have known good values—if we had, we could’ve used those instead without messing around with any of this in the first place.
While we can’t get known good values for real data, though, we can get known good values for fake data—in other words, a simulation. In this case, a simulation is startlingly simple to do; we assume that a player’s TAv_OBS is his true talent level and that all past seasons are equally predictive if PA are held constant. Then we simply take a player’s PA in each of the three preceding seasons and use a random number to come up with TAv values for each preceding season that reflect a combination of a player’s true talent and random variance. (For those who care about the technical details: we generate a random number between 0 and 1, convert that from a percentage to a z-score, multiply by the expected random variance, assuming TAv is a binomial, and add that to TAv_OBS.)
Running regressions on our simulated data, we get weights of 1/.8/.3 for our unweighted model compared to 1/1/1 for the weighted model. We constructed our simulation to behave as though player talent was absolutely stable from season to season, so we can confirm that the second set of weightings is correct here, which we couldn’t do with the first set of regressions that featured real-world data. The unweighted method, in this case, still downweights past seasons, which shouldn’t be the case
There are three important practical takeaways from this finding. The first and most obvious one is that projection systems that dramatically emphasize a player’s most recent performance will be biased against players with poor recent results and toward players with good recent results. Players are more likely to bounce back from poor seasons or revert back to type after exceptional seasons than those sorts of models would predict.
It also suggests that three years is not enough data for a forecasting model to use. If you assume the Marcel weights are accurate, then it makes sense that older seasons wouldn’t add much value to your forecasting models. However, if the decline in value of older seasons is much more subtle than that, you can make good use of five or even seven years of data, if not more.
The third, and perhaps most important, takeaway has to do with regression to the mean. We can add a simplistic version of regression to the mean to our forecasting model by adding a TAv_REG of .260 (the league average) with a PA_REG of 1200. (The PA_REG comes from the Marcels; it’s included here mostly for the purposes of illustration. The regression component in PECOTA is a more rigorous model based on random binomial variance—again, the purpose here is only to illustrate the concepts.
Consider a player with 650 PAs in three straight seasons, or 1950 total PA. Using the Marcel weighting of 1/.8/.6, that comes out to 1560 effective PA— in other words, throwing out 20 percent of a player’s PAs during that time period. That means 56 percent of a player’s forecast comes from his own performance, and 44 percent comes from the regression to the mean component. Using weights of 1/.92/.90 yields 1833 effective PA, throwing out only about six percent. Using the same regression component, that’s 60 percent of a player’s forecast coming from his own production and only 40 percent coming from regression to the mean. (And if you follow from the conclusions above and start using more years to forecast a player as well, even less regression to the mean is necessary.)
Regression to the mean is a valuable concept to keep in mind when forecasting, but increasing statistical power (in other words, the amount of data used to make a forecast) is a far better solution whenever possible. Discarding data (or in this case, downweighting it) in favor of regression to the mean is only advisable when there is conclusive evidence that the data being discarded or downweighted is less predictive.
As a result of its revamped weighting, PECOTA is going to be more bullish on players coming off a bad year and more bearish on players coming off a great year than many other forecasting systems. We’re okay with that. We believe that a full accounting of the historical data supports what we’re doing with PECOTA, and we think a forecasting system with a uniquely accurate outlook is more valuable than one that conforms.
UPDATED: Coming soon, we'll have a more in-depth look at how the new PECOTA stacks up, including RMSEs against Marcels. Some quick examples beforehand: the recent poster boy for “New PECOTA” would probably be Francisco Liriano, whose 3.60 PECOTA forecast for 2010 was almost identical to his real-life 3.62 ERA, while Marcels weighted his recent past (2009 was horrific, and many observers wondered if he'd ever return from his injury woes) and forecast a 4.88 ERA. The ERAs cited here were derived using a 3rd-party version of Marcels (don't want anyone thinking we cooked the books), against the “New PECOTA” system applied retroactively.
Some hits are obviously due to differences between the systems, such as Aaron Harang moving to PETCO (4.01 PECOTA, 3.64 real, 4.74 for Marcel, which doesn’t account for park effects). With other pitchers, it's just a matter of missing the least, such as when Mike Scott unveiled his nuclear splitter for 1986 (4.55 PECOTA, 2.22 actual, 3.79 Marcels). Usually, pitchers don't leap like Mike; their dramatic improvements are quirky statistical samplings which need to be included, but should be weighted little more than earlier seasons. A more recent example is Tim Redding, who posted ERAs of 5.72, 10.57 (in just 30 innings), DNP, 3.64, and 4.95, the last in 2008. PECOTA wasn't impressed with his recent exploits, and projected a 5.30 ERA, compared to 4.51 for Marcels. His actual 2009 ERA was 5.10 (2009), and his latest pitching exploits involved a combined 6.24 ERA for two Triple-A teams.—Rob McQuown