To access the 2012 PECOTA spreadsheet, click here.
Madame Sosostris, famous clairvoyante,
Had a bad cold, nevertheless
Is known to be the wisest woman in Europe,
With a wicked pack of cards. Here, said she,
Is your card, the drowned Phoenician Sailor,
(Those are pearls that were his eyes. Look!)
Here is Belladonna, The Lady of the Rocks, The lady of situations.
Here is the man with three staves, and here the Wheel,
And here is the one-eyed merchant, and this card,
Which is blank, is something he carries on his back,
Which I am forbidden to see. I do not find
The Hanged Man. Fear death by water.
—T.S. Eliot, The Waste Land
PECOTA has arrived.
BP’s projection system, at its core, follows the same basic principles as it has before. We begin with our baseline projections, which start with a weighted average of past performance, with decreasing emphasis placed on seasons further removed from the season being projected. Then that performance is regressed to the mean. After that, we use the baseline forecast to find comparable players (while also taking into account things like position and body type) and use those to account for the effects of aging on performance.
Every season we put PECOTA under the knife, looking for things we can improve to make sure we’re coming up with the best forecasts possible. Sometimes what we come up with is a minor tweak. At other times, though, what we unearth is not only more significant, but an interesting baseball insight in its own right, even aside from its inclusion in PECOTA.
This season, we’ve made some rather radical changes to how we handle the weighted averages for the PECOTA baselines—we still deemphasize past seasons, but nowhere near as much as we used to. With such a dramatic and counterintuitive change, we thought it best to give our users an explanation of what was changed and why so that they could correctly use and interpret the PECOTA forecasts.
Last year, I was asked to appear on a Chicago sports talk station to discuss the town’s two teams, in particular how PECOTA saw them faring. I said many things, most of which don’t bear repeating (or for that matter remembering) this far past, but there was one thing I remember saying, and it probably does bear repeating—I expected Adam Dunn to be the best hitter on the White Sox in 2011. Suffice it to say, this statement does not represent my finest hour as a baseball analyst. Consequently, I’ve spent a bit of time thinking about Adam Dunn and whether there was anything in 2010 or earlier that hinted he might be capable of a season like 2011. In other words, is there anything that I know now about forecasting in general that would allow me to predict what happened using only what I could reasonably have known about Adam Dunn before the start of the season? The conclusion I’ve come to is that no, there really wasn’t. What happened to Dunn was, in essence, unforeseeable given what we knew heading into last season. That’s the bane of forecasting—no matter what you do, reality in all its many variations is always going to be able to surprise you.
Now it’s time to predict 2012’s stats, and PECOTA has learned from its mistake. No longer does it declare Dunn the best hitter on the White Sox. It has been humbled, dropping Dunn… all the way to second place, behind Paul Konerko. This is partly due to the fact that the White Sox are not a very good hitting team as currently constituted, having traded away Carlos Quentin during the offseason, but part of it is because PECOTA sees a far greater chance of the Adam Dunn that mashed baseballs for the better part of a decade showing up next year than the putrid Adam Dunn the White Sox saw in his first season on the South Side.
Naturally, some of you are going to look at PECOTA’s forecast for Dunn, think back to his abysmal season, and say, “I’ll take the under, thanks.” But PECOTA knows about his terrible performance just as we do; at its core, PECOTA takes past baseball statistics and applies a set of rules to them to come up with an estimate of what a player’s future statistics will be. If PECOTA is too optimistic about Adam Dunn, the culprit can be found in the rules governing the amount of emphasis to be placed on recent performance.
Of course, in tying myself so explicitly to Dunn, I run the risk that—to be blunt about it—he sucks again. I’m reminded of an article Ron Shandler wrote prior to the 2005 season, where he said:
As an example, let's look at Pujols. After hitting 37, 34, 43, and 46 HRs, his baseline projection called for 42, which represented a normal regression to the mean. However, our flags pointed out consistent upward trends in contact rate, fly ball ratio, batting eye and a second half surge in his power index. Add in his alleged age (25) and a reliability rating of 94, and all signs pointed north for his power trend to continue. Our projection now calls for 50 HRs.
Why 50? I believe it is reasonable to expect Pujols to maintain his second half PX level for a full six months, given the trends in his skills. For some people, it might take a moment to accept 50, but the more you look at it, the more it passes the eyeball test. This is a player with no true comparables in history. All we have is our eyeballs and a general idea of what makes sense. Fifty makes sense to me.
Shandler probably should have left well enough alone; Pujols hit 41 home runs in 2005, and he’s never hit 50 or more home runs in a season. But it all comes down to the same set of questions: How much emphasis should we put on Dunn’s utter collapse, or on a young Pujols’ second-half power index? We don’t just have our eyeballs to rely on—we have decades of past baseball stats we can use to come up with an idea of how to weight baseball stats in relation to one another.
So, let’s build ourselves a forecasting model and see how various changes to the backweighting affect the forecasts, as well as try to determine the correct way to derive the backweights. For the sake of illustration, we’re going to use a much, much simpler model than PECOTA (it will remind many of you of the Marcels done by Tom Tango). To predict future TAv (from here on out, TAv_OBS), we will use three years of past TAv, where TAv_1 means one season prior to TAv_OBS, TAv_2 is two seasons prior, and TAv_3 is three seasons prior. The simplest model we can come up with is:
TAv_OBS = (TAv_1 * PA_1 + TAv_2 * PA_2 + TAv_3 * PA_3)/(PA_1+PA_2+PA_3)
What we have here is a weighted average of a player’s TAv for the past three seasons. But let’s suppose that we want to downweight less recent seasons based on our intuition that more recent seasons are more reflective of a player’s current ability level. We would modify the formula as such:
TAv_OBS = (TAv_1 * PA_1 * a + TAv_2 * PA_2 * b + TAv_3 * PA_3 * c)/(PA_1 * a+PA_2 * b+PA_3 * c)
So how do we come up with our yearly weights? What we can do (and what many other forecasters have done) is use an ordinary least squares regression to come up with weights for each prior season. The simplest way to do this is to use TAv_1 through TAv_3 to predict TAv_OBS in our regression. If we do so, we get:
TAv_OBS = 0.47 * TAv_1 + 0.32 * TAv_2 + 0.18* TAv_3
According to this model, the most recent season is nearly 1.5 times as predictive as the second-most recent season and over 2.5 times as predictive as the third-most recent season. Recasting the coefficients so that the first season is equal to one, I get 1/.6/.4. (This is similar but not an exact match to the weights used in the Marcels, which work out to 1/.8/.6.)
[I’ve set the intercept to zero, because our weighted average formula lacks an intercept and this makes it a slightly more representative model, although the effect on the relative (rather than absolute) value of the weights is rather modest. If you include an intercept, it will essentially behave as the regression to the mean component of the forecast, which we’ll address separately in a moment.]
The trouble is that this kind of regression doesn’t truly model how the weights will be used in practice. From now on, we’ll call it our unweighted model. With a little bit of algebra, we can redistribute the formula like so:
TAv_OBS = TAv_1 * PA_1 /(PA_1+PA_2+PA_3) + TAv_2 * PA_2 /(PA_1+PA_2+PA_3) + TAv_3 * PA_2 /(PA_1+PA_2+PA_3)
If there were no need for downweighting of past data, this would provide the proper weighted average we need for our forecasting model. For the sake of brevity, we will refer to
TAv_1 * PA_1 /(PA_1+PA_2+PA_3)
as TAv_1_W (for weighted), and so on. If we plug those into our regression model, we get some radically different weights:
TAv_OBS = 1.03 * TAv_1_W + .95 * TAv_2_W + 0.93 * TAv_3_W
These values are on a very different scale, since due to the lack of an intercept the values have to sum to one for the first regression and to three for the second regression, but they’re also very different in a more meaningful sense; recasting the first year to 1 (which is practically already done for us), we get weights of 1/.92/.90.
In this second method, we get a result that seems contrary to our intuition—the most recent season is only slightly more predictive than older seasons. How can we assure ourselves that the less intuitive model is still more correct? We can look to the regressions themselves for one piece of evidence. The r-squared of the first regression is .27, compared to .38 for the second regression. It’s also more consistent with the way the weights will actually be used in practice.
What’s interesting is that by themselves, the PA weights have no meaningful predictive value—by definition, they have to sum to one for every player, and including them in the regression as separate variables doesn’t do anything to increase the predictive power of the regression. It’s not the distribution of past playing time that’s affecting the model, but rather what that distribution tells us about the TAv values themselves.
Ideally, we’d compare both methods with known good values for what the seasonal weights ought to be and determine the correct method by whichever provides the more accurate results. But we don’t have known good values—if we had, we could’ve used those instead without messing around with any of this in the first place.
While we can’t get known good values for real data, though, we can get known good values for fake data—in other words, a simulation. In this case, a simulation is startlingly simple to do; we assume that a player’s TAv_OBS is his true talent level and that all past seasons are equally predictive if PA are held constant. Then we simply take a player’s PA in each of the three preceding seasons and use a random number to come up with TAv values for each preceding season that reflect a combination of a player’s true talent and random variance. (For those who care about the technical details: we generate a random number between 0 and 1, convert that from a percentage to a z-score, multiply by the expected random variance, assuming TAv is a binomial, and add that to TAv_OBS.)
Running regressions on our simulated data, we get weights of 1/.8/.3 for our unweighted model compared to 1/1/1 for the weighted model. We constructed our simulation to behave as though player talent was absolutely stable from season to season, so we can confirm that the second set of weightings is correct here, which we couldn’t do with the first set of regressions that featured real-world data. The unweighted method, in this case, still downweights past seasons, which shouldn’t be the case
There are three important practical takeaways from this finding. The first and most obvious one is that projection systems that dramatically emphasize a player’s most recent performance will be biased against players with poor recent results and toward players with good recent results. Players are more likely to bounce back from poor seasons or revert back to type after exceptional seasons than those sorts of models would predict.
It also suggests that three years is not enough data for a forecasting model to use. If you assume the Marcel weights are accurate, then it makes sense that older seasons wouldn’t add much value to your forecasting models. However, if the decline in value of older seasons is much more subtle than that, you can make good use of five or even seven years of data, if not more.
The third, and perhaps most important, takeaway has to do with regression to the mean. We can add a simplistic version of regression to the mean to our forecasting model by adding a TAv_REG of .260 (the league average) with a PA_REG of 1200. (The PA_REG comes from the Marcels; it’s included here mostly for the purposes of illustration. The regression component in PECOTA is a more rigorous model based on random binomial variance—again, the purpose here is only to illustrate the concepts.
Consider a player with 650 PAs in three straight seasons, or 1950 total PA. Using the Marcel weighting of 1/.8/.6, that comes out to 1560 effective PA— in other words, throwing out 20 percent of a player’s PAs during that time period. That means 56 percent of a player’s forecast comes from his own performance, and 44 percent comes from the regression to the mean component. Using weights of 1/.92/.90 yields 1833 effective PA, throwing out only about six percent. Using the same regression component, that’s 60 percent of a player’s forecast coming from his own production and only 40 percent coming from regression to the mean. (And if you follow from the conclusions above and start using more years to forecast a player as well, even less regression to the mean is necessary.)
Regression to the mean is a valuable concept to keep in mind when forecasting, but increasing statistical power (in other words, the amount of data used to make a forecast) is a far better solution whenever possible. Discarding data (or in this case, downweighting it) in favor of regression to the mean is only advisable when there is conclusive evidence that the data being discarded or downweighted is less predictive.
As a result of its revamped weighting, PECOTA is going to be more bullish on players coming off a bad year and more bearish on players coming off a great year than many other forecasting systems. We’re okay with that. We believe that a full accounting of the historical data supports what we’re doing with PECOTA, and we think a forecasting system with a uniquely accurate outlook is more valuable than one that conforms.
UPDATED: Coming soon, we'll have a more in-depth look at how the new PECOTA stacks up, including RMSEs against Marcels. Some quick examples beforehand: the recent poster boy for “New PECOTA” would probably be Francisco Liriano, whose 3.60 PECOTA forecast for 2010 was almost identical to his real-life 3.62 ERA, while Marcels weighted his recent past (2009 was horrific, and many observers wondered if he'd ever return from his injury woes) and forecast a 4.88 ERA. The ERAs cited here were derived using a 3rd-party version of Marcels (don't want anyone thinking we cooked the books), against the “New PECOTA” system applied retroactively.
Some hits are obviously due to differences between the systems, such as Aaron Harang moving to PETCO (4.01 PECOTA, 3.64 real, 4.74 for Marcel, which doesn’t account for park effects). With other pitchers, it's just a matter of missing the least, such as when Mike Scott unveiled his nuclear splitter for 1986 (4.55 PECOTA, 2.22 actual, 3.79 Marcels). Usually, pitchers don't leap like Mike; their dramatic improvements are quirky statistical samplings which need to be included, but should be weighted little more than earlier seasons. A more recent example is Tim Redding, who posted ERAs of 5.72, 10.57 (in just 30 innings), DNP, 3.64, and 4.95, the last in 2008. PECOTA wasn't impressed with his recent exploits, and projected a 5.30 ERA, compared to 4.51 for Marcels. His actual 2009 ERA was 5.10 (2009), and his latest pitching exploits involved a combined 6.24 ERA for two Triple-A teams.—Rob McQuown
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.Subscribe now
IIRC, Jorge de la Rosa screws it up though. Best to add about 4 blank columns to the right of the names so there is space for the de, the la, and the Rosa.
Several 'de's to sort out, and the whole process is silly. I can't see anyone wanting to sort 1000s of baseball players by first name.
What I do then, which may be a little obsessive, is, in the blank column to the right, create a lastname/firstname column, using the formula =[lastname field]&", "&[firstname field] on the top row; copy that to all rows below it; then, highlight the entire column, copy it, and paste special using "values" over the same column. That will give you a lastname, firstname column that you can move over the first name lastname column.
Another head scratcher; Kershaw 30th in WHIP among SP behind guys like Lilly & Peavy. We'll see.
And this past year, we've dramatically improved the ability to run the system against historical data, and it does amazingly well by any measure.
That said, I don't recall Kershaw coming out that badly in the preview PECOTA runs, I'll look into that now, thanks for the feedback.
Also, friendly feedback: team depth charts were much easier to read when they were organized by position rather than by batting order. These are pretty impossible to consume, and I'm not sure what value it adds to know if Adrian Gonzalez will bat 3rd or 4th.
Now I have to draft him in all leagues.
1. The obvious problems last year were the Kila/Bowker problem, the minor league comps problem, and the long-term aging curves looking ridiculous.
2. Some efforts have been made. Still, Bryce Harper's number one comp is Wayne Causey. I'd categorize that as a failure. It indicates an inattentiveness to the obvious errors of effectively dropping minor league seasons out of the comps and comparing minor leaguers to major leaguers only (and then, if your answer is Wayne Causey, doing it wrong.)
3. These are updated PECOTA's; I expect that the books are going to have different ones. I was part of the Beta Testing, and I think my comments thereto are part of my public comment log.
4. The claims that this PECOTA is (roughly) better than everything in history have been made the last two years. For any comparisons (and one to ZiPS seems warranted), transparancy will be vital.
5. For pity's sake, implying that these are iterative changes to the already-Deadly Accurate PECOTA's is worrying. Authors cited the projections for Kila and Bowker repeatedly and BP never said, "These were clearly wrong."
6. There are some weird projections. Chipper PECOTA: 281/375/450. Chipper ZiPS: 260/348/438. We'll see if this is right. But I doubt it: Weightings for the very young or very old should trend more strongly to more recent seasons. (See: Silver, Nate, Rearranging PECOTA, in the 2006 annual.)
7. Speaking of 2006 conclusions:
A. Level adjustment for minor league players by age. Since comps are not selected from minor leaguers, I doubt this is done. Also, you should not comp more than one level away by age, according to Silver.
B. Starter/Reliever adjustment. Chris Sale will lose velocity and K rate as a starter.
Crap. Went long. Gotta run.
Speaking of which, was the order of past seasons considered? That is, if a player had TAvs of .250, .280, .310 in his last three seasons, his next season projection would look roughly the same as someone who had gone .320, .280, .250 -- assuming I am reading the explanation correctly. I would that for any aged player, a clear trend (and you probably need more than 3 years to establish a trend) would merit some level of inclusion in the predictive model.
It's both a counterintuitive result and not significant (either statistically or practically) in the testing I've done so far, so it's not currently being incorporated into PECOTA. (With some further corroborative testing, it might be.) But to the larger point, I haven't seen anything to suggest that a player with a .320/.280/.250 three-year run of TAvs should be expected to continue trending upward, above and beyond what his forecast would otherwise suggest.
Also, why not let the season sequence info. contribute to the comparable player-finding? Right now, you generate a single number (baseline forecast) using certain weights, and then try to find historical players with similar stats and other attributes. [Correct me if this is inaccurate]. I would suggest looking for historical players who have experienced season sequences similar to the player in question (say, .250,.290,.230) over the last x years, and see how they performed. x would certainly need to be greater than 3. It may not even be the sequencing that provides extra info.; it may the y2y volatility or lack thereof. And clearly the criteria for "similar" season sequences (a.k.a., "career paths") would need to be pretty loose to create a sufficient sample and not overfit.
Is this type of approach (using more than 3 past seasons and using loosely-bucketed career paths to find comps) under consideration for future versions of PECOTA?
In terms of looking at trajectory on the whole, and matching comps based on that - I've done some examination of that. Where you run into the most problems is comparing players who are currently mid-career to guys who have finished their entire career.
They do seem to be roughly adjusted for reasonable expectations about injuries, so it seems like they are somewhat current, and they match the numbers from the pecota spreadsheet.
But isn't the spreadsheet not supposed to be playing time adjusted?
It looks like a couple people have gotten to them. Thanks.
So he is still getting worse right now (age 23), but when he turns 30, he'll finally turn the corner and start to improve... I think the erratic K rate projections are mostly to blame here.
Or was I just dumb and used last years.
(I kid, of course, Derek! Pecota release day always gets me all fired up.)
I just saw that there was a PFM up. I had copied that and the PECOTA and the stats were e same. But the player values seemed wayyyyy off.
I'm very curious about the comps. If you sort the batters by age, you'll notice that all 19 years olds have the same 3 comps: Wayne Causey,Ed Kranepool,Robin Yount. Sometimes the order differs, but it's always the same players. This seems to back up what jrmayne says that minor league seasons aren't being used as comps, and suggests that all 19 year olds get compared to those 3 players because those are the only 3 available to use as comps. If that is true, I find it troubling. It wasn't long ago that BP ran Rany's articles about the importance of age in assessing player's potential. So, BP is on record having published that youth is a critical factor, yet if minor league seasons aren't used in comps, then the pool of potential comps shrinks more and more(meaning that the quality of the comps must decline) as the importance of the comps should be increasing to the model. All other things equal, a 19 year old should ultimately improve more than a 20 year old who should improve more than a 21 year old. If my understanding is correct, PECOTA sets it's expectations for improvement or growth based on the comps, so the time period where players should see the most growth is also the time period with the worst quality comps.
From my perspective this troubles me because the projections I care most about are the prospects. I get that whether Pujols hits 40 or 50 homers is probably due to chance. I don't particularly care if you say Votto will hit .300, .310 or .290 because that prediction probably says more about the biases of your projection system than it does about the reality of how good Votto actually is. The fact is that no current projection system is going to offer any real insight into an established player like Votto.
But, where a system I think can offer insight is in the younger players, making sense of what the minor league stats mean, assessing how that player might develop and grow over time based on what similar players have done. This is what I thought PECOTA was doing, but it sounds like that is not the case. If PECOTA's projections for all 19 year olds are based primarily on what Ed Kranepool and Wayne Causey did, I don't see how it can offer any insight into how those players will develop.
Please note that we are still putting finishing touches on PFM and depth charts so, for the time being, we've disabled those features. If you ran PFM earlier today, please disregard those results.
Thank you for your patience and your feedback.
There also seems to be quite a few batters with 250 PA. Was this an attempt to truncate high PA for players not expected to be full time players?
Liriano's highest comp on his player card is Oliver Perez's 2010 who had a RA of 7.19. Maybe PECOTA is more tuned to be correct for the short term and Marcel is more correct for the long term (when Liriano's 2011 ERA went up to 5.09 and FIP went up to 4.58)?
In particular: FAIR RA, BREAKOUT, IMPROVE, COLLAPSE, ATTRITION, ML_PCT, DC_FL
Then the article shifts to talking about Liriano and pitchers, but it's not clear that 1/.92/.90 and 7 or so past seasons are also used for pitchers. Are pitcher's baselined using the same past weighting scheme as hitters? (And if so I'm questioning if that is correct...)
Newer to BP, any chance people are looking for a keeper league owner? Looking to join a dynasty league. Thanks.
tell him Matt sent you. This company runs 40+ dynasty leagues and they're always looking for new owners.
Yet despite equally bad projections each of the last two years, the problems are not publicly acknowledged, and PECOTA is once again advertised as "deadly accurate" despite achieving worse results than Marcel in recent years.
I want BP to be successful, but I have to say that I'm disappointed. I'm not sure why you released the spreadsheet in its present state, but I wish you hadn't.
Also does BB include IBB? What about HBP?
And finally, are all the numbers assuming player plays half his games in his home park? For minor league players, do the numbers assume he plays half his games in his minor league park as well?
B1 = ((SLG*(B2+B3+HR))-(((B2*2)+(B3*3)+(HR*4))*AVG))/(AVG-SLG)
After that, you should be able to get AB pretty easy:
AB = (B1+B2+B3+B4)/AVG
You can double-check your calculations for B1 and AB by using them to calculate SLG, (B1+(B2*2)+(B3*3)+(B4*4))/AB. Your result should match the given value on the spreadsheet.
Hope that helps!
Marcel adds 1200 PA of regression only if you use the 5/4/3 scheme. If you use the 1/.8/.6 scheme, then you would add 1200/5 = 240 PA of regression.
In the example in the article, a player with 650 PA each year for 3 years will regress 11% toward the mean, not 40%.
Marcel shows exactly how much each player was regressed, if you download the Marcel files. There's a column called "r", that shows the percentage of his performance stats that were used for the forecast. When you see r=.89, that means that regression was 11% toward the mean.
To the "Breakout, Improve, Collapse, Attrition" data -- I took the spreadsheet, and narrowed to the top 350 players or so, i.e., the ones that are likely to get the majority of the at bats in 2012. I then sorted by "Improve", and found that 70 had scores of 51 or higher, 7 had scores of 50, and the other 270 or so players had scores of 49 or less.
I am sure there is a reasonable explanation for that outcome, but it does not feel intuitively correct to me, unless the opposite was true for pitchers (i.e., the ascendance of pitching over hitting over the past several seasons is expected to continue, so pitchers' performances are expected to "improve" as hitters' production continues to deteriorate).
Can you guys shed some light?
However, the total 2012 WAR for all batters drops 36.1 YoY, implying a drop off of nearly 350 runs. (I;m comparing PECOTA WAR with the WARs on the BP Statistics Sorts for Team Value.) On the pitching side, the 2012 WARs rise a whopping 67.8. (Obviously implying a much, much lower run environment in 2012 than 2011.) Something seems off here. Has there been a change to replacement level, especially at the pitching level? If so, why don't the prior year's change? I've always suspected prior year pitching WARs are too low based on some change made between the 2011 Annual and after the season starter. (For instance look at Halladay's huge drop from book to PECOTA card.)
Finally, the shift in WARs is most pronounced by league. AL starting pitchers, per PECOTA, are expected to increase WAR by 57.1 (of the total 67.8 I refer to above) YoY. That just doesn't seem plausible and Yu Darvish isn't even in the database yet.
Any thoughts from BP?
It does not appear as if it does. For most players, you can't tell from looking at the numbers, since most players have few IBB. But look at Fielder and Pujols. Fielder is projected with 92 BB+IBB in 687 PA. He typically gets IBB'd 25 times per season. That leaves 67 NIBB. That can't be right. Same thing with Pujols.
Also, you did not answer my question about singles and AB.
More on PECOTA's WAR calculation: Is it the same as BPs calculation?
Here's why I ask: in 2011, per BP, only 23 pitchers had a WAR of 3 or better. On the surface that seems way too low, and I've thought that since changes were made to your calculations in 2010/2011. PECOTA has 42 pitchers with a projected WAR of 3 or better for 2012. Projection systems are always more conservative vs. actual outliers so again, it wouldn't seem possible the WAR calculation is consistent.
Can you shed some light on this?
BP Customer Service
3 comments | 0 total rating | 0.00 average rating
I didn't even realize there was a generic BP Customer Service login though I guess it makes sense.