BP Comment Quick Links

October 1, 2010 Reintroducing PECOTAThe Seven Percent Solutionby Colin Wyers Let's talk percentiles. It's probably the most famous thing about PECOTAthe fact that we provide a range of forecasts instead of just a single point estimate. Earlier this week, I talked about the accuracy of the weighted mean forecasts. But what about the percentiles? First, some notes about the percentiles. They are derived based upon the overall unit of production (TAv for hitters, ERA for pitchers), not the underlying components. This is important, because a hitter who hits more home runs than we expect (I hesitate to call it luckhe may have been underestimated, or he may have found a way to improve his talent) isn't necessarily going to improve his rate of hitting singles by the same amount, or at all. What this means is that you can't look at a single stat (say, hits or strikeouts) and think that's the range of expectations PECOTA has for that skill. The percentiles are supposed to reflect what we know about the distribution of a player's skill, but they are in essence the average batting line we should expect from that player if he puts up that level of performance in that season. There are a lot of different shapes that performance could take, however, and that means there's more variance in any single component than is reflected in the percentiles. So the correct test of the percentiles is the overall level of performance, not the underlying components. The other thing to note is that the observed performance of any individual player is a function of his playing timethe less playing time a player has, the more variance we expect in his overall performance. Things have a tendency to even out over time (although a tendency is not the same thing as a guarantee), and so the spread of observed performance goes down as playing time goes up. If a player is projected for a full season's worth of playing time, and only ends up playing 50 games or so, the percentiles are going to be too tight. That's not a bugit's impossible to make one set of percentiles that functions across any amount of playing time. Let's start off with the hitters. Looking at only players with at least 300 PA, here's how the distribution of players looks:
Going from left to rightDIFF20 refers to the percentage of players between their 40^{th} and 60^{th} percentiles, through to DIFF80, which represents the percentage of players between their 10^{th} and 90^{th} percentiles. The second row represents those players above the 50^{th} percentile; the third row represents players below the 50^{th} percentile. Adding up plus down gives you the overall percentage. What we should want to see is DIFF20 equal to 20 percent, etc. We don't quite see it, though. It may be a bit more helpful to look at a histogram:
The first thing that sticks out should be the fact that most players are in the 50^{th} to 60^{th} percentiles, by a large margin. Why? Fundamentally, players who perform above their expectations are more likely to get playing time than players who perform below their expectations. This isn't something that should surprise usthis is why we have the weighted means forecasts for PECOTA, which explicitly takes this fact into account. (This is also probably the explanation for why DIFF20 exceeds 20 percent.) But there's also more variation in observed performance than what the percentiles expect. Let's consider the reasons we see variation from what our projections expect. The first point I want to make is that forecasting is not mathamancy; there's no such thing as a perfect forecast, except in hindsight. PECOTA utilizes a twostage process:
Both of those estimates are subject to a measure of uncertainty. The third source of variation is simply randomness. We use the observed variation of the performance of the comps to model this variance. Not all forecasts have the same expected variance, thoughit seems as though some players have more variance in their baseline forecasts than their comparables do. This is a relatively simple fixthe uncertainty in a forecast is largely a function of the amount of data you have on a player. (It's also something of a function of a player's skill set, among other things.) When we build a player's baseline forecast, we can compare the uncertainty in the forecast to the uncertainty of the comps' forecasts and figure out how much additional variance we need to add to the percentiles. We've also been treating the uncertainty of a forecast as symmetricalapparently there's more uncertainty on the downside than the upside. This is something we can build into our model as well. Now let's take a look at our pitchers, minimum of 70 IP:
I should clarify "down" and "up" in this contextup is an ERA below the forecast, down is an ERA above the forecast. What we see is something similar to the hitters, but much more pronounced. Let's examine it from a slightly different angle, and look at FIP as a standin for ERA:
That's a lot closer to what we saw with the hitters (and of course, everything I said about those applies equally here). What it comes down to, I suppose, is how you define performance for a pitcher. There are three elements to preventing or allowing runs:
I've talked in the past about how those figure into a player's value. Suffice it to say that the range on the PECOTA percentiles are largely focused on the first element (the one which is where most of the variation in pitcher skill occurs and thus the area most relevant to forecasting). So, lemme askwhat do you find the most useful to you in using the percentiles? Would you rather they reflect the extent to which we know pitchers have skill in preventing runs? Or would you rather the percentiles reflect the rather considerable noise in measuring a pitcher's performance (really, the performance of a pitcher and his teammates at preventing runs)? Drop me a line in the comments and let me know. Or you could talk to me about thator anything else related to PECOTA, or baseball stats in generalin a few hours, when I chat live starting at 1 ET, as the finale of PECOTA week. And againthis is the beginning, not the end, of a long conversation about PECOTA. Thanks for being a part of it.
Colin Wyers is an author of Baseball Prospectus. Follow @cwyers
88 comments have been left for this article. BP Comment Quick Links Mountainhawk (37208) That histogram doesn't look too horrible to me. If you subject it to a Chisquared test for uniformity, does it pass? BillJohnson (2635) The thing is, if the prediction system is optimal, the histogram shouldn't be "uniform," it should take the shape of a Gaussian. Actually, apart from the bottom two bins, it fits that shape quite well, although it's hard to be sure because of the small number of bins. (You might repeat this analysis with twice as many bins that are half as wide, i.e., 5percentileswide bins. A Gaussian shape would be more obvious from that.) The system's main problem is inability to deal with the Cardinals^h^h^h players that have unexpected, complete meltdowns. It'll be interesting to see whether the incorporation of health into PECOTA addresses that. I am dubious  including past health doesn't necessarily have predictive value for future healthrelated collapse  but I'm looking forward to seeing it tried. TangoTiger (57181) Bill: no, it has to be uniform. If PECOTA is saying that something is going to be between the 70th and 80th percentile, then we'd expect 10% of those something to occur between the 70th and 80th percentiles. Mountainhawk (37208) I hadn't noticed they left on 010 and 90100, the histogram below does look much worse. Judging by the graph below, PECOTA just doesn't have enough variability in their model. TangoTiger (57181) Right, but not totally. If you compare relievers to starters, you will see that the range are similar. And, the variance around the true expectation of starters should be much smaller than that of relievers. You see that in some cases (Felix, CC), but most of the time, that's not the case. Mountainhawk (37208) I was focusing on the hitter stats, but absolutely correct on the relievers vs starters. Both process variance (since the IP you are projecting the ERA for is lower) and parameter variance (since the amount of data you have to make the projection is generally less) ought to be significantly higher for relievers. BillJohnson (2635) You're misunderstanding what this distribution is *of*. It isn't what percentile *accuracy* the projections achieve (you are correct that that, by definition, must be uniform across the bins), it's how the projections relate to what the *players* do, and there is no requirement for that to be uniform. TangoTiger (57181) I don't think I am following you. Suppose you have this player: BillJohnson (2635) That is correct, and let us extend things further, to a "league" of five players, whose projections and performances are as follows: BillJohnson (2635) Sorry, "under*predict* how these guys hit" in that last paragraph, not "underestimate". 3 would have been underpredicted, 2 overpredicted. The total extent of the underprediction would have been greater than the extent of the overprediction. This is entirely normal and acceptable in the real world, btw; PECOTA can't predict what the umpires' (random?...) strike zones are going to be like in the coming year, whether the ball  or players  will be more juiced than expected, and so on. TangoTiger (57181) "HOWEVER: In terms of how well **PECOTA** performed" BillJohnson (2635) That's correct. TangoTiger (57181) "And that is what is interesting about the histogram. If PECOTA works right, the results *should* cluster around 50thpercentile predictions, and indeed, they do." mickeyg13 (46429) What Tango said. BillJohnson (2635) Now we're getting somewhere, TT. Mountainhawk (37208) No. The bell shape curve you are thinking of would come from calculating (Actual TAv  Expected TAv) / (SD of TAv) for each player, then plotting those results in a histogram. TangoTiger (57181) Let there be no doubt that Colin is now taking PECOTA by the blls. After several years of me shouting from the rooftops and my padded room that the PECOTA percentile forecasts are highly suspect, and providing probability proof, we now have empirical confirmation. Colin shows us how often players who have at least 300 PA had their TAv land in each of the percentile ranges (1020, 2030, ... 8090). In a perfect world, you'd have 10% of all players in each 10% group. In a real world, we'd expect say 812% across the board. But, this is not at all what we get. While Colin showed the numbers for the 1090 group, he did not show the 010 and the 90100. We do know that the total of these two groups is 36.5% (Colin reports that the 1090 group is 63.5%). So, here is what that chart looks like if we just split the 36.5% evenly in the two extreme groups: flirgendorf (30950) TangoTiger, Mountainhawk (37208) If you are going to call them percentiles, you should have 10% of the players fall into 10 point wide ranges of percentiles. TangoTiger (57181) Right. To put it another way: under what conditions should we see 10% of the players exceed their 90th percentile forecasts? flirgendorf (30950) You should expect that if every player received a full season's worth of plate appearances that 10% of all players would fall into each decile. However, because players who underperform their projection to the extent that they are below replacement level may get replaced, they will not reach 300 plate appearances and as a result will not be counted in the study. If the study examined every player who was in his teams' opening day starting lineup, your argument for uniform distribution would be valid. However, the study selected for players whose teams decided to play them for 300 plate appearances, and therefore a bias was introduced and one should no longer expect players to be uniformly distributed in the deciles. Mountainhawk (37208) Fine, but unless you think teams are horrible at putting the best players on the field, that should result in a upward sloping graph (fewest players in 010% and most in 90100%) and not the U shape in TTango's graph or the downward slope in the graph in the article. Mr. Cthulhu (47348) "Fine, but unless you think teams are horrible at putting the best players on the field, that should result in a upward sloping graph (fewest players in 010% and most in 90100%) and not the U shape in TTango's graph or the downward slope in the graph in the article." TangoTiger (57181) Then you would get a rightskew. Instead of it being: CRP13 (46873) I just noticed that if I accidentally click "Submit comment" without any content, I get an assertive error saying "Your message appears to be blank. Stop that." CRP13 (46873) This is a general comment on these past few articles. mickeyg13 (46429) It's not about "who is better than who." Forgive me for putting words in his mouth, but I'd bet that Tango, for instance, would LOVE for PECOTA to absolutely destroy Marcel and for the percentiles to work as we expect them to. He's not worried about some chestthumping competition to prove the superiority of "his system." He pretty much tells you that "his system" is the worst possible acceptable system and he pleads for you to do better if possible. jrmayne (1468) Yeah. This. CRP13 (46873) Then maybe it's a phrasing thing, because it doesn't always sound constructive to me. (This is not just directed at TT, either) There's a huge difference between working to correct flaws (which I approve of) and mocking issues that haven't been addressed yet or aggressively criticizing the people behind the work. TangoTiger (57181) "But there's enough finger pointing, accusing, and comparisons going on to make it really annoying to a casual reader." CRP13 (46873) I'm not making this up, but I really don't want to post quotes, though I could. From this thread. We all appreciate your work and the work of the people who manage BP. Let's not get nasty. :) TangoTiger (57181) "From this thread. " CRP13 (46873) Check your email. :) TangoTiger (57181) Can you forward to tom~tangotiger~net (replacing ~ as appropriate). I can't check my Yahoo account from the office. Mr. Cthulhu (47348) Tom, I love your work and appreciate your comments here (generally pointing out things that I didn't think of, but can pretend I did when talking with friends), but you can come off as being arrogant. I hate to say it, because I don't want you to change you comments, or style (they are useful to the discussion here). But, for people who just see you as a competitor (or even just a random guy) trying to nit pick I can understand why people can misunderstand you. I think the issue is you constantly prodding for proof or clarification, which is absolutely a necessary thing and people don't understand that. People see it as being annoying. I appreciate it (even if I don't always agree) and I hope you continue to annoy people here, to better advance our knowledge. CRP13 (46873) Just to clarify, I learned that Tango's comments above were direct copypaste from his own blog. In the context of his own blog, I don't think there's anything arrogant or elitist. Thing is  nobody here knew it was copy/paste so it probably sounded a little stronger than something he would have written here from scratch. To clarify further, my initial comments weren't directed at Tango specifically. leites (17240) Prefer "the percentiles reflect the rather considerable noise in measuring a pitcherās performance" . . . in others, the realworld projection. jlefty (39531) So what's the (seven percent) solution? Cocaine. It's from Sherlock Holmes. Oct 01, 2010 08:01 AM MHaywood1025 (46036) I first thought of the chapter out of the biography of one Richard Feynman. Pretty sure it had the same name as well. Rowen Bell (5629) I believe you're right  as I recall, one of the anecdotes in that Feynman chapter involved new data that was different by 7% and they were trying to work out whether that 7% improved or worsened the experimental fit to a theory he was developing  but, of course, Feynman's choice of title to that chapter was an allusion to Holmes. dbiester (25354) I am a casually interested fan of sabremetrics most interested in using the projections to get a better handle on likely outcomes in the upcoming baseball season so I don't get crushed in my fantasy leagues. This year I got crushed in my fantasy leagues, and I blame society. dalbano (11458) I have to say getting crushed in fantasy isn't any fun, as I was crushed this year after finishing in the money all 7 years of my league. After having great keepers, and what I thought was a great draft, I was easily an early favorite to start the year. A good chunk of my team wound up having a HORRIBLE first 4 months of the season. I was in last place out of 12 teams the entire time. If the season were extended, to say...January, I think I would have a good shot of winning it. Instead, I will be right in the middle after having an extraordinary final two months. flirgendorf (30950) I would prefer that the percentiles for pitchers reflect the noise; couldn't you just include SIERA in the forecasts for those curious about the "true ability"? bmarinko (12618) Maybe I missed something in the article, but are these stats/charts just for 2010, or for some larger time span? If its just for 2010, do other seasons have distributions that look similiar? ScottyB (23917) Forgive me if this is naive, but isn't the trimodal distribution we see in Tango's chart above (and implied in Colin's in the article) perfectly explainable? ScottyB (23917) Put more simply PECOTA seems to miss very high, very low, or totally nail player performance. Can't this be mostly attributed to unexpectedly low or high amounts of playing time (due to injury or promotion/demotion)? TangoTiger (57181) "Those in the first hump (players who vastly underperfromed their projections) are those who missed significant playing time due to unanticipated injury or got sent to the minors." bwilhoite (55891) Maybe I'm misunderstanding how this is supposed to work, but I don't think the projections ARE saying this is a level of production 10% of the players should reach, but IS saying this is a level that this particular player is projected as having a 10% chance of being worse than and and 90% chance of being better than. Mountainhawk (37208) No! That's not what percentiles mean. If PECOTA had 100% of players between the 45th and 55th percentiles (or the 49th and 51st percentiles) or whatever, then there prediction system is just as broken as it is when 40% of players are outside the 1090% confidence interval. jlefty (39531) Maybe I'm misunderstanding how this is supposed to work, but I don't think the projections ARE saying this is a level of production 10% of the players should reach, but IS saying this is a level that this particular player is projected as having a 10% chance of being worse than and and 90% chance of being better than. batpig (1607) "but IS saying this is a level that this particular player is projected as having a 10% chance of being worse than and and 90% chance of being better than." Rob_in_CT (25572) Apolgies if this doesn't make sense, but... Luke in MN (42774) "There are a lot of different shapes that performance could take, however, and that means thereās more variance in any single component than is reflected in the percentiles. So the correct test of the percentiles is the overall level of performance, not the underlying components." batpig (1607) thanks for posting this Colin, and for finally providing some transparency for PECOTA. I posted the question about the accuracy of the percentile bands in one of the previous articles, and I'm glad to see it looked at in some detail. TangoTiger (57181) On my blog, someone asked me this: BillJohnson (2635) This is a fair criticism, IMO, and I'd like to see an explanation. Consider the following question that your analysis implies: is the difference between a player's 40th and 60th percentile predictions (based on tAv) less than, equal to, or greater than the difference between his 70th and 90th percentile predictions? A sortofrandom check of a reasonable number of players (will give details if interested) reveals that in only about 5% of all cases is the 4060 gap smaller than the 7090 gap. In all the others the difference is equal or larger for the 4060 than for the 7090, just as it is for ARod. Whatever's going on here, it's systematic rather than an exception that you happened to pick. Dr. Dave (1652) Probability theory makes it very clear that the (true) width of the percentile bands MUST get wider as you move away from the mean. Any set of percentile forecasts that don't obey this are (forgive the term) nonsense. TangoTiger (57181) kantsipr (1382) Is that statement true in general, or only for a subset of distributions, including the normal distribution? A Poisson distribution demonstrates the behavior you describe for the high end but not necessarily on the low end. And I can certainly define a probability distribution that does not behave as you describe. TangoTiger (57181) When would the range of the 0th to 10th percentile ever be smaller than the 40th to 50th percentile as an estimate of the true mean? Dr. Dave (1652) It's true for any distribution with a single central peak near the mean/median. By definition, the quantiles are closer together where the density (or pdf) is highest, and farther apart where it is lower. TangoTiger (57181) The last two links on my site gives you the best estimate of run scoring distribution. Keith used this in the BPro annual from a few years ago. batpig (1607) that explanation not only explains the clustering in the middle quintile, it also explains the clustering in the outer deciles (<10% and > 90%). TangoTiger (57181) This is Felix Hernandez: bmarinko (12618) TangoTiger  forgive me if you have answered this question somewhere else, I'm not that familiar with your work: TangoTiger (57181) The *idea* behind having uncertainty of your estimated forecast is good. Indeed, we devote several pages in The Book on not only the need, but the method, to calculate the uncertainties. When I publish the Marcels, I include a "reliability" figure, which acts in a similar way. batpig (1607) I do think publishing the percentile lines, baseball card style, is a useful tool for the most common enduser  the guy playing fantasy baseball who wants an easy "snapshot" of the range of expected outcomes for Player X, in a readily understandable format. TangoTiger (57181) But the range will be virtually the same for all starting regulars, and all starting pitchers. It's not going to differ by any amount that would be of any help to anyone. sensij (41659)
BillJohnson (2635) It doesn't really need to be "virtually the same for all starting regulars," because the players and their comparables don't have "virtually the same" skills and limitations. If there is significant scatter in the way the comparables did in the year following the comparison, for reasons that are apropos to the comparison, then it seems reasonable for the range to be significantly greater than if all the comparables turned out about the same. (Suppose a starting pitcher's best comp is a young Bret Saberhagen, with his weird alternatingyearsofgreatnessandmediocrity thing, for example.) If you want to say that "most" starting regulars should have fairly similar ranges, I'd agree. But not "all" or anything close to it. TangoTiger (57181) "the ranges are nice to see  if they work." penski (286) Along similar lines, I would like to see PECOTA replace the percentiles with a single weighted forecast, followed by three reliability numbers, perhaps on a scale of 110 for simplicity of presentation. The numbers would represent the confidence of the key elements of the projection. sensij (41659)
tbwhite (361) I think part of the problem is that two distinct measures keep getting conflated: a players talent level or ability and playing time. Randy Brown (189) Just wanted to say that I have really enjoyed the Reintroducing PECOTA series of articles on several levels, including the clarity about the system that the articles have provided, the discussion that the articles have spawned, and (not least) the amount of effort that has gone into revamping the system to make it better. I think most would agree that 2009 and 2010 were not exactly banner years for the system; these articles have restored my confidence that the 2011 PECOTAs will be what I've come to expect and enjoy. Great work Colin. BurrRutledge (18981) This is amazing. And I'm very happy that BP is publishing it for our consumption and analysis. The results are truly astounding. BurrRutledge (18981) correction... "But, at the end of the day, only 10% of the overall population should fall into *each of* those categories if they are really percentiles at all (and if this sample is representative)." BurrRutledge (18981) Okay, I came back for a second read through the article. Something I rarely do, but here I am. What was bugging me was my recollection that the 2010 Player Card 10year forecasts that several readers (including me) called into question. For almost all the players, but particularly for young prospects, the 10th percentile just seemed way too low, and if I remember correctly, there had been a programming update that introduced feedback into the algorithm. Players/prospects out of baseball very quickly at their 10th (and even higher) percentile of performance. Not a subscriber? Sign up today!
 

Thank you very much for soliciting your subscribers input. To answer one of your questions, I prefer the pitching percentiles reflect the hurlers skill in preventing runs.
Additionally, as this exchange with us continues and develops could you keep us aware of the schedule you are working with as I hope/anticipate your projections (and all Pecota related data) will be available much earlier next year.
Thanks again.