Last night's depth chart/PFM/PECOTA update was more than just a simple update.

Yes, it did cover the changes from the last few days, like Elijah Dukes' release and Armando Galarraga being sent down. More importantly, though, it includes what I believe are the final substantive changes to the PECOTA process for this season. This is the first run of the PECOTA system that includes the full, finished percentile- and ten-year runs for every hitter and pitcher with a record in professional baseball. 

It has been a long time coming, and for that I am truly sorry. The retrospective solution would have been to switch from the day job at least 60 days earlier than I did, but the task–in layout–did not seem as daunting as the execution proved to be. I use this data myself–I've had two drafts already, and will have three more in the next two weeks–so I do understand the hardship that comes from these delays. I want to thank Dave Pease for the thankless task he took on, playing the front man for the process to allow me to concentrate on the work itself–but any complaints about PECOTA should properly be directed at me, not to Dave, and not to BP in general. 

The process started with a PECOTA version that only produced data for the book: one year, one forecast. We had an inital set of modifications that expanded those projections from one year to ten, in a manner that led to a wide divergence of possible outcomes, probably too broad, and one which was very slow to run–it would take about a week to run every card. Towards the end of February we pushed a major upgrade for the hitters, which streamlined this process, removed an unnecessary program which will make it much easier to transfer from one machine to another, and reduced the processing time for all hitters from about four days to 18 hours, at the same time improving the accuracy of the system when run on past years' data.

This update represents that same step for pitchers. The future casts, which we are scrambling to get into the cards, are based on the 10-year performance of the current cast of comparables, not on generating a new set of comps each year. As with hitters, the unneccessary calls to R have been removed, replaced with inline statistical calculations, and the processing time to cover all pitchers is reduced from 72 hours to 13. Once again, we do see an improvement in the tested accuracy when compared to previous seasons:

System Hits ER HR BB K Sum
2009 PECOTA 15.19 12.96 4.56 10.05 15.94 58.70
BP2010 PECOTA 14.51 12.86 4.61 9.96 16.33 58.27
February 14.55 12.88 4.70 9.95 16.15 58.23
Now 14.53 12.88 4.72 9.91 16.16 58.20
Now (Weighted Mean) 14.40 12.72 4.69 9.79 16.02 57.62

These are the root mean square errors for a set of 300 pitchers from 2009, with all forecasts pro-rated to the pitcher's actual innings total. The last two columns represent the current run – "now" is the 50% projection, and now-WM is the weighted means projection. For pitchers, unlike hitters, the use of the weighted mean does result in a noticeable improvement in performance. And that is why the depth charts are now using the weighted mean projection for pitchers instead of the 50% projection, which is the main reason why the numbers have changed from the previous run as much as they have.

The weighted mean projection, for those unfamiliar with the phrase, is a weighted average (by innings) of the difference percentile probabilities for his performance. Nate described it well here.

I will, of course, continue looking for places to improve the program, but I highly doubt that I will be able to both find and implement anything prior to Opening Day. So I'm declaring that this version is closed, except and unless for any bugs which turn up which require a fix. For now, I'm looking forward to actually using the programs and not just building them…some of which I'll be doing this week.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Thank you Clay for being forthright. Its a monumental task. Maybe BP (in general) botched the timing of the switch-over, but as one who has had to automate many manual processes over the years (my Operations Research background), I know its not easy. Perhaps it would have been best to run the old and new ways in parallel for a year prior to the switch-over.

Regardless, I can only imagine the gigabytes of data that needed to be wrangled.
Thanks, Clay!
Thanks Clay - very helpful. Does this mean that there will NOT be WMs for hitters in the spreadsheets?
I think based on previous posts, the numbers they're using tested out better on predicting 2009 data (without using any 2009 data) than the weighted means, so from what I recall, they're not going back to using the weighted means.
Thanks a lot, Clay. Now you need a massively parallel computer system that will allow you to run your R code in a few minutes rather than hours. We're setting one up that has 24 PC processors in a "cloud computing" network (sorry not available to outsiders). You may be able to get something that doesn't rely on just one box.
Could someone tell me if the 1 digit Peavy value is correct. As an example, listed below the likes of M Byrd outfielder Chicago in relative value and below the likes of a shoulder-damaged Ted Lilly as a direct starting pitcher comparisons.

Anybody care to weigh friend's and I were having a discussion, if you have the 5th pick in a 6x6 rotisserie league with the extra offensive category as OPS, who do you pick assuming 1-4 is pujols, hanley, arod, braun. I argued for Longo , others said Prince, but the consensus was Utley which I think is an overrated pick? Thoughts?
Go with Longo and nab Votto and Weeks later in the draft.

This is finally the year Rickie Weeks arrives.
Until he gets injured again...which is inevitable. Love me some Weeks, but let's be realistic!
You can go wrong in that situation, but as long as you stay flexible and draft according to what the rest of the league is leaving for you in your next few picks you should be fine.

If you don't take Prince, you can get Adrian or Votto in the 2nd or 3rd round, respectively. Other options like Berkman and Pena are available many rounds later should somebody reach for either Adrian or Votto.

If you don't take Utley, you can get Kinsler or Pedroia in the 2nd or 3rd round, respectively. Or Roberts, Weeks, or Uggla many rounds later.

If you don't take Longoria, you can get Wright or Zimmerman in the 2nd or 3rd round, respectively. Young, Beckham, and Chipper are your later-round options here.

So, which of these options is the best for you:

1a) Prince, Kinsler, and Zimmerman
1b) Prince, Wright, and Pedroia
2a) Utley, Adrian, and Zimmerman
2b) Utley, Wright, and Pedroia
3a) Longoria, Adrian, and Pedroia
3b) Longoria, Kinsler, and Votto
or, another alternative:
4) Prince, Adrian and Pedroia/Zimmerman.

2b should be Utley, Kinsler, Votto
er... Utley, Wright, and Votto.
Thanks Clay, for all your hard work!!!

If I may offer some advice for next year. Since you've gone through the hardship of automating this process, why not run your first set of projections immediately after the 2010 season ends. I'm unsure as to how future projections are affected by park factors, but it seems to me like you've been doing them all in a neutral park and league and then translating the results. If so, it makes sense to simply run a set of projections, lock in the data, and then spend the whole winter simply translating based on transactions. Then, if someone says, "why did so-and-so's numbers change?" you can simply say, "because he got traded to the Mets, and he'll be in an easier league but a harder park, or because he's slated to start the year in AA.

Thanks for all the hard work....