Happy Thanksgiving! Regularly Scheduled Articles Will Resume Monday, December 1
September 28, 2010
Whatever Happened to the Man of Tomorrow?
Yesterday, Dave Pease went ahead and talked about the process of generating the PECOTA forecasts in the past. Now I’m here to talk about how PECOTA has fared, and where PECOTA is headed.
First, let’s look at the hitter forecasts for this season. We have four test candidates:
There are a lot of other forecasting systems in the wild; I chose to look at Marcel and CHONE in comparison because they’ve fared well historically and they have sound underpinnings.
Let’s look at how well each system forecasted the overall offensive levels of the league as a whole. We’ll use OPS, since it’s a “good enough” offensive estimate for the sort of study we’re doing, and it’s typically calculated by every forecaster, so it’s a very transparent way to compare systems. Looking only at players in common between the four projection sets, the weighted average of OBP, SLG, and OPS for each system:
This was a down year for offense on the whole; the most recent PECOTAs and CHONE forecasts were a shade closer on projecting the league offensive environment, but even then there were .023 points of OPS between them and the observed results.
So I adjusted each set of forecasts to line up with the lower offensive environment, and looked at the root mean square error of each forecast from the observed result, weighted by the number of plate appearances that player had. (Root mean square error represents the standard error—in other words, 68 percent of the time, you should expect outcomes to occur within that distance from the forecast.) I now present the most boring chart I have ever had the honor of showing you:
That’s, uh, not a lot of difference. Strictly speaking, it’s no difference at all. Now let’s take a look at your standard rotisserie categories in fantasy baseball:
We see a bit more separation here, but not much. PECOTA and CHONE were tops at predicting batting average, the Marcels were best at predicting home runs, and PECOTA was tops at predicting runs scored, RBI, and stolen bases. (PECOTA looks better at these projections because we use our Depth Charts to model a player’s specific role and playing time—that has little to no impact on his OPS, but has a big effect on his counting stats and thus his fantasy value.)
Now, let’s look at pitchers, using ERA as our measurement. First, the predicted versus observed ERAs, as a group:
Again, the run environment this year was lower than any of these forecasts expected it to be. After adjusting the forecasts to account for the difference between expected and actual run environment, a look at how each forecasting system did:
We see a little more separation in the pitcher projections that we did in the hitter projections, but not a lot. Looking at the other roto categories for pitchers (CHONE doesn’t project saves, so it received no score in that category):
These are not especially surprising results. This has been the state of forecasting for a while now—you have a pretty tight bunching of the advanced forecasting systems. (That’s what Nate Silver found back in ’07, for instance; even in ’04, hitting forecasts were pretty tightly bunched.) It’s been a while since PECOTA’s main competitors were a handful of fantasy touts and the Favorite Toy, and these results reflect that.
Now, of course, PECOTA has always done more than simply project a player’s basic stat line—we have a lot of other things going on, like the 10-year forecasts, the percentiles and the upside/downside ratings. It’s one of PECOTA’s main attractions, but it shouldn’t be its major downside as well.
One of the drawbacks of PECOTA’s additional complexity is simply how long it takes to produce forecasts. But that’s also a consequence of using Excel to generate them. We’ve cut that dependency a while ago, and are continuing to work to integrate PECOTA more with our other statistical offerings. That’s important to you because it means you get your forecasts sooner—because the word “fore” is of course a major component in forecasting.
But it’s also important for the accuracy of any individual forecast. I can take one hitter’s forecast and substitute any number of outlandish findings for him, and that on its own won’t move the needle on those RMSE figures I showed you—it takes a systemic problem affecting a lot of forecasts to show up in that sort of test.
And PECOTA is a computer program—essentially, a list of instructions. It will follow those instructions unerringly, regardless of whether those instructions are correct. It takes a human to write instructions for the computer to follow, and as we all know, humans make mistakes now and then.
Some of you may remember the PECOTA forecast for one Matt Wieters’ debut season. It struck a lot of people as being outlandish—I was certainly one of them. In this case, PECOTA was a victim of its own complexity—by taking so long to produce forecasts, there wasn’t enough time to properly proof the PECOTAs.
Minor league players are heavily dependent on methods of putting their stats in terms of expected major league performance; in this case, it was the Davenport Translations. The foundation of this process is a set of league difficulty ratings that establish how a league compares to the majors.
What seems to have happened is that, when spitting out translations for the two leagues that Wieters played in (and only those two leagues, mind you), those league difficulty factors were significantly inflated from what they should have been—the Eastern League was not only rated higher than the other two Double-A leagues, but above both Triple-A leagues as well. And the High-A Carolina League placed above both of the other Double-A leagues as well.
For most players, that wasn’t going to make a noticeable impact—very few players who are expected to be anywhere close to the majors have only one year of stats split between the Eastern and Carolina Leagues. But one is enough to produce the Wieters forecast.
For last year’s book, we had a fairly involved proofing process of the PECOTAs. (Notably we neglected to do that for the first run of the Depth Charts. There’s a lesson to be learned there, and we’ve learned it—proof everything before publishing.) That’s good, but we want to do better than that. So in addition to having humans proof the PECOTAs, we’re building a set of unit tests to run alongside the PECOTAs, testing each element to make sure it’s functioning properly. What this means is that the output of the PECOTAs are going to be tested at several steps along the process, to ensure that everything is functioning correctly.
We’re also utilizing these tests to make sure that when changes are made to the PECOTAs, that they’re actually improving the underlying accuracy of the product. And we will be notifying subscribers when changes are made to the methods between PECOTA updates.
Of course, PECOTA has had some infamously mistaken forecasts that wouldn’t have been caught regardless of the amount of proofing. Most of them have been of Ichiro Suzuki. Tomorrow, we’ll go ahead and address how PECOTA missed the boat on Ichiro, and what we’ve learned from those mistakes.