September 27, 2010
What A Long, Strange Trip It Has Been
Welcome to PECOTA week here at Baseball Prospectus. All week, we'll be running content on the state of our projection system, covering where we're at and where we're going. To kick things off, let's pull back the curtain and have a look at the history of PECOTA production, which should answer a lot of questions readers have asked.
The original PECOTA process, as long-time readers know, was designed by Nate Silver and first offered at Baseball Prospectus when we went to a subscription model in 2003. The basis of the system was groundbreaking for the time: similarity modeling of all possible comparables in baseball history to determine likely future performance, separated into percentile bands. We heard from people who just loved that they could take guys who "they had a feeling about" and kick them up or down the percentile ranges a little, while taking the weighted mean for players they didn't have a feel for. We had some neat graphs on the PECOTA cards which helped break out of the sea-of-tables presentation that was common for player cards of the day. BP staffers and subscribers alike had fun using PECOTA projections to do mean things to their fantasy leagues and predicting, quite seriously, that Nate had a future in PR or politics.
Behind the scenes, the PECOTA process has always been like Von Hayes: large, complex, and full of creaky interactions and pinch points. It started with Clay Davenport and Keith Woolner delivering Nate the data he needed, which took some time to compile after every season. Once Nate got the data, he picked up a case of Red Bull. Then he took that data and built constants and preprocessed the dataset with STATA, which took days for each iteration. He took that output and loaded the results into the heart of PECOTA—a Rube Goldberg contraption of an Excel spreadsheet. He'd finish making his changes to the PECOTA methodology for the year to the spreadsheet, kick off the macros, and generate the output on a player-by-player basis on his laptop. Just the Excel portion of the processing took days, barring a memory leak or computer crash that might necessitate starting the entire Excel process over. The output would be a monster CSV and a bunch of image files that I'd run a Perl script on to build the PECOTA cards. The output seemed to change in small ways every year, so the script had to change to accommodate those. For my part, I wrote the card-building script in late 2002, and just thinking of looking at the code today makes me cringe.
Every year, there would be errors and omissions, which really isn't surprising for a system of this level of complexity. With the system constructed as it was, though, we were especially ill-prepared to fix them because of the turnaround time, and because Nate couldn't use his computer while it was steaming through its Excel gyrations. Nate didn't own another computer, and he was writing for and managing the operations of Baseball Prospectus during most of the PECOTA generation time, so even if he didn't have any other interests or hobbies online, this was a problem. One obvious avenue of relief would have been to put the process on dedicated, non-laptop-form-factor hardware, but there are people in this world who think nothing of configuring, maintaining, and using multiple computers in their homes, and Nate Silver, who has often led off discussions about the PECOTA process with "now, I'm not a programmer," is not one of these people.
The numbers crunching for PECOTA ended up taking weeks upon weeks every year, making for a frustrating delay for both authors of the Baseball Prospectus annual and fantasy baseball players nationwide. Bottlenecks where an individual was working furiously on one part of the process while everyone else was stuck waiting for them were not uncommon. To make matters worse, we were dealing with multiple sets of numbers. The 'official' Baseball Prospectus statistics lived on our database server by the middle of the decade, in permutations and schema originally designed by Keith Woolner. The Davenport Translations and many of the eventual inputs to PECOTA came from Clay Davenport, who has his own statistics, processes, and player identification scheme. Like a Bizarro world subway system where texting while drunk is mandatory for on-duty drivers, there were many possible points of derailment, and diagnosing problems across a set of busy people in different time zones often took longer than it should have. But we plowed along with the system with few changes despite its obvious drawbacks; Nate knew the ins and outs of it, in the end it produced results, and rebuilding the thing sensibly would be a huge undertaking. We knew that we weren't adequately prepared in the event that Nate got hit by a bus, but such is the plight of the small partnership.
Nate didn't get hit by a bus, but he did get crazy famous—you might have heard—and that was close to the same thing as far as a predictable and orderly PECOTA generation process went.
The 2009 season was a tough year for us PECOTA-wise. There was the infamous Matt Wieters projection, but we’ll have more on that later in the week. From a process standpoint, we continued to use the original spreadsheet, but it took even longer than usual to get the projections run considering Nate's other obligations. We got the code running on a dedicated machine, but the lack of organizational expertise in the PECOTA generation process gave away the processing time advantage and then some. We ended up giving Fantasy subscribers a free upgrade to Premium for the delay.
As the season progressed, we had some of our top men—not in the Raiders of the Lost Ark meaning of the term—look at the spreadsheet to see how we could wring the intellectual property out of it and chuck what was left. But in addition to the copious lack of documentation, the measurables from the latest version of the spreadsheet I've got include nice round numbers like 26 worksheets, 532 variables, and a 103MB file size. The file takes two and a half minutes to open on this computer, a fairly modern laptop. The file takes 30 seconds to close on this computer. There's some color coding, and a few notes, but you're not going to sit down with a nice cup of tea and pick this thing up in an afternoon. More than one of the big brains on the team threw up their hands while saying uncle. Finally, Clay Davenport stepped up and, essentially by himself, produced PECOTAs based on the logic from the original spreadsheet, and Baseball Prospectus 2010 was saved. We thought we’d reached the promised land.
Then January and February rolled around, and we still didn't have PECOTA cards. The complexity of generating the multi-year projections and producing the expanded output of the player cards, versus just the book projections, was proving to be a much more difficult problem to solve, and the well-documented issues we were having re-rolling the depth charts processes from scratch were just screwing things up further. Clay works in Fortran, and those on staff with Fortran experience didn't want to admit how long ago we last used it because we've gotten self-conscious about sounding old, so collaborative problem-solving wasn’t going to happen. Clay was still working with his own data on his own systems, and the linkages between our database server and the PECOTA data were as shifty and error-prone as ever. Worst of all, even if everything was working tip-top, all we'd done is switch victims in the Murphy's Law BP-staffer-getting-hit-by-a-bus scenario.
We eventually produced a release of our standard Fantasy package, and there were some tantalizing big-picture advantages to the new PECOTAs versus the old—the more automated process and better integration with Clay's raw stats meant we could run PECOTA projections and cards for over twice as many players as we did with the Excel process, for example. Still, it was late, there was understandable uncertainty about the product, and we ended up giving Fantasy subscribers the free Premium upgrade and extended Premium subscribers by a month for the trouble... and it was such a hectic time I don't think we actually announced this. Enjoy the free baseball coverage, folks; when we screw up, we try to make things right, and the suits can't stop us from doing it because we are the suits.
We’ve continued to push out PECOTA updates throughout the 2010 season, but we haven’t been happy with their presentation or documentation, and its become clear to everyone that its time to fix the problem once and for all. The year 2003 seems like an eternity ago; we’ve undergone a huge amount of change since then, and so has the competitive marketplace for baseball analysis. We want PECOTA to be hands-down the best baseball performance projection system in the world, and over the next few days we’re going to break down what we’re going to do—and what we’ve already done—to get there. Stay tuned.
Dave Pease is an author of Baseball Prospectus.
Click here to see Dave's other articles.
You can contact Dave by clicking here