keyboard_arrow_uptop

Projecting playing time is hard. When news hit that Adam Wainwright was almost certainly out for the year, forecasters went into fits. There’s no way to predict Tommy John surgery* for a pitcher coming off back-to-back 230-inning seasons. So you can say there’s a five percent chance he gets hurt and dismiss that possibility as too unlikely to weigh into your projection, or you can drag your forecast down by five percent. Regardless of which course of action is better, the only recourse after the fact is to “cheat” by manually updating the number of projected innings when such news comes out.

*Wait, he had an inverted W? Well, in that case…

PECOTA, Dan Szymborski’s ZiPS, and Brian Cartwright’s Oliver all use manually adjusted depth charts. PECOTA also uses a simple average of past seasons for individual projections. Victor Wang tried his hand at projecting playing time using some more advanced techniques. But for now, I’d like to keep it simple.

Marcel the Monkey, the world’s most baseliniest projection system, projects Wainwright at a 2.98 ERA in 198 innings. Marcel was developed by Tom Tango to serve as a replacement-level forecasting system, as its methodology is entirely open source. Marcel is quite sophisticated in some ways, as it understands regression to the mean, handles the weighting of prior seasons, and includes an aging curve. While I wouldn’t trust myself to guess a random pitcher’s ERA over a projection system without using a computer of my own, I’m positive that I can do better in terms of innings pitched. I’m also positive that I can come up with an algorithm just as basic as the one that Marcel uses to better predict playing time on average.

Marcel starts by giving all batters 200 plate appearances. It then adds half of a batter’s plate appearances from the previous year and 0.1 times his total from two years prior. For pitchers, it uses the same weights, but with 60 innings for starters and 25 for relievers as the constants. I decided to use a similar dataset and see how easily I could beat the monkey with a couple of simple rules of my own.

My sample consisted of all regular-season plate appearances in year N since 1980 for everyone who played in the Majors in either year N-1 or year N-2 and played pro ball in some year thereafter.

Marcel’s equation explains 63 percent of the variance in a batter’s plate appearances in a given year. When I best-fit the data, the regression equation took 75 percent of the PAs in year N-1 in addition to 10 percent in year N-2. But we can do better than the best-fit line with one simple separation of the data. When projecting playing time in season N for players who played more in season N-1 than in season N-2, the playing time in season N-2 becomes irrelevant. For batters, that means that you project with 80 percent of the previous year’s plate appearances, and for pitchers it’s closer to 75 percent. Otherwise, the equation is 60 percent  plus 20 percent for batters and 60 percent plus 15 percent for pitchers, where the first term is the PAs in year N-1 and the second term is year N-2.

The r-squared, however, was identical at 0.64. What this means is that there are multiple ways to skin this cat. (Are there actually multiple ways to skin a cat? Cat-skinners, get at me.)

So which way is more correct? Below I plot the best-fit lines for batters and pitchers and the games played per games projected indexes.

I submit that it is more sensible to force the intercept to zero, since if someone hasn’t played in the Majors in either of the past two years, he probably won’t play the following year. Indeed, for players projected at fewer than 300 PAs by Marcel, Marcel generally overshoots by 180 PAs, while the just-as-simple best-fit hereinafter referred to as Marcello, Marcel’s Italian twin* is on average within 10 PAs. A quarter of these players don’t play at all in the projected season, yet Marcel is projecting 95 percent of them for a career high in plate appearances. When Marcel projects for over 300 PAs, it misses by an average of 100 PAs, compared to an average of two PAs for Marcello. Marcello projects 20 fewer innings for a workhorse like Wainwright. There are two hitters and one pitcher Marcello projects for more plate appearances than Marcel: Rickie Weeks, Austin Jackson, and C.J. Wilson.

*I’m not sure why their parents named them that way, or how they can be of different nationalities.

This works back to the original uncertainty surrounding the question that a projection system is attempting to answer. Yes, it’s more likely that someone like Wainwright will pitch 200 innings than 180. For pitchers who throw 200 innings in back-to-back years, there’s a better than 50-50 chance that they have a third season at 200 innings-plus. However, the mean innings pitched for those players turns out to be way less than 200.

The Bill James forecasts are notoriously optimistic because they shoot for the mode as opposed to a mean. At the other end is Marcel, which regresses everything towards league average to such an extent that it is resistant to outliers. Marcel is trying to provide a true talent estimate, and therefore trying to minimize the error around projected production level. I think that similarly minimizing the error around projected playing time for a system like Marcel makes more sense than projecting playing time based on a different set of circumstances in which the player has likely outperformed his projected production. Of course, when it comes to projecting playing time, it’s unlikely that even the best algorithm supplied with the best data could outperform old-fashioned flesh, blood, and brainpower.

You need to be logged in to comment. Login or Subscribe
levidavis
2/24
I've never understood why an inverted W is not just called an M.
adamsternum
2/24
I've never understood why the inverted W is not just called the W. How do you know where I'm standing?
jgreenhouse
2/24
Clearly you both have a lot to learn about pitching mechanics.
mikefast
2/24
Good stuff, Jeremy.
TangoTiger1
2/24
Nice work Jeremy. I posted a reply on my blog. Thanks for giving me more to think about.
mikefast
2/24
Jeremy, you addressed the median and the mode in the text, but as I stare at the graphs, I find myself wondering if there is a simple way to draw a line that goes through both the dark cluster around the origin and the dark cluster of full playing time in the upper right. A mathematical technique, I mean. The "best fit" line is having to account for all the points along the x-axis.
jgreenhouse
2/24
Mike, right, the mode method would go through both of those clusters. I believe that would entail simply using the exact same number of opportunities from one year to the next.
o2bnited
2/26
Wouldn't the cluster in the lower left (for pitchers) be the relievers (along with the injuries and such)? And so if you separated by pitcher role, would that help? You could obviously approach the model itself from multiple angles (logit, piece-wise, two separate models completely) if you had some type of logit flag for GS > G*0.8 or something. Or is that more complicated than what you wanted to do? Anyway it would be interesting to see how the plot looks with pitchers separated into two groups with some simple rule of thumb. Interesting, stuff, though. Thanks Jeremy.
jgreenhouse
2/27
Yes, I think games per game started would be a good variable in a more sophisticated model. I might revisit projected playing time more toward opening day.
jgreenhouse
2/24
I should have tested using simply year n-1 as a predictor of year N. I also should have used the absolute error or RMSE or something to have tested. I will try to get to those later.
jgreenhouse
2/27
OK, I checked it out, and I think I've made it clear I don't feel Marcel is competent in projecting playing time. Using simply year n-1 to predict year N obviously comes in with a higher average error than Marcello, as Marcello is simply a best fit and year n-1 is unregressed. But year n-1 and Marcello have practically the same average absolute errors. Marcel has a much larger average absolute error.