Full disclosure: I have never really played fantasy baseball, at least in a serious or semi-serious capacity, prior to this season. My lack of participation had nothing to do with ulterior motives like taking a stance against WL and batting average. I just never got into it. Well, things have changed and, in deciding to try my hand at the massively popular game, I am finding that certain tendencies have awoken that I believed were trained out of my baseball vernacular long ago. For instance, it is becoming increasingly tempting to drop a player after a poor week in exchange for a player in the midst of a hot streak. I mean, I know Jeff Francoeur isn’t going to hit .438/.583/.839, but my goodness, if I had that production or even some semblance of it instead of the .250/.345/.333 from Andrew McCutchen, I might have won both of my matchups so far.

Despite my own analytical prowess and well-known feelings about sample sizes, I find myself eerily entertaining ideas like this with each passing day, getting rid of someone I know will perform well by the end of the year for someone who is performing well now, regardless of whether that explains anything about his year-long attributes. After talking to our stable of fantasy writers, I grew to learn that giving into these temptations is actually a fairly prominent strategy with its own nickname: churning and burning. Fantasy owners will actively add or drop players based on results from the prior week or two, and even potentially try to cash in on unsustainable hot streaks by trading these players to, perhaps, lesser statistically-oriented competitors.

That such a strategy is renowned enough to have a nickname got the motors moving in my mind, wondering if the strategy made sense. Do small samples derived from a week or so of playing time hold predictive value? Regardless of the answer to that question, would our findings even be relevant relative to the context of fantasy baseball and not actual general managers making decisions?

We’re Going Streaking!

As was discussed in my opus on spring training statistics, it is human nature for fans to anchor to early-season performances when forming opinions on players, because the beginning of the year provides a tangible tracing point. This is the same reason why diets or workout regimens are started by so many on the first of the month, or a Monday, as opposed to a random Wednesday the 12th. Fantasy owners operate differently, however because each week tends to be to its own, independent of past or future weeks. Last week, my team, the HeyNowHankKingsleys lost 9-5 despite stellar pitching across the board because Kevin Youkilis, Justin Upton, McCutchen, Yunel Escobar and Adam Dunn all posted sub-.700 OPS marks, with Matt Wieters chiming in at .361. Each of these players should be firmly above .800 by the end of the year, but that fact does little to console the deep pain suffered due to the loss.

If each of them manages an OPS above 1.000 this week, wonderful, but it doesn’t retroactively change the standings just because they have regressed closer to their true talent levels. By examining players under this weekly lens, it is impossible to avoid noting who is benefiting from a hot streak and who is underperforming in the midst of a cold streak. But are streaks predictive?

To be frank, this is the point in my articles where I’ll generally explain the statistical tests or procedures to be used in order to answer the objective question of the piece, but I know the answer to this particular query without having to even flinch. I don’t need week-to-week correlations or RMSE pre- and post-tests to confirm that, no, streaks are not very predictive in the aggregate. In The Book, Tango, Lichtman and Dolphin took all hot or cold streaks, over a five-game span, from 2000-03 and compared the numbers that were to be expected after the streak to what was actually produced. The effect was faint at best and led to the conclusion that, realistically, what a player is expected to produce in a given frame based on a rolling projection matters much more. If a tiebreaker is needed, streaks work well in that regard, but they do not hold any true or clinically significant statistical advantage.

Along the lines of the test, forgive my brief tangent into discussing what I call “The Abreu Fallacy,” which aptly explains why the tests they conducted are tantamount to conducting an accurate study in this forum. When evaluating if streaks cause any effect, the numbers produced in the streak itself cannot be compared, straight up, to what transpires after the fact. This occurred frequently when Bobby Abreu won the All-Star Home Run Derby in 2005. He went into the All-Star break on an absolute tear, mashed about 732 home runs in the derby at Comerica Park, and then proceeded to hit like Winston Abreu the rest of the year. Several writers took it upon themselves to compare the pre- and post-derby numbers of several players to determine if the derby caused a decline, but they missed the boat, because Abreu and others should not have been expected to keep up their pre-derby performance.

The accurate comparison is between what he was expected to do after the hot start to what he actually did. Relating this to streaks, to determine if they are predictive, post-streak performance cannot be compared directly to streak performance, but rather to what is expected to happen after the streak, which could be as simple as a weighted average of the past three years and the numbers from the current year up to that point in time. Speaking of points in time, streaks are incredibly interesting in terms of the churn-and-burn strategy because they occur in the past. We know when one ends, because the player stops performing as well, but it is impossible to know when one will end. How do we know exactly when Franceour’s numbers will begin their gradual decline? We know it will happen in the back of our heads, but hey, if it lasts another week that could be all I need to win this matchup.

Essentially, in employing this strategy, we would be using a backwards-looking concept in order to drive decisions revolving around future performance. It can certainly work—and it must work on some level to have become such a common strategy—but from a statistical perspective, streaks in the aggregate hold next to no predictive value. What happens last week is unlikely to influence performance this week. But the question of relevance comes into play, as what happens in the aggregate does not apply to every single individual. While the broad concept holds true in that streaks are not predictive, there are certainly individuals for whom this might not apply.

The Strategy Itself

Not all churn and burns are created equally; some owners might rotate all roster spots save for the Utleys, Pujolses, and Wellemeyers Halladays of the world. Others might choose to reserve just one offensive and one pitching spot for streak-driven decisions. Even though, overall, streaks boast little predictive value, the strategy itself is still valid for the same reason that people continue to invest in penny stocks: even though the percentages are not in their favor, capitalizing on a few that turn out well more than makes up for the other failures. Imagine if you entered last season with Casey Kotchman as your starting first baseman and picked up Garrett Jones after he hit .296/.345/.667 in his first seven games. In that scenario, your replacement would have hit .293/.374/.557 for 75 games and 329 plate appearances after that point, vastly superior to he who was rostered heading into the swap.

Same with Joe Blanton, who might not have been drafted, but who posted a 3.16 ERA, 1.21 WHIP and 7.5 K/9 from May 26 to the end of the season. Players like that would have supremely benefited those who took the chance. While the strategy might be very risky given the lack of predictive value in the aggregate, much of that risk is mitigated if the strategy is employed somewhat conservatively—as in, don’t drop Chase Utley after a poor week—by the fact that some really will pay off, and those that don’t can be dropped for someone else almost instantly.

So how do we actually test the strategy given that so much is open to interpretation? Well, an idea I had that I wanted to throw out is literally to apply the test in a fantasy setting. As in, two faux-leagues are set up with identical rosters, and in one of them the only moves allowed deal with DL stints, while in the other owners can run rampant with churning and burning, with a few restrictions. What I have in mind would work like this:

1) Two leagues are set up with, say, eight teams each

2) An autodraft is done in League #1

3) Those same rosters are imported into League #2

4) We define parameters for League #2 as far as who cannot be dropped. This could be as simple as anyone in the top four at catcher, second base, shortstop and center field; within the top six at first base, left field, right field and third base; within the top 12 starting pitchers and top six relievers. Numbers can change depending how the consensus feels but that would be the gist; you can drop Stephen Drew, but not Hanley Ramirez.

5) We define parameters for what constitutes a hot/cold streak. Maybe something like an OPS below .700 is a cold streak and above .900 is a hot streak.

6) In League #1, the only moves can occur if someone gets hurt, or we could just say no moves, period, and if someone is DL’d we can control for that after the fact.

7) In League #2, add/drops based on streaks can be randomized or determined by the draft order, to ensure that no one team can load up on the “hotter” streakier hitters.

Let me throw it out to the audience—is this something anyone would be interested in seeing or doing? In the end, what we could begin to determine is whether or not better production could be had by sticking to a drafted player, who was obviously solid to begin with, or exploiting streaks. Begin is the key word, as one test-flight isn’t going to answer any questions, but if it can open the discussion and shed some light on the results I’ll be satisfied. Additionally, to those who have employed this sort of strategy, what are your experiences? When has it worked or failed? Would you recommend it?