June 19, 2014
Should You Trust the Projections?
Last week, the sabermetric community had—well, not an argument, because the participants were generally professional and cordial to one another, but a debate about what we might expect over the rest of the season from a player who is currently enjoying a hot (or cold) streak. It all started with researcher Mitchel Lichtman (better known by his initials,
Dave Cameron followed with a post at FanGraphs, in which he summarized
The problem is that we’re asking the wrong question. To understand why, we need an analogy.
Suppose that some serious disease that no one had seen before were making the rounds. Naturally, biomedical researchers and public health officials would be hard at work immediately trying to figure out what was going on and would likely try to develop a test that could pinpoint whether someone was infected with this disease. Early detection of just about anything saves lives. In an ideal world, we’d want the test to get it right every time. If a person really were infected, we’d want the test to say “yes.” If a person were disease free, we’d want the test to say “no.” It’s rare to get a test that’s 100 percent accurate, but that’s the goal.
Now, let’s say that we can reasonably assume, based on some surveillance and epidemiology data, that 10 percent of the population is infected. But which 10 percent? Ah, that’s where I would come to the rescue with my extra super-duper magical test, because I am brilliant. I would simply declare, according to my test, which is actually some stray wires tapes to a cardboard box, that no one actually has the disease. And I would be right 90 percent of the time. That’s an A-minus, mom!
Oh right, that doesn’t really help the people who are infected, does it? Okay, instead I'll say "Everyone has the disease!" Now, I have accurately identified everyone who is infected, without missing anyone. I’ve gone up to an A-plus! Yeah, there are cases where that sort of “just assume everyone has the disease” model works in public health, but coming back to baseball, it’s basically like saying in March, “I know that a couple of these 750 players are going to break out this year. I just know it.” Technically, you can take credit for having “called” every breakout in baseball that year, but your original statement isn’t useful.
In public health and statistics, we call this a signal detection problem. A signal detection problem has two parts: something that we’re looking for (the signal) and some test that tries to find it. You can visualize the problem like this:
Now, about the other two:
There are two different kinds of errors that a person can make in a signal detection problem, and in signal detection theory, there are two things that we want to know about a test to determine how good it is. One is a measure of how good a test is at sorting cases into the “good” boxes. This measure, called detectability (often abbreviated d’) is what you really want in a good test. But the other measure of a test is called response bias (often abbreviated with the Greek letter beta). This is a measure of which type of error your test will make more often.
You can think of it in terms of what you might do in a case where you looked at the evidence and found that it wasn’t quite clear whether you should go with “Definitely a breakout” or “Pshaw, just a small-sample fluke.” To which one do you usually give the benefit of the doubt? That’s your response bias. Again, in public health, there are cases where it makes sense to prefer one sort of error over another, but adjusting the response bias isn’t helping you to get more cases correctly classified. It’s just adjusting what sort of errors you will make. Sometimes that’s the only thing that you can do, and it can make the test better, but it’s no substitute for better detectability.
Here’s the problem with “Always trust the projections.” It’s also the problem with “Everyone (or no one) has the disease.” We are trying to figure out whether a player who is playing above his head really is breaking out, or if it’s just acne. Going with “always trust the projection” is a way of saying “adjust your response bias toward saying ‘No breakout’ rather than working on making the test a better detector.” If we made a list of players who have exceeded expectations this year (pick whatever definition of that you want), most will probably revert to form, but some really are emerging from their chrysalis and have become beautiful butterflies. Let’s say that 10 percent of them are real breakouts (just picking a number). Saying “small-sample fluke” all the time will be correct 90 percent of the time. And only minimally useful.
The real question that teams are concerned with is the detectability question. Suppose that a team saw a player that was starting to break out at the end of a season and could detect that yes, this one was real. At the Winter Meetings, the team’s GM would invite the breakout player’s GM out for some lemonade-fueled debauchery on the hotel mini-golf course, and somewhere over by the windmill would mention an idea for a “minor” deal. Those are the kinds of moves that a World Series team is built on. Sticking to “always trust the projection” probably does keep you from over-reacting to (and over-paying for) a two-month hot stretch, and maybe if it’s one of your own guys, you can sell high on him, but even then, you’re only getting half the benefit that you could.
In fairness to Messrs. Lichtman and Cameron (Hi guys!), I doubt either one would significantly disagree with my general point, and likely they'd be all for a method that could better detect a real breakout when it's happening (or about to happen). They’d likely agree that in a perfect world, we’d have a perfect test, but since we don’t live in a perfect world or have a perfect test, it’s better to pick the option that makes you wrong the least often. That’s perfectly sound thinking from a statistical point of view, until you look at it from the point of view of a team or anyone else who needs to be able to pick out the real breakout. Anyone can adopt "trust the projections." There’s no strategic value in it at all. Tell me when I should disregard even my own model!
Again, to be fair, MGL's projection system (and others) allow for some types of new information to re-write the projection mid-season (For example, there were specific mentions of a pitcher who is clearly losing velocity, which would be factored into the projection.) But there's another problem. What happens when there’s information that the model doesn't account for? Sure, a good model tries to take everything into account, but what happens when a scouting report comes back that says, "No really, he really has changed his whole approach and it's working for him." We can't privilege all information like that.
Your cousin's girlfriend's brother's boss who has been a Rockies fan for 40 years (yeah, I know) isn't a reliable source of information on Charlie Blackmon. And yes, ideally, a more complete model might find a way to incorporate that sort of feedback to make the model better, but we're kidding ourselves if we think our models are that complete at this point. The problem with "trust the projection" is that you're leaving out any information that isn't fueling that projection, but that might still be important. The fact that projection systems miss on a lot of breakout guys is evidence that that we’re leaving out some critical information.
We should aspire to greater things than that, even though that aspiration is a mighty tall order. We have good ways for measuring what a player did on the field, and some nifty one-number catch-all stats, but very little in the way of measuring some of the more base component skills. How good is Smith's pitch recognition? What does it mean that someone finally explained how not to chase breaking stuff low and away to him in a way that he could understand? How does that affect all the other variables? What does it mean that not only is his wrist actually healthy, but that he actually trusts it now? How does all that interact with the rest of his skillset? It's a harder question and a more humbling one. It's going to be messy to figure it out, with fits and starts and failures and maybe some long pauses between breakthroughs. But that sort of mistrust of even the most sophisticated model is the difference between saying something that's correct and reaching the point of saying something that's useful.