Well, this year's Sloan Sports Analytics Conference has come and gone, and I wasn't able to attend. Worse, I couldn't go to the SABR Analytics Conference either. Of course, I've followed along as best as I could, but there's no substitute for actually being in the room.

With analytics having their big moment in the sun, and with the topic of how analytics fit into sports still something of an open cultural question, there have been a few writers who have considered that intersection and written something about it. Over at SB Nation, Andrew Sharp wrote a review of the Sloan Conference (seems that he was in the room), which contained this excerpt:

If there's genius on display at Sloan, it's this: When scouts or coaches or old school GMs get something wrong, it's an example of traditional scouting methods failing. When analytics get something wrong, it's "randomness" that you can't control. A small part of a much bigger process, and teams and fans should trust that process until they get a better outcome.

This might be the most damning critique of sabermetrics (and sports analytics in general) I've ever seen. Worse, it might be true.

Let's first name what we're talking about. It's a well-known phenomenon in psychology called the self-serving bias. When something good happens to you, you will tend to see it as the result of your own hard work and talent. When something bad happens, you will tend to blame bad luck. When something happens to someone else, especially to a rival, those attributions are generally switched. This has been proven in the psychological literature about 500 times. It's everywhere. You will do it today, I promise. I will too.

If you're reading Baseball Prospectus, you're probably tempted to reflexively say "Sharp is wrong." I was too. But, if we're going to be intellectually honest, this critique deserves a more thoughtful answer. One thing that sabermetrics can pride itself on is that we've made great strides in analyzing baseball while minimizing the biases that go along with "the human element." If we're going to be good scientists, we can't let this bias bring us down. Otherwise, we're just cheerleaders for spreadsheets.

There is a lot of randomness in baseball. Balls take weird hops when they hit a pebble. Wind air currents push a ball just inside the line for a double, rather than a loud foul ball. Teams have years where they win 80 percent of their one-run games. How much can we blame on random noise when we (as a group) don't get something right? How many of our correct predictions are the result of random noise breaking our way, rather than our own brilliant ideas? And is there room to say that "old school" methods (to the extent that the stereotype of the scout going purely by gut feel even exists any more) might have some credence, but might also have to deal with the problem of bad luck as well? There isn't a way to directly answer these questions fully, but the questions themselves raise some important issues.

Consider that almost no one, sabermetrically inclined or not, predicted the Orioles or A's (see, Moneyball works!) to be even close to the playoffs last year. Everyone missed badly on those two, except for a few optimistic fans. Sure, the Orioles caught some very lucky breaks along the way, and if the season were played again a few million times over in parallel universes, I don't think that they would get that lucky again very often. But here I am blaming luck for something that I didn't get right. Then again, had one of my models predicted the A's as AL West champions, I would certainly point to this as confirmation of my awesome powers of awesomeness. This despite the fact that the model most likely to have produced that prediction would have been for me to have chosen a team from the AL West out of a hat and proclaimed them to be my pre-season favorite.

There are plenty of well-scouted draft picks who busted, and not just due to injuries. Look at the recap of the first round of any draft over the past few years. It's fun to play the "He was drafted in front of these other three really good players" game… unless you're a scouting director. Then again, there are plenty of sabermetric darlings out there who were supposed to be the next big thing, but who just didn't make it either. Matt Murton will patrol the outfield for the Hanshin Tigers this year. If we're going to blame scouts for trying to draft blue jeans models, should we not also point out that OBP is not the only thing in the world that makes a baseball player good?

I think that one blind spot that sabermetrics hasn't yet dealt with is that we've misunderstood what it is exactly that we're good at. Sabermetrics is good at taking a lot of observations and sorting through the patterns in them in an unbiased way, at least within the constraints of how the mathematical model that sorts those data is defined. If the model is a good one, then it will perform well in describing the past and predicting the future. But what if the problem is that we just don't really understand what we're trying to model and our equations are off? It's great to have unbiased residuals, but how's the R-squared doing? (And why do we so rarely report it?)

To be clear, I do believe that the amount of randomness present in baseball can overwhelm even a good model. But the reflexive use of "I got unlucky over a small sample size" to explain away any variance is a problem that needs to be addressed in the field. Maybe it's the truth, but it's too easy to say and accept without really thinking about it. "Luck" gives you a get-out-of-jail-free card from having to examine what really went wrong. It short-circuits critical self-analysis, which is the key to breaking out of the self-serving bias.

To the extent that Andrew Sharp's critique is correct, it means that too often in sabermetrics, we sneakily start with the assumption that our models are correct. My model for processing information is inherently sound, and any variance is the result of chance. Paired with the other side of the self-serving bias (Your model is inherently flawed, no wonder you got such bad results), it begins with the assumption that I'm right and you have no idea what you're doing. There's a PR problem with that approach—one that probably has a lot to do with how contentious the field is in baseball more generally—but worse, at that point, we're not even doing science.

Maybe our models are brilliant. But maybe they're not. Maybe we don't know as much as we think we do or as much as we'd like other people to believe that we do. Maybe the "old school" models, even with their flaws, have strengths, but they too got unlucky over a small sample size.  Maybe the reason that it stung so much when I read this critique was that when I thought about it, the very unsettling conclusion that I had to come to was "Maybe I'm wrong."

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Very thoughtful and important article. Being willing to examine failures and being willing to accept that sometimes one fails even if one did one's best is critical, but not easy.
Great article. One key point that I think needs to be looked at is that there is a large percentage of people that right out of the gate despise statistics because they dont understand or dont want to understand the meaning derived from the math. Scouts and coaches are given the benefit of the doubt out of the gate while stats continually need to be vetted. In the words of the mlb network resident genius Harold Reynolds "I just know that that aint right."
As a stats guy myself (in another field), I cringe when I see error variance attributed to "luck"- and this does happen a lot in sabermetric analysis (and most other walks of life, too).

Error variance is something we failed to predict, some of it is randomness or luck (chance would be a better word than luck, though), but it is mostly due to things we neglected to consider in our model.

I would also love to see more variance built into sabermetric numbers. For example, given that we do (pretty much) know how big a sample we need for certain stats to stabilize) we can build in confidence intervals or something around our stats.

Then we can have a debate on candidate A with 6.2(+-.5) WAR vs. candidate B with 5.9(.3), or with 1 year defense stats, something like 2.0(1.6). We kinda do this with PECTOTA's quartile projections, I'd love to see this with backwards looking stats, too. It would keep us griunded in the fact that stats often have a lot of error built in.
"given that we do (pretty much) know how big a sample we need for certain stats to stabilize) we can build in confidence intervals or something around our stats."

Right. Obvious example, that I harped on in the Angels depth chart already, is PECOTA's projection of Pujols walking almost 12% of the time this year. The explanation in the bounceback candidates piece a couple weeks ago which featured Pujols was that PECOTA has a long memory. But that long memory is going to ultimately make the PECOTA projection for Pujols look bad at the end of the year, with his OBP coming in way under his projection when he walks about 60% as much as is being projected.

For players like Zobrist and Bautista who were walking a lot in 2010, and still walking a lot in 2012, saying they will walk 13.2% (Zobrist) and 14.8% (Bautista) still makes sense because they demonstrably still had the high-walk skill at the end of 2012 in a big enough sample. Pujols did not. Isolating walk rates, strikeout rates, and other quick-to-stabilize rates in the projection engine, and giving them a different weight than the standard "long memory" would ultimately lead to better projections. Pujols' walk rate in 2009, 2010, when he was a completely different player, are at this point virtually irrelevant.

Crediting PECOTA for it's long memory without acknowledging where a long memory is a detriment I think is an example of what Russell is talking about.
Absolutely, epsilon will contain true randomness, model mis-specification, and omitted variables.

In an effort to correct the omitted variable problem, I suspect that the model mis-specification problem has become more acute in SABR analysis as defense and (now) baserunning are rolled up into a single value statistic that corrupts the information from the much better specified hitter-value function.
There's a great section in the Francona book that deals with this very issue from a manager's perspective:

"...the problem was that the number would change as we played. On Thursday Mike Lowell could be a good fit to play, but on Friday he could be a bad fit because of what happened to the numbers Thursday. It was a little too fluid for me."

At what point do you reach a level of statistical significance where that Thursday/Friday dynamic isn't the case?

Similarly, if you flip a coin and get heads seven times in a row, are the odds that you're going to get tails on the eighth flip 50/50? Perhaps, if you view the eighth flip as part of a series. But as an independent act, it's 50/50 just like the rest.

Good food for thought here...
It's mainly a communication issue--rather than point estimates these data can be presented as trend information.
One of the most interesting things I learned from following the Orioles last year was considering what (besides the knee jerk reaction of "luck")are the characteristics of a team that wins a disproportionate number of 1 run games.

It reminds me of the refinement of the Voros theory that pitchers don't influence the results of balls in play (but on further study some do, at the edges).

The conclusion I reached from following the Orioles was that there are good reasons to be skeptical of the ability of a team that wins a dispropotionate number of their 1 run games to be able to continue that ability, but the ones that have a really good bullpen are the ones that are most likely to be able to sustain success in 1 run games.
Nice article that brings up a salient point.

Why are r^2 so rarely reporte in sabermetric work - is it a sort of early convention that's persisted?
I think the biggest issue with Sabermetrics is confusing better with good. There are plenty of examples of Sabermetric models that have value and provide interesting levels of analysis. The biggest problem I see with Sabrematricians is their extrapolation of something demonstrably better into being considered right. Those are different things. We can say WAR has value and is informative; which isn't the same thing as saying WAR correctly values player's contributions.
The biggest issue, IMO, is reification of the useful simplifying assumption of replacement level talent. The baseline is built on quicksand (and it has only gotten worse throwing defense and baserunning into the mix).
I would give you a +1 but for some reason I can't do that.
I freely admit that in discussing players I never use WAR because I can't defend it. There was an article here the other day about "replacement level" where the definition was hard to pin down from paragraph to paragraph.
A subtle, important point.
"...we sneakily start with the assumption that our models are correct."

The curse of economics. Although the upside is that when sabermetric models are wrong, they aren't catastrophically wrong! As in the case of economics.
The difference is in baseball, when the models don't predict the real world data, the model gets changed. In economics, when that happens, they keep the model and throw out the data.
Great article. One of the big points that was made at the GM Panel at the SABR Analytics conference (either by Jed Hoyer or Rick Hahn, I forget which) is that if you give some information to a player and it doesn't work, he will never trust you again. It doesn't matter that you would be right 99.5% of the time. So, information presented to players needs to be presented correctly, even if it is "right".
Great article. I'm an accountant and part of my job is to prepare budgets. I love doing it because if I'm right I look good. If I am wrong, then I make a comment that I can't predict the future or I would have hit the lottery years ago. I never tell my boss that my model was incorrect.
Well done. I think one can never write about confirmation bias enough -- that belief is so ingrained in people. This article feels like a continuation of your essay at the back of the 2013 annual.
One question I haven't solved: We'd all agree that judging the process instead of the results is most important. And a good manager hires people "smarter" than him, including, but not limited to, a position like an analyst. But if that manager has to resist judging the results, and he's unable to fully grasp the methods being used, how is he to evaluate the researcher?

Of course, I'm really limiting things to the quality of research itself, not communication or interpersonal abilities.
I always thought that everyone had confirmation bias, and this proves it.