Let’s talk about the ESPN: The Magazine team chemistry rankings. For those who haven’t seen them yet, I suggest going here, but if you’d like to skip to the good part, the centerpiece of ESPN’s predictions about the 2014 season is that they adjusted them for team chemistry. The article actually (seriously, no really) says that the Tampa Bay Rays are projected to win 1.7 extra games this year because of chemistry. This will be enough to win them the AL East.
This is evolution at a speed that would disprove some tenet of special relativity. I only wish I knew more about special relativity so that I could tell you which part. We’ve gone from chemical atheism (“Chemistry doesn’t matter. Winning breeds chemistry”) to chemical agnosticism (“Chemistry is this vague mysterious thing that you can’t easily measure and it might or might not have an effect.”) to fully drinking the chemically laced Gummiberry Juice (“Chemistry is worth 1.7 wins.”). Whoa.
I should probably cop to a couple of things. I have a bit of a head start on this one. Last summer, BP’s own Sam Miller wrote an article on team chemistry in—funny enough—ESPN the Magazine, in which I was quoted. He also talked to Santa Clara University professor Katerina Bezrukova, who spoke of her work on group dynamics, specifically how “fault lines” can appear based on any number of factors to divide members of a group from one another and can determine the success or failure of a group.
Since I’m thrilled to meet anyone who lives at the corner of Baseball Street and Psychology Lane, I contacted her and we chatted behind the scenes. I looked a bit into the #GoryMath and the theory behind her research. It’s not my area of expertise, but the theory was at least pleasantly plausible and the math that she (and her colleagues) used made sense, at least for what they were trying to investigate about workplace interactions in general. Like a lot of things in science, it needed more work before I would fully buy into it, but it passed the silliness test.
Flash forward to a few days ago, when the same Katerina Bezrukova and her collaborator Chester Spell re-appeared with their model, and some numbers to attach to it. They suggested three dimensions to team chemistry, including a demographic factor, an isolation factor, and an ego factor. To take the demographic factor as an example, they looked at race, nationality, and age. Methodologically, it seems that they ran a proximity/similarity matrix within each team based on those characteristics. The idea is that if a player is surrounded by other players who demographically resemble him, he will have more people to talk to, and that will lead to better living through chemistry.
The idea of fault lines comes in when there are individuals (or small groups) who have nothing in common with others. That presents an opportunity for conflict and poor chemistry. Isolation refers to the fact that some people are demographically unique within a team. The article specifically mentions Hyun-Jin Ryu and Kenley Jansen as the only Dodgers from South Korea and Curacao, respectively, and offers them up as examples of isolated players. “Ego” refers to a combination of performance (they did not disclose what performance metric they used) and salary, with the idea being that you don’t want too many highly-paid All-Stars around a bunch of guys who make the league minimum. It’s too easy to draw fault lines between the haves and have-nots.
I think we need to pull apart a few things here. With respect to the demographic and isolation factors, it’s easy to caricature Bezrukova and Spell’s argument as saying that friendships can’t form across racial and age lines (and snark that they are advocating teams composed of players from only one group.) The magazine story doesn’t make things better by citing that the Giants’ excellent chemistry rating is largely attributable to the fact that their relievers are mostly Latino while their starters are mostly Caucasian.
The fact of the matter is that in United States culture, we still (and I weep for this) base a lot of our cultural language around the color of someone’s skin. Blessedly, the most horrific expressions of that legacy are less common today than they used to be. There are people (a lot of people, I’d like to believe) who make honest efforts to purge themselves of those demons. It’s true that conflict can arise over plenty of issues other than demography (think politics, religion, and other stuff you’re not supposed to discuss at the Thanksgiving table) and that some clubhouses might actually look and act like public service announcements. Sadly, though, race and nationality can still be a source of conflict in a baseball clubhouse or in any other workplace.
If we can accept that we, as humans, haven’t fully figured out the whole race thing, then we have to admit that it could be something that could be hard for people to deal with. Maybe it doesn’t produce open hostility, but it could take the form of “I’m not going to make the effort to get to know you.” Whatever you want to call that, it most certainly would be a missed opportunity for two players to share knowledge that could make both of them better players or to share emotional support that could keep them grounded during the several-month grind of a baseball season. It’s not certain that demographic differences will produce conflict, but the risk goes up.
The “ego” factor rests on the idea that players will, deep down, feel jealous of other players who make more money than they do or who are more talented than they are, and that this jealousy will be a cause for conflict in the clubhouse. On the other hand, the article discusses how having no good players making a lot of money means that the team might lack leadership. Therefore, the researchers suggest that a nice balance is ideal for providing optimal chemistry.
There is research to back up the idea that people do feel ill at ease when interacting with people who differ from them significantly in income or wealth, although I don’t know how well it translates when many of the people in the room are millionaires. As far as the talent spread goes, it’s also true that groups of men often assign leadership roles to the people who are good at whatever the job is rather than the person who might make the best leader. But again, we’re dangerously close to trafficking in generalities. There probably are superstar baseball players who are so in love with their own reflections that everyone else hates them for being such jerks. There are also those who are really good but are beautifully humble about it and universally beloved because of it. But in the aggregate, having too big a spread in talent between players could lead to jealousy. Again, it’s a risk factor, not a certainty.
If I had to review the model as it applies to baseball, I would offer this: The ESPN model does what it says on the label. It takes publicly available data (age, race, national origin, salary, performance) and gives a numerical estimate of how far apart or close together a team is on these factors. If we assume that these factors are associated with an increased chance of conflict (reasonable) and that the conflict could affect performance (again, reasonable) then you could make the argument that this calculation is a decent stand-in for “team chemistry.”
This is a strange moment for sabermetrics. We’ve gone through a phase where we had to learn, sometimes in a painful way, the lesson that just because we can’t put a number on something doesn’t mean that it doesn’t exist. I’d argue that here we see the flipside. Just because you can do some mathematical gymnastics to put a number on something doesn’t mean that you’ve measured it.
The reaction to the ESPN piece has centered around a certain amount of outrage from fans of teams who were rated as poor (but those guys always look so happy together when the cameras are on! And they have beards!). Everyone wants to believe that their team has good chemistry. There’s also been a mystical/gnostic strain of critique that while chemistry exists, it is fundamentally unmeasurable (often leveled by the same fans of low-rated teams). At its core, though, I think the Bezrukova and Spell model is tripping over a simple question. It’s clear that they’ve measured something, but is it chemistry?
A model doesn’t have to be perfect to be useful, but how much imperfection can we accept before it stops being useful? There are several holes that can be poked in this model. For one, it focuses on only on demographic factors as indicators of chemistry. What of personality traits? We could perhaps get a better read on a clubhouse by asking the question “How many jerks are in there?” What about accounting for leaders in the clubhouse who can help to smooth over some of these issues? Can a “clubhouse” be measured as a unified whole? Are demographic differences really causing arguments (or standoffishness between players) in the clubhouse? Perhaps other issues (religious players vs. non-religious players; drinkers vs. teetotalers; soda vs pop?) are causing more divisions?
Beyond that, there are some #GoryMath issues. The Venn diagram of “things we are measuring” and “things that could cause issues in a clubhouse” are not perfect overlaps, but even if we assumed that the only things that could ever cause a problem in an MLB clubhouse are these demographic issues, we’ve got an R-squared problem. Suppose that we had the most fractured clubhouse possible under the Bezrukova and Spell system. The model sees the demographic composition of the team, but not whether the team contains 25 horrible bigots or 25 guys who are thrilled because of all the cultural learning they are about to do. Fault lines do not necessitate conflict. They just increase the chances of it. When you plug a variable like that into a regression, you’re not going to pick up a huge amount of variance.
I haven’t even yet touched on the fact that they somehow (and they won’t say how, other than the fact that they used a regression) converted these chemistry ratings into estimates of how much a team could expect to gain or lose in terms of wins. Like any other black box model, that’s annoying. The scientist in me would politely reserve comment until they open up the box and show how they controlled for previous talent and what measures they used and what their training data set was.
However, I can comment on how the results were presented. Each team was given a rating for how much each of the three chemistry subscales (demographics, isolation, ego) contributed to its prediction. A team might lose 0.2 wins on one of the scales and gain 0.5 on another. That’s not hard to get out of a regression. What concerns me is that if we have very real concerns about how well the input variables explain the variance in what they are ostensibly predicting, the results that come from such a regression are going to have big standard errors of estimation. There’s a big difference between “I think this will have an effect of half a win plus or minus three” vs. “I think this will have an effect of half a win plus or minus a tenth of a win.” To present the estimates without that context makes them seem much more certain than they really are, and here, I think that matters a lot.
So here’s what we have. We have a model that takes publicly available data and cleverly finds a way to turn it into an index of things that could cause conflict. That index doesn’t include everything that could cause conflict. Nor does it guarantee that the things it does include will cause conflict. But it is about the best that they could do with what data are easily available. In some sense, the problem with this research (other than the black box part) isn’t that it’s necessarily wrong. It’s just being oversold. This is a neat little first step in understanding team chemistry in baseball, but there’s still a lot to hash out. Using this as the centerpiece of the ESPN predictions gives it the appearance of being a finished product that’s ready for prime time, and that we’ve somehow solved the chemistry question. We haven’t. Yet.