Notice: Trying to get property 'display_name' of non-object in /var/www/html/wp-content/plugins/wordpress-seo/src/generators/schema/article.php on line 52

Let’s talk about the ESPN: The Magazine team chemistry rankings. For those who haven’t seen them yet, I suggest going here, but if you’d like to skip to the good part, the centerpiece of ESPN’s predictions about the 2014 season is that they adjusted them for team chemistry. The article actually (seriously, no really) says that the Tampa Bay Rays are projected to win 1.7 extra games this year because of chemistry. This will be enough to win them the AL East.

Right then.

This is evolution at a speed that would disprove some tenet of special relativity. I only wish I knew more about special relativity so that I could tell you which part. We’ve gone from chemical atheism (“Chemistry doesn’t matter. Winning breeds chemistry”) to chemical agnosticism (“Chemistry is this vague mysterious thing that you can’t easily measure and it might or might not have an effect.”) to fully drinking the chemically laced Gummiberry Juice (“Chemistry is worth 1.7 wins.”). Whoa.

I should probably cop to a couple of things. I have a bit of a head start on this one. Last summer, BP’s own Sam Miller wrote an article on team chemistry in—funny enough—ESPN the Magazine, in which I was quoted. He also talked to Santa Clara University professor Katerina Bezrukova, who spoke of her work on group dynamics, specifically how “fault lines” can appear based on any number of factors to divide members of a group from one another and can determine the success or failure of a group.

Since I’m thrilled to meet anyone who lives at the corner of Baseball Street and Psychology Lane, I contacted her and we chatted behind the scenes. I looked a bit into the #GoryMath and the theory behind her research. It’s not my area of expertise, but the theory was at least pleasantly plausible and the math that she (and her colleagues) used made sense, at least for what they were trying to investigate about workplace interactions in general. Like a lot of things in science, it needed more work before I would fully buy into it, but it passed the silliness test.

Flash forward to a few days ago, when the same Katerina Bezrukova and her collaborator Chester Spell re-appeared with their model, and some numbers to attach to it. They suggested three dimensions to team chemistry, including a demographic factor, an isolation factor, and an ego factor. To take the demographic factor as an example, they looked at race, nationality, and age. Methodologically, it seems that they ran a proximity/similarity matrix within each team based on those characteristics. The idea is that if a player is surrounded by other players who demographically resemble him, he will have more people to talk to, and that will lead to better living through chemistry.

The idea of fault lines comes in when there are individuals (or small groups) who have nothing in common with others. That presents an opportunity for conflict and poor chemistry. Isolation refers to the fact that some people are demographically unique within a team. The article specifically mentions Hyun-Jin Ryu and Kenley Jansen as the only Dodgers from South Korea and Curacao, respectively, and offers them up as examples of isolated players. “Ego” refers to a combination of performance (they did not disclose what performance metric they used) and salary, with the idea being that you don’t want too many highly-paid All-Stars around a bunch of guys who make the league minimum. It’s too easy to draw fault lines between the haves and have-nots.

I think we need to pull apart a few things here. With respect to the demographic and isolation factors, it’s easy to caricature Bezrukova and Spell’s argument as saying that friendships can’t form across racial and age lines (and snark that they are advocating teams composed of players from only one group.) The magazine story doesn’t make things better by citing that the Giants’ excellent chemistry rating is largely attributable to the fact that their relievers are mostly Latino while their starters are mostly Caucasian.

The fact of the matter is that in United States culture, we still (and I weep for this) base a lot of our cultural language around the color of someone’s skin. Blessedly, the most horrific expressions of that legacy are less common today than they used to be. There are people (a lot of people, I’d like to believe) who make honest efforts to purge themselves of those demons. It’s true that conflict can arise over plenty of issues other than demography (think politics, religion, and other stuff you’re not supposed to discuss at the Thanksgiving table) and that some clubhouses might actually look and act like public service announcements. Sadly, though, race and nationality can still be a source of conflict in a baseball clubhouse or in any other workplace.

If we can accept that we, as humans, haven’t fully figured out the whole race thing, then we have to admit that it could be something that could be hard for people to deal with. Maybe it doesn’t produce open hostility, but it could take the form of “I’m not going to make the effort to get to know you.” Whatever you want to call that, it most certainly would be a missed opportunity for two players to share knowledge that could make both of them better players or to share emotional support that could keep them grounded during the several-month grind of a baseball season. It’s not certain that demographic differences will produce conflict, but the risk goes up.

The “ego” factor rests on the idea that players will, deep down, feel jealous of other players who make more money than they do or who are more talented than they are, and that this jealousy will be a cause for conflict in the clubhouse. On the other hand, the article discusses how having no good players making a lot of money means that the team might lack leadership. Therefore, the researchers suggest that a nice balance is ideal for providing optimal chemistry.

There is research to back up the idea that people do feel ill at ease when interacting with people who differ from them significantly in income or wealth, although I don’t know how well it translates when many of the people in the room are millionaires. As far as the talent spread goes, it’s also true that groups of men often assign leadership roles to the people who are good at whatever the job is rather than the person who might make the best leader. But again, we’re dangerously close to trafficking in generalities. There probably are superstar baseball players who are so in love with their own reflections that everyone else hates them for being such jerks. There are also those who are really good but are beautifully humble about it and universally beloved because of it. But in the aggregate, having too big a spread in talent between players could lead to jealousy. Again, it’s a risk factor, not a certainty.

If I had to review the model as it applies to baseball, I would offer this: The ESPN model does what it says on the label. It takes publicly available data (age, race, national origin, salary, performance) and gives a numerical estimate of how far apart or close together a team is on these factors. If we assume that these factors are associated with an increased chance of conflict (reasonable) and that the conflict could affect performance (again, reasonable) then you could make the argument that this calculation is a decent stand-in for “team chemistry.”

This is a strange moment for sabermetrics. We’ve gone through a phase where we had to learn, sometimes in a painful way, the lesson that just because we can’t put a number on something doesn’t mean that it doesn’t exist. I’d argue that here we see the flipside. Just because you can do some mathematical gymnastics to put a number on something doesn’t mean that you’ve measured it.

The reaction to the ESPN piece has centered around a certain amount of outrage from fans of teams who were rated as poor (but those guys always look so happy together when the cameras are on! And they have beards!). Everyone wants to believe that their team has good chemistry. There’s also been a mystical/gnostic strain of critique that while chemistry exists, it is fundamentally unmeasurable (often leveled by the same fans of low-rated teams). At its core, though, I think the Bezrukova and Spell model is tripping over a simple question. It’s clear that they’ve measured something, but is it chemistry?

A model doesn’t have to be perfect to be useful, but how much imperfection can we accept before it stops being useful? There are several holes that can be poked in this model. For one, it focuses on only on demographic factors as indicators of chemistry. What of personality traits? We could perhaps get a better read on a clubhouse by asking the question “How many jerks are in there?” What about accounting for leaders in the clubhouse who can help to smooth over some of these issues? Can a “clubhouse” be measured as a unified whole? Are demographic differences really causing arguments (or standoffishness between players) in the clubhouse? Perhaps other issues (religious players vs. non-religious players; drinkers vs. teetotalers; soda vs pop?) are causing more divisions?

Beyond that, there are some #GoryMath issues. The Venn diagram of “things we are measuring” and “things that could cause issues in a clubhouse” are not perfect overlaps, but even if we assumed that the only things that could ever cause a problem in an MLB clubhouse are these demographic issues, we’ve got an R-squared problem. Suppose that we had the most fractured clubhouse possible under the Bezrukova and Spell system. The model sees the demographic composition of the team, but not whether the team contains 25 horrible bigots or 25 guys who are thrilled because of all the cultural learning they are about to do. Fault lines do not necessitate conflict. They just increase the chances of it. When you plug a variable like that into a regression, you’re not going to pick up a huge amount of variance.

I haven’t even yet touched on the fact that they somehow (and they won’t say how, other than the fact that they used a regression) converted these chemistry ratings into estimates of how much a team could expect to gain or lose in terms of wins. Like any other black box model, that’s annoying. The scientist in me would politely reserve comment until they open up the box and show how they controlled for previous talent and what measures they used and what their training data set was.

However, I can comment on how the results were presented. Each team was given a rating for how much each of the three chemistry subscales (demographics, isolation, ego) contributed to its prediction. A team might lose 0.2 wins on one of the scales and gain 0.5 on another. That’s not hard to get out of a regression. What concerns me is that if we have very real concerns about how well the input variables explain the variance in what they are ostensibly predicting, the results that come from such a regression are going to have big standard errors of estimation. There’s a big difference between “I think this will have an effect of half a win plus or minus three” vs. “I think this will have an effect of half a win plus or minus a tenth of a win.” To present the estimates without that context makes them seem much more certain than they really are, and here, I think that matters a lot.

So here’s what we have. We have a model that takes publicly available data and cleverly finds a way to turn it into an index of things that could cause conflict. That index doesn’t include everything that could cause conflict. Nor does it guarantee that the things it does include will cause conflict. But it is about the best that they could do with what data are easily available. In some sense, the problem with this research (other than the black box part) isn’t that it’s necessarily wrong. It’s just being oversold. This is a neat little first step in understanding team chemistry in baseball, but there’s still a lot to hash out. Using this as the centerpiece of the ESPN predictions gives it the appearance of being a finished product that’s ready for prime time, and that we’ve somehow solved the chemistry question. We haven’t. Yet.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Seems like it should be fairly easy to apply at least a cursory test to their hypothesis. Go back a decade or so and check their model performance vs any number of the sabermetric win prediction models. Did they use any of this sort of testing in developing their model?
"There’s a big difference between “I think this will have an effect of half a win plus or minus three” vs. “I think this will have an effect of half a win plus or minus a tenth of a win.”"

BP has been writing about baseball stats for ~15 years and never once supplied these kinds of error bars.

Don't get me wrong, I think it's an extremely important thing to do, I just hope you have time to scurry back into BPs figurative glass house. ;)
I do wish BP (and everyone else) would publish error bars. I get why BP (and everyone else, including ESPN) doesn't. It's mostly a space issue, but the old stats prof in me says if mean, then standard deviation.
Exactamundo. Test #1 is to apply it to past teams. They didn't even bother doing that? Then it's worthless.
In fairness, they may have behind the scenes. They didn't report it in the article. Until they either say "We didn't do that" or reveal their methods (and results) the proper thing to do is reserve judgment.
It is a provocative line of thought that could, and should, forward the conversation on chemistry. Hopefully it doesn't get twisted to meet other agendas. The assumption that similarity breeds harmony & chemistry could have merit but I, too, would like some quantification based on previous results.
Seems ironic that just as ESPN brings Nate Silver into the fold, they publish this piece with such, ahem, "tenuous" statistical methodology.

I was mostly annoyed that they called out the A's for only having one non-white pitcher (Abad IIRC), when Jesse Chavez was always going to be on the team.
Here goes Puig, single handedly ruining the Dodgers again
I wonder if you could come up with a better "chemistry" rating simply by running a super-secret player poll in which they rate their manager on how much they respect him and like playing for him. Not to beat a dead horse here -- OK, I am going to beat a dead horse -- but you could call it the "Valentine Rating" or something like that and even run it around 14 February, before the team has endured any slings and arrows of outrageous fortune in actual games.
Regression results without any accompanying information on model structure or diagnostics are just about useless. Yes, one acquires shiny exciting coefficients to use, but one gets those no matter how well the model fits the data and whether or not it displays any predictive ability. Here, we have to take it on faith that those doing the study know what they're doing, and I've certainly seen many counter-examples among academics.
Until they "show their work" I am going to assume that they are mostly just making "stuff" up.
I am surprised that there is nothing here about last year's Red Sox. It was a goal of the front office to bring in strong clubhouse presences and it appears to have worked, but how do you quantify Johnny Gomes and his great line, "One day closer to the parade". It is clear that when he is on a team, that team seems to exceed the projections, and often by a great deal, but what other factors might enter the picture. I would also wonder what influence one or two outsized personalities might have. Big Papi's talk to the team in the dugout during last year's World Series was remarkable. I had never seen such a scene in a long career of watching baseball. Dustin Pedroia also appears to have a similar type of personality and commands a great deal of respect in the Boston dugout. John Farrell appeared to be a very positive, and calming, influence after the Valentine disaster. That begs the question of who sets the tone in a clubhouse, the players or the manager. IMHO, I agree that there is something here but that something lives in a sort of Never-Never Land that exists but doesn't and cannot be conjured up. Achieving that nirvana that is great team chemistry is serendipity and attempts to quantify it are like trying to prove the existence of Sasquatch.
Around the turn-of-the-century, I did a stint in the Israeli Army.
The typical Israeli has European or Middle Eastern skin colouring, but in the mid-80s and early-90s, groups of Jewish refugees from Ethiopia emigrated to Israel. As you should be able to imagine, these Ethiopians looked neither European nor Middle Eastern.
Today, they make up a bit less than 2% of the country's population. A typical army unit has between 50-200 people. By law of averages, one would expect the average army unit to have between 1-4 Ethiopians. This was not so. While I never saw the official order, all of my anecdotal experience tells me the following is - or at least was - true.
Army units were not allowed to have single soldiers of Ethiopian descent. They either had multiple, or they had none. In my own unit, we were once assigned a fresh young Ethiopian soldier. Within three days, she was transferred back out to a more duochromatic unit.
I look at the question of when the first MLB player will announce his homosexuality in a similar way. There will not be one guy who comes out. There will be multiple emerging simultaneously.
If economic parity does lead to better team chemistry, the Houston Astros will earn a zillion bonus points, since there isn't a highly paid player in the bunch.