keyboard_arrow_uptop

People love to talk about the mood of a franchise, or the collective feeling of its fanbase. Are they dispirited, optimistic? Ecstatic following a World Series win, or broken after an agonizing walkoff loss? For the most part, we leave it to the beat writers to gauge mood (which is not necessarily a bad thing), without any kind of backing for their proclamations (which might be a bad thing).

Hypothetically, fans are a reservoir of great wisdom (collectively, although perhaps not individually). So tapping into the mood of a fanbase could be more than interesting, it could be useful. But, beyond inquiring with potentially biased observers, there was little we could do to objectively or quantitatively measure a fanbase’s mood.

In this article, I’m going to present one way to gauge the happiness of a fanbase, using a text analysis of the website Reddit. Reddit is an aggregation engine, to which individual users can submit links to other websites or original content, which is then upvoted, downvoted, and commented upon. Importantly, Reddit self-organizes into communities of like-minded individuals, one category of which is fans of a sports team. As a result, there is one team-specific subreddit (community) for each MLB teams’ fans, along with a huge body of text from that teams’ fans.

I used a freely-available program[1] to harvest Reddit comments and posts en masse, over a month-long time period (roughly Jan. 5-Feb. 5). The program spits out a list of words, along with the number of times each word occurs. So, for example, the Yankees subreddit uses the word “money” 25 times in the past month. The small-market Rays, on the other hand, used the same word merely five times.

To figure out how happy each team’s fanbase is, I did what’s called ‘sentiment analysis’ on each list of words. The idea is like this: Some words tend to be used in positive situations, and indicate that the writer is happier, while others are more negative in connotation, and suggestive of despair. For example, ‘excellence’ is a very positive word, and ‘deception’ an unpleasant one. If a team’s comments are filled with words like excellence, and bereft of words like deception, they are probably happy, and vice versa.

To do the sentiment analysis, I used a list of words (called AFINN-111[2]) which had been manually assigned levels of positivity from -5 to 5. To give you an idea of how it works, the word ‘excellence’ is rated a +3 on this list, while ‘deception’ is rated -3. Then I matched up words from the Reddit analysis with the sentiment list and multiplied by the number of times each word was used in each subreddit. The higher the total score, which I called the total affect rating, the more happy the fanbase[3].

Here’s what I found, for all 30 teams, sorted by total affect rating, our proxy for fanbase happiness.

Name

Total Affect Rating

Projected Wins

Last Year's Wins

Affect Ratio

San Francisco Giants

12082

84

88

1.983636

New York Mets

8087

81

79

1.823188

St. Louis Cardinals

7185

89

90

1.868383

Atlanta Braves

6967

75

79

2.008833

Los Angeles Dodgers

5214

97

94

1.574419

Toronto Blue Jays

4263

83

83

1.88444

Seattle Mariners

4172

87

87

2.596021

Chicago Cubs

4096

81

73

2.007131

Boston Red Sox

3914

88

71

2.100056

Washington Nationals

3706

91

96

1.988267

Oakland Athletics

2816

85

88

2.222753

Baltimore Orioles

2623

78

96

1.852454

Chicago White Sox

2214

79

73

2.410191

Detroit Tigers

2163

83

90

1.916525

Milwaukee Brewers

1984

80

82

2.242329

Texas Rangers

1849

79

67

1.912185

Cincinnati Reds

1618

79

76

2.091032

Pittsburgh Pirates

1574

81

88

2.569292

San Diego Padres

1540

85

77

2.295206

Philadelphia Phillies

1475

70

73

1.866627

Houston Astros

1289

77

70

2.141718

Miami Marlins

1184

80

77

2.624143

Kansas City Royals

1000

71

89

1.996016

Minnesota Twins

791

70

70

1.873068

Cleveland Indians

771

80

85

2.164653

New York Yankees

684

80

84

2.055556

Arizona D-backs

624

73

64

1.794904

Colorado Rockies

504

71

66

1.760181

Los Angeles Angels

433

91

98

1.80334

Tampa Bay Rays

320

86

77

2.5311

It’s Always Sunny in {Insert City Here}
First of all, let’s get this out of the way: Fanbases are all, without exception, pretty optimistic compared to other subreddits. On average, every fanbase maintains a substantially positive total affect. This finding makes a lot of sense, when you take into account the powerful selection bias involved in contributing to a team-specific subreddit—you probably aren’t going to do it unless you have some positive feelings (or at least hope) for the team of interest.

But perhaps these fanbases aren’t any happier than the rest of the internet. To check that, I looked at a few other subreddits, and calculated their levels of positive affect. For example, I scrutinized a collection of texts from city-based subreddits (for example, /r/Chicago, /r/Miami, etc.). No city I looked at had higher than the lowest affect ratio for a team-specific subreddit. All in all, this makes a lot of sense: baseball is an optional hobby, so if someone doesn’t like participating in it, they probably won’t.

The Causes of Fan Happiness
Next, I was curious about what factors correlate with the happiness of the redditors. The first and most obvious factor that might influence the happiness of a fanbase is its past performance. The Tigers, for example, are perennial contenders and finished last year with 90 wins. They’ve been to a World Series recently, and are known as a great organization. How much does that contribute to their mood? As a rough proxy for past success, I used last year’s number of wins.

Previous year wins contribute surprisingly little to total happiness, is what I would say. The correlation is there (r=.3[4]), but not quite significant.

Another possibility is that the fanbase is less concerned about the past performance, and more with the future. It’s possible that fans are already over the results of last season, and have moved on in their mood to thinking about next season. We can check this by going to PECOTA, which objectively projects the performance of every team for the next year. PECOTA stands in here for the conventional wisdom, reflecting what we think we know about next year’s likely performance.

Here, there is a slightly more substantial (r=.39) and also significant (p=.032) relationship. So it seems, on the surface at least, that Reddit fanbases are much more concerned with the future than they are dwelling on their past success.

Individually, past performance and future projections contribute relatively little to explaining a fanbase’s mood. But perhaps together, there are some synergistic effects that can explain more of the variation. I put both predictors into a combined regression, and checked to see how well I could predict the resulting affect ratio.

Surprisingly, when combining the variables together[5], a very substantial improvement is possible. Using the complete model[6], I can predict the total affect rating astoundingly well (r=.7). So maybe fan happiness is, in aggregate and to a first approximation, a simple function of past success and future expectations.

Irrational Exuberance
Doing the predictions in this way allows us to also look at fanbases that are irrationally happy or sad. Here are the top five fanbases that are happier than their performances suggest that they should be:

Name

Total Affect Rating

Predicted Affect Rating

Difference

San Francisco Giants

12082

8008

4074

Seattle Mariners

4172

3338

834

Atlanta Braves

6967

6522

445

Chicago White Sox

2214

1846

368

New York Mets

8087

7814

273

There’s no surprise in number one. The Giants total happiness is off the charts, which I think must be the result of winning the World Series (again and again and again, in all even-numbered years since 2010). The magnitude of the effect is kind of incredible: The Giants fans have a total affect number about 50 percent higher than the next happiest fanbase.

The other teams are a bit more surprising. The Seattle Mariners were significant to the playoff picture last year for the first time in a few seasons, and they project to be above average this year as well. Maybe this excess happiness is the side effect of that return to relevancy. A similar argument could be made for the White Sox, whose shrewd offseason has seen their postseason odds increase substantially. The Braves confuse me, both at the organizational and fanbase levels. The team is not projected to be competitive, nor were they last year, and yet their hopes spring eternally enough to invest $44 million in the dubious defense of Nick Markakis. On top of that, the team is undergoing a gruesome publically-funded stadium controversy, with allegations of political corruption. How the fans remain so optimistic is anybody’s guess.

And the reverse, the fanbases that are most groundlessly unhappy:

San Diego Padres

1540

1813.261696

-273.262

New York Yankees

684

962.1412489

-278.141

Los Angeles Angels

433

1162.718562

-729.719

Tampa Bay Rays

320

1183.282363

-863.282

Toronto Blue Jays

4263

5183.414512

-920.415

Three of the top five are in the AL East, and that might be more than coincidence. It must be frustrating to see your team regularly compete with great teams outside of the division, only to contend for division titles and wild cards with two of the richest teams in baseball, along with three less wealthy but exceedingly well-run teams (one of whom possesses occult powers). Beyond them, we have the Angels, who are as puzzling as the Braves above. They are good, young, and projected to win 91 games after pacing all of baseball with 98 wins last year. Their continuing despair is mysterious.

There could be a variety of reasons which explain deviations from their expected behavior, some of which I’ve explained above. I have a faint and probably baseless hope that some of the deviations in expected happiness are the result of the fanbases being able to weigh and take into account factors beyond PECOTA’s considerable purview, like changes in coaching staff (the Rays and the Cubs) or other positive or negative indications from their organization. If that’s the case, than maybe the teams with exceptionally happy or sad redditors (relative to expectations) might be able to tell us something about the accuracy of the projections.

To that end, as the season goes on, I’m hoping to continue tracking the mood of the redditors, checking back in a few times during the year to see how their sentiment scores have changed. It would be fun to see when each fanbase gives up on a team, or if they simply don’t until the very last gasp; or how they react to winning or losing streaks, injuries to their core players, and so on. On top of that, although it’s a very long shot, maybe the mood of the fans will be able to tell us something PECOTA doesn’t know.



[1] Thanks to github user rhiever for making this script.

[2] Check out this paper for some details about the word sentiment list.

[3] Fan bases also differed in terms of their levels of Reddit particpitation, so in addition to the total affect rating, I calculated the ratio of positive to negative affect scores, which I term the affect ratio. The latter statistic corrects for the variation in participation, and could be used as another measure of fanbase ‘happiness’. Surprisingly, however, affect ratio was not correlated with total number of words in a Reddit, indicating the participation and happiness are somewhat decoupled. The other results also mostly hold if I look at affect ratio instead of total, although some of the surprisingly happy/unhappy teams change.

[4] For these correlations, I am using the Spearman, i.e. rank-order, correlation coefficient, because the relationships don’t look linear to me.

[5] Along with the total number of words on each subreddit, to account for the level of participation.

[6] To guard against overfitting, I built a support-vector machine model with 2-fold cross-validation, because that’s all this small sample of data could bear. However, there still exists the possibility of overfitting, with so few datapoints. I would like to have more data than just the 30 teams, but unfortunately I am not yet able to harvest subreddit information from earlier than a year ago.