I can’t say that it started innocently, because it started with a tweet. But it started innocently enough:
As you’ll note throughout the above thread, MLB dot com would have you believe that there has been a veritable cornucopia of “mammoths.” Our giant cup labeled MAMMOTHS has runneth over. But as much as we may enjoy Scrooge McDuck-swimming in our vault filled with mammoths, there is something that nags at us with each lap. Why so many mammoths? Where did they come from?
We are taught from a young age to count our blessings, and since we’ve been blessed with many a mammoth I took it upon myself to tally up a total, starting from the beginning of the season, with a little help from a friend. Before we get to the cold, hard data I want to walk you through what exactly it is I’m counting, since it is actually a touch more confusing than you might think.
When you navigate to MLB dot com and decide you want to see some goddamn dingers, you’ll hover over the “video” tab, and select the link that says “home runs.” That takes you to a screen that has a ribbon underneath that allows you to select which highlights you’d like to see following a 15-second commercial each time you click. There are a couple different descriptions each homer, and we’ll be focusing on the one directly below the video, circled in red. We’re using that description because it correlates to the headline that shows up in the MLB At Bat app. This is relevant because the original adjective use noted by @ProductiveOuts was via the app.
Now that we know what we’re dealing with, let’s talk about how this data was called. I received assistance from a friend who did “some manual filtering (e.g. get rid of “to left/right/center”), and then had [a] natural language processing package called spaCy analyze the text. It uses a model to tag each word with a part of speech and constructs a dependency tree, e.g. in ‘grand slam,’ ‘grand’ depends on ‘slam.’” Once that data was collected, certain headlines were filtered out for overuse (solo, 2-run, 3-run, etc.) as they are both used an inordinate amount and do not really get to the core of the idea in terms of analyzing the adjectives being used in these home run highlight headlines.
To get a sense of how this language processing package works, and how the data was culled, I have an example produced based on a headline about Bret Sayre and Wilson Karaman’s large, beefy progeny that was shown above:
Now that we understand both what data was scoured for this project and how it was parsed, let’s take a look at the first run of sorted information:
We have 20 terms that were used to describe homers (not including solo, 2-run, etc. which were previously discarded). Of those, six of them tally 50 or more mentions over our sample period of the beginning of the season through June 16th. Of those six, four (Go-Ahead, Back-to-back, Game-Tying, and Leadoff) are adjectives that apply to the state of the game rather than the particular quality of the home run. If we want to isolate those types of adjectives, it looks more like this:
Ah, yes. That’s a bit more digestible. Here, we see that “towering” is the only adjective that outpaces the usage of “mammoth” in MLB dot com’s attempts to cycle through the thesaurus as the rabbit ball continues to fly over fences. We also have the tally of each word broken out by day of the week. It’s expected that Mondays and Thursdays would be the lowest in terms of adjective usage just because those are the traditional major-league off days. The original idea in parsing the data out like this was to investigate whether a particular headline writer or writers on particular shifts were likely to use “mammoth” but the relatively equal distribution of the term places that type of conclusion out of reach, based on what we have in front of us.
The lack of booming, clutch homers on Sunday is rather disappointing. There also seems to be an inordinate number of big flies on Mondays, while towering shots are least frequently found on Fridays.
So what did we learn? “Mammoth” is a frequent word of choice for the headline authors over at MLB dot com, even if it isn’t the runaway leader. We’ve also learned that there appears to be a limited approved vocabulary to be used when describing a high-flying dong. Else we might see a soaring slam, a majestic moonshot, perchance even a titanic tater.
Perhaps the greatest lesson we’ve learned is not to write articles based on tweets.
Thank you to Robert Au for his legitimate research assistance in pursuit of a very illegitimate article.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus.Subscribe now