Giancarlo Stanton is a bigtime power hitter. Ben Revere is a fast guy with an empty hit tool. Joey Votto is Nick Johnson on a permanent hot streak. If pressed, we could do this all day with every position player in baseball, arbitrarily deciding which features to focus on for each player and tossing them into buckets.
But we don’t need to, because today we’re presenting some early results from Baseball Prospectus’ collaboration with Ayasdi. This partnership allows BP to use Ayasdi’s proprietary analytics software, called Ayasdi Core, with the goal of advancing baseball research. Ayasdi Core is built on the concept of Topological Data Analysis (TDA), which allows us to visualize the complex connections between data points from a given population. Here’s a brief explanation of TDA from Ayasdi:
[TDA] represents data using topological networks. A topological network represents data by grouping similar data points into nodes, and connecting those nodes by an edge if the corresponding collections have a data point in common. Because each node represents multiple data points, the network gives a compressed version of extremely high dimensional data.
Or, in layman’s terms: TDA creates a player similarity map.
Ultimately, Ayasdi’s tools help portray the shape of the data in a very compact and easily understandable way, which can help in our interpretation of said data. Imagine a world where we took all the data—all the data—and put it on a bunch of XY plots and attempted to draw conclusions. We’d quickly run into a problem: Our data would cluster in ways that might not be conducive to interpreting with regression lines or other methods. Consider data that fell like this:
This is, of course, an oversimplification of a complex phenomenon, but the point remains that data is often not best categorized by traditional methods. Ayasdi’s approach takes these issues into account and interprets the connections between data points in a different way. (For a more thorough description of how this works, I recommend reading this post from one of the company’s founders. There’s also a whitepaper available here for even more information.)
This isn’t the first time Ayasdi’s software has been used to unlock some of the secrets in sports data. At the MIT Sloan Sports Analytics Conference in 2012 Muthu Alagappan presented on how Ayasdi helped redefine the positions on a basketball court. Ayasdi has gracefully allowed BP to replicate this work for the baseball world, with some of those early results I hinted at before below.
For this first step we looked at every position player who had 250 plate appearances last season, a sample that includes 311 different players. We then took several key statistics that presented basic components of a singular skill on the field. This can be a bit complicated, but the analyses below can easily be replicated with any set of statistics you can imagine. This is, after all, an early presentation of results so input is truly welcomed. (We’ve already discussed internally alternatives to some of the measures below, most notably for capturing speed and defensive ability; those changes will go into our next run.) Here are the stats we used this time, and the reasons for their inclusion:
Stat 
Reasoning 
Gives a rough approximation of a player’s hit tool and get on base by putting the ball in play. 

OBPAVG 
Gives more context to AVG, allowing plate discipline to be acknowledged. 
Simplest stat to isolate a batter’s power. 

Stolen bases are an approximation for speed in this analysis. 

UZR/150 
One of a variety of flawed defensive statistics to be included in the analysis. 
These stats aren’t perfect, so keep in mind the rationale behind them as we explore the results of their inclusion in the analysis later. One important consideration in selecting these five stats was that isolating fewer components in the analysis makes it easier to compare groups because differences are less nuanced. So a dataset including these five qualities, along with a variety of other context stats (PA, Age, etc.) were uploaded into Ayasdi’s software for analysis.
So what does the data look like as seen through Ayasdi Core? Well it looks something like this:
That doesn’t look like much right now, but it shows some of the relationships and overall shape of the data. The topological map is created by examining connections between the players in each node. Clustering occurs when a lot of players are very similar and their nodes are densely packed and connected. More sparse areas represent players with unique skill sets that aren’t as common, for better or worse, across majorleague baseball.
For a more thorough discussion of the methods and analyses used to build these groups, please see the Technical Notes section at the very bottom of this post.
***
Understanding how we got to the conclusions we’ll make in the following paragraphs is important, which is why we’ve spent so much time discussing the components of the analysis on a high level. The most interesting part, however, is what Core allows us evaluate. We have been able to break down our overall population of 311 players into the following six (or, if you prefer, nine) groups:
· Power Outage
· Speedy Hit Tools
· Balanced Skill Sets
· Hit Tools
· Power Bats
· On Base Specialists
· Outliers 1
· Outliers 2
· One of a Kinds
Here is an annotated image that shows where each group is represented on the map:
This then leads into a necessary discussion around how these groups are isolated and what it means for a player to be a part of them. It’s easiest to start with some of our major groups, analyzing what separates them from the rest of the players included in the overall population.
Power Outage
The first group—which we also considered called lateinning replacements—comprises those players who fill the bottom few spots on a roster. They have fringy power, onbase skills and strong hit tools, but they’re just enough over the line for speed and defense to build a career:
First a note on colorizing the topological maps: We can apply color as a dimension, allowing us to view the overall map and how a particular skill is distributed. Each image in this article is colorized for a different skill, and in this case we used ISO. Which means that in the image above blue represents your Elvis Andruses of the world, while players like Giancarlo Stanton are bright red.
As I noted previously, we might see individuals from this group putting up respectable numbers in certain other stats, but there’s nothing else that ties them all together in a significant way. Some of the players included in this group are Daniel Nava, Yunel Escobar, Sam Fuld, and Eugenio Suarez.
Speedy Hit Tools
Working clockwise, the next group is named “Speed Hit Tools.” They generally rank pretty highly compared to the rest of the population when it comes to average and defense, but OBP and ISO are definitely not strong suits for this group. The isolation of this group is most obvious when looking at a map colorized by the SB column:
The naming of this group isn’t perfect, but stolen bases is a tough category for this analysis since so many players skew toward the lower end of the spectrum. This group consists of players like Josh Harrison, Jose Altuve, Denard Span, and Starlin Castro.
It’s interesting to look at a player like Starlin Castro, who doesn’t steal a ton of bases, and question his inclusion in this group. His four steals from last season certainly don’t merit mention with a player like Altuve, but his inclusion makes sense considering the bigger picture. Castro is a near perfect match for the overall archetype of this group in every way except for stolen bases. Here we see the power of these groupings: You might not see how much Castro has in common with Harrison if you’re focusing mostly on Harrison’s “defining” stat, steals; but the whole player picture reveals it. (Castro does steal some bases.)
Balanced Skill Sets
This group ends up fairly middle of the road for nearly all of the statistics used in this study, so this is a perfect opportunity to look at what these groups look like within the analysis:
The balanced group sits, as a whole, near the mean for a lot of the stats used in this analysis but we can see just a little bit of direction in terms of how they compare to their peers. They are slightly better than the rest of the players in the population for ISO, OBPAVG, and AVG but are slightly worse when it comes to UZR/150 and SBs. The average age of this group is within half a year of the rest of the population, so this group serves as a good average group against which others can be compared. This group is headlined by players who are reasonably good at a lot of things like Russell Martin, Nolan Arenado, Ben Zobrist, and Alex Gordon. They are, you might note, the sorts of players who have lately been more serious MVP candidates by WAR than by actual voter sentiment.
Hit Tools
I considered naming this group the “Slow Hit Tools.” Let’s take a look at why this group got its name:
This entire group is made up of guys that posted high batting averages but didn’t necessarily possess the other skills to fit in other groups. They are very average defensively, don’t get on base via walks or HBP, and don’t steal bases. That said, they are better than the average for AVG (obviously) and ISO, so these players can be considered high contact batfirst type players with power that plays in games. Names from this group include Victor Martinez, Corey Dickerson, Justin Turner, and J.D. Martinez.
Power Bats
I won’t duplicate it here, but you can refer to the ISO image showcased in the “Power Outage” group earlier in this post to see the distribution of ISOs throughout our map. This group, much like the “Balanced Skill Sets” group has a lot of overlap with neighboring groups because of its position between three other groups of players.
This group put up much better ISOs than the rest of the players in the dataset, but that’s not the end of their offensive prowess. They also are slightly better than average for OBPAVG and AVG. These guys generally don’t steal bases and are among the worst defenders, but their bats easily make up for any shortcomings in those departments. It’s worth considering as well that this group’s extreme skew toward “feared” status gives them a leg up on OBPAVG.
The interesting thing here is that there is some overlap with neighboring groups, as players with aboveaverage skills in multiple facets of the game are difficult to bucket in just one place. This group includes Troy Tulowitzki, Andrew McCutchen, Victor Martinez, Mike Trout, and Lucas Duda.
You’ll notice that Martinez is represented in nodes in both the “Power Bats” and “Hit Tools” groups. So is Tulowitzki, for that matter. This is part of the beauty of Ayasdi’s model wherein players are grouped into nodes based on their performance in the stats we’re using for analysis. Take a look at where Tulo fits in the overall image below:
Tulowitzki is represented in the blue and red nodes in the image above, nestled right between the “Hit Tools” and “Power Bats” groups. Since his skill set fits in with his contemporaries from both groups, he’s among a handful of elite players (like Trout, Victor Martinez, and Jose Abreu) who serve as a primary link between the two groups.
On Base Specialists
The group of on base specialists is easily distinguished by their nonbatting average activities (be it taking walks or getting hit by pitches). This group skews highly toward the top of that particular leaderboard, and is easy to distinguish visually with the proper colorization applied:
This group is one of the more predictable ones—they tend to skew low on pure AVG, and they also don’t steal bases—but there is something interesting lying in the data. The members of this group had, on average, 76 fewer plate appearances than the average player from this study. Perhaps some OBPspecialists struggle to get on the field for other reasons, or maybe managers are still inclined to use AVGspecialists instead when given the choice. The reasoning for this needs to be explored—you can hypothesize as easily as I can—but it’s certainly interesting given the focus that getting on base receives in the media. Some of the players included in this group include Joey Votto, Adam LaRoche, Carlos Santana, Mike Napoli, and George Springer.
Outliers 1
This group includes four players who didn’t fit within the larger group of players and only has connections amongst themselves. They are Collin Cowgill, Jed Lowrie, Logan Forsythe, and B.J. Upton. As a whole this group of players hovers right around the averages for the entire population for each of the five stats used, but had about 10 percent fewer plate appearances than the rest of their peers. You might call these guys a subcategory of the balanced players: They’re roughly flat across all five categories, but aren’t good enough to be considered “balanced.”
Outliers 2
This group consists of three players: Logan Morrison, Josh Rutledge, and Nick Castellanos. There isn’t much that is statistically significant that ties these players together, besides being young, average and slow.
One of a Kinds
Nine players didn’t have any connections to their peers. They not only weren’t included in the overall topological map, but they weren’t even connected to one another. This is a group of guys who don’t fit into any group, including this group. Guys falling into this category are: Devin Mesoraco, Adam Eaton, Torii Hunter, Martin Prado, Chris Owings, Tommy La Stella, Derek Jeter, Alexi Amarista, and Everth Cabrera.
Some of these players had one or two truly awful stats that precluded them from inclusion in another group. Some are longterm veterans who have seen their skill sets erode in all but one or two main categories. Either way, these players were unlike any others in baseball, for better or worse.
***
This is really just the beginning of a deeper dive into TDA as it applies to baseball. There are, I don’t know, hundreds of possible applications that we have yet to delve into where Ayasdi Core and TDA might provide insight that has previously eluded researchers. These are just hypotheses, but some things that come to mind as the next level of analysis through TDA:
· Does team construction impact a team’s ability to over/underperform their Pythagorean record?
· Are certain batting orders able to capitalize on pairings of compatible skill sets to produce more runs than would normally be expected?
· Are some hitter types better suited to face certain types of pitcher?
· Which skills are systematically under/overcompensated in free agency?
· Can we apply this style of analysis to minor leaguers to determine if certain types of player are more likely to be busts than others?
And finally the one that we’ve spent months on that isn’t yet ready for public consumption:
· What does this sort of analysis look like for pitchers?
***
Technical Notes
In order to analyze this information we had to apply some analyses in Core that identifies and maps the relationships in the data. The first step is using a metric to help categorize the data. In this case we’ve used Norm Angle which “normalizes the columns [stats] for each column in the dataset to have mean 0 and variance 1, and then computes the angle distance on the data.” We want to use Norm Angle here because it allows for nulls (e.g., catchers who don’t have a UZR/150) and it works best with columns or stats that are related but not interdependent. Per Ayasdi, the formula for Norm Angle is:
Where,
It’s worth noting that Core does all the work for you in applying this metric to each data point, so we don’t need to worry about calculation errors or the like. We then apply some lenses to our dataset that determines how the software interprets each player before mapping the overall relationships. In this case we used Metric PCA Coordinates 1 and 2 as lenses. Here’s a brief description from Ayasdi:
These lenses compute a variant of PCA coord lenses for data that does not use the Euclidean metric. When you use these lenses, Ayasdi Core first maps your data into a Euclidean space using the rows of the distance matrix as the coordinates and then performs PCA.
The topological map of the data is what results from running these metrics and lenses together, with alterations to either the metric or a lens causing a new and different map to come to life.
Once the map is created statistical tests and colorization helps identify groups that live within the overall topological map. Once each group is identified, statistical tests (specifically KolmogorovSmirnov testing) to compare the group to the remaining population. The results indicate how that group compares to the rest of the population, along with pstats to identify the validity of the finding.
**Full exports for each group of players are available for download here.
Thanks to Devi Ramanan and the rest of the Ayasdi team for their support and willingness to collaborate with Baseball Prospectus.
Thanks to Sam Miller, Jon Shepherd, and Craig Goldstein for serving as a sounding board throughout the analysis process and indulging me in this endeavor.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
I think another cool angle to take would be to compute this over different time periods, and see whether some clusters have gained or lost players, or whether there were other clusters in the past that have since gone "extinct".
Can't wait to see a similar approach taken for pitchers ... although perhaps not limited to "outputs" (e.g. for hitters, ISO / AVG / OBP) but inclusive of "inputs" (e.g. # of pitches, types of pitches, fastball speed ... ) ?
I've always thought there were a relatively small number of pitcher 'archetypes' ...
For offense, I think I would have gone with:
 Contact% for batonball ability
 Swing% for BB% for discipline
 ISO for raw power
 SBA/(times reached base) for speed
Still, this is really cool stuff. I've always assumed we could do some skillbased grouping but I've never known how it would be done. It would be absolute fascinating to see this done on scouting report data, such as the figures Kiley McDaniels and the Fangraphs team has been putting together.
I do think things like contact%, swing%, etc. could be more fruitful stats to use. Ultimately this is, hopefully, the first phase in many iterations for this type of analysis.
"**Full exports for each group of players are available for download here."
but got a message saying the file has been moved.
Try this link instead...
https://www.dropbox.com/s/va2j30r4q5ftffu/Ayasdi%20Groups.xlsx?dl=0