Notice: Trying to get property 'display_name' of non-object in /var/www/html/wp-content/plugins/wordpress-seo/src/generators/schema/article.php on line 52

Giancarlo Stanton is a big-time power hitter. Ben Revere is a fast guy with an empty hit tool. Joey Votto is Nick Johnson on a permanent hot streak. If pressed, we could do this all day with every position player in baseball, arbitrarily deciding which features to focus on for each player and tossing them into buckets.

But we don’t need to, because today we’re presenting some early results from Baseball Prospectus’ collaboration with Ayasdi. This partnership allows BP to use Ayasdi’s proprietary analytics software, called Ayasdi Core, with the goal of advancing baseball research. Ayasdi Core is built on the concept of Topological Data Analysis (TDA), which allows us to visualize the complex connections between data points from a given population. Here’s a brief explanation of TDA from Ayasdi:

[TDA] represents data using topological networks. A topological network represents data by grouping similar data points into nodes, and connecting those nodes by an edge if the corresponding collections have a data point in common. Because each node represents multiple data points, the network gives a compressed version of extremely high dimensional data.

Or, in layman’s terms: TDA creates a player similarity map.

Ultimately, Ayasdi’s tools help portray the shape of the data in a very compact and easily understandable way, which can help in our interpretation of said data. Imagine a world where we took all the data—all the data—and put it on a bunch of XY plots and attempted to draw conclusions. We’d quickly run into a problem: Our data would cluster in ways that might not be conducive to interpreting with regression lines or other methods. Consider data that fell like this:

This is, of course, an oversimplification of a complex phenomenon, but the point remains that data is often not best categorized by traditional methods. Ayasdi’s approach takes these issues into account and interprets the connections between data points in a different way. (For a more thorough description of how this works, I recommend reading this post from one of the company’s founders. There’s also a whitepaper available here for even more information.)

This isn’t the first time Ayasdi’s software has been used to unlock some of the secrets in sports data. At the MIT Sloan Sports Analytics Conference in 2012 Muthu Alagappan presented on how Ayasdi helped redefine the positions on a basketball court. Ayasdi has gracefully allowed BP to replicate this work for the baseball world, with some of those early results I hinted at before below.

For this first step we looked at every position player who had 250 plate appearances last season, a sample that includes 311 different players. We then took several key statistics that presented basic components of a singular skill on the field. This can be a bit complicated, but the analyses below can easily be replicated with any set of statistics you can imagine. This is, after all, an early presentation of results so input is truly welcomed. (We’ve already discussed internally alternatives to some of the measures below, most notably for capturing speed and defensive ability; those changes will go into our next run.) Here are the stats we used this time, and the reasons for their inclusion:




Gives a rough approximation of a player’s hit tool and get on base by putting the ball in play.


Gives more context to AVG, allowing plate discipline to be acknowledged.


Simplest stat to isolate a batter’s power.


Stolen bases are an approximation for speed in this analysis.


One of a variety of flawed defensive statistics to be included in the analysis.

These stats aren’t perfect, so keep in mind the rationale behind them as we explore the results of their inclusion in the analysis later. One important consideration in selecting these five stats was that isolating fewer components in the analysis makes it easier to compare groups because differences are less nuanced. So a dataset including these five qualities, along with a variety of other context stats (PA, Age, etc.) were uploaded into Ayasdi’s software for analysis.

So what does the data look like as seen through Ayasdi Core? Well it looks something like this:

That doesn’t look like much right now, but it shows some of the relationships and overall shape of the data. The topological map is created by examining connections between the players in each node. Clustering occurs when a lot of players are very similar and their nodes are densely packed and connected. More sparse areas represent players with unique skill sets that aren’t as common, for better or worse, across major-league baseball.

For a more thorough discussion of the methods and analyses used to build these groups, please see the Technical Notes section at the very bottom of this post.


Understanding how we got to the conclusions we’ll make in the following paragraphs is important, which is why we’ve spent so much time discussing the components of the analysis on a high level. The most interesting part, however, is what Core allows us evaluate. We have been able to break down our overall population of 311 players into the following six (or, if you prefer, nine) groups:

· Power Outage

· Speedy Hit Tools

· Balanced Skill Sets

· Hit Tools

· Power Bats

· On Base Specialists

· Outliers 1

· Outliers 2

· One of a Kinds

Here is an annotated image that shows where each group is represented on the map:

This then leads into a necessary discussion around how these groups are isolated and what it means for a player to be a part of them. It’s easiest to start with some of our major groups, analyzing what separates them from the rest of the players included in the overall population.

Power Outage
The first group—which we also considered called late-inning replacements—comprises those players who fill the bottom few spots on a roster. They have fringy power, on-base skills and strong hit tools, but they’re just enough over the line for speed and defense to build a career:

First a note on colorizing the topological maps: We can apply color as a dimension, allowing us to view the overall map and how a particular skill is distributed. Each image in this article is colorized for a different skill, and in this case we used ISO. Which means that in the image above blue represents your Elvis Andruses of the world, while players like Giancarlo Stanton are bright red.

As I noted previously, we might see individuals from this group putting up respectable numbers in certain other stats, but there’s nothing else that ties them all together in a significant way. Some of the players included in this group are Daniel Nava, Yunel Escobar, Sam Fuld, and Eugenio Suarez.

Speedy Hit Tools
Working clockwise, the next group is named “Speed Hit Tools.” They generally rank pretty highly compared to the rest of the population when it comes to average and defense, but OBP and ISO are definitely not strong suits for this group. The isolation of this group is most obvious when looking at a map colorized by the SB column:

The naming of this group isn’t perfect, but stolen bases is a tough category for this analysis since so many players skew toward the lower end of the spectrum. This group consists of players like Josh Harrison, Jose Altuve, Denard Span, and Starlin Castro.

It’s interesting to look at a player like Starlin Castro, who doesn’t steal a ton of bases, and question his inclusion in this group. His four steals from last season certainly don’t merit mention with a player like Altuve, but his inclusion makes sense considering the bigger picture. Castro is a near perfect match for the overall archetype of this group in every way except for stolen bases. Here we see the power of these groupings: You might not see how much Castro has in common with Harrison if you’re focusing mostly on Harrison’s “defining” stat, steals; but the whole player picture reveals it. (Castro does steal some bases.)

Balanced Skill Sets
This group ends up fairly middle of the road for nearly all of the statistics used in this study, so this is a perfect opportunity to look at what these groups look like within the analysis:

The balanced group sits, as a whole, near the mean for a lot of the stats used in this analysis but we can see just a little bit of direction in terms of how they compare to their peers. They are slightly better than the rest of the players in the population for ISO, OBP-AVG, and AVG but are slightly worse when it comes to UZR/150 and SBs. The average age of this group is within half a year of the rest of the population, so this group serves as a good average group against which others can be compared. This group is headlined by players who are reasonably good at a lot of things like Russell Martin, Nolan Arenado, Ben Zobrist, and Alex Gordon. They are, you might note, the sorts of players who have lately been more serious MVP candidates by WAR than by actual voter sentiment.

Hit Tools
I considered naming this group the “Slow Hit Tools.” Let’s take a look at why this group got its name:

This entire group is made up of guys that posted high batting averages but didn’t necessarily possess the other skills to fit in other groups. They are very average defensively, don’t get on base via walks or HBP, and don’t steal bases. That said, they are better than the average for AVG (obviously) and ISO, so these players can be considered high contact bat-first type players with power that plays in games. Names from this group include Victor Martinez, Corey Dickerson, Justin Turner, and J.D. Martinez.

Power Bats
I won’t duplicate it here, but you can refer to the ISO image showcased in the “Power Outage” group earlier in this post to see the distribution of ISOs throughout our map. This group, much like the “Balanced Skill Sets” group has a lot of overlap with neighboring groups because of its position between three other groups of players.

This group put up much better ISOs than the rest of the players in the dataset, but that’s not the end of their offensive prowess. They also are slightly better than average for OBP-AVG and AVG. These guys generally don’t steal bases and are among the worst defenders, but their bats easily make up for any shortcomings in those departments. It’s worth considering as well that this group’s extreme skew toward “feared” status gives them a leg up on OBP-AVG.

The interesting thing here is that there is some overlap with neighboring groups, as players with above-average skills in multiple facets of the game are difficult to bucket in just one place. This group includes Troy Tulowitzki, Andrew McCutchen, Victor Martinez, Mike Trout, and Lucas Duda.

You’ll notice that Martinez is represented in nodes in both the “Power Bats” and “Hit Tools” groups. So is Tulowitzki, for that matter. This is part of the beauty of Ayasdi’s model wherein players are grouped into nodes based on their performance in the stats we’re using for analysis. Take a look at where Tulo fits in the overall image below:

Tulowitzki is represented in the blue and red nodes in the image above, nestled right between the “Hit Tools” and “Power Bats” groups. Since his skill set fits in with his contemporaries from both groups, he’s among a handful of elite players (like Trout, Victor Martinez, and Jose Abreu) who serve as a primary link between the two groups.

On Base Specialists
The group of on base specialists is easily distinguished by their non-batting average activities (be it taking walks or getting hit by pitches). This group skews highly toward the top of that particular leaderboard, and is easy to distinguish visually with the proper colorization applied:

This group is one of the more predictable ones—they tend to skew low on pure AVG, and they also don’t steal bases—but there is something interesting lying in the data. The members of this group had, on average, 76 fewer plate appearances than the average player from this study. Perhaps some OBP-specialists struggle to get on the field for other reasons, or maybe managers are still inclined to use AVG-specialists instead when given the choice. The reasoning for this needs to be explored—you can hypothesize as easily as I can—but it’s certainly interesting given the focus that getting on base receives in the media. Some of the players included in this group include Joey Votto, Adam LaRoche, Carlos Santana, Mike Napoli, and George Springer.

Outliers 1
This group includes four players who didn’t fit within the larger group of players and only has connections amongst themselves. They are Collin Cowgill, Jed Lowrie, Logan Forsythe, and B.J. Upton. As a whole this group of players hovers right around the averages for the entire population for each of the five stats used, but had about 10 percent fewer plate appearances than the rest of their peers. You might call these guys a subcategory of the balanced players: They’re roughly flat across all five categories, but aren’t good enough to be considered “balanced.”

Outliers 2
This group consists of three players: Logan Morrison, Josh Rutledge, and Nick Castellanos. There isn’t much that is statistically significant that ties these players together, besides being young, average and slow.

One of a Kinds
Nine players didn’t have any connections to their peers. They not only weren’t included in the overall topological map, but they weren’t even connected to one another. This is a group of guys who don’t fit into any group, including this group. Guys falling into this category are: Devin Mesoraco, Adam Eaton, Torii Hunter, Martin Prado, Chris Owings, Tommy La Stella, Derek Jeter, Alexi Amarista, and Everth Cabrera.

Some of these players had one or two truly awful stats that precluded them from inclusion in another group. Some are long-term veterans who have seen their skill sets erode in all but one or two main categories. Either way, these players were unlike any others in baseball, for better or worse.


This is really just the beginning of a deeper dive into TDA as it applies to baseball. There are, I don’t know, hundreds of possible applications that we have yet to delve into where Ayasdi Core and TDA might provide insight that has previously eluded researchers. These are just hypotheses, but some things that come to mind as the next level of analysis through TDA:

· Does team construction impact a team’s ability to over/underperform their Pythagorean record?

· Are certain batting orders able to capitalize on pairings of compatible skill sets to produce more runs than would normally be expected?

· Are some hitter types better suited to face certain types of pitcher?

· Which skills are systematically under/overcompensated in free agency?

· Can we apply this style of analysis to minor leaguers to determine if certain types of player are more likely to be busts than others?

And finally the one that we’ve spent months on that isn’t yet ready for public consumption:

· What does this sort of analysis look like for pitchers?


Technical Notes
In order to analyze this information we had to apply some analyses in Core that identifies and maps the relationships in the data. The first step is using a metric to help categorize the data. In this case we’ve used Norm Angle which “normalizes the columns [stats] for each column in the dataset to have mean 0 and variance 1, and then computes the angle distance on the data.” We want to use Norm Angle here because it allows for nulls (e.g., catchers who don’t have a UZR/150) and it works best with columns or stats that are related but not interdependent. Per Ayasdi, the formula for Norm Angle is:

Description: NormAngle Equation.png


Description: NormAngle Equation2.png

It’s worth noting that Core does all the work for you in applying this metric to each data point, so we don’t need to worry about calculation errors or the like. We then apply some lenses to our dataset that determines how the software interprets each player before mapping the overall relationships. In this case we used Metric PCA Coordinates 1 and 2 as lenses. Here’s a brief description from Ayasdi:

These lenses compute a variant of PCA coord lenses for data that does not use the Euclidean metric. When you use these lenses, Ayasdi Core first maps your data into a Euclidean space using the rows of the distance matrix as the coordinates and then performs PCA.

The topological map of the data is what results from running these metrics and lenses together, with alterations to either the metric or a lens causing a new and different map to come to life.

Once the map is created statistical tests and colorization helps identify groups that live within the overall topological map. Once each group is identified, statistical tests (specifically Kolmogorov-Smirnov testing) to compare the group to the remaining population. The results indicate how that group compares to the rest of the population, along with p-stats to identify the validity of the finding.

**Full exports for each group of players are available for download here.

Thanks to Devi Ramanan and the rest of the Ayasdi team for their support and willingness to collaborate with Baseball Prospectus.

Thanks to Sam Miller, Jon Shepherd, and Craig Goldstein for serving as a sounding board throughout the analysis process and indulging me in this endeavor.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
It's an interesting visual approach to presenting cluster data, which I've always found to be a real challenge. My only caution is that developing multidimensional clustering solutions isn't really a fire-and-forget type of endeavor.
My thought too. I would be curious to see the same data analyzed using hierarchical clustering techniques, just to see how different the solution is.
Yeah, give the small population there's no reason at all not to employ a hierarchical approach.
This is one of the coolest articles I've seen here. This is why I tell people to pay for this site.
Really neat, Jeff.

I think another cool angle to take would be to compute this over different time periods, and see whether some clusters have gained or lost players, or whether there were other clusters in the past that have since gone "extinct".
That's an interesting angle I hadn't considered! We'll definitely have to add that to the list of possible analyses we want to use this tool for.
this was the first thing i thought of-- given the decline in offense, can we use TDA to identify which player types have been able to capitalize on the new pitching trends, and which have struggled.
Great stuff.

Can't wait to see a similar approach taken for pitchers ... although perhaps not limited to "outputs" (e.g. for hitters, ISO / AVG / OBP) but inclusive of "inputs" (e.g. # of pitches, types of pitches, fastball speed ... ) ?

I've always thought there were a relatively small number of pitcher 'archetypes' ...
This is in the works for sure. The difficult thing is identifying the right inputs that will produce a result. Outputs are easy because everyone has the same ones!
Nice article. Thanks also for the stats process stuff at the end.
my mind wanders to the speculation we do each offseason when a team seems to have a "plan" that we can't understand. Why did Boston choose the pitchers they chose? Why did the A's and Padres build the teams they built? Categorizing players and their connections, and how they fit in with the league as a whole, might give us evidence to build more informed theories from.
I agree!
Fascinating. I'll admit I was a bit suprised to see the stats you chose to use. I would have thought you'd look at things a bit more indepenent and which stablize more quickly. Raw SB are driven in no small part by OBP. Average is influenced by speed and quite variable in a single season sample due to fluctuations in BABIP.

For offense, I think I would have gone with:
- Contact% for bat-on-ball ability
- Swing% for BB% for discipline
- ISO for raw power
- SBA/(times reached base) for speed

Still, this is really cool stuff. I've always assumed we could do some skill-based grouping but I've never known how it would be done. It would be absolute fascinating to see this done on scouting report data, such as the figures Kiley McDaniels and the Fangraphs team has been putting together.
This is great feedback. I wasn't under the delusion that the stats I picked were perfect (far from it), but I wanted to use baseline stats that isolated one part of the game from another.

I do think things like contact%, swing%, etc. could be more fruitful stats to use. Ultimately this is, hopefully, the first phase in many iterations for this type of analysis.
Typically what you'll want to do before running a cluster solution on your population/sample is to cluster the variable set itself and let the data tell you which set of indicators provide the widest information spread/are least correlated with each other. Theory drives which variables go into that initial analysis but the clustering approach itself finds the variables that are most orthogonal with each other.
I can't figure out why Adam Eaton presents as a unique player. Seems like a prototypical leadoff/speed guy.
I think it's the walk rate and the lack of power. Ben Revere doesn't walk that much. Most guys who walk are in the power groups. Also, even though he's fast, he doesn't run as much as a Leonys Martin type. Last year he was Dexter Fowler with a few less HRs.
Feeding off of what tylersnotes said, maybe you use this to see how teams have been built over the years. Or show how WS champions have been built over the years? Or what separates them from all other teams in a given year? Just a thought. Great article, Jeff.
Excellent work, Jeff, and completely over my head, but I admire any effort to turn disparate numbers into coherent analysis. I'll be interested in seeing where you go with this.
Uh-huh, yeah. Um, so? Neat-o graphs but what, in the end, have you discovered -- that some players are like others? If your research is more profound than that, you might want to explain how.
Where does Scott Hatteberg appear on your chart? A streaking green and yellow star? Great stuff - interested to see where this goes.
Excellent article that has an endless array of interesting tangents.
I tried downloading the

"**Full exports for each group of players are available for download here."

but got a message saying the file has been moved.
My apologies!

Try this link instead...