keyboard_arrow_uptop

At Brooks Baseball, we’ve built a repository where you can access almost any information about any pitcher’s pitches and be confident that the pitch types were identified correctly. For example, you can ask how many times batters swung and missed at a Stephen Strasburg changeup, how often batters hit Chris Sale’s slider for a groundball, or what the overall called-strike rate is for Felix Hernandez’s fastball.

But PITCHf/x databasing is still in its infancy. Pitching is not the sum of individual statistics about individual pitches any more than a piece of music is the sum of an individual set of notes. Pitching is a sequence of events—the previous pitch’s execution may be as germane to the outcome of the at-bat as the current pitch’s execution. We often hear about how a pitcher might go up in the zone with a high fastball to raise a batter’s eye level and then down in the zone with a curveball. None of that was captured in the maze of tables and charts already available.

However, a new feature on BrooksBaseball.net written by Daniel Mack, a computer science doctoral student at Vanderbilt, allows you to visualize these sequences, providing a deeper understanding of pitching than ever before. (Technical details in appendix below.)

To access the sequences, simply go to any player card at Brooks Baseball and scroll down to the "Pitch Sequencing" section. To interpret what you see, you should first look at the title of each sequence, e.g.: "LHH Sequence: Out Zone | Low Over the Plate | Changeup –> Out Zone | High Over the Plate | Fourseam." First, this tells you this is a sequence thrown to left-handers. Next are descriptions separated by "|" that indicate different dimensions of a pitch. The "–>" represents the sequence marker, so the first pitch is then followed up by another pitch in order. In this case (which is the third left-hand sequence for Strasburg), we have two pitches that are out of the strike zone. The first is low, but over the plate, and it’s a changeup. The second is high, still over the plate, and is Strasburg's four-seam fastball. So the descriptions, in order, tell us: whether the pitch is in the strike zone; where it’s located in (or out of) the zone (if its location can be specified); and finally, the type of pitch, which can be as specific as an individual pitch type or as general as "Hard Pitch" or "Any Pitch."

Looking at these sequences by themselves helps us understand what a pitcher has a tendency to throw, but visualizing each sequence's usage and impact on at-bats over time can qualify its strategic potential for the pitcher. To aid in this process, we have selected heat maps as the visualization choice. To be clear, there are several interesting and valuable ways we could display this information, but when dealing with long stretches of data and an unpredictable distribution of sequences over those stretches, a heat map will draw the eye to areas of high contrast and indicate clearly where something occurs with more relative frequency than something else.

When interpreting these visualizations, it’s crucial to remember those words: relative frequency. If the data is uniformly distributed, there is a good chance it will look similar to a map where there is no data over that time (or low frequency, represented by bright green). Similarly, for maps where there are several high-intensity areas (which on these maps is represented by bright red), it may be that there was only a small amount of data, concentrated in a few places. This second case is easier to detect if there is no gradation in the heat map, but only high-frequency and low-frequency areas.

The maps show blocks of 100 at-bats over time, as labeled from the PITCH INFO database. There is a sliding window of 10 at-bats to provide some smoothing to these images (i.e. each block contains 10 of the at-bats of the previous block). These are separated on the x-axis into years. The maps are further detailed along the y-axis in three different ways. The first is the overall frequency of the sequence over time. This view informs us at a glance when this sequence has been used, whether it is more homogenous through the pitcher’s career, or if it developed and stopped over a period of time. This view is shown by default for each sequence.

The second view shows the outcome of the at-bat, so you can see what happens in the at-bat when a sequence is used. Coupled with the overall frequency, this view helps us explore the effectiveness of the sequence. The final view uses a y-axis that explores the outcomes of the second pitch in the sequence, revealing the impact of a pitcher’s first pitch by showing what happens on the pitch that follows it. This is a wonderful compliment to Brooks Baseball's information on the outcomes of individual pitches, providing a story about what pitches may have set those outcomes up.

As an example of how these views may be used, let’s take a look at Mat Latos. This is his top-ranked left-handed sequence:

This sequence is for back-to-back sinkers located in the strike zone. The overall frequency map shows that Latos used this pattern often in 2010 and then again this year, but not in 2011. In both cases, the sequence was used at the beginning of the season before falling out of favor. If we look at the outcomes of the at-bats for this sequence, we find that its effectiveness has changed over time:

We can see that over the blocks of 100 at-bats, batters facing this sequence in 2010 tended to strike out, then hit groundballs, then hit more line drives and fly balls. In an interesting symmetry, the evolution of this sequence in 2012 has occurred in the opposite direction, with fly balls and line drives giving way to more groundballs and more strikeouts.

This symmetry is also visible when looking at the outcomes of the second pitches in this sequence:

Echoing the outcome of  the plate attempt view, the second pitch at the beginning of 2010 was often taken for a strike or hit foul. Over time, batters started hitting this second pitch foul more often, with an increase in contact (and an odd increase in its being taken for a ball). Finally, at the end of this sequence’s heavier use in 2010, it was hit mostly for contact. In 2012, this trend has reversed: the second pitch started out most often hit for contact, then hit foul, and then became more effective, as batters began to take it for strikes. Using these three views, we can look for interesting sequences and their potential effectiveness.

There are three ways in which we can visualize each of these last two views. First, we can look across all the data, as we did above. However, we can further look at two other filters. The first looks at these outcomes when the sequence begins an at-bat. In this case, we are interested to see what happens when the sequence is a strategy for a pitcher to begin working a batter. In some cases, we may find that a pitcher doesn't use it much in the beginning, or that its impact is relatively minimal. In others, we may see it being used to induce foul contact, or even being taken for a ball. The other filter contrasts this opening sequence by examining the "closing" sequence. This filter looks at instances where the sequence brings the at-bat to an end. In this case, the pitcher may be using this as a strategy to pursue a strikeout (in which case we can check to see whether it’s usually swinging or looking), or if it's often hit for contact, whether it induces a groundball or is hit in the air. It's worth noting that in some cases, there is little information in these last two filters, since the sequence is used somewhere in the middle. In this case, the “all” filter is the best one to study.

An interesting way to use these filters is to look at how opening with a sequence may be different from ending with a sequence. Let’s go back to Latos. First, the “First Pitch” filter:

There isn’t a lot of color on this map, indicating that the sequence is either relatively homogenous or rarely occurs. It was more likely to be used to open an at-bat in 2010, when it would often open at-bats that more ended with groundballs. Now, let’s take a look at this sequence when it ends the at-bat:

In this case, it appears that the sequence was used more often to end at-bats. We see that it ended with a lot of contact in both years, with 2010 seeing the bulk of the groundballs and fly balls. In 2012, this has been a sequence on which batters have hit line drives and some grounders. There are some strikeouts in both years, as well, but strikeouts are more prevalent in the all-data view. This would suggest that this sequence also led to strikeouts when it was neither the opening nor the closing approach.

We hope these pitch sequence visualizations will allow people to answer some important questions and find better ways to ask others. We're excited to see what you do with this information, and your feedback will help us make it even more useful.

Appendix
The basis for this visualization is the use of an algorithm from a family of data mining techniques known as Association Rules and Sequence Mining. These techniques are used to discover common patterns in large amounts of data. While Association Rules are useful for finding relations such as "if a person buys eggs, they also tend to buy milk," Sequence Mining extends this to an ordinal relation, which is most often temporal. In this case, Sequence Mining would find "If a person first buys eggs, then they will buy milk afterwards." The choice between these two is based on whether the practitioner is looking for things that occur together or things that happen together in order. In the case of our PITCHf/x data, since pitches are well ordered, we want to find these common sequences for each pitcher.

The algorithm can handle multiple dimensions from the database, so we examine attributes such as the handedness of the batter, the location of the pitch, and the type of pitch thrown. These dimensions can not only be specified explicitly, but can be represented in a hierarchy of values providing generality. For example, we can represent the handedness of a batter in a simple two-level tree, where the root of the tree is every batter and the children of the root are LHH and RHH. The algorithm will first look to see if there is enough support for the handedness to be specified at the most specific level. If not, it will begin to categorize the data with more general forms, for the actual mining of the sequences. For this current iteration, the two dimensions where you will see this more commonly are the pitch type and location. Pitch Type's hierarchical structure is at the very bottom, every pitch being specified explicitly. The level above that is to group the "Hard" pitches, "Breaking" pitches, and "Offspeed" pitches together. At the very top, all pitches are fully generalized to "Any Pitch."

The location at its most specific level is a specific location either out of the zone (e.g. low and away from a left-hander), or a specific location inside the strike zone (e.g., middle-in). The strike zone is broken up into ninths for further specificity (e.g. high and inside the strike zone to a right-hander). In the case where a pitcher throws a pitch with significant movement and thus never really throws it with enough control to target specific areas of the strike zone, but does frequently throw it in the zone, we want to capture that it's often just "in the strike zone." In this case, the location in the strike zone could be just that, so the location dimension of the pitch will be abstracted for that data from a specific location to "In Zone." These dimensions can be further generalized to locations such as vertical locations to the batter ("High, "Middle," and "Low"") to the most general of all locations, "Anywhere."

The algorithm that handles the correct level of specificity of multi-dimensional data is based on "M2SP: Mining Sequential Patterns Among Several Dimensions" by Plantevit, et al. After these dimensions have been mined to the appropriate specificity, these at-bats are then mined using a Sequence Mining algorithm known as SPAM (Sequential PAttern Mining) by Ayres, et al (which, for those interested in learning more about the inner workings of machine learning algorithms, can be found here). These two algorithms working together take two parameters. These parameters are termed "Minimum Support," since they are essentially the lowest threshold for which a dimension can be thought of as specific enough and a pattern can be thought to be "frequent." Thus, these parameters specify how much support you are guaranteed to find in the data for a multi-dimensional pattern returned as a result. Now clearly, some patterns are more frequent than others, so in our visualization, the patterns are shown in decreasing levels of frequency. However, they are all mined with the same levels of support.

Daniel would like to acknowledge his adviser Gautam Biswas for the discussions about this work and helping frame it for his dissertation. He would also like to acknowledge his colleague John Kinnebrew for help in providing ideas and code for this work, especially for the sequence-mining algorithms.

10/16
10/16
10/16

4/20
5/31
5/27