The PITCHf/x optical video and TrackMan Doppler radar sensors estimate parameters of pitches, including the speed, horizontal movement and vertical movement. The data recorded by these systems can be used to develop pitcher similarity measures. These measures are valuable not only for comparing majorleague pitchers to each other, but also for allowing the direct comparison of pitchers in other leagues (minor, amateur and foreign) to their MLB counterparts.
A pitcher similarity measure can be employed for multiple purposes by analysts. The identification of groups of similar pitchers can be used to generate optimized projection models [18], or to generate larger samples for predicting the outcome of batter/pitcher matchups [3], [20]. In addition, a similarity measure allows for individual pitchers to be monitored over time in order to detect possible changes in pitch characteristics, health and throwing mechanics.
Previous methods for quantifying pitcher similarity have been limited to the comparison of pitches of the same type, which makes these methods highly dependent on the outcome of pitchclassification algorithms. Kalk [8], [9] developed a similarity measure that compared pitches of the same type using variables that included pitch frequency, speed and movement. Loftus [11], [12], [13] improved on Kalk's approach by separating pitchers by handedness while using the KolmogorovSmirnov distance to compare distributions. Like Kalk's method, however, this approach only considers comparisons between pitches of the same type.
A difficulty for these methods is that different pitch types for a single pitcher or across multiple pitchers can have similar properties. This causes the pitchfrequency statistics used by similarity algorithms to depend heavily on the classification process; it also prevents the comparison of similar pitches that are classified as different pitch types.
In 2016, for example, Ubaldo Jimenez's sinker averaged 91.12 mph, 7.35 inches of horizontal movement and 8.53 inches of vertical movement, while Jeremy Hellickson's fourseam fastball had nearly identical averages of 90.81 mph, 7.63 inches of horizontal movement and 8.44 inches of vertical movement. Due to this issue, Loftus [13] conceded that his own method is best suited for comparing individual pitches as opposed to comparing pitchers based on their entire arsenal. Gennaro [3] has proposed a more qualitative approach to measuring pitcher similarity by using a handselected set of features and weightings. The features used by this method include a pitcher's two mostcommon pitch types and his mostcommon twopitch sequence.
In this work, we develop a pitcher similarity measure that considers the speed and movement of every pitch. We note that other factors that are less indicative of a pitcher's raw stuff such as pitch location [4], sequencing [5], and deception [14] also play a role in determining performance.
Given pitch speed and movement, we can plot a pitch as a point in a cube. Using data from Brooks Baseball, for example, we can plot a thousand Jon Lester pitches from 2016 with the speed (s) in miles per hour, along with the horizontal and vertical movement parameters (x, z) in inches, and where different colors represent different pitch types:
Jon Lester pitches in 2016
We also can do this for 1,000 Chris Sale pitches:
Chris Sale pitches in 2016
Lester and Sale clearly have different pitch distributions. But how different are they?
Here's a puzzle: Suppose that each of Lester's pitches in the plot is a tenpound weight. Without worrying about pitch types, move each of Lester's thousand pitches so that, as a group, they end up at the same location as Sale's thousand pitches. To make this more interesting, find the way to move the pitches that requires the least work.
Too busy to solve the puzzle right now? That's OK. There's an algorithm called the Earth Mover's Distance, or EMD [16], which can figure out the easiest way to move the pitches and how much work is required. The idea is that the less work that's needed to rearrange Lester's pitches to match Sale's pitches, the more similar the two pitchers are to each other. Even better, the EMD algorithm is efficient and can normalize the distributions so that we don't need the same number of pitches in each plot.
Things get more complicated because some paths are more difficult to traverse as we move pitches around in the cube. To be more specific, let's look at a plot of the speed and vertical movement (again, represented by s and z) for a large set of pitches from different pitchers in 2016.
Scatterplot of Speed and Vertical Movement
We see that s and z have a significant correlation, so that a pitch thrown with a higher speed will tend to have a higher vertical movement. This means that moving a pitch with the flow from the orange spot toward the red spot is easier than moving it against the flow toward the green spot. We can address this issue by combining a whitening transform [1] with the Earth Mover's Distance to account for both differences in the variances of the s, x and z variables and their correlation structure.
Since a pitcher's approach depends on batter handedness, we use the whitened EMD to compare pitchers separately based on their pitch distributions against righthanded and lefthanded batters. The two values are then combined into a single measure of similarity. If you'd like more details [6] on how this all works, just follow the link.
Data Analysis
We will demonstrate the similarity measure for several applications including the identification of similar and dissimilar pitchers, the identification of unique pitchers, the quantification of yeartoyear pitcher stability, and the quantification of pitcher variation with batter handedness and the count. All analysis in this article uses the pitch data from Brooks Baseball and the associated pitch classifications from Pitch Info. Pitch speed will be given in miles per hour, and the x and z movement parameters [15] will be specified in inches.
Similar Pitchers
For the 2016 season, we consider the 196 righthanded pitchers and the 63 lefthanded pitchers who threw at least 1,000 pitches during the regular season. For each of these pitchers, the most similar pitcher and the corresponding distance can be found here [7]. Smaller values of the distance correspond to more similar pitchers.
The most similar pair of righthanded pitchers in 2016 was Matt Harvey and Shelby Miller. Both threw fourseam fastballs with similar parameters (speed, horizontal movement and vertical movement) at similar frequencies. In particular, each pitcher threw 5960 percent fourseamers to righthanded batters, and 5657 percent fourseamers to lefthanded batters, with Harvey averaging 95.39 mph and Miller averaging 94.15 mph on these pitches. We also note that Harvey's slider (89.51 mph, 0.90 inches of horizontal movement, 4.28 inches of vertical movement) was like Miller's cutter (89.41, 1.17, 3.89), and each pitcher used this respective pitch 2526 percent of the time against righthanded batters. Similarity metrics that do not compare pitches of different type would be unaware of the similarity of these pitches.
The most similar pair of lefthanded pitchers in 2016 was Jon Niese and Chris Rusin. The most frequent pitches for each lefthander against righthanded batters were their sinker and cutter, which they threw at similar frequencies and with similar properties. For their sinkers against RHB, we have 89.52 mph, 9.63 inches of horizontal movement, 4.30 inches of vertical movement at 27.2 percent frequency for Niese, and 90.32, 9.74, 4.88 and 24.4 percent frequency for Rusin. For their cutters against RHB, we have 86.74 mph, 0.30 inches of horizontal movement, 3.86 inches of vertical movement and 27.2 percent frequency for Niese, and 87.49, 1.62, 3.78, 29.9 percent) for Rusin. Each pitcher's most frequent pitch to lefthanded batters was their sinker, which Niese threw 40.7 percent of the time and Rusin threw 38.8 percent.
Dissimilar Pitchers
The most dissimilar pair of righthanded pitchers in 2016 was Brad Ziegler and Marco Estrada, with a distance of 5.688. The difference largely was due to an extreme discrepancy in the vertical movement on their pitches. Ziegler threw 57.7 percent sinkers with an average vertical movement of 6.72 inches, while Estrada threw 50.1 percent fourseam fastballs with an average vertical movement of 13.01 inches. Ziegler had the smallest average vertical movement, 5.33 inches, over all of his pitches. Estrada had the highest vertical movement at 9.64 inches.
The most dissimilar pair of lefthanded pitchers was Zach Britton and Tommy Milone, with a distance of 4.238. Britton threw more than 90 percent sinkers, averaging at least 97 mph and with 3.70 inches of vertical movement. Milone averaged only 88.19 mph on his hardest and most frequent pitch, a fourseam fastball, which he threw 45.5 percent of the time with an average vertical movement of 11.45 inches.
Unique Pitchers
The similarity measure can also be used to find the most unique major league pitchers.The righthanded pitchers with the greatest distance to their most similar match in 2016 are:
Unique RHP 
Distance to nearest RHP 
Brad Ziegler 
2.8651 
1.7653 

1.4429 

1.3934 

1.3896 

1.3648 

1.2610 

1.2232 

1.1660 

1.1258 
Lefthanded pitchers with the greatest distance to their most similar match in 2016:
Unique LHP 
Distance to nearest LHP 
1.7251 

1.4946 

1.4912 

1.4223 

1.3264 

1.2464 

Tommy Milone 
1.1309 
1.0658 

0.9960 

0.9782 
Hardthrowing Aroldis Chapman fell short of the 1,000pitch threshold, but would rank as the secondmost unique lefthander behind Britton, with a distance of 1.5495 to the nearest lefthander Tony Cingrani.
Visualizing Similarity
The similarity structure for a group of pitchers can be visualized using nonmetric multidimensional scaling [10]. We use NMDS to visualize properties of the similarity measure for unique righthanded and lefthanded pitchers. NMDS results for the ten most unique righthanded pitchers plus the two most prominent knuckleballers R.A. Dickey and Steven Wright is:
NMDS Result for Unique Righthanded Pitchers in 2016
The most unique righthander, Brad Ziegler, is in the far upper right in the figure. Ziegler's uniqueness is largely due to throwing a large amount (57.7 percent) of sinkers with a low average velocity (84.74 mph) and heavy sink (7.28 inches of vertical movement). The closest pitchers to Ziegler in the plot are Steve Cishek and Aaron Nola, who each threw 4044 percent sinkers but at a higher velocity than Ziegler. The pitchers in the plot with the highest average velocity over their pitches (Rodney, McCullers, Shaw) are in the lowerright quadrant. In this group, Rodney appears closest to Cishek and Nola due to also throwing a high percentage of sinkers (39.1 percent), but the high vertical movement on his pitches, particularly his fourseam fastball, pulls him to the left of these two. Bryan Shaw has the highest average velocity among pitchers in the figure and appears at the lowest point in the plot.
To the left of Rodney is a group of three pitchers (Estrada, Young, Clippard) who displayed the highest average vertical movement on their pitches among the pitchers in the figure. This high vertical movement was largely achieved by throwing 4551 percent fourseam fastballs. Above this group is Jered Weaver, who threw pitches with a high average vertical movement, but also had the lowest average pitch velocity in the plot among the nonknuckleballers. Dickey and Wright appear together above Weaver and, as shown here [7], the two knuckleballers are the best match for each other over the 196 righthanded pitchers in the data set. We see that the most dissimilar righthanded pitchers in the entire data set, Ziegler and Estrada, are also the most separated in the plot.
The NMDS result for the tenmost unique lefthanded pitchers, plus Aroldis Chapman, is:
NMDS Result for Unique Lefthanded Pitchers
The most unique lefthander, Zach Britton, is on the farright edge of the plot. Britton achieved his uniqueness by throwing a high volume (92.0 percent) of very hard (97.44 mph) sinkers. The closest lefthander to Britton in the figure is Clayton Richard who also threw a high volume (65.0 percent) of sinkers but at a lower velocity (91.59 mph). To the left of Richard and farther removed from Britton is Zach Duke who also threw a large number of sinkers but at an even lower frequency (50.4 percent) and velocity (90.13 mph). The secondmost unique lefthander in the group, Aroldis Chapman, who threw a lot (81.1 percent) of very hard (101.32 mph) fourseam fastballs appears at the lowest point on the plot.
On the left side of the figure are four lefthanders (Milone, Lamb, Urias, Kershaw) who all favored the fourseam fastball with frequencies varying between 45.5 percent for Milone and 55.3 percent for Urias. The average fourseam velocity for the pitchers increases from top to bottom with mph values of 88.19 (Milone), 90.49 (Lamb), 93.32 (Urias) and 93.74 (Kershaw). To the right of these four pitchers are Drew Pomeranz and Rich Hill, who both complemented their fourseam fastball with a large percentage of curves with sharp downward movement. Hill is the closest pitcher to Andrew Miller in the plot. Since Miller's fourseam fastball is harder than Hill's, and Miller's most frequent offspeed pitch is a slider that is thrown substantially harder than's Hill's curve, Miller appears lower than Hill. We see that the most dissimilar lefthanded pitchers in the fulldata set, Britton and Milone, are also the most separated in the plot.
Pitchers with Small YeartoYear Variation
We can use the similarity measure to compare pitchers to themselves over time. For this purpose, we computed the similarity measure between 2015 and 2016 for each pitcher who threw at least 1,000 pitches in each regular season.
Righthanded pitchers who changed the least between 2015 and 2016 (with their age as of June 30, 2016):
RHP 
Distance 
Age 
R.A. Dickey 
0.1280 
41 
0.2584 
31 

0.2654 
31 

0.2801 
43 

0.2881 
29 

0.2995 
30 

0.3040 
28 

Jered Weaver 
0.3062 
33 
0.3107 
31 

0.3215 
33 
Lefthanders:
LHP 
Distance 
Age 
Jon Lester 
0.2581 
32 
0.3056 
23 

0.3357 
35 

0.3572 
32 

0.3922 
27 

0.3963 
26 

0.4007 
26 

0.4147 
31 

0.4150 
30 

Chris Rusin 
0.4169 
29 
Many of the smallest changers are veterans, with 13 of the 20 pitchers in the tables being at least 30 years old at midseason 2016, and with all pitchers (except Carlos Rodon) being at least 26. Two of the smallest changers are the knuckleballers R.A. Dickey and Steven Wright. Unsurprisingly, Bartolo Colon is also one of the leastchanging righthanders.
Pitchers with Large YeartoYear Variation
Righthanded pitchers who changed the most between 2015 and 2016:
RHP 
Distance 
Age 
2015 ERA 
2016 ERA 
1.1081 
29 
4.50 
2.28 

0.9869 
25 
4.55 
4.26 

0.9639 
26 
2.71 
2.75 

0.9227 
32 
4.18 
4.43 

0.9156 
29 
4.46 
3.88 

0.9063 
35 
2.84 
2.48 

0.8785 
31 
1.90 
2.25 

0.8329 
22 
3.22 
3.22 

0.8240 
23 
3.24 
2.60 

Aaron Nola 
0.8150 
23 
3.59 
4.78 
Lefthanders:
LHP 
Distance 
Age 
2015 ERA 
2016 ERA 
1.4217 
27 
3.90 
3.79 

1.0952 
26 
4.60 
2.52 

1.0056 
26 
5.30 
2.92 

0.9570 
25 
7.53 
4.53 

0.9151 
26 
4.48 
6.04 

0.8312 
23 
3.75 
3.38 

Drew Pomeranz 
0.8008 
27 
3.66 
3.32 
0.7765 
27 
4.08 
3.51 

0.7258 
28 
4.49 
5.44 

Chris Sale 
0.6737 
27 
3.41 
3.34 
We see that these pitchers are younger than their more stable counterparts, with only three of the 20 pitchers being at least 30 years old at midseason 2016. Six of the 10 righthanders, and eight of the ten lefthanders, improved their ERA from 2015 to 2016. Several of these pitchers (Phelps, Chavez, Montgomery, Hand, Pomeranz) went from starting in 2015 to relieving in 2016. Others near the top of the lists include Trevor Bauer and Kelvin Herrera, who made significant changes to their pitch mix [2] [19], along with James Paxton, who made a significant change to his pitching mechanics [17].
Pitchers with Small Platoon Distances
We can use our similarity measure to compute the difference between a pitcher's distribution of pitches against righthanded and lefthanded batters. We considered all pitchers who threw at least 1000 pitches during the 2016 regular season.
Righthanded pitchers who changed the least with batter handedness:
RHP 
Distance 
wOBA vs. R 
wOBA vs. L 
0.0781 
.229 
.228 

0.0970 
.222 
.292 

Will Harris 
0.1592 
.263 
.229 
0.1780 
.324 
.327 

0.2242 
.320 
.476 

Adam Warren 
0.2338 
.343 
.258 
0.2352 
.318 
.333 

R.A. Dickey 
0.2400 
.337 
.339 
0.2486 
.260 
.353 

Steven Wright 
0.2517 
.303 
.271 
Lefthanders:
LHP 
Distance 
wOBA vs. R 
wOBA vs. L 
Adam Conley 
0.2157 
.316 
.334 
0.2538 
.310 
.290 

0.2632 
.395 
.356 

0.2781 
.292 
.287 

0.2919 
.279 
.223 

Drew Smyly 
0.3004 
.328 
.305 
0.3076 
.296 
.307 

0.3166 
.333 
.270 

Zach Britton 
0.3253 
.180 
.226 
Andrew Miller 
0.3339 
.207 
.220 
Several of these pitchers relied heavily on a single pitch type. Reed (72.2 percent), Allen (63.3 percent) and Conley (65.5 percent) threw a large fraction of fourseam fastballs. Dickey (87.6 percent) and Wright (83.1 percent) threw a large fraction of knuckleballs, while Harris (66.4 percent cutter), Britton (92.0 percent sinker) and Miller (60.7 percent slider) also threw a large fraction of a single pitch type. Throwing a similar distribution of pitches to righthanded and lefthanded batters is a characteristic of a pitcher's approach, but is not necessarily indicative of his platoon results. While several of the pitchers (Reed, McCullers, Dickey, Happ) who had a similar approach against righthanded and lefthanded batters exhibited a very small wOBA platoon split, others (Young, DeSclafani) had large wOBA platoon splits.
Pitchers with Large Platoon Distances
Righthanded pitchers who changed the most with batter handedness:
RHP 
Distance 
wOBA vs. R 
wOBA vs. L 
Brad Ziegler 
1.8874 
.278 
.306 
Jered Weaver 
1.1993 
.365 
.365 
1.0970 
.224 
.332 

1.0896 
.212 
.375 

Kelvin Herrera 
1.0802 
.268 
.246 
0.9924 
.243 
.269 

0.9723 
.313 
.334 

0.9458 
.287 
.262 

0.9152 
.317 
.327 

0.8719 
.412 
.454 
Lefthanders:
LHP 
Distance 
wOBA vs. R 
wOBA vs. L 
Brad Hand 
1.1295 
.297 
.194 
1.0366 
.272 
.343 

Tony Watson 
0.9595 
.302 
.253 
Tommy Milone 
0.9032 
.362 
.357 
0.8817 
.322 
.231 

Danny Duffy 
0.8480 
.325 
.201 
0.8313 
.269 
.302 

Rich Hill 
0.8279 
.244 
.232 
Patrick Corbin 
0.7693 
.363 
.324 
Drew Pomeranz 
0.7610 
.287 
.284 
We see that by using very different distributions of pitches to righthanded and lefthanded batters, several of these pitchers (Weaver, Milone, Pomeranz) had very small wOBA platoon splits while others (Iglesias, McGowan, Duffy) had large wOBA platoon splits.
None of the righthanders and only two of the lefthanders (Rivero and Siegrist) who changed the most in response to batter handedness threw a single pitch type at least 60 percent of the time. Seven of the righthanders (Ziegler, Weaver, Iglesias, McGowan, Herrera, Ramos, Chacin) contributed to their platoon variation by throwing a significantly higher fraction of sliders to righthanded batters and a significantly higher fraction of changeups to lefthanded batters. For the purposes of this analysis, “significantly” refers to a fraction that is higher by at least 10 percent. Similarly, four of the lefthanders (Rivero, Watson, Manaea, Corbin) threw a significantly higher fraction of sliders to lefthanded batters and a significantly higher fraction of changeups to righthanded batters.
Another popular strategy used by six of the pitchers who changed the most (Weaver, McGowan, Hand, Duffy, Siegrist, Corbin) was to throw a significantly higher fraction of fourseam fastballs to sameside batters, and a significantly higher fraction of sinkers to oppositeside batters. Righthander Kyle Hendricks employed the opposite approach by throwing a significantly higher fraction of sinkers to righthanded batters, and a significantly higher fraction of fourseam fastballs to lefthanded batters. Lefthanders Milone and Hill enhanced their platoon variation by throwing a significantly higher fraction of curveballs to lefthanded batters.
Pitchers with Small Changes after Two Strikes
We can use the similarity measure to compute how much a pitcher changes his distribution of pitches as the count changes. For each pitcher who threw at least 1,000 pitches in 2016, we computed the similarity measure between the pitcher's distribution of pitches thrown before two strikes and his distribution of pitches thrown after two strikes.
Righthanders:
RHP 
Distance 
0.2231 

Addison Reed 
0.2744 
0.2877 

Jered Weaver 
0.2944 
Fernando Salas 
0.3022 
0.3100 

Steven Wright 
0.3101 
0.3113 

Seunghwan Oh 
0.3137 
Jesse Chavez 
0.3192 
Lefthanders:
LHP 
Distance 
Zach Britton 
0.2108 
0.2379 

0.3354 

Tony Cingrani 
0.3404 
Chris Rusin 
0.3549 
Tyler Anderson 
0.3730 
Jeff Locke 
0.3734 
0.3756 

0.3857 

Steven Matz 
0.4018 
The two righthanders who changed the least (Grilli 62.4 percent fourseamer, Reed 72.2 percent fourseamer) and the two lefthanders who changed the least (Britton 92 percent sinker, Buchter 84.7 percent fourseamer) each threw a large fraction of a single pitch type in 2016. In addition, several of the other pitchers in the two tables (Wright 83.1 percent knuckler, Quackenbush 63.2 percent fourseamer, Oh 60.6 percent fourseamer, Cingrani 87.4 percent fourseamer, Bastardo 65.5 percent fourseamer) each threw over 60 percent of a single pitch type in 2016.
Pitchers with Large Changes after Two Strikes
The righthanded and lefthanded pitchers who changed the most after reaching two strikes in 2016 are listed below. Each of these pitchers threw a significantly higher fraction of a particular breaking ball with two strikes. The pitch with the largest increase in frequency after two strikes over all batters faced is referred to as the Delta Pitch in the lists. The Δf column indicates how much more frequently a pitcher threw the Delta Pitch after two strikes as compared to before two strikes. Brad Ziegler, for example, threw his slider 10.16 percent of the time before two strikes and 40.45 percent of the time after two strikes for a Δf of 30.29 percent.
Righthanders:
RHP 
Distance 
Delta Pitch 
Δf 
Brad Ziegler 
2.4306 
slider 
30.29% 
1.4501 
curve 
20.86% 

1.3814 
curve 
26.17% 

1.2009 
slider 
25.65% 

1.1797 
curve 
29.98% 

1.0923 
curve 
18.28% 

1.0913 
curve 
31.83% 

Raisel Iglesias 
1.0753 
slider 
26.01% 
1.0514 
slider 
12.19% 

Aaron Nola 
1.0365 
curve 
20.94% 
Lefthanders:
LHP 
Distance 
Delta Pitch 
Δf 
Zach Duke 
1.3792 
curve 
29.97% 
Clayton Kershaw 
1.3174 
curve 
17.31% 
1.1124 
slider 
34.10% 

Brad Hand 
1.0659 
slider 
25.61% 
Carlos Rodon 
1.0431 
slider 
24.30% 
Chris Sale 
1.0141 
slider 
22.79% 
Patrick Corbin 
0.9607 
slider 
33.72% 
Gio Gonzalez 
0.9532 
curve 
19.23% 
Francisco Liriano 
0.9263 
slider 
31.01% 
0.8758 
curve 
14.26% 
Among the pitchers in the lists with smaller values of Δf for their Delta Pitch, Fiers (six pitch types) and Darvish (seven pitch types) had a large set of possible pitch types with which to adjust frequencies. Lefthanders Kershaw and Snell used a higher fraction of sliders with two strikes in addition to a higher fraction of their Delta Pitch curveballs.
Conclusion
We have developed a new tool that analysts can exploit to study a range of application areas. The similarity measure allows the direct comparison of pitchers across various contexts including MLB, MiLB, amateur and foreign leagues which can improve predictions for how a pitcher will perform in a new environment. The identification of similar pitchers increases the sample sizes that can be used to forecast the outcome of batter/pitcher matchups and supports regression to more appropriate population means by projection models. The measure also can be used to monitor pitchers over time, and to develop improved models for the health risk and aging characteristics associated with different pitcher classes.
For fans the new tool reveals similarities that we didn't know existed and shows us, once again, that there's more than one way to find success as a majorleague pitcher.
Acknowledgment
The authors thank Tom Tango and Mitchel Lichtman for helpful comments on a previous draft of this article. All pitch data used in this study was obtained from Brooks Baseball.
References
[1] R. Duda, P. Hart and D. Stork. Pattern Classification. WileyInterscience, New York, 2001.
[2] A. Fagerstrom. (June 24, 2016). FanGraphs: Trevor Bauer looks like a completely different pitcher.
[3] V. Gennaro. The Big Data approach to baseball analytics. In SABR Analytics Conference, Phoenix, AZ, March 2013.
[4] G. Healey. The intrinsic value of a pitch. In SABR Analytics Conference, Phoenix, AZ, March 2017.
[5] G. Healey and S. Zhao. Using PITCHf/x to model the dependence of strikeout rate on the predictability of pitch sequences. Journal of Sports Analytics, 2017.
[6] G. Healey, S. Zhao and D. Brooks. Measuring pitcher similarity: Technical details.
[7] G. Healey, S. Zhao and D. Brooks. Most similar match tables, 2016.
[8] J. Kalk. (Feb. 12, 2008). Hardball Times: Pitcher similarity scores.
[9] J. Kalk. (Feb. 19, 2008). Hardball Times: Pitcher similarity scores (part 2).
[10] J. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29:127, 1964.
[11] S. Loftus. (Apr. 15, 2013). Beyond the Box Score: Pitcher similarity scores.
[12] S. Loftus. (Apr. 25, 2013). Beyond the Box Score: Testing and visualizing similarity scores.
[13] S. Loftus. (Nov. 25, 2013). Beyond the Box Score: Pitcher similarity scores 2.0.
[14] J. Long, J. Judge and H. Pavlidis. (Jan. 24, 2017). Baseball Prospectus: Introducing pitch tunnels.
[15] A. Nathan. (Oct. 21, 2012). Determining pitch movement from PITCHf/x data.
[16] Y. Rubner, C. Tomasi and L. Guibas. The Earth Mover's Distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99121, 2000.
[17] E. Sarris. (June 9, 2016). FanGraphs: James Paxton's new angle on life.
[18] N. Silver. Why was Kevin Maas a bust? In J. Keri, editor, Baseball between the numbers, pages 253271. Basic Books, New York, 2006.
[19] J. Sullivan. (April 13, 2016). FanGraphs: Now Kelvin Herrera is almost impossible.
[20] T. Tango, M. Lichtman and A. Dolphin. The Book: Playing the Percentages in Baseball. Potomac Books, Dulles, Virgina, 2007.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
Am I correct in understanding that a score can only be computed/expressed vis a vis one pair of pitchers?
Can one impute the score between, for example, pitcher A and B if we know the score between A and C and then again, B and C?
How can one produce a list of pitchers who are all similar to one another? Is it some kind of "least squares" method whereby you look at all possible combinations of scores?
The similarity measure D(A,B) is computed for a pair of pitchers A and B. If we know D(A,C) and D(B,C) we can not determine D(A,B). We can, however, compute D(A,B) directly. The measure does satisfy the triangle inequality so that D(A,B) <= D(A,C) + D(B,C) or we could say in English that if A and B are both close to C then A canâ€™t be too far from B.
One way to define a ``similarity group'' is as a set of pitchers where the maximum distance between any pair of pitchers in the group is less than some value. We can use clustering techniques to try to identify appropriate groups.
Similar to what we did in The Book (looking at "families" of pitchers).
Have you done this at all?
What I mean is this: Your 3 axes are speed and x and z movement, right? Speed is scaled in mph and movement in inches. If you changed the speed axis to km/hour or feet per second, or anything else, wouldn't that change the EMD?
Same with the movement axes. What if it were cm and not inches? How does the EMD know how much to "weight" movement in one direction compared to another. I mean if we change the speed axis to miles per second, it wouldn't take much to move from Jared Weaver's FB to Aroldis Chapman's. Isn't EMD dependent on what arbitrary units/scale you choose for each parameter? And shouldn't you have some idea as to how you want to weight each parameter? What if, for example, you thought that in terms of similarity, speed differences mean almost nothing  that movement is the most important thing? How would you tweak the EMD algorithm?
But mostly I am confused and concerned with how the EMD algorithm "knows" how to weight each of the parameters? Is 1 inch of movement a lot? 6 inches? How does the EMD algorithm know what is a lot of movement and what is a little? Since it is relative to other pitchers, it doesn't matter, but that's only if there were only one parameter. But since there are 3, YOU have to tell it how to weight each movement, no?