keyboard_arrow_uptop

Pitcher throws, batter swings. The fundamental unit of baseball events can be distilled into an interaction between the pitcher and the batter. These days, it seems most popular to optimize for end results such as home runs. While home runs are easy to quantify because of their direct affecting on the score, they’re generally difficult to predict because they are rare events. Out of about 717,000 pitches, only 2.2 percent of them result in good contact.


Figure 1: A diagram of all pitch events separated into contact/non contact outcomes, and contact quality by speed angle classification.

On the other hand, another logical place to start is at the beginning – when the batter decides to swing. Baseball can be broken down into a sequence of conditional probabilities. The first event is the pitcher throwing the ball with a given set of physical characteristics. Given that the pitcher throws the ball with certain physical characteristics, what is the probability that the batter will swing at the ball? If the batter swings, what’s the probability of hitting the ball? If he hits the ball, what are the probabilities of hit locations? Given the probability of hit locations, what’s the probability of an array of ball in play outcomes – so on and so forth until you reach the end of the game. Each piece of the probability chain can be modeled and optimized separately to produce a more accurate predictions at each step to improve the overall prediction.

Secondarily, predicting if/when a batter will swing can be used to evaluate a batter’s physical capabilities and decision making. The decisions that a pitcher and batter make is a game where the batter needs to be able to predict the location of the ball and combine it with the knowledge of their own capabilities and decide if it’s valuable to swing or not.

In order to predict swing and non-swing events, I gathered the trackman dataset all pitches thrown in 2018. It was cut down to only include information that the batter would have at the moment they decide to swing. That includes physical variables such as the release point and the metrics of the ball 50 feet away from the plate. Spin axis and angle isn’t included in the database, and considering their importance on ball flight, position of the ball at the plate had to be included. Other variables in the dataset include strike zone size/location and game situation information such as balls and strikes, along with inning number, men on base, and run difference between the at-bat team and the fielding team.

I used this dataset to train a decision tree model to predict a batter’s swing or non-swing. Decision trees are a non-linear model that takes each variable and tries to use it to split the dataset. Using the dataset of right-handed pitches against right handed batters, the model was 76 percent accurate when tested on a validation dataset.

So what does the model use to make it’s swing/non swing determination? Variable importance is a metric used to calculate how useful a given variable is in a decision tree model. Basically, it’s a measure of how likely a variable is to be used in the tree, and how accurately it can split the dataset. In these models, position of the ball at the plate is the most important, followed by strikes.


Figure 2: Chart of decision tree variable importances. Plate_x/z is the horizontal/vertical position of the ball, vz0 is the vertical velocity in feet per second. Ax is a horizontal left/right acceleration value in feet per sec per sec.

This is convenient because there are easy ways of visualizing these three components. What does the global major-league swing zone look like against the average pitch?


Figure 3: Global swing decisions by plate_x/z and strikes with average pitch characteristics. Top Left: LHP vs. LHB, Top Right: LHP vs. RHP, Bottom Left: RHP vs. LHP, Bottom Right: RHP vs. RHP.

Overall, the most obvious trend is that batters expand the zone depending on number of strikes. With no strikes, batters don’t like swinging at pitches at the corners. By these charts, the easiest way to get strike one against the average major-league batter is anywhere down in the zone and especially at the corners. Against same-handed pitchers, batters have a preference swinging at balls up in the zone. Similarly, against same-handed pitchers, they are more likely to swing at inside pitches, and less at outside pitches. As strike zones expand with one to two strikes, there’s a slight tendency to swing at up and in and down and away, especially in LHP vs. LHB matchups.

Figure 4: 24 Different individual batter profiles.

We can also train decision trees on individual batters, and those profiles show some interested individual trends.

Some players prefer inside pitches, especially down and in.

Some players have a great eye and only swing at strikes, regardless of count.

Some players have a great eye and use an expanded the strike zone regardless of count.

Some players will swing at anything (or they’re fooled a lot).

Some players like swinging at high balls.

Some players only like first strikes if they’re in a specific spot.

Some players are more likely to chase balls away on two strikes

Some young players have great eyes too: Jeimer Candelario Detroit’s young third baseman.

Looking at players with fewer pitches seen, they must tell pitchers to either swing at the first pitch, or don’t swing at all.

We can also examine what the model says about players when facing LHP and RHP. For example, Joey Votto’s profile shows that he has almost the same exact approach regardless of pitcher handedness.

On the other hand, looking at Dee Gordon’s profile hammers home how uncomfortable he looks against LHP and the model predicts that he doesn’t swing at strike one or two, and will swing at anything thrown by a LHP with 2 strikes.

Overall, it looks like pitch location and strikes are by far the most important variables in a batter’s swing decision, which makes sense, because it’s the primary skill they’ve spent years training. Interestingly, almost all other game context doesn’t make much a difference. Balls, men on base, being down or up a run – none appear to be too important when it comes to swing decisions.

If you have any questions and would like to contact the author for more information, please email him at rkdzeng@gmail.com.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
morro089
10/10
Very good work (assuming it's all correct...because I sure can't validate). Lends credence to the thought "situational hitting" is a bit of a fallacy, at least in this era. [carriage return, new thought] This data looks at the "end result" of the pitch (pitch location at plate). Because batter decisions are made when the ball is about halfway to the plate ("decision distance"), I'd be curious about pitch location and spin rate at that point (and I'm sure that research has been done by people who have information). Did you find anything in your dataset regarding the amount of break, location and swing rate? Classic example would be the low and away slider that "starts" as a strike and is likely influencing your finding that "some batters swing at low and away". But considering you're just looking at the plate location and not "decision distance" location, these charts underlie the stupendous ability of MLBers to figure out where a pitch is going, even though they decide 30 feet before it gets to them.
rdzeng
10/10
Yes, that's definitely a concern. The problem is the dataset is missing important information about the spin axis and angle, so I had to use location of the pitch at the plate to accommodate that shortcoming. And yes, one of the big takeaways is the ridiculous eye of the average hitter, and the even more stupendous eye of some batters (having a great eye is not necessarily the characteristic of an amazing hitter).
rdzeng
10/10
Also, the dataset does contain spin rate and amount of break. But those variables alone are not predictive.