It has become fashionable to bemoan the absence of novel, raw baseball data on which the next generation of would-be analysts can hone their skills. In the case of Statcast, that certainly describes both the status quo and the foreseeable future, as far as public analysis is concerned.

However, Statcast isn’t the only potential source of fresh baseball data. This week, we’d like to think we have made at least a small contribution along these lines: by reviewing our newly-released data bearing upon pitcher command, control, pitch tunnels, and pitch sequencing, both novice and seasoned analysts can unleash their creativity and hopefully teach the baseball community a thing or three.

That said, this data presents some rather unique challenges that might be overlooked in the rush to “see what Excel can do” or apply the trendiest machine-learning technique. So, while I encourage readers to do whatever they want with our new data, I will also start you off with a few words of advice.

Inference Versus Prediction

First, effectively using tunnels data will almost certainly require you to appreciate the distinction statisticians make between “inference” and “prediction.” By “inference,” statisticians describe the process of isolating predictors that tend to be associated with certain outcomes. This usually occurs by isolating certain coefficients in a regression or classification problem, and exploring whether they are consistently meaningful. Examples of inference would be comparing a new drug to a placebo in preventing disease, or in the baseball context, looking at the effect of ballparks on run-scoring. In both cases the outcome is important, but it is not the focus of the investigation.

“Prediction,” on the other hand, is not particularly concerned with the precise contribution each input makes to an outcome. Rather, prediction seeks to forecast the outcome as correctly as possible as often as possible. Many baseball models tend to focus on prediction, deriving an “expected” rate of some event or another, such as a batter’s home run rate or a pitcher’s strikeout rate. Prediction is right in the wheelhouse of your most advanced machine-learning algorithms, which tend to build the shiniest, blackest box imaginable in exchange for terrific results. You often don’t really know how the algorithm got there; all you know is that it did a great job—whatever the hell it did.[1]

With tunnels, though, we usually already know the outcome we care about: a swinging strike, a home run allowed, the rate of weak grounders. What you typically want to know is what factor or combination of factors makes a particular pitcher or class of pitchers more or less effective with respect to that outcome. In other words, it’s not helpful to tell a young pitcher to “just be Justin Verlander.” Rather, you want him to figure out what makes Justin Verlander Justin Verlander, and try to appropriate underlying skills that actually make Justin Verlander what he is. To do this means that you are performing statistical inference. This, in turn, will tend to steer you toward “old-school” regression methods that identify meaningful coefficients to answer such questions.

Performing effective inference often means being mindful of (but not beholden to) p-values, applying intelligent cross-validation and regularization, and using Bayesian or quasi-Bayesian models to capture uncertainty, at least when sample sizes permit it (fully-Bayesian sampling can become challenging with more than 10,000 samples or so). If you haven’t done any of these things before, tunneling data should provide you an excellent way to learn.

The Role of Interactions

Second, please recognize that many tunneling measurements are not automatically informative by themselves. Minimizing the distance between pitches isn’t very helpful if those pitches don’t break apart at the plate. Maximizing the difference in flight time between pitches won’t do much if both pitches are meatballs. Rather, you will often find that effective tunneling works in combination with other aspects of the pitcher’s approach. This means that you need to be mindful of interactions between predictors, and to be creative in trying them out. Generally speaking, if two predictors appear to be statistically meaningful, chances are that their interaction(s) will be as well.

Although typical linear models can control for interactions, other methods like trees and MARS often do a better job, particularly when the effect is non-linear, and is present at certain levels of a predictor but not others. So far, we have found that often to be the case.

The Problem of Multiple Right Answers

The third point is the trickiest of all: the fact that there almost certainly are multiple, distinct combinations of our statistics that work for different types of pitchers. This is antithetical to your typical regression or classification model, which usually looks for some midpoint as “the” correct approach. But when there are multiple “right” answers, that midpoint can by definition generate only one solution, sometimes with curious predictor combinations. [2] Worse, that midpoint may be largely useless, delivering you pitchers who are OK at many things, but not particularly good at any of them, or at baseball pitching in general.

How do you deal with the “multiple right answers” problem? Well, that’s part of the challenge for readers to solve. But, two places to start would be regression trees/CART and unsupervised learning. Regression trees specialize in both drawing the forks in the road, and also taking them. Their answers tend to be the easiest models to visualize for lay people, and they tend to run very quickly. Unsupervised learning (such as clustering and principal component analysis) tries to consolidate observations (or pitchers, if you prefer) into subsets combining various combinations of predictors, which can then be separately modeled to the outcome.

Russell Carleton used factor analysis for a similar purpose in a terrific piece this week. Be careful of boosting and ensemble methods that tend to obscure the predictors in exchange for better outcome prediction; those methods are fine to get a sense of what may be more or less important, but will compromise your ability to find the actual thresholds that matter.

Finally, don’t rely solely on models to discover useful leads. Baseball managers and coaches are full of insight as to how or why they feel a pitcher is being successful. This is particularly true in the minor leagues, where a manager’s primary job is to tell the difference between a future big leaguer and an org guy. Many of these insights have a way of working their way into game articles or feature stories. In short, if you suspect there is a particular pitching approach worth investigating, take the time to poke around a bit and see what true baseball experts say about that approach or type of pitcher. Baseball lifers believe all sorts of interesting things, but some of those insights have plenty to do with successful pitching.

Moving Forward

None of this is intended to squelch anyone’s creativity in tackling our tunneling or any other data. It’s entirely possible that the best approach lies in something both very simple and/or which none of us have even considered. In fact, we would be absolutely delighted if that turned out to be the case.

In the meantime, we wish you the best of luck, and we can’t wait to see where the data takes you.

[1] Deserved Run Average (DRA) actually does a bit of both inference and prediction, relying at least technically on prediction to fit pitcher DRA, but then using inference to create the DRA Runs table, assessing the effects of individual variables in each pitcher’s performance.

[2] To give just one example, the swinging strike model we composed for yesterday’s article actually kicked out CSAA entirely from the list of useful variables. It’s hard to believe that pitcher command does not play a role in at least certain pitcher types getting the desired level of whiffs.