keyboard_arrow_uptop

Because it’s Hall of Fame week, there have been plenty of Edgar Martinez partisans out there making the case for the former Mariners third baseman/designated hitter. Martinez’s case is dulled somewhat by the fact that he spent so much of his career designated to hit, and perhaps more damning that he didn’t reach 3,000 hits nor 500 home runs (nor 50 stolen bases) for his career.

I’ve again heard the line about how Martinez absolutely owned Mariano Rivera. Martinez faced Rivera 23 times in his career and went 11-for-19 against him (also, three walks and one HBP) with three doubles and two home runs, good for a .579/.652/1.053 line off The Sandman. That’s a great bit of trivia, but … well, this is the part where I tell you that based on this article by a talentless hack who once named himself after an auxiliary kitchen utensil, 23 plate appearances is nice but it doesn’t tell us much of anything.

We know that Martinez was a great hitter (and a Hall of Fame-level hitter), but those numbers could just be total randomness. Against another Hall of Famer, Nolan Ryan, he was was 1-for-19 lifetime with 10 strikeouts. He didn’t have any success off another noted ace closer from the era going 1-for-14 with two walks against … oh, that’s his line against Billy Taylor.

We can’t really use that matchup information to predict much of anything. That’s the official sabermetric party line and we’re not allowed to question that.

Heh.

Warning! Gory Mathematical Details Ahead!

There’s a mismatch methodologically between the way that we normally think about determining reliability and then how we apply that to batter/pitcher matchups. For example, when we look at a batter and we want to see the reliability of some stat, we take all of his plate appearances from some starting point, line them up, and use some sort of reliability technique to get a number. But the unspoken reality is that those plate appearances come against a whole bunch of different pitchers.

Usually that’s a feature, rather than a bug for what we’re trying to accomplish. Usually the question is “at what point in a season do we have a reliable sample?” and during the course of a season a batter faces many pitchers. But at what point does a batter’s results against a specific pitcher become reliable? We dismiss managers who use “Smith is 7-for-9 against Jones lifetime, so that’s why I had him pinch-hit” as a justification for their move, because nine plate appearances isn’t a big enough sample to tell us anything.

But implicitly, that manager is saying: “You’re doing reliability on Smith against everyone, and Smith might be an awful hitter against everyone in general, but it’s not everyone is out there on the mound. It’s Jones out there on the mound, and there seems to be something that he has on Jones. We should at least consider that the 7-for-9 represents different information than his results against the rest of the league. Maybe it’s even information that’s valuable.”

Hmmm.

I used data from 1993-2016 and looked at two statistics–walks per PA and strikeouts per PA–that are known to become reasonably reliable for both batters and pitchers in a relatively short period of time. I looked for pairs of hitters and pitchers who had faced each other at least 100 times during that time period. A sample of 100 PAs will give me enough to at least compare two 50 PA samples against each other (and so, determine reliability at 50 PA). There were 23 such pairs, which isn’t an amazing sample, but we will work with it.

(There’s a sampling problem here too. Who are the guys who would stick around long enough to get 100 PAs against a single pitcher? Guys who are good. Consistently good. That’s going to inflate our estimates of internal consistency a bit, but such is life.)

I used the KR-21 formula, consistent with my previous work, on that set of 23 batter/pitcher combos and figured out the reliability of K rate (and then BB rate) at 1 PA, 2 PAs, 3 PAs, and all the way up to 50 PAs. As a comparison, I also used data from 2012-2016 and did reliability for K rate in the traditional way for both batters (by themselves, against the league) and then pitchers (same).

A graph of those findings appears below:

We see that all three lines track each other pretty well and that at 50 PAs all three rates are at least in that “close enough” range to being reliable. That’s interesting because we generally think of stats as being reliable over the course of a year or a half-year, as the batter (or pitcher) producing them during that time frame might change a little bit but not very much, so we can consider him “the same as he was three months ago” as long as we put scare quotes around it.

But the “batter/pitcher matchup” reliability line represents something different. To get to 100 PAs against a single pitcher, it can take eight or nine years under ideal circumstances (the hitter being in the same division as the pitcher and facing each other four times a year with three plate appearances in each of those games). It’s hard to say that the hitter and pitcher are “the same as they were nine years ago” by the end of that string.

What does stay consistent is that … well, it’s the same batter and pitcher, probably with at least some of the same mental approach and tendencies that have lasted over the years, even if the stuff deteriorates over time. So we at least know that there is coherent information in that

I did the same reliability analysis for walk rate (below), and found that after a little while matchup data is actually more reliable than “normal” pitcher or batter data alone the old-fashioned way:

Let’s stop for a moment. The fact that a measure is reliable at a certain sampling frame means that it is a reliable descriptor of what the batter was in the past. Please note: in the past. It’s an assumption (not always a bad one) to believe that past performance will be consistent with future performance. Do matchup stats have any predictive value? We can test that.

We can even model the information that the manager would actually have on hand. Since we know that 50 PAs is “good enough” for the stats that we’re going to test, I found all situations in which a batter was facing a pitcher for at least the 50th time in his career (or at least since 1993). I calculated the combo’s strikeout rate per PA to date in their relationship. (That is, it updates with every step along the way.)

I also found both the pitcher’s and the batter’s strikeout rate to date for that season, on the assumption that both had faced 50 batters/had 50 PAs at that point. At this stage, the manager has a reasonably reliable estimate of what both the pitcher and batter have been doing strikeout-wise this season and a good sample of what they’ve done against each other in the past. Which one is more predictive of whether the plate appearance will end in a strikeout?

I coded all plate appearances as either a strikeout or not, and converted the pitcher’s K rate to date for the current year, the batter’s K rate, and the historical matchup K rate between the two into logged-odds ratios. I put all three into a binary logistic regression that was set to stepwise mode.

For those who aren’t familiar with the technique, the first is a conversion that takes plain old percentages and puts them into a numerical format that’s easier for logistic regression to work with (binary logistic regression models outcomes that have two possible outcomes, mostly yes/no questions). The second step means that first the regression will look to see which of the variables is the strongest and then will look to see if any of the other variables can improve the regression by adding additional predictive power, and if so, which of those is the strongest.

So we have this year’s overall stats vs. the batter/pitcher matchup history. Which one will come shining through?

Well, this year’s stats were the winner. It happened again when I re-ran the same analyses on walks. I even looked at on-base events (looking at OBP to date for batter and pitcher, and the OBP of their previous meetings as the predictors) and that had the same pattern.

But …

The historical record of the two performers against one another still entered the regression significantly. That is, it helped to predict the outcome of the plate appearance over and above what just looking at this year’s stats would have told us. Looking a little deeper, we see that when we parse out some variance, the mixture (using -2 Log Likelihood for the initiated, which is an analog on binary logistic regression to what R-squared does in good old OLS regression) was about two parts “this season” and one part “the historical record.”

There’s some signal in there!

The samples in those regressions were cases where the batter and pitcher were seeing each other for the 50th time (or more) so we know that we have a really good read on their history with each other. Maybe it’s not surprising that the historical record was significant in the regression. Then again, some of that information is five years old. Why does it still hold predictive sway?

I examined what would happen when I looked at plate appearances in which the pitcher and batter were seeing each other for between the 25th and 50th time. This means that they have some history together, but not as much as before. In those same logistic regressions, the matchup’s history continued to hold predictive power, though less than just looking at the current season totals and less so (in an R-squared sense) than it did when the batter and pitcher had been even longer-term rivals.

Let’s take it down to the 10th through 25th instances of a matchup. That historical variable just won’t shake out. Again, the R-squared is reduced, but the matchup numbers are still helping to predict the outcome. It seems reasonable to think then that historical matchup stats provide at least different information than current season stats provide, and that it is important information to consider.

Going Over the Hitter

There’s a phrase that I didn’t pay enough attention to over the years. In interviews, pitchers will sometimes talk about their preparation for a game and that they “go over a hitter.” It’s not a surprising phrase, but I don’t think I ever gave it its full weight. It speaks to the idea that pitchers view each hitter as a puzzle to be solved, individually. Sure, there are guys you try to blow away with just pure stuff, but there’s a chess match element to it. And so if you’ve got something on a guy (or if you just can’t figure him out) over the years, it probably leads to some similar results as long as that thing holds.

But I want to put a marker down on this article. To slightly paraphrase some scouting lingo, I want to put my graphing calculator on the table for this one. If there’s a sampling frame that we can get to where those historical matchups, even including data from several years ago, is picking up half the variance that this year’s stats are, then that means something. The standard response is that the matchup data, even in relatively large sizes, is useless. The evidence suggests otherwise.

There appears to be some stable, predictive power for the historical record to predict outcomes in the present, independent of what we would otherwise expect from recently collected data (which is what we’ve always been told to look at). It’s not the only factor to consider, but it seems silly to ignore it.

I don’t think we could ever use matchup data to directly inform strategic decisions. That is to say I don’t think we should give carte blanche to the idea that “Smith is 7-for-9 lifetime against Jones although he is also generally a .220 hitter, but we should believe the 7-for-9.” Maybe even if Smith was 20-for-50 lifetime against Jones, I don’t know that I’d accept that as direct evidence, although I’d start listening really hard.

Instead, let me frame it thusly: In all of these regressions, I am extracting one very rough piece of evidence from each plate appearance, which is whether it ended in a strikeout/walk/on-base event. That’s interesting information and it’s important information, but it’s not all of the information that’s available in those plate appearances. Were the outs loud or soft? Did the batter seem to be getting a good look at his release point?

If, with a rough-hewn, outcome-only measure, I can get to a very respectable slice of variance–relative to something that should be very powerful, which is recent performance in the current year against the league–and do it using just 50-100 PAs of history, what could I do if I went through and extracted all of the available information from 25 PAs or maybe even 10 PAs (and I also had the training and experience to fully dissect and appreciate that information)?

I might see that those seven hits were are all little dinks where Smith got lucky to have a few balls fall in. And in that case, I wouldn’t pinch-hit him and you’d never know that I didn’t do it, because nobody notices the move that isn’t made. But there would probably be cases where I was convinced that the 7-for-9 was real, and at that point I might pinch-hit Smith into a key situation. I’m not gonna give away the tell that I found in the tape, but I’d mention the obvious fact that he’s had past success.

As sabermetricians we are fond of projecting what is going to happen based on evidence gained from the player as he faces the entire league and assuming that evidence is largely portable. It has the advantage of being able to draw large samples and makes for convenient analysis. I have to wonder if we’d be better off looking into some of these batter/pitcher matchups of their own merit. Maybe hitters (and pitchers) are not always well-represented by their aggregate stats and are better understood as a series of encounters against specific pitchers (or batters) and how they negotiate those transactions and how well (and quickly) they learn those patterns.

That’s messy and we don’t really even have a public language to start on that. But what if it’s a better way to go about things? The evidence that we have available suggests that even if it’s not the majority of what we should be aiming for, it actually is a significant enough amount that we can’t reasonably ignore it.

So, the next time a manager makes one of those “but the small sample size data told me so” moves, maybe it was the wrong move or the wrong justification, but I think that’s a more open question than we had previously thought.

***

Because someone will ask, the aforementioned matchup pairs with 100 or more plate appearances:

Josh Beckett against Bob Abreu
Mark Buehrle against Michael Cuddyer and Torii Hunter
Bartolo Colon against Ichiro Suzuki
Roy Halladay against Johnny Damon, Derek Jeter, and David Ortiz
Felix Hernandez against Elvis Andrus
Livan Hernandez against Juan Pierre
Tim Hudson against Jimmy Rollins
John Lackey against Ichiro Suzuki and Michael Young
Greg Maddux against Craig Biggio
Jamie Moyer against Garret Anderson
Mike Mussina against Manny Ramirez
Kenny Rogers against Garret Anderson
Jason Schmidt against Luis Gonzalez
James Shields against Derek Jeter
Justin Verlander against Alex Gordon
Tim Wakefield against Jason Giambi, Derek Jeter, and Alex Rodriguez