Because it’s Hall of Fame week, there have been plenty of Edgar Martinez partisans out there making the case for the former Mariners third baseman/designated hitter. Martinez’s case is dulled somewhat by the fact that he spent so much of his career designated to hit, and perhaps more damning that he didn’t reach 3,000 hits nor 500 home runs (nor 50 stolen bases) for his career.
I’ve again heard the line about how Martinez absolutely owned Mariano Rivera. Martinez faced Rivera 23 times in his career and went 11-for-19 against him (also, three walks and one HBP) with three doubles and two home runs, good for a .579/.652/1.053 line off The Sandman. That’s a great bit of trivia, but … well, this is the part where I tell you that based on this article by a talentless hack who once named himself after an auxiliary kitchen utensil, 23 plate appearances is nice but it doesn’t tell us much of anything.
We know that Martinez was a great hitter (and a Hall of Fame-level hitter), but those numbers could just be total randomness. Against another Hall of Famer, Nolan Ryan, he was was 1-for-19 lifetime with 10 strikeouts. He didn’t have any success off another noted ace closer from the era going 1-for-14 with two walks against … oh, that’s his line against Billy Taylor.
We can’t really use that matchup information to predict much of anything. That’s the official sabermetric party line and we’re not allowed to question that.
Warning! Gory Mathematical Details Ahead!
There’s a mismatch methodologically between the way that we normally think about determining reliability and then how we apply that to batter/pitcher matchups. For example, when we look at a batter and we want to see the reliability of some stat, we take all of his plate appearances from some starting point, line them up, and use some sort of reliability technique to get a number. But the unspoken reality is that those plate appearances come against a whole bunch of different pitchers.
Usually that’s a feature, rather than a bug for what we’re trying to accomplish. Usually the question is “at what point in a season do we have a reliable sample?” and during the course of a season a batter faces many pitchers. But at what point does a batter’s results against a specific pitcher become reliable? We dismiss managers who use “Smith is 7-for-9 against Jones lifetime, so that’s why I had him pinch-hit” as a justification for their move, because nine plate appearances isn’t a big enough sample to tell us anything.
But implicitly, that manager is saying: “You’re doing reliability on Smith against everyone, and Smith might be an awful hitter against everyone in general, but it’s not everyone is out there on the mound. It’s Jones out there on the mound, and there seems to be something that he has on Jones. We should at least consider that the 7-for-9 represents different information than his results against the rest of the league. Maybe it’s even information that’s valuable.”
I used data from 1993-2016 and looked at two statistics–walks per PA and strikeouts per PA–that are known to become reasonably reliable for both batters and pitchers in a relatively short period of time. I looked for pairs of hitters and pitchers who had faced each other at least 100 times during that time period. A sample of 100 PAs will give me enough to at least compare two 50 PA samples against each other (and so, determine reliability at 50 PA). There were 23 such pairs, which isn’t an amazing sample, but we will work with it.
(There’s a sampling problem here too. Who are the guys who would stick around long enough to get 100 PAs against a single pitcher? Guys who are good. Consistently good. That’s going to inflate our estimates of internal consistency a bit, but such is life.)
I used the KR-21 formula, consistent with my previous work, on that set of 23 batter/pitcher combos and figured out the reliability of K rate (and then BB rate) at 1 PA, 2 PAs, 3 PAs, and all the way up to 50 PAs. As a comparison, I also used data from 2012-2016 and did reliability for K rate in the traditional way for both batters (by themselves, against the league) and then pitchers (same).
A graph of those findings appears below:
We see that all three lines track each other pretty well and that at 50 PAs all three rates are at least in that “close enough” range to being reliable. That’s interesting because we generally think of stats as being reliable over the course of a year or a half-year, as the batter (or pitcher) producing them during that time frame might change a little bit but not very much, so we can consider him “the same as he was three months ago” as long as we put scare quotes around it.
But the “batter/pitcher matchup” reliability line represents something different. To get to 100 PAs against a single pitcher, it can take eight or nine years under ideal circumstances (the hitter being in the same division as the pitcher and facing each other four times a year with three plate appearances in each of those games). It’s hard to say that the hitter and pitcher are “the same as they were nine years ago” by the end of that string.
What does stay consistent is that … well, it’s the same batter and pitcher, probably with at least some of the same mental approach and tendencies that have lasted over the years, even if the stuff deteriorates over time. So we at least know that there is coherent information in that
I did the same reliability analysis for walk rate (below), and found that after a little while matchup data is actually more reliable than “normal” pitcher or batter data alone the old-fashioned way:
Let’s stop for a moment. The fact that a measure is reliable at a certain sampling frame means that it is a reliable descriptor of what the batter was in the past. Please note: in the past. It’s an assumption (not always a bad one) to believe that past performance will be consistent with future performance. Do matchup stats have any predictive value? We can test that.
We can even model the information that the manager would actually have on hand. Since we know that 50 PAs is “good enough” for the stats that we’re going to test, I found all situations in which a batter was facing a pitcher for at least the 50th time in his career (or at least since 1993). I calculated the combo’s strikeout rate per PA to date in their relationship. (That is, it updates with every step along the way.)
I also found both the pitcher’s and the batter’s strikeout rate to date for that season, on the assumption that both had faced 50 batters/had 50 PAs at that point. At this stage, the manager has a reasonably reliable estimate of what both the pitcher and batter have been doing strikeout-wise this season and a good sample of what they’ve done against each other in the past. Which one is more predictive of whether the plate appearance will end in a strikeout?
I coded all plate appearances as either a strikeout or not, and converted the pitcher’s K rate to date for the current year, the batter’s K rate, and the historical matchup K rate between the two into logged-odds ratios. I put all three into a binary logistic regression that was set to stepwise mode.
For those who aren’t familiar with the technique, the first is a conversion that takes plain old percentages and puts them into a numerical format that’s easier for logistic regression to work with (binary logistic regression models outcomes that have two possible outcomes, mostly yes/no questions). The second step means that first the regression will look to see which of the variables is the strongest and then will look to see if any of the other variables can improve the regression by adding additional predictive power, and if so, which of those is the strongest.
So we have this year’s overall stats vs. the batter/pitcher matchup history. Which one will come shining through?
Well, this year’s stats were the winner. It happened again when I re-ran the same analyses on walks. I even looked at on-base events (looking at OBP to date for batter and pitcher, and the OBP of their previous meetings as the predictors) and that had the same pattern.
The historical record of the two performers against one another still entered the regression significantly. That is, it helped to predict the outcome of the plate appearance over and above what just looking at this year’s stats would have told us. Looking a little deeper, we see that when we parse out some variance, the mixture (using -2 Log Likelihood for the initiated, which is an analog on binary logistic regression to what R-squared does in good old OLS regression) was about two parts “this season” and one part “the historical record.”
There’s some signal in there!
The samples in those regressions were cases where the batter and pitcher were seeing each other for the 50th time (or more) so we know that we have a really good read on their history with each other. Maybe it’s not surprising that the historical record was significant in the regression. Then again, some of that information is five years old. Why does it still hold predictive sway?
I examined what would happen when I looked at plate appearances in which the pitcher and batter were seeing each other for between the 25th and 50th time. This means that they have some history together, but not as much as before. In those same logistic regressions, the matchup’s history continued to hold predictive power, though less than just looking at the current season totals and less so (in an R-squared sense) than it did when the batter and pitcher had been even longer-term rivals.
Let’s take it down to the 10th through 25th instances of a matchup. That historical variable just won’t shake out. Again, the R-squared is reduced, but the matchup numbers are still helping to predict the outcome. It seems reasonable to think then that historical matchup stats provide at least different information than current season stats provide, and that it is important information to consider.
Going Over the Hitter
There’s a phrase that I didn’t pay enough attention to over the years. In interviews, pitchers will sometimes talk about their preparation for a game and that they “go over a hitter.” It’s not a surprising phrase, but I don’t think I ever gave it its full weight. It speaks to the idea that pitchers view each hitter as a puzzle to be solved, individually. Sure, there are guys you try to blow away with just pure stuff, but there’s a chess match element to it. And so if you’ve got something on a guy (or if you just can’t figure him out) over the years, it probably leads to some similar results as long as that thing holds.
But I want to put a marker down on this article. To slightly paraphrase some scouting lingo, I want to put my graphing calculator on the table for this one. If there’s a sampling frame that we can get to where those historical matchups, even including data from several years ago, is picking up half the variance that this year’s stats are, then that means something. The standard response is that the matchup data, even in relatively large sizes, is useless. The evidence suggests otherwise.
There appears to be some stable, predictive power for the historical record to predict outcomes in the present, independent of what we would otherwise expect from recently collected data (which is what we’ve always been told to look at). It’s not the only factor to consider, but it seems silly to ignore it.
I don’t think we could ever use matchup data to directly inform strategic decisions. That is to say I don’t think we should give carte blanche to the idea that “Smith is 7-for-9 lifetime against Jones although he is also generally a .220 hitter, but we should believe the 7-for-9.” Maybe even if Smith was 20-for-50 lifetime against Jones, I don’t know that I’d accept that as direct evidence, although I’d start listening really hard.
Instead, let me frame it thusly: In all of these regressions, I am extracting one very rough piece of evidence from each plate appearance, which is whether it ended in a strikeout/walk/on-base event. That’s interesting information and it’s important information, but it’s not all of the information that’s available in those plate appearances. Were the outs loud or soft? Did the batter seem to be getting a good look at his release point?
If, with a rough-hewn, outcome-only measure, I can get to a very respectable slice of variance–relative to something that should be very powerful, which is recent performance in the current year against the league–and do it using just 50-100 PAs of history, what could I do if I went through and extracted all of the available information from 25 PAs or maybe even 10 PAs (and I also had the training and experience to fully dissect and appreciate that information)?
I might see that those seven hits were are all little dinks where Smith got lucky to have a few balls fall in. And in that case, I wouldn’t pinch-hit him and you’d never know that I didn’t do it, because nobody notices the move that isn’t made. But there would probably be cases where I was convinced that the 7-for-9 was real, and at that point I might pinch-hit Smith into a key situation. I’m not gonna give away the tell that I found in the tape, but I’d mention the obvious fact that he’s had past success.
As sabermetricians we are fond of projecting what is going to happen based on evidence gained from the player as he faces the entire league and assuming that evidence is largely portable. It has the advantage of being able to draw large samples and makes for convenient analysis. I have to wonder if we’d be better off looking into some of these batter/pitcher matchups of their own merit. Maybe hitters (and pitchers) are not always well-represented by their aggregate stats and are better understood as a series of encounters against specific pitchers (or batters) and how they negotiate those transactions and how well (and quickly) they learn those patterns.
That’s messy and we don’t really even have a public language to start on that. But what if it’s a better way to go about things? The evidence that we have available suggests that even if it’s not the majority of what we should be aiming for, it actually is a significant enough amount that we can’t reasonably ignore it.
So, the next time a manager makes one of those “but the small sample size data told me so” moves, maybe it was the wrong move or the wrong justification, but I think that’s a more open question than we had previously thought.
Because someone will ask, the aforementioned matchup pairs with 100 or more plate appearances:
Josh Beckett against Bob Abreu
Mark Buehrle against Michael Cuddyer and Torii Hunter
Bartolo Colon against Ichiro Suzuki
Roy Halladay against Johnny Damon, Derek Jeter, and David Ortiz
Felix Hernandez against Elvis Andrus
Livan Hernandez against Juan Pierre
Tim Hudson against Jimmy Rollins
John Lackey against Ichiro Suzuki and Michael Young
Greg Maddux against Craig Biggio
Jamie Moyer against Garret Anderson
Mike Mussina against Manny Ramirez
Kenny Rogers against Garret Anderson
Jason Schmidt against Luis Gonzalez
James Shields against Derek Jeter
Justin Verlander against Alex Gordon
Tim Wakefield against Jason Giambi, Derek Jeter, and Alex Rodriguez
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.Subscribe now
But when examining historical data, you are lumping in at-bats versus both right- and left-handed pitchers. So the signal you find could be attributed to same side vs. opposite side matchups and less so player to player.
When an observed ratio is reliable, A and B are relatively large, so you need lots of matchup data or very different matchup odds ratios to see an effect. many of the matchups in your analysis may happen to not be much different from the in-season estimate, so matchups that are truly different may not be represented. It would also seem that matchup data would have greater explanatory power when the non-matchup performance is less reliable.
One, your first inquiry, looking at reliability of pitcher/batter matchups - what is that telling us? Of course they will be just as reliable as regular ole pitcher and batter stats. You're just capturing the talent in the batters and the pitchers (the log5 result) but not any "extra" information from "specific batter/pitcher" matchups.
As far as your second inquiry, again, of course batter/pitcher specific matchups will give you extra information than just using log5 alone if you did not break it down by platoon handedness! How could you not do that?
Even if you did, it would still give you extra information. And that is because log5 is only an approximation of the result of a batter/pitcher matchup using limited variables.
The actual result of a batter/pitcher matchup, for example K rates, requires more variables, namely G/F ratios of batters and pitchers.
As well, log5 doesn't work that well at some of the extremes because the outcome actually reflects more on either the batter or pitcher whereas log5 always assumes equal contribution. Log5 also assumes independence between the batter and pitcher rates, which isn't true in most cases.
So basically the reason you find that batter/pitcher results adds to your regression is because log5 simply doesn't use all the relevant information and doesn't treat what it does use properly. The batter/pitcher matchup results picks up some information that log5 does not.
The real question is whether batter A versus the universe of pitcher B's where pitcher B's were all the same in terms of handedness and G/F ratio (and maybe a few other things) is always the same regardless of the different results of each batter/pitcher matchup in that universe. I don't think you addressed that question at all.
As to the second inquiry, what surprised me was the variance partitioning. "This year" still beat "matchup history" by a 2:1 ratio, but that's out of alignment with the usual guidance on this work. As you suggest, the answer might just be "Smith does well against GB pitchers and Jones is a GB pitcher" and maybe that's the answer for 80 percent of cases. That's fine and I'm happy to see more work in that area. (You're quite right I haven't gotten there yet!) Personally, I walked away from this one with "Gee, I really should be taking matchup data more seriously." If others walk away with that as well, I am a happy man.
Although to the first point, if you merely did a KR-21 or Cronbach's Alpha on large samples of batter data alone (against all pitchers) even if that data spans 7-8 years, surely you are going to get a pretty high reliability, no? Not only does talent not change THAT much over time, but if it changes systematically that won't even affect the correlation, no? In fact, I'm guessing that if you controlled for underlying sample sizes, you would get around the same reliability whether your used batters alone or batter/pitcher matchup data. The batter/pitcher matchup data is really just a proxy for batter v. any pitcher data, with smaller sample sizes (assuming the null hypothesis that batter/pitcher matchups are nothing more than a log5).
Surely you should have split the batter/pitcher matchups by handedness platoons though. Couldn't that alone account for most or all of the "1-part" batter/pitcher matchup in your regression?
Again my guess is that between handedness and G/F platoon, that will account for virtually all of your "1-part batter/pitcher matchup."
Technically you COULD call the G/F platoon stuff part of a "batter/pitcher" matchup even though I don't think most people look at it that way.
The ultimate question is how much would you regress a specific batter/pitcher result in X number of PA toward the log5 expectation? And my guess would be about what we have always thought which is close to 100% for any reasonably small sample and maybe 95% for any really large sample.
Which pretty much means what we thought it meant, which is, as with clutch, "You can ignore it other than as a tie breaker."
No one every said it means NOTHING. Or should say that. We don't say that about clutch, protection, chemistry, etc. We simply follow the evidence. So far I don't think we've found any significant evidence of "batter/pitcher effect" and I'm not sure that you have changed that to be honest, for the reasons we have discussed. You have merely opened up the inquiry again.