keyboard_arrow_uptop

Since the original appearance of "
Field
General or Backstop?
" in last
year’s Baseball Prospectus, we’re received a great deal of praise and
compliments on the article, for which we’re grateful. We’ve also received
some thoughtful criticism that is worth responding to (there’s also been
some not so thoughtful criticism, but that’s what the Delete key is for).

The primary question raised by attentive readers is that when comparing
catchers against their counterparts in consecutive years, the counterparts
can change. That is, while Scott Hatteberg caught over 100 games for Boston
in both 1997 and 1998, the backup catchers changed. Bill Haselman got the
most of the rest of playing time behind the plate in 1997, while Jim
Leyritz
, Jason Varitek, and Mandy Romero split time backing up
Hatteberg in 1998.

The Z-score method introduced in "
Field
General Or Backstop?
" rates catcher
performance in terms of standard deviations beyond the collective
performance of the other catchers on the team. If the collective
performance baseline changes from year to year, it could produce variation
in the catcher’s Z-score even if his ability hasn’t changed. That is, if
Hatteberg is actually an average defensive catcher in both 1997 and 1998
(in an absolute sense), but Haselman in fantastic defensively, Hatteberg
will be negatively rated in 1997 (since he underperformed Haselman). If
Varitek and company are poor defensively, the same performance by Hatteberg
would be rated highly in 1998 (since he outperformed the others). Given
this genuine concern, the question then becomes whether this effect is
substantial enough to skew the results and alter the conclusions of the
original study.

As with many things in research, the design of the study involved
tradeoffs. In this case, the alternatives were to select the entire
universe of pitcher and catchers, maximizing sample size, or select just
those pitchers and catchers who qualified for selection (minimum of 100
PA/season) in two consecutive years, ensuring a consistent baseline for
comparison in both years. Restricting the data set to pairs of catchers who
worked with the same pitcher over the course of two years would eliminate
85+% of the potential data set that I considered. Reducing the sample size
increases the variability in the measurement. As I wanted to test as robust
and diverse set of catchers as possible, I opted for the larger data set.

The effect of this choice is that the baseline for comparison varies
somewhat year to year, depending on the turnover rate (and mix of playing
time, even if there is no turnover) in backup catchers. However, if the
average variation across the data set is small, it will not overly
influence the results obtained.

A comparable analysis would be that you compare the batting average of,
say, Tony Gwynn to a randomly selected Padres teammate each year. If he has
a higher average he "wins", if lower he "loses". Chances are that Gwynn
will outhit that teammate, regardless of who is selected, and despite
variations not only in Tony’s performance, but also the makeup of the team
roster from year to year as players arrive and depart. Similarly, in the
same comparison, Mario Mendoza would be expected to substantially under-hit
a typical teammate. You see more or less consistent results (win-win or
loss-loss) in consecutive from players at the extremes. Players towards the
middle would be most subject to "flipping" from year to year as the
vagaries of who they compared to and the expected small differences between
them takes over.

Part of the analysis in the article was to look at, collectively, all of
those players who were above average in one year (the "winners") and see if
they had any tendency towards winning (or, in the terms of the article,
posting a positive battery Z-score) in the second year. Even allowing that
there is some variance in the baseline of comparison (the composition of
the rest of the catching staff), there was essentially zero evidence for any kind of
trend for either good catchers to stay good (the Tony Gwynn case) or for
bad players to remain awful (the Mendoza case). As mentioned in the
article, I even looked at the extremes (those with a battery Z-score
greater than +1 or less than -1 — essentially those more than one standard
deviation from the mean), and found no evidence that good catchers would
start to stand out from the crowd. Ditto for Z > +2 or Z < -2. No matter how
you sliced it, nothing pointed to a consistent game-calling ability for
catchers, regardless of how exceptional they have appeared in the past.

Space and time constraints prevented a full treatment of this issue in
BP99. However, the beauty of the web is that we are not as bound by the
constraints of print. I went back to the data I collected, and looked only
at pairs of catchers who worked with a pitcher over two consecutive years
for at least 100 PA per catcher per year. Using a similar method to the one
in the article, I computed the PR/PA (Pitching Runs per Plate Appearance)
for the pitcher with each catcher, and took the difference between the two.
Then I compared it to the difference in PR/PA in the following year.

This may be better explained through an example:

Suppose that pitcher Able pitches to catchers Brown and Church in both 1990
and 1991.

In 1990, Able-Brown has a 0.100 PR/PA, while Able-Church has a 0.150 PR/PA,
for a difference of +0.050.

In 1991, Able-Brown has a 0.080 PR/PA, while Able-Church has a 0.65 PR/PA
for a difference of -0.015

So this creates a (much smaller) set of data points to compare, where two
catchers each work with the same pitcher over two years. We would end up
with a list of pitchers & catchers, each with two data points, as in the
following list:

                         1990     1991
Able-{Brown/Church}    +0.050   -0.015
Aaron-{Brown/Church}   -0.145   +0.004
Evans-{Felix/Gomez}    -0.200   -0.075
[...]

By taking the correlation between the two columns, we can determine whether
a catcher with relative success with a pitcher in one year is more likely
to have continued relative success the next. The correlation was virtually
zero: +0.01, and virtually identical to the correlations presented in the
original article. This, therefore, reinforces the original conclusion that
there is no statistical evidence for a substantial and persistent
game-calling ability that differentiates among major league catchers.

The two new charts recreate two of the key comparisons from the original
article. The
first chart plots the differences between catchers for a given
pitcher in year 1 against the same catchers with the same pitcher in year
2. If game-calling is a skill, good catchers in year 1 should tend to
continue to do well in year 2, resulting in a linear trend in the plot.
Recalling the original article, we saw exactly this linear pattern when we
looked at the ability for batters to hit home runs, or for pitchers to get
strikeouts. However, here (as in the original analysis), there is no
pattern. Instead we see a random scattering, indicating that there is no
relationship between how well a catcher works with a pitcher in one year to
the next.

The second chart
follows up another analysis done in the original article.
We split all the catcher combination in year 1, and categorized them as
above or below average, then looked at the range of performances the
following year for each subset. Again, recalling the results of the
original article, when you plot the cumulative probabilities for each
subset for commonly accepted abilities like HR rate and strikeout rate, the
two curves have a wide separation on the chart, indicating that there was a
significant difference in the distribution of performances between the two
subsets in the following year. As you can see in the new chart, even when
two catchers are compared only to each other, there is no significant trend
for the better catcher in one year to continue showing superior results.
The two curves are virtually on top of one another, with no separation
between the two distributions of performance.

Thus, even addressing a potential source of error in the original study,
the conclusion that there is no detectable game-calling ability still
stands. Whether you look across all pitcher-catcher combinations (as in the
original article), or whether you focus on two catchers who both work with
the same pitcher in consecutive years, there is no tendency for a catcher
to exert influence over the opposition’s offensive production. While the
position of catcher is still physically demanding, and may indeed be a
critically important defensive position, we’ll need to look elsewhere to
assess the full magnitude of his contribution.