Since the original appearance of "

Field

General or Backstop?" in last

year’s *Baseball Prospectus*, we’re received a great deal of praise and

compliments on the article, for which we’re grateful. We’ve also received

some thoughtful criticism that is worth responding to (there’s also been

some not so thoughtful criticism, but that’s what the Delete key is for).

The primary question raised by attentive readers is that when comparing

catchers against their counterparts in consecutive years, the counterparts

can change. That is, while **Scott Hatteberg** caught over 100 games for Boston

in both 1997 and 1998, the backup catchers changed. **Bill Haselman** got the

most of the rest of playing time behind the plate in 1997, while **Jim
Leyritz**,

**Jason Varitek**, and

**Mandy Romero**split time backing up

Hatteberg in 1998.

The Z-score method introduced in "

Field

General Or Backstop?" rates catcher

performance in terms of standard deviations beyond the collective

performance of the other catchers on the team. If the collective

performance baseline changes from year to year, it could produce variation

in the catcher’s Z-score even if his ability hasn’t changed. That is, if

Hatteberg is actually an average defensive catcher in both 1997 and 1998

(in an absolute sense), but Haselman in fantastic defensively, Hatteberg

will be negatively rated in 1997 (since he underperformed Haselman). If

Varitek and company are poor defensively, the same performance by Hatteberg

would be rated highly in 1998 (since he outperformed the others). Given

this genuine concern, the question then becomes whether this effect is

substantial enough to skew the results and alter the conclusions of the

original study.

As with many things in research, the design of the study involved

tradeoffs. In this case, the alternatives were to select the entire

universe of pitcher and catchers, maximizing sample size, or select just

those pitchers and catchers who qualified for selection (minimum of 100

PA/season) in two consecutive years, ensuring a consistent baseline for

comparison in both years. Restricting the data set to pairs of catchers who

worked with the same pitcher over the course of two years would eliminate

85+% of the potential data set that I considered. Reducing the sample size

increases the variability in the measurement. As I wanted to test as robust

and diverse set of catchers as possible, I opted for the larger data set.

The effect of this choice is that the baseline for comparison varies

somewhat year to year, depending on the turnover rate (and mix of playing

time, even if there is no turnover) in backup catchers. However, if the

average variation across the data set is small, it will not overly

influence the results obtained.

A comparable analysis would be that you compare the batting average of,

say, **Tony Gwynn** to a randomly selected Padres teammate each year. If he has

a higher average he "wins", if lower he "loses". Chances are that Gwynn

will outhit that teammate, regardless of who is selected, and despite

variations not only in Tony’s performance, but also the makeup of the team

roster from year to year as players arrive and depart. Similarly, in the

same comparison, **Mario Mendoza** would be expected to substantially under-hit

a typical teammate. You see more or less consistent results (win-win or

loss-loss) in consecutive from players at the extremes. Players towards the

middle would be most subject to "flipping" from year to year as the

vagaries of who they compared to and the expected small differences between

them takes over.

Part of the analysis in the article was to look at, collectively, all of

those players who were above average in one year (the "winners") and see if

they had any tendency towards winning (or, in the terms of the article,

posting a positive battery Z-score) in the second year. Even allowing that

there is some variance in the baseline of comparison (the composition of

the rest of the catching staff), there was essentially *zero* evidence for any kind of

trend for either good catchers to stay good (the Tony Gwynn case) or for

bad players to remain awful (the Mendoza case). As mentioned in the

article, I even looked at the extremes (those with a battery Z-score

greater than +1 or less than -1 — essentially those more than one standard

deviation from the mean), and found no evidence that good catchers would

start to stand out from the crowd. Ditto for Z > +2 or Z < -2. No matter how

you sliced it, nothing pointed to a consistent game-calling ability for

catchers, regardless of how exceptional they have appeared in the past.

Space and time constraints prevented a full treatment of this issue in

BP99. However, the beauty of the web is that we are not as bound by the

constraints of print. I went back to the data I collected, and looked only

at pairs of catchers who worked with a pitcher over two consecutive years

for at least 100 PA per catcher per year. Using a similar method to the one

in the article, I computed the PR/PA (Pitching Runs per Plate Appearance)

for the pitcher with each catcher, and took the difference between the two.

Then I compared it to the difference in PR/PA in the following year.

This may be better explained through an example:

Suppose that pitcher Able pitches to catchers Brown and Church in both 1990

and 1991.

In 1990, Able-Brown has a 0.100 PR/PA, while Able-Church has a 0.150 PR/PA,

for a difference of +0.050.

In 1991, Able-Brown has a 0.080 PR/PA, while Able-Church has a 0.65 PR/PA

for a difference of -0.015

So this creates a (much smaller) set of data points to compare, where two

catchers each work with the same pitcher over two years. We would end up

with a list of pitchers & catchers, each with two data points, as in the

following list:

1990 1991 Able-{Brown/Church} +0.050 -0.015 Aaron-{Brown/Church} -0.145 +0.004 Evans-{Felix/Gomez} -0.200 -0.075 [...]

By taking the correlation between the two columns, we can determine whether

a catcher with relative success with a pitcher in one year is more likely

to have continued relative success the next. The correlation was virtually

zero: +0.01, and virtually identical to the correlations presented in the

original article. This, therefore, reinforces the original conclusion that

there is no statistical evidence for a substantial and persistent

game-calling ability that differentiates among major league catchers.

The two new charts recreate two of the key comparisons from the original

article. The

first chart plots the differences between catchers for a given

pitcher in year 1 against the same catchers with the same pitcher in year

2. If game-calling is a skill, good catchers in year 1 should tend to

continue to do well in year 2, resulting in a linear trend in the plot.

Recalling the original article, we saw exactly this linear pattern when we

looked at the ability for batters to hit home runs, or for pitchers to get

strikeouts. However, here (as in the original analysis), there is no

pattern. Instead we see a random scattering, indicating that there is no

relationship between how well a catcher works with a pitcher in one year to

the next.

The second chart

follows up another analysis done in the original article.

We split all the catcher combination in year 1, and categorized them as

above or below average, then looked at the range of performances the

following year for each subset. Again, recalling the results of the

original article, when you plot the cumulative probabilities for each

subset for commonly accepted abilities like HR rate and strikeout rate, the

two curves have a wide separation on the chart, indicating that there was a

significant difference in the distribution of performances between the two

subsets in the following year. As you can see in the new chart, even when

two catchers are compared only to each other, there is no significant trend

for the better catcher in one year to continue showing superior results.

The two curves are virtually on top of one another, with no separation

between the two distributions of performance.

Thus, even addressing a potential source of error in the original study,

the conclusion that there is no detectable game-calling ability still

stands. Whether you look across all pitcher-catcher combinations (as in the

original article), or whether you focus on two catchers who both work with

the same pitcher in consecutive years, there is no tendency for a catcher

to exert influence over the opposition’s offensive production. While the

position of catcher is still physically demanding, and may indeed be a

critically important defensive position, we’ll need to look elsewhere to

assess the full magnitude of his contribution.