I’m not a particularly great baseball analyst. I’m constantly fascinated and

amazed by how little we really know about the game, and the limitations

inherent in any analysis of the game, be it evaluation of performance data

or by observation. I tend to get caught up in the problems with our tools,

and the poor resolution or granularity of what conclusions one can

reasonably draw. As a result, I tend to lose interest very quickly in the

sort of heavy lifting done by more capable analysts like Clay Davenport,

Keith Woolner, and Michael Wolverton.

A lot of what’s done at *Baseball Prospectus* and other outlets for

similar content is really iterative modeling. Basically, that’s using

multiple models that incorporate information from other models. This is done

all the time in almost every business.

For example, Procto and Gambill may know that they can expect one new

customer for Blammo detergent for every dollar they spend on advertising.

From each new customer, they can expect sales of half a box of detergent a

month for the next year, losing 3% of their customer base per month as they

go. From each customer, each month, they expect to bring in 10 cents in

profit. You can then crank the arithmetic to get an idea of how much money

you can expect to make, and you can identify changes in the marketplace.

We do the same thing in baseball analysis. We have models that link

together, and using data generated from the performance on the field, we can

put some parameters around the value of players’ performances. With that

information, we can make some calculations about the effects of those

performances on the team’s performance. Based on team performance, we can

model probable changes in a team’s revenue stream.

Using linear weights, a run value is assigned to everything that an

offensive player does so that his offensive contribution can be measured in

terms of runs. For example, a caught stealing might be worth -.6 runs, and a

stolen base might be worth +.3 runs. Linear weights are a model of

performance developed through a multiple regression analysis of team scoring

data. Basically, it encapsulates the relationship among the number of hits,

walks, and steals teams have, and how many runs they score.

So let’s say we have a left fielder–let’s call him Rickey–and we know,

through the use of linear weights, that he contributed about 110 runs last

year. We also know that the average left fielder in the league last year

contributed about 80 runs last year. We can use these values, derived from

the linear weights model, to assess the impact on the team’s performance.

Rickey’s team scored 800 runs last year and allowed 750. They went 86-76 for

the year. What would have been the probable impact had Rickey been replaced

with an average left fielder? Well, the offense probably would have dropped

to about 770 runs. Using the Pythagorean model for winning percentage–runs

scored^2/(runs scored^2 + runs allowed^2), we would expect their winning

percentage to drop to .513, for an overall record of 83-79. So we’d say that

Rickey was about three wins better than an average left fielder last year.

We get pretty comfortable and facile with these numbers. We can take that

three-wins number, and bounce it out to another model, one that demonstrates

that revenue is directly related to wins, at the marginal value of say, $2

million, and come to the conclusion that Rickey’s compensation should be

about $6 million more than the average left fielder. We can also fold in

defensive numbers for increased precision. Clubs and agents do this sort of

thing all the time, but more thoroughly and with much more precision.

The problem is that we tend to be overreliant on these numbers. The methods

of development for most models are based on very large data sets, and the

methods used all have certain assumptions and limitations. Baseball analysts

are quick to point out that stolen bases aren’t really all that valuable,

particularly if you’re not an effective base-stealer. But we need to

understand the limitations of the tools we’ve come to rely upon so heavily,

and, by extension, the limitations of our own understanding.

According to most measurements, a stolen base is worth somewhere in the

vicinity of a fifth of a run. That’s about a fiftieth of a win or so, using

the Pythagorean model and typical run-scoring numbers. But is it really?

Remembering back to about 1984…

Tie game, bottom of the ninth inning.

**Dwayne Murphy**

leads off with a safe, E4, takes second on G5, steals third base, and scores

on a sacrifice fly. Was that stolen base worth 1/50th of a win? Hell, no, it

was worth well more than that.

Those types of nuances are something we cannot effectively capture using the

accounting system we have in baseball. (And that’s what it is–an accounting

system.) We often glibly make the assumption that for each tremendously

valuable steal, there are others that are of little to no value whatsoever,

so our models are still highly precise and accurate.

We need to keep in mind that we don’t know as much as we think. Players that

we might ride because they don’t walk, or because they steal at a 40%

success rate, or hit like

**Vince Coleman**

but play great defense, might be much better than we give them credit for being.

I used to hate this argument when it was turned against me in

rec.sport.baseball many years ago, because its proponents tried to discredit

all of statistical analysis. Analysis is valuable, certainly. But there are

limits to its utility, and we need to be very careful not to place too much

emphasis on this kind of iterative modeling. Our models are built on enough

assumptions about baseball, data, and the world that we need to turn a

skeptical eye on them at every opportunity, just like we do to conventional

wisdom.

What we don’t know could fill a book. Perhaps even an annual one.

*Gary Huckabay is an author of Baseball Prospectus. You can contact him by clicking here.*