We spend a lot of time analyzing baseball, studying it, trying to learn about it, and simply enjoying it. But what if I were to tell you that there was a secret to understanding baseball, a shortcut to knowing (almost) everything you would ever need to know?
Well, there is. And it’s hiding in plain sight–it’s the second line of the official rules of baseball: “The objective of each team is to win by scoring more runs than the opponent.”
Yes, it seems obvious… OK, it seems obvious because it is obvious. But if you free yourself from the obviousness of that statement and let yourself ruminate over it, it’s a pretty bold statement of some pretty powerful truths:
- Baseball is a team sport. What matters is not how well an individual plays in the abstract, but how well he is able to contribute to his team.
- The point is winning. When we measure something in baseball, we should always be asking ourselves, “How does this contribute to wins and losses?”
- You win by scoring more runs than the other team. Runs are the building blocks of wins. In order to understand how teams win games, we have to understand how they score–and prevent–runs.
The rub is in the “almost.” We know that the key to understanding baseball is knowing how teams score runs. That still leaves a lot of work for us to do, though.
So let’s break down–piece by piece–how run scoring works and how we can model the process. In this article, we’ll look at some of the fundamental principles in how teams score runs. Next, we’ll look at the history of run estimation. And to wrap it up, we’ll do a mathematical autopsy of sorts on some selected run estimators, to see how they work.
Start the clock
Baseball is somewhat unusual among modern team sports in that it does not feature, as such, a clock. The length of a game is – hypothetically, at least – infinite.
Instead of keeping time, we keep track of outs. You get 27 of them (ignoring, for a moment, the bottom of the ninth ends when the home team is leading – so in a win by the home team, it can be as little as 24 outs). Once you’ve made all of your outs, you’re done batting.
But so long as you have outs available, you can keep batting. And so long as you can keep batting, you can score more runs. That’s why outs are the lifeblood of an offense – it’s as important as clock management in other sports.
The other interesting thing to note about the lineup is that it’s cyclical – like a conveyor belt, it keeps looping over and over, until all the outs are used. That’s another unusual feature of baseball – everybody takes their turn batting in order. In basketball, your big scorer can take most of your team’s shots. In football, you can throw to your best wide receiver as much as you want to. (Yes, obviously there are reasons you may not want to, but there’s nothing keeping you from it.)
But in baseball everyone takes their turn in the batting order, regardless of situation. You can’t try and get your best hitter to take most of your at-bats. You can’t even try to get him most of your key at-bats. Once you’ve turned the lineup, you can’t move hitters around.
What this means is that, on offense, there is precious little “specialization” possible. Yes, certain lineup spots may entail more of a “table-setting” role, and some more of a “run-producing” role, but only as a matter of degrees. Every spot of the lineup can – and hopefully, does – contribute in all facets of run scoring.
The hitter's role
So what are the facets of run scoring? There are three jobs for the hitter at the plate:
- Avoid making outs. Again, outs are the clock in baseball. On offense, you want to keep as much time on the clock as possible, to give your team the most chances to score runs.
- Provide a baserunner. You want to get on base, to give other hitters a chance to drive you in. (The home run is a special case, where you in essence get on base to drive yourself in.)
- Advance the other runners. It’d be great to advance them all the way to home, but not necessary – any base advancement you can provide puts your team in a better position to score runs.
If you want to truly measure a hitter’s contribution to the offense, you need to measure all of these in the right proportion. All of them are important to team run scoring – and again, due to the cyclical nature of the lineup, no hitter can “specialize” in one to the exclusion of others without negative effects.
The counterpart to counting outs is counting bases – you want to accumulate as many bases as possible. After all, it takes four bases to score one run, right?
Now, ask yourself this – how many bases is a walk worth?
The answer is one, of course – if there is no runner on first. The batter advances one base on a walk. But if there’s a runner on first, he gets to advance a base. If the bases are loaded, a walk is worth four total bases, as many as a solo home run. (And it drives in the same number of runs for the team as well.)
And of course we’re used to talking about bases in terms of the batter alone – it’s how we count bases for the misleadingly named total bases and its rate counterpart, slugging average.
Of course it’s wrong, and of course we know it’s wrong – who would say that a walk and a single were equally as valuable? And we know it’s for that very reason that they aren’t. We understand this, but we often neglect to put this knowledge into practice.
You see, there are limits on how many bases a runner can advance. Only the hitter, for instance, can advance four bases. So for all other baserunners, a home run and a triple are indistinguishable. And for a runner on third, a home run and a single are equally as valuable.
So by only paying attention to the batter’s rate of base advancement, we tend to overrate extra-base hits, relative to hits in general.
And that’s the most important thing to remember about modeling run scoring – any assumptions you don’t make will end up being made by something else – be it historical accident, peer pressure, minimization of least square errors in collinear data sets, etc.
Or to put it another way – if you don’t explicitly think about it, you are going to end up with whatever implicit assumptions are in the underlying material. Many smart, talented people have been lead astray by looking at data without first trying to grasp the fundamental principles at hand. It’s like trying to sail a boat without a chart – you wind up adrift at sea, with no landmarks to guide you.
Three types of models
There are, broadly speaking, three ways to model run scoring. Before we proceed with discussing them, let’s briefly ask, why might we want to model run scoring?
There are several reasons, probably too many to list. But I’ll go over the obvious ones. Building a model helps you examine the run-scoring process and forces you to think critically about it. It lets you apply understanding of the run-scoring process in a uniform way, to make comparisons between players and teams. And, frankly, some of us enjoy the process of making and perfecting models.
Now, on to the three basic models of run scoring (or run estimators, if you prefer):
- A linear model is a formula for estimating runs with no interaction between the basic terms – an additive model, if you will. It will sometimes feature multiplication (say, weighting a single at half of a walk) but it will never multiply two component inputs. Examples: Batting Runs, Estimated Runs Produced, Equivalent Runs, OPS.
- A dynamic model is a formula that does feature interaction between the basic terms. Most dynamic models feature, at the very least, a “baserunners” term and an “advancement” term. Examples: Runs Created, Base Runs
- More complicated models are not formulas at all, but algorithms, or a series of instructions used to model run scoring. (Each of the instructions can be – and typically is – a formula in its own right.) Think of algorithms like computer programs (and in fact, computer programs are a type of algorithm). There are typically two kinds of algorithms used to model run scoring: a Monte Carlo simulation, or a Markov Chain.
Inherently, none of those approaches is superior to the others; each has advantages and disadvantages. The question one is trying to answer usually determines which of them is the “correct” approach at the time.
Speaking broadly, when dealing with questions about individual hitters, who have little control over the run environment as a whole, linear approaches work best. For entities that do have control over the run environment – like pitchers and whole teams – dynamic estimators work better. An algorithmic approach offers more fine-grained control over particular elements, but at the expense of much greater complexity.
Next time, we’ll look at the history of run estimation, and review a few of the key advancements of our understanding of run estimation. Until then, take care, and estimate your runs responsibly.