September 22, 2010
A Walk in the Park
One of the fascinating things for baseball fans is the differences between ballparks--the role that the very park itself plays in baseball is probably unique in sports. Different ballparks bring a very different character to the proceedings, and of course, they can even change the course of the events on the field.
It doesn’t help that MLB’s rules on the subject can be remarkably vague on the subject--at one point it states that “[a] distance of 320 feet or more along the foul lines, and 400 feet or more to center field is preferable,” which does little to indicate what might be allowed. This, of course, gives great latitude to ballpark designers, and they’ve taken advantage of that latitude.
And of course, the size alone of a ballpark doesn’t determine how it affects scoring. Elevation, wind, etc. can play a role in the friendliness of a park to hitters, or a decided lack thereof.
For many years, we’ve been using park factors to try to adjust for these effects, so we can compare two players from different parks in an even light. What I want to do here is twofold --discuss the construction of park factors as well as discuss the philosophy of park adjustments. But for now, let’s start with looking at how park factors have been traditionally figured.
Traditional park factors
The traditional formula for park factors is essentially:
Home Runs Per Game / Road Runs Per Game
I’m presenting the most simplistic version because it’s the easiest to discuss--there’s a lot of complexities you can layer on top of that, but the simplistic version will serve as an adequate stand in for them.
It helps, I think, to break down the road runs per game into components. Essentially, it’s the average of a team’s run per game in each of the different road parks it plays in, weighted by the number of games it plays there.
The first thing that we should realize pretty quickly is that with an unbalanced schedule, not every team is going to have the same mix of road parks, or even a presumably random mix of road parks. The parks that make up the rest of your division are going to figure more prominently on your schedule. And of course, even if the schedule was totally random, you won’t have a wholly representative set of road parks on your schedule--you don’t play road games in your own park, after all.
Of course, there’s no guarantees that a park will play to its “true” tendencies in a run of six or eight games. One thing to consider is the distribution of pitching performance. How often are a team’s best pitchers pitching at home? From 1993-2009, here’s the 10 teams whose Opening Day starters played most of their games on the road:
In contrast, the ten teams whose Opening Day starters played most of their games at home:
This is just an example of one of the many ways in which a team’s distribution of run scoring between their home and away games isn’t simply a matter of parks or environment at all. (Sheer random chance can’t be discounted either–81 games at home and away aren’t really a large sample size, in the grand scheme of things.)
So the simplest way to take these noisy park factors and make them useful is to average them over a period of years, perhaps with some accounting for regression to the mean as well. This works, but you tend to lose some of your power to detect changes in parks over time (or changes in the leagues that the parks play in).
And if you look at the data we’ve had and the resources we’ve had to process that data, that method makes sense--there are some tradeoffs, but they’re understandable ones. But thanks to the tireless efforts of the folks at Retrosheet, we now have extensive records of every play leading back to 1974 and the bulk of all plays going back to 1950. And computers and data storage just keep getting cheaper and cheaper every year.
A new method
So let’s go ahead and tackle this with a lot more resources than might have been used before. And let’s look at how different ballparks affect different baseball skills.
The most noticeable park effect is on the long ball, of course--we pay a lot of attention to how parks affect home runs. But of course parks affect almost every aspect of baseball--everything from infield singles to doubles to strikeouts.
Now, what we have to be careful of is the interaction of different components. After all, if a park suppresses home run rates, that has to have an impact on other things--that usually means more balls in play, if nothing else.
Taking a cue from Voros McCracken’s work on DIPS, I broke batting lines down into a series of independent component rates. I use a larger number of components than Voros did--I didn’t have to restrict myself to using official pitching stats or official batting stats, for that matter. So infield singles are treated separately from other hits, for instance.
So for every player, in every park they played in, I put together a set of component batting lines--I call them batting lines, but I did them for pitchers as well (call that batting line against, if you prefer--pitchers as hitters were excluded). For a player’s home park, I adjusted the batting line based upon observed home field advantage over a five year period.
These were combined into what I called “road weight” batting lines for each park in which a player played. If the park under consideration was not a player’s home park, the road weight batting line was simply their batting line for that season. If it was their home park, their games played there were given 1/15th the weight and averaged with their road stats. All of these component batting lines were then regressed to the mean – “noisier” components were regressed more than more stable components.
Then, for each batter-pitcher matchup in that park, I calculated an expected set of probabilities using the odds ratio method. (When the home team batted, the component home field advantage was added to the component batting lines of both hitter and pitcher.)
This gives us a set of observed and expected component batting lines for each park. Subtracting one from the other gives us component park factors for each park. Actually, I weighted the observed line at 1/15th and combined it with the expected line, to account for the fact that the home park is not well reflected in the expected batting line – without this, you won’t get your park factors to average out properly.
Actually, they still don’t – see, the mix of road parks still isn’t totally random. So what we do is curse, and then recurse. Each player’s batting line in each park is adjusted using the factors derived from the process, and then the whole thing starts over again – the road weight lines, the regression, the odds ratio calculation. This continues until the park factors become relatively stable (around three iterations).
Looking for value
What these park factors are good at is telling us, with a certain amount of uncertainty, how a player might have played in a different context. Which is what we’re interested in with park factors, isn’t it?
Well, not exactly.
What we do know is that some hitters and pitchers are especially well (or ill) suited to their home park. Juan Pierre is my favorite example – following three seasons in one of the greatest hitting parks in the modern era, Coors Field, he goes to Florida and puts up practically identical batting numbers. Why? Because Juan Pierre doesn’t hit enough deep fly balls to pick up even a handful of cheap home runs in Colorado.
Here’s the thing, though. Most of the hitters on the teams the Rockies have to play are hitting those extra home runs. Coors Field isn’t changing Juan Pierre, but it is changing the value of what he’s doing. It takes more runs to win games in Coors than it does in other parks, and Juan Pierre’s inability to take advantage of that is costing his team real wins. By the same token, a player who is able to adjust and take advantage of his home park is creating real value for his team, and we want to credit him for that. So in a value metric, what interests us isn’t necessarily what a player would do with a different home park but how a player compares to a league-average player in the same park.
So what we can do is take our component rates, rebuild a more standard batting line, feed that batting line through the process we use to compute True Average, come up with a run value of what that park is doing to the average player, and use that for our baseline of comparison.
And what we can also do, because we have the splits data, is park adjust a player’s road batting line based upon the particular parks in which he played, and not simply assume that his road parks were average. Because, again, the unbalanced schedule means that’s not necessarily so, doesn’t it?
Parks of course have different effects based upon batter handedness (and by extension, pitcher handedness as well)–it’s a very simple set of changes to the method to derive platoon-based park factors as well. We’ll get into that more as we start to discuss some of the work that’s going into the PECOTAs this year.
We are very, very close to being done with the development of the new hitter WARP as well – we have offense, defense, positional adjustment and park. What remains to be discussed is replacement level and the conversion from runs to wins. That should be concluded soon.