Are you ready for some geography?
One of my favorite sabermetric pieces is Mike Jones’ study on market sizes. By making a couple of common-sense adjustments to standard assessments of market size based on city-level or metropolitan-level population, Mike was able to go a long way toward creating truer and more intuitive assessments of the relative market sizes of different baseball clubs. I liked his data so much that I used it as the basis for my attendance study in Baseball Between the Numbers.
This past weekend, however, I had a bright idea. Actually, it wasn’t such a bright idea, because I wound up spending the better part of three days on it. Forget about how we define metropolitan areas and how we dole out secondary markets–let’s try and account for the MLB affiliation of every person in the country!
Clearly, definitions of city boundaries are arbitrary. Both Northern California and Southern California are densely populated, but San Francisco occupies 47 square miles, while Los Angeles occupies 470 square miles. Definitions of metropolitan boundaries are arbitrary to a certain extent too. The areas may cross state boundaries, or run up against other cities. In some cases, a metropolitan statistical area (MSA) may be drawn too tightly from a baseball team’s perception, excluding people in the exurbs who could reasonably attend a baseball game. In other cases, they might be drawn too broadly, including people that might have some lukewarm affiliation with the central city, but are probably too far to attend major-league baseball games on a regular basis.
Breaking things down to the county or city level would resolve this problem. Indeed, it turns out that there is quite a wealth of population data available for free at the Census Bureau home page. My base unit of analysis was the 2006 population estimates on a county-by-county basis for the 48 contiguous states, plus Alaska, Hawaii, Puerto Rico and the District of Columbia. In some cases, where a county had a population of two million or more, I drilled down further to the city or census-tract level. In addition, I was able to track down population data at the metropolitan level for Canada. (Although this excludes rural Canada, Canada is highly urbanized, and if we wind up excluding a few farmers in Northern Saskatchewan, I think I can live with that).
Building this database provides several advantages, above and beyond the benefits of not having to rely on someone else’s definition of what constitutes a city or MSA. For one thing, we can very naturally account for a team’s potential secondary markets. For another, the area in between cities has a different character in different parts of the country. It’s often fairly dense along the Eastern Seaboard, and reasonably dense in the Midwest and the South, but generally completely barren in the Mountain West.
The other interesting piece of data available at the Census Bureau–something that required some digging to find–is their estimate for the latitude or longitude of the geographic center of each county. By using something called the Haversine formula, we are able to estimate exactly how far each county is from each major league ballpark.
So far, so good. Everything sounds very precise and very scientific. You can see that what I’m going to do is to build some sort of sliding scale for market size based on the distance between a given person and a given ballpark. But in order to get the model to really sing, we have to account for a couple of other wrinkles that straddle the boundary between the subjective and objective.
For one thing, we need some notion of a team’s sphere of influence. If you viewed things strictly in terms of geography, you’d find that the Cubs and the White Sox have nearly identical market sizes, when in fact the Cubs have quite a bit more influence, particularly once you get outside of Chicago proper and into suburbs and cornfields. A team’s sphere of influence can penetrate outward much farther if it has a strong brand.
There is no one perfect way to define the strength of a team’s brand, so what I did instead was to combine six or seven imperfect metrics in the hopes of coming up with a tasty sausage. In particular, the measures that I looked at were as follows:
- The ranking of each team in the ESPN Ultimate Standings in two categories that are closely associated with brand: perception of ownership and fan relations. Data was averaged over the past five years of the ESPN survey.
- Baseball avidity in each area, as measured by a 2002 Scarborough Research study.
- The number of "hits" for each team using Google Blogsearch. Alternate team names ("Oakland A’s," "Oakland Athletics") were accounted for.
- The number of regular-season wins for each team since 1901, provided continuous tenure in its current market. The Giants, for example, start counting upward from when they arrived in San Francisco, and do not get credit for what they did in New York.
- The amount of postseason success for each team, again provided continuous tenure in its market. One point was given for each playoff appearance, and a two-point bonus for each World Series championship.
- The value of the brand intangible for each club, as estimated by Forbes.
- The Forbes data again, this time transformed to a logarithmic scale.
As you can see, I’m trying to house a lot of different definitions of brand under one roof. A team’s "likability" plays a role, as reflected in the ESPN survey, but so too does its history of success, the amount of buzz that it generates in media circles, and so forth. The Forbes data is intentionally given double weight because it’s probably the most reliable data among our metrics in this exercise.
Each team was assigned a rating in each category, ranging from 50 for the lowest team to 100 for the highest team, with the rest of the data linearly extrapolated from there. The rating across each of the seven categories was then averaged to produce the final result.
Team Absolute Score Relative Score 1. New York Yankees 94.8 1.44 2. St. Louis Cardinals 82.6 1.26 3. Boston Red Sox 80.1 1.22 4. Chicago Cubs 72.8 1.11 5. New York Mets 72.2 1.10 6. Cleveland Indians 71.7 1.09 7. Atlanta Braves 70.3 1.07 8. Chicago White Sox 68.2 1.04 9. Houston Astros 67.9 1.03 10. Detroit Tigers 67.1 1.02 11. Los Angeles Angels 67.0 1.02 12. San Francisco Giants 66.6 1.01 13. Los Angeles Dodgers 66.3 1.01 14. Philadelphia Phillies 66.2 1.01 15. Seattle Mariners 65.1 0.99 16. San Diego Padres 64.3 0.98 17. Cincinnati Reds 63.7 0.97 18. Baltimore Orioles 63.5 0.97 19. Pittsburgh Pirates 62.0 0.94 20. Texas Rangers 61.8 0.94 21. Oakland A's 61.5 0.94 22. Arizona Diamondbacks 61.4 0.93 23. Minnesota Twins 59.4 0.90 24. Toronto Blue Jays 59.2 0.90 25. Washington Nationals 58.7 0.89 26. Milwaukee Brewers 57.7 0.88 27. Florida Marlins 56.4 0.86 28. Colorado Rockies 56.3 0.86 29. Kansas City Royals 55.3 0.84 30. Tampa Bay Devil Rays 52.2 0.79
You’ll see two sets of ratings reflected in the chart. The first is the “raw” rating on a 50-100 scale, while the second is the score relative to the league average, which is the number that we’re going to use to tweak our market-size estimates. We could probably devote a column or two to the accuracy or lack thereof of these brand ratings–I think they seem pretty darn good–but we have a lot of other things to look at, so let’s move forward.
It became clear to me in thinking about market size that there are two ways to define a team’s market. On the one hand, you have a team’s market for attendance, which is going to involve a smaller geographic radius, since people need to be able to commute to the ballpark to attend a baseball game. On the other hand, you have a team’s media market, which is less subject to geographic constraints, but tends to be more of a winner-take-all affair. In general, the media market will be larger than the attendance market, but the larger is not necessarily a subset of the former. A fan in northeast Pennsylvania is probably going to get the Phillies but not the Mets on TV, even though he could commute to New York about as easily as he could commute to Philadelphia.
We’ll concentrate on the attendance side of the coin first. My process for determining each team’s potential attendance market was as follows:
- The distance in miles between each county and each major league ballpark was determined using the Haversine formula. Before you ask, I was able to identify the exact geographic coordinates of each major league stadium.
- This raw distance was adjusted for out-of-state commuters. When I was running some gut-checks of the model, I found that many of the counter-intuitive results involved travel across state lines. The Indians were getting too much credit for southern Michigan, for example. Therefore, each team was assigned to its home state(s); the Royals were given both Kansas and Missouri, and the Nationals were given both Virginia and Maryland in addition to the District. The Blue Jays were assigned to the "state" of Canada. A 10 percent penalty was applied to an out-of-state commuter in a state without a home team; for example, a fan in South Carolina is assigned a 10 percent mileage penalty with respect to his distance to Turner Field. If the commuter comes from a state that does have a home team, a much harsher 50 percent mileage penalty is applied. For example, a fan in Western Massachusetts has a 50 percent penalty assigned to all teams but the Red Sox. I provided for a grace period of 10 miles before any penalties were applied, so that immediate border cities (such as Covington, Kentucky for the Reds) were not affected.
- The raw distance was further adjusted based on a team’s influence, by dividing the mileage by a team’s relative influence rating. What this does, effectively, is to expand a team’s geographic radius if it has a stronger brand. For example, the Red Sox get to draw attendance from a radius of 252 miles rather than the standard 200, while the Devil Rays are confined to 158 miles.
A team’s 'Claim Percentage' for a given county is assigned based on the following formula (my apologies if this is starting to sound like Win Shares):
Claim Percentage = ((200 – Adjusted Distance) / 200) ^ 2.41
The "200" number you see in the formula corresponds to a maximum radius of 200 miles from which a team might draw attendance. The 2.41 exponent was chosen because it means that a fan 50 miles away from the ballpark is worth about half as much as fan right next door to the ballpark. Both of these constants are arbitrary, since I am not aware of any empirical research that relates distance from the ballpark to the likelihood of attendance at a baseball game. However, I believe my choices produce results that are fairly intuitive, as reflected in this chart:
Adjusted Distance Claim Percentage 0 100.0% 1 98.8% 5 94.1% 10 88.4% 25 72.5% 50 50.0% 75 32.2% 100 18.8% 150 3.5% 200 0.0%
- The Claim Percentage is multiplied by the county’s population to produce a raw attendance estimate.
- The raw attendance estimate is adjusted for dominance. Typically, baseball allegiance in any given area involves a tipping point of one kind or another; the more popular team or teams tend to crowd out all others, since fans of a secondary team will find that they can’t find their team’s games on TV, will have nobody to talk about the team with at the water cooler, and so forth. The mathematics of the dominance adjustment are a bit convoluted, but the basic idea is to reassign fans from one team to another by squaring the raw attendance estimates. So a team with a natural 3:2 advantage based on geography alone instead winds up at a 9:4 advantage.
- Finally, we check to see whether the raw attendance estimates between all teams in any given county add up to more than 150 percent of that county’s population. If so, the estimates are prorated downward to the 150 percent cap. Effectively, this means that in a market with two identical clubs, each team is assigned a maximum of 75 percent of its potential fan base. Once again, the selection of this constant is somewhat arbitrary.
Yep, this really is like Win Shares, what with its combination of superfluous precision and extreme subjectivity. Nevertheless, I think the model produces some fairly reliable results. For example, in Lake County in extreme Northeast Illinois, we come up with the following estimates:
Lake County, Illinois (Population 713,076)
Cubs 500,357 70.2% White Sox 384,211 53.9% Brewers 90,947 12.8% Total 975,515 136.8%
I’d guess that those numbers are just about right. The Cubs have roughly a 7:5 edge over the White Sox, while the Brewers are penalized for being out-of-state, even though Milwaukee isn’t much farther from Lake County than Chicago is. For a somewhat more dramatic example, here is how the Northeast gets divvied up between the Red Sox and the Yankees:
The map is an oversimplification, since it does not account for the Mets, Phillies, Orioles, and so forth, but all of that is accounted for by the model. It’s fairly obvious what’s going on there, so I’ll let the pretty picture speak for itself.
We’ll break everything down on a team-by-team basis Friday, but first let me briefly describe my alternate method for calculating a team’s TV audience. There are several adjustments from the attendance version of the model, most of which are designed to reflect the winner-take-all nature of media coverage:
- The radius around each ballpark is expanded from a maximum of 200 miles to a maximum of 400 miles.
- The out-of-state penalty is increased to 100 percent for out-of-state markets with a natural home team; it remains at 10 percent for states with no home team.
- Teams are only given credit for their TV audience if they either have the highest Claim Percentage in the county, or have a Claim Percentage of at least 50 percent. That means that in most markets, there is only one TV team assigned, unless there are two or more that “obviously” deserve credit.
- The mot popular team in the market is given a bonus, which is determined by taking the square root of its Claim Percentage. For example, a team with a natural 20 percent Claim Percentage sees this percentage boosted to 44 percent, provided that it is the most influential team in the market.
- In addition, the most influential team in each market is guaranteed a minimum 10 percent share of the TV audience, even if it exceeds the 400-mile radius. This mostly applies to extreme rural areas; for example, the Mariners get assigned 10 percent of Alaska’s population.
Here, then, is what the ping-pong balls say: the attendance and TV markets for each major-league club.
Team Attendance Rank Rel TV/Media Rank Rel NYA 17,851,140 1 304 21,933,814 1 247 NYN 14,283,315 2 244 15,510,522 3 174 LAN 11,869,232 3 202 13,908,965 4 156 LAA 11,149,730 4 190 13,775,861 5 155 PHI 7,669,007 5 131 11,266,405 6 127 CHN 7,558,066 6 129 10,296,326 8 116 CHA 7,387,544 7 126 8,184,670 16 92 BOS 6,788,847 8 116 10,138,743 9 114 TOR 6,579,560 9 112 9,252,530 12 104 OAK 6,105,012 10 104 7,727,977 18 87 SFN 5,903,008 11 101 9,678,663 11 109 WAS 5,894,698 12 101 10,317,452 7 116 ATL 5,459,976 13 93 15,623,999 2 176 BAL 5,267,088 14 90 6,996,070 21 79 DET 5,206,887 15 89 8,288,697 15 93 HOU 4,990,053 16 85 9,757,806 10 110 TEX 4,922,605 17 84 9,065,761 14 102 FLO 4,226,982 18 72 5,737,102 23 65 CLE 3,854,535 19 66 5,760,772 22 65 ARI 3,730,833 20 64 5,371,744 24 60 CIN 3,681,420 21 63 9,067,568 13 102 SEA 3,470,303 22 59 8,145,144 17 92 SDN 3,433,886 23 59 4,713,421 26 53 SLN 3,003,085 24 51 7,502,913 19 84 MIN 3,001,789 25 51 5,239,711 25 59 TBA 2,999,411 26 51 7,329,379 20 82 COL 2,779,034 27 47 4,532,396 27 51 PIT 2,486,991 28 42 3,593,959 30 40 MIL 2,431,916 29 41 3,868,097 29 44 KCA 1,917,808 30 33 4,164,140 28 47 175,903,763 266,750,607
I hope I haven’t lost you guys, because now the (comparatively) fun part is up next: our team-by-team breakdown. In addition to the attendance and TV estimates from my model, I have provided a comparison to Mike Jones' figures, and the raw census data from each team’s primary MSA. All of this runs in a Friday edition of LDL.