Happy Labor Day Weekend! Regularly Scheduled Articles Will Resume Tuesday, September 8
May 3, 2007
Lies, Damned Lies
Defining a Market, Part One
Are you ready for some geography?
One of my favorite sabermetric pieces is Mike Jones’ study on market sizes. By making a couple of common-sense adjustments to standard assessments of market size based on city-level or metropolitan-level population, Mike was able to go a long way toward creating truer and more intuitive assessments of the relative market sizes of different baseball clubs. I liked his data so much that I used it as the basis for my attendance study in Baseball Between the Numbers.
This past weekend, however, I had a bright idea. Actually, it wasn’t such a bright idea, because I wound up spending the better part of three days on it. Forget about how we define metropolitan areas and how we dole out secondary markets--let’s try and account for the MLB affiliation of every person in the country!
Clearly, definitions of city boundaries are arbitrary. Both Northern California and Southern California are densely populated, but San Francisco occupies 47 square miles, while Los Angeles occupies 470 square miles. Definitions of metropolitan boundaries are arbitrary to a certain extent too. The areas may cross state boundaries, or run up against other cities. In some cases, a metropolitan statistical area (MSA) may be drawn too tightly from a baseball team’s perception, excluding people in the exurbs who could reasonably attend a baseball game. In other cases, they might be drawn too broadly, including people that might have some lukewarm affiliation with the central city, but are probably too far to attend major-league baseball games on a regular basis.
Breaking things down to the county or city level would resolve this problem. Indeed, it turns out that there is quite a wealth of population data available for free at the Census Bureau home page. My base unit of analysis was the 2006 population estimates on a county-by-county basis for the 48 contiguous states, plus Alaska, Hawaii, Puerto Rico and the District of Columbia. In some cases, where a county had a population of two million or more, I drilled down further to the city or census-tract level. In addition, I was able to track down population data at the metropolitan level for Canada. (Although this excludes rural Canada, Canada is highly urbanized, and if we wind up excluding a few farmers in Northern Saskatchewan, I think I can live with that).
Building this database provides several advantages, above and beyond the benefits of not having to rely on someone else’s definition of what constitutes a city or MSA. For one thing, we can very naturally account for a team’s potential secondary markets. For another, the area in between cities has a different character in different parts of the country. It’s often fairly dense along the Eastern Seaboard, and reasonably dense in the Midwest and the South, but generally completely barren in the Mountain West.
The other interesting piece of data available at the Census Bureau--something that required some digging to find--is their estimate for the latitude or longitude of the geographic center of each county. By using something called the Haversine formula, we are able to estimate exactly how far each county is from each major league ballpark.
So far, so good. Everything sounds very precise and very scientific. You can see that what I’m going to do is to build some sort of sliding scale for market size based on the distance between a given person and a given ballpark. But in order to get the model to really sing, we have to account for a couple of other wrinkles that straddle the boundary between the subjective and objective.
For one thing, we need some notion of a team’s sphere of influence. If you viewed things strictly in terms of geography, you’d find that the Cubs and the White Sox have nearly identical market sizes, when in fact the Cubs have quite a bit more influence, particularly once you get outside of Chicago proper and into suburbs and cornfields. A team’s sphere of influence can penetrate outward much farther if it has a strong brand.
There is no one perfect way to define the strength of a team’s brand, so what I did instead was to combine six or seven imperfect metrics in the hopes of coming up with a tasty sausage. In particular, the measures that I looked at were as follows:
As you can see, I’m trying to house a lot of different definitions of brand under one roof. A team’s "likability" plays a role, as reflected in the ESPN survey, but so too does its history of success, the amount of buzz that it generates in media circles, and so forth. The Forbes data is intentionally given double weight because it’s probably the most reliable data among our metrics in this exercise.
Each team was assigned a rating in each category, ranging from 50 for the lowest team to 100 for the highest team, with the rest of the data linearly extrapolated from there. The rating across each of the seven categories was then averaged to produce the final result.
Team Absolute Score Relative Score 1. New York Yankees 94.8 1.44 2. St. Louis Cardinals 82.6 1.26 3. Boston Red Sox 80.1 1.22 4. Chicago Cubs 72.8 1.11 5. New York Mets 72.2 1.10 6. Cleveland Indians 71.7 1.09 7. Atlanta Braves 70.3 1.07 8. Chicago White Sox 68.2 1.04 9. Houston Astros 67.9 1.03 10. Detroit Tigers 67.1 1.02 11. Los Angeles Angels 67.0 1.02 12. San Francisco Giants 66.6 1.01 13. Los Angeles Dodgers 66.3 1.01 14. Philadelphia Phillies 66.2 1.01 15. Seattle Mariners 65.1 0.99 16. San Diego Padres 64.3 0.98 17. Cincinnati Reds 63.7 0.97 18. Baltimore Orioles 63.5 0.97 19. Pittsburgh Pirates 62.0 0.94 20. Texas Rangers 61.8 0.94 21. Oakland A's 61.5 0.94 22. Arizona Diamondbacks 61.4 0.93 23. Minnesota Twins 59.4 0.90 24. Toronto Blue Jays 59.2 0.90 25. Washington Nationals 58.7 0.89 26. Milwaukee Brewers 57.7 0.88 27. Florida Marlins 56.4 0.86 28. Colorado Rockies 56.3 0.86 29. Kansas City Royals 55.3 0.84 30. Tampa Bay Devil Rays 52.2 0.79
You’ll see two sets of ratings reflected in the chart. The first is the “raw” rating on a 50-100 scale, while the second is the score relative to the league average, which is the number that we’re going to use to tweak our market-size estimates. We could probably devote a column or two to the accuracy or lack thereof of these brand ratings--I think they seem pretty darn good--but we have a lot of other things to look at, so let’s move forward.
It became clear to me in thinking about market size that there are two ways to define a team’s market. On the one hand, you have a team’s market for attendance, which is going to involve a smaller geographic radius, since people need to be able to commute to the ballpark to attend a baseball game. On the other hand, you have a team’s media market, which is less subject to geographic constraints, but tends to be more of a winner-take-all affair. In general, the media market will be larger than the attendance market, but the larger is not necessarily a subset of the former. A fan in northeast Pennsylvania is probably going to get the Phillies but not the Mets on TV, even though he could commute to New York about as easily as he could commute to Philadelphia.
We’ll concentrate on the attendance side of the coin first. My process for determining each team’s potential attendance market was as follows:
Yep, this really is like Win Shares, what with its combination of superfluous precision and extreme subjectivity. Nevertheless, I think the model produces some fairly reliable results. For example, in Lake County in extreme Northeast Illinois, we come up with the following estimates:
Lake County, Illinois (Population 713,076)
Cubs 500,357 70.2% White Sox 384,211 53.9% Brewers 90,947 12.8% Total 975,515 136.8%
I’d guess that those numbers are just about right. The Cubs have roughly a 7:5 edge over the White Sox, while the Brewers are penalized for being out-of-state, even though Milwaukee isn’t much farther from Lake County than Chicago is. For a somewhat more dramatic example, here is how the Northeast gets divvied up between the Red Sox and the Yankees:
The map is an oversimplification, since it does not account for the Mets, Phillies, Orioles, and so forth, but all of that is accounted for by the model. It’s fairly obvious what’s going on there, so I’ll let the pretty picture speak for itself.
We’ll break everything down on a team-by-team basis Friday, but first let me briefly describe my alternate method for calculating a team’s TV audience. There are several adjustments from the attendance version of the model, most of which are designed to reflect the winner-take-all nature of media coverage:
Here, then, is what the ping-pong balls say: the attendance and TV markets for each major-league club.
Team Attendance Rank Rel TV/Media Rank Rel NYA 17,851,140 1 304 21,933,814 1 247 NYN 14,283,315 2 244 15,510,522 3 174 LAN 11,869,232 3 202 13,908,965 4 156 LAA 11,149,730 4 190 13,775,861 5 155 PHI 7,669,007 5 131 11,266,405 6 127 CHN 7,558,066 6 129 10,296,326 8 116 CHA 7,387,544 7 126 8,184,670 16 92 BOS 6,788,847 8 116 10,138,743 9 114 TOR 6,579,560 9 112 9,252,530 12 104 OAK 6,105,012 10 104 7,727,977 18 87 SFN 5,903,008 11 101 9,678,663 11 109 WAS 5,894,698 12 101 10,317,452 7 116 ATL 5,459,976 13 93 15,623,999 2 176 BAL 5,267,088 14 90 6,996,070 21 79 DET 5,206,887 15 89 8,288,697 15 93 HOU 4,990,053 16 85 9,757,806 10 110 TEX 4,922,605 17 84 9,065,761 14 102 FLO 4,226,982 18 72 5,737,102 23 65 CLE 3,854,535 19 66 5,760,772 22 65 ARI 3,730,833 20 64 5,371,744 24 60 CIN 3,681,420 21 63 9,067,568 13 102 SEA 3,470,303 22 59 8,145,144 17 92 SDN 3,433,886 23 59 4,713,421 26 53 SLN 3,003,085 24 51 7,502,913 19 84 MIN 3,001,789 25 51 5,239,711 25 59 TBA 2,999,411 26 51 7,329,379 20 82 COL 2,779,034 27 47 4,532,396 27 51 PIT 2,486,991 28 42 3,593,959 30 40 MIL 2,431,916 29 41 3,868,097 29 44 KCA 1,917,808 30 33 4,164,140 28 47 175,903,763 266,750,607
I hope I haven’t lost you guys, because now the (comparatively) fun part is up next: our team-by-team breakdown. In addition to the attendance and TV estimates from my model, I have provided a comparison to Mike Jones' figures, and the raw census data from each team’s primary MSA. All of this runs in a Friday edition of LDL.