June 28, 2007
Lies, Damned Lies
Thanks largely to the smart and forward-thinking people in its Advanced Media wing, Major League Baseball has moved toward providing more and more information to its fans. One exception is in the department of my column today: All-Star voting. It used to be that baseball provided a rundown of every player's vote total, including write-ins that received a material number of votes, and all the way down to the player ranked at the bottom of his position pool. But now all we get is the top five at each position. How many Nationals loyalists voted for Nick Johnson--even though he's yet to play a game this season? Is Trot Nixon or Emil Brown the lowest ranked outfielder in the American League? How close is Curtis Granderson to cracking the top 15 (and how in the hell did Craig Monroe get in there?) Perhaps this is done for political reasons; nobody wants to hurt Nick Punto's feelings. But inquiring minds want to know these things.
In fact, not only should MLBAM be providing vote totals for every player listed on the ballot, but it should be breaking those vote totals down in as many ways as possible. Who has received the most votes in the past week? How much of Prince Fielder's support comes from Wisconsin? Are there significant differences between Internet and ballpark ballots? Who is winning the foreign vote?
This sort of thing would increase enthusiasm and participation for the balloting process, especially if tied in with community-oriented elements. Create a Facebook page for J.J. Hardy's candidacy where fans can write in with their arguments on his behalf. Send text messages to fans when their favorite player rises or falls in the standings. The All-Star Game is the one time each season when baseball just bucks up and has some fun, and yet the balloting counts are treated like some sort of trade secret.
The reason I'm complaining about this is because I've long wanted to write a column on team-by-team biases in All-Star balloting. How much does it help to be on the Yankees, or hurt to be on the Pirates? Unlike every other year, I've remembered that idea two weeks before the All-Star Game is played, rather than two weeks after. This analysis is made more difficult, of course, by the lack of comprehensive voting results. Nevertheless, we will press forward.
The idea is to create a simple model of the voting results based a player's performance. For example, the model might conclude that David Wright should have about 1.0 million votes based solely on his performance. If instead David Wright has 1.4 million votes, that tells us something about the popularity of David Wright, or of the team that he plays for.
More specifically, the inputs for the model are a player's VORP thus far in 2007 and his VORP in 2006. There are certainly other variables that could be considered--do fans like batting average guys more than home run guys?--but since this column is meant to be on the whimsical side, we'll just keep things simple.
In addition, we need a player's vote total. We have this information for the top five qualifiers at each position, but for everyone else, we have to guess. It looks like there have been about six million ballots cast thus far--the cumulative totals for the top five players at each position generally sum up to between four million and five million, and you have to make some allowance for the downballot guys. Thus, we start with a baseline of six million votes, subtract out the voting totals for the top five, and divide the remaining votes evenly among the rest of the nominees.
By using 2007 and 2006 VORP alone (technically, the square of 2007 and 2006 VORP, which tracks the results more closely than any sort of linear estimate), it turns out that the model is able to explain about 50 percent of the variance in the vote counts. That's actually not a bad total, considering how crude both the model and the underlying data set is. We gain a few more points of predictive power if we tweak the results to ensure that the proper number of votes are allocated at each position. Without this adjustment, there are too few predicted votes for National League catchers, for example, since there aren't too many high VORPs at that position this year, but there are too many at first base in the American League, since the league has cherry-picked some of the stronger DHes and listed them on the ballot as first basemen.
One interesting result is that 2007 performance is much more predictive of a player's position on the ballot than 2006. Specifically, each point of VORP accumulated in 2007 is worth about four times as much as a point of VORP accumulated in 2006. This is a slightly misleading result, since we're comparing partial-season and full-season numbers. Nevertheless, it's safe to conclude that fans are treating the All-Star game as a reward for having a hot six, eight or ten weeks to start the season, rather than necessarily going with the guys they'd pick if their lives depended on winning a ballgame tomorrow.
This is a somewhat ironic result, because for years and years the argument was that the fans were too slow to recognize changes in performance, and kept electing the same veterans year after year. In fact, the process has almost completely reversed itself, to the point where the fans barely consider a player's pedigree prior to 2007 at all. No doubt this has a lot to do with the introduction of Internet balloting, which puts the current year's statistics just a mouse click away, and perhaps more specifically the influence of fantasy baseball, since no crowd of fans is more firmly in the what-have-you-done-for-me-lately camp than avid fantasy gamers.
Ultimately, this process speaks pretty well to the influence of the information revolution in baseball, although I side with Joe Sheehan in not liking the result: I tend to prefer picking my All-Stars based on who I think the best players are at any given time, rather than who has been the hottest over a three-month stretch.
As alluded to earlier, the "trick" is to figure out the differences between a player's predicted vote total and his actual vote total, and to see how those differences play out by team. The model predicts, for example, that Alfonso Soriano should have about 0.8 million votes; he actually has 1.3 million, so we take the difference of 500,000 votes and place it in the Cubs' column. We then repeat this process for each of the eight players a team has listed on the ballot to come up with an estimate of the residual number of votes that a player picks up based on playing for a particular team; those results follow below.
Mets 435,030 Red Sox 423,489 Brewers 395,426 Yankees 386,261 Dodgers 379,235 Tigers 338,581 Cardinals 117,479 Twins 98,612 Cubs 28,921 Astros (100) Giants (20,353) Reds (27,129) Braves (54,679) A's (70,231) Angels (76,663) Devil Rays (77,627) Mariners (90,137) Royals (104,549) White Sox (105,308) Nationals (128,867) Rangers (135,562) Pirates (137,710) Padres (149,781) D'Backs (171,242) Orioles (179,958) Blue Jays (194,114) Phillies (205,883) Indians (211,400) Rockies (224,608) Marlins (237,113)
The way to read this is that a typical player would pick up about 435,000 votes simply by virtue of playing for the Mets, or lose about 150,000 votes by playing for the Padres. The teams toward the top of the list are pretty much those teams that you'd expect to see. It was obvious that the Mets were going to do well when I saw Jose Valentin's name in the top five at his position.
For the most part, those teams that get the biggest boost in the All-Star balloting are those that are doing the best at the box office; the correlation between the All-Star residuals and per-game attendance thus far in 2007 is .64. I do not know how much of this has to do with "ballot stuffing"--the relationship between attendance and All-Star voting is no stronger if you account for the number of home dates thus far in 2007--as opposed to attendance serving as a good proxy for a team's popularity overall.
There are also different degrees of "homerism" between different sets of fans. Fans in the northeast are very loyal to their clubs, with the notable exception of Philadelphia, a contrarian city where fans will find any excuse to rag on their own players. Fans in the upper Midwest are the next most loyal, especially in Milwaukee and Detroit, where the ballclubs are generating a ton of buzz right now. West Coasters are a lot more equivocal in their voting patterns.
We can subtract out the residual factor for each club to come up with "context-neutral" balloting results. I would not advocate doing this for picking the actual All-Star clubs; it is a popularity contest, after all, and the last thing I'd want to do is punish the Brewers because their fans are excited about them right now. Nevertheless, here is how the top vote-getters at each position would change:
Player TEAM LG POS Actual Adjusted Ivan Rodriguez DET AL C 1363K 1024K Joe Mauer MIN AL C 952K 853K Victor Martinez CLE AL C 567K 778K David Ortiz BOS AL 1B 1810K 1387K Justin Morneau MIN AL 1B 1063K 964K Travis Hafner CLE AL 1B 503K 714K Placido Polanco DET AL 2B 1270K 931K Robinson Cano NYA AL 2B 966K 580K B.J. Upton TBA AL 2B 490K 568K Alex Rodriguez NYA AL 3B 2543K 2157K Mike Lowell BOS AL 3B 892K 469K Adrian Beltre SEA AL 3B 321K 411K Derek Jeter NYA AL SS 2127K 1741K Miguel Tejada BAL AL SS 624K 804K Orlando Cabrera ANA AL SS 512K 589K Vlad Guerrero ANA AL OF 2044K 2121K Ichiro Suzuki SEA AL OF 1410K 1500K Magglio Ordonez DET AL OF 1446K 1107K Grady Sizemore CLE AL OF 803K 1014K Torii Hunter MIN AL OF 1085K 986K Manny Ramirez BOS AL OF 1387K 964K Sammy Sosa TEX AL OF 515K 651K Gary Sheffield DET AL OF 958K 619K Albert Pujols SLN NL 1B 1198K 1081K Prince Fielder MIL NL 1B 1454K 1059K Nomar Garciap. LAN NL 1B 1011K 632K Russell Martin LAN NL C 1291K 912K Brian McCann ATL NL C 716K 771K Bengie Molina SFN NL C 688K 708K Chase Utley PHI NL 2B 1289K 1495K Craig Biggio HOU NL 2B 747K 747K Jeff Kent LAN NL 2B 862K 483K Miguel Cabrera FLO NL 3B 1142K 1379K David Wright NYN NL 3B 1425K 990K Chipper Jones ATL NL 3B 773K 828K Jose Reyes NYN NL SS 1365K 930K Jimmy Rollins PHI NL SS 595K 801K J.J. Hardy MIL NL SS 1152K 756K Ken Griffey Jr. CIN NL OF 1641K 1668K Alfonso Soriano CHN NL OF 1333K 1304K Carlos Beltran NYN NL OF 1698K 1263K Barry Bonds SFN NL OF 1213K 1233K Matt Holliday COL NL OF 866K 1091K Andruw Jones ATL NL OF 916K 971K Carlos Lee HOU NL OF 809K 809K Jim Edmonds SLN NL OF 560K 443K
There are only two positions at which the projected starter would change if everyone played for the American Neutrals, and those are the two infield corners in the National League, where Albert Pujols pulls just back ahead of Prince Fielder, and Miguel Cabrera way ahead of David Wright. If picking Wright over Cabrera is the most we have to criticize in this year's All-Star balloting, then the fans have come an awfully long way.