keyboard_arrow_uptop

The new Yankeee Stadium has received a lot of press this spring for the large number of homeruns hit there so far. On April 21, 2009, Buster Olney wrote at ESPN http://sports.espn.go.com/mlb/news/story?id=4080195 “The New York Yankees might have a serious problem on their hands: Beautiful new Yankee Stadium appears to be a veritable wind tunnel that is rocketing balls over the fences…including 17 in the first three games in the Yankees’ first home series against the Indians. That’s an average of five home runs per game and, at this pace, there would be about 400 homers hit in the park this year — or an increase of about 250 percent. In the last year of old Yankee Stadium, in 2008, there were a total of 160 homers.”

The first mistake in Olney’s analysis is to take the homerun rate of five games and extrapolate that over a full season, and the second is to refer to how many were hit in the old Yankee Stadium last year, without considering if there might be different players on the field. The accepted method of measuring park factors, on any statistic, is to compare the home totals of both the batters and pitchers to those compiled on the road, where playing in fifteen or more different parks minimizes the effect of any one park. The factors then allow us to estimate how these players would perform in a neutral environment.

As of this writing on May 20, the Yankees have played 19 games at home, which have seen a total of 71 homeruns, 37 by Yankees hitters, 34 by their opposition. They’ve played 21 games on the road, with 49 homeruns, 27 by Yankees hitters and 22 by their opposition. 71 homers at Yankee Stadium divided by 49 in the Yankees road games gives a factor of 1.45-indicating the new Yankee Stadium inflates homerun rates 45%. The Yankees have played two more games on the road than at home, so let’s instead find the ratio of the home HR% (hr/(ab-so)) of .064 to their road game rate of .043, which is 1.48-slightly higher, but basically the same.

Is 20 games, a quarter of a season, enough of a sample size to get a reliable factor? After two exhibitions and three regular season games, Olney calculates an increase of 250%. After 19 regular season home games, I calculate an increase of 45%. What is it likely to be by the end of the season?

From 1985 to 1991, a period of seven seasons, there were no changes in the National League in either ballparks or schedule. I ran a series of one year, two year and three year factors to find out how much each varied from the seven year ‘true’ value at each park. The chart below shows the standard deviation of the results for each category at each sample size. After one year all categories are fairly close to 2 decimal point accuracy, except homeruns which take three years and triples which take even longer.

If Yankee Stadium still has a homerun factor of 1.45 at the end of the year, with a SD of .149, that means there’s a 70% chance the ‘true’ value is between 1.30 and 1.60, and a 95% chance of it being between 1.15 and 1.75. After 19 games it is still possible that Yankee Stadium could turn out to be an average park.


       SDT   XBH    SI    DO    TR    HR    BB    SO
1 Yr  .039  .083  .044  .091  .292  .149  .069  .044
2 Yr  .023  .057  .025  .060  .207  .085  .054  .030
3 Yr  .018  .046  .020  .045  .161  .060  .041  .023

A stadium having a factor of 1.45 tells us that plays in that park will be increased by 45% over normal rates. We can use this number to normalize the performance of batters and pitchers to what they would have done in a ‘neutral’ park. Each team is scheduled to play half their games at home, the other half at the various road parks. If we assume that the road parks average out to 1.00, then the ‘team’ factor which is applied to the seasons stats would be (home+road)/2, or in this case (1.45+1.00)/2, which is 1.22. Yankees hitters would be normalized by having their homerun percentage reduced by 22%, and the pitchers increased by 22%. However, with interleague play and unbalanced schedules, we can not assume the a team’s road parks average 1.00. The Pirates play division games in Great American Ballpark, Miller Field, Wrigley Field and Minute Maid Park, all of which are among the easiest to homer in. The Rockies play division games in Petco Park, Dodger Stadium and AT&T Park, which are among the hardest. After the initial calculation of each park’s factors, use those to normalize each team’s road statistics and rerun to generate a new version of factors. A third time is even better, but more than that doesn’t add any meaningful accuracy.

The chart shows that it takes at least three years to get a fairly accurate set of factors, but before that time has gone by a new stadium has likely been constructed-the road parks have changed. Assuming Yankee Stadium’s HR factor reamains higher than the park it replaced, the factor for Fenway Park will go down because Red Sox hitters can be expected to hit more homers on the road. In 1978, Fenway was the fourth easiest park in the AL to homer in, but in 1999, Fenway had dropped to the ninth-Fenway hadn’t changed, it was all the other parks that changed. Can we legitimately say “It used to be a hitter’s park, but now it’s a pitcher’s park.” It would make sense for each park’s factors to remain constant as long as there had not been any changes in that park. To find each team’s factors, multiply how many times they play in each park by each park’s factors, then divide the sum by the total number of games. The team factor can change each year with a different mix of road parks for each team, while the factors for each park do not change as long as the park hasn’t changed. When play by play data is available, team factors to adjust a season total are not needed. Instead, how each player performed in each ballpark can be normalized with that park’s factors, and then summed into an adjusted season total.

1978   American League               1999   American League
ParkID Name                  HRpf   ParkID Name                  HRpf
SEA02  Kingdome              1.55   DET04  Tiger Stadium         1.21
DET04  Tiger Stadium         1.21   BAL12  Camden Yards          1.13
TOR01  Exhibition Stadium    1.06   STP01  Tropicana Field       1.12
BOS07  Fenway Park           1.02   TOR02  Skydome               1.10
MIN02  Metropolitan Stadium  1.00   ARL02  Ballpark at Arlington 1.10
CLE07  Cleveland Stadium     1.00   SEA02  Kingdome              1.09
ARL01  Arlington Stadium     0.94   NYC16  Yankee Stadium        1.07
OAK01  Oakland Coliseum      0.93   KAN06  Kaufman Stadium       1.05
NYC16  Yankee Stadium        0.92   BOS07  Fenway Park           1.02
ANA01  Anaheim Stadium       0.86   ANA01  Anaheim Stadium       1.01
MIL05  County Stadium        0.85   MIN03  Metrodome             0.98
CHI10  Comiskey Park         0.79   CHI12  Comiskey Park II      0.98
KAN06  Kaufman Stadium       0.77   OAK01  Oakland Coliseum      0.97
BAL11  Memorial Stadium      0.76   CLE08  Jacobs Field          0.95

In calculating long term park factors, I first made a list of ballpark ‘versions’. Three Rivers Stadium opened in Pittsburgh in 1970, so that’s version 1. In 1975, an inner wooden fence was constructed, about 6 feet shorter, creating version 2 which lasted until it’s closing after the 2000 season. Version 2 of Veteran’s Stadium in Philadelphia existed from 1972 to 2003. Three River v2 and Veteran’s v2 both existed from 1975 to 2000. For those 26 seasons, compare the Pirates and Phillies stats in Pittsburgh with the same two teams stats in Philadelphia. Repeat for every combination of ballpark versions, then compare the total home to road stats for the entire range of years.

I’ve spoken mainly of homeruns in this article, as that category is the one that varies the most between ballparks, ranging from 1.65 for the Polo Grounds 1954-1963 to 0.48 for the Astrodome 1977-1984. Other than the mile high Coors Field with it’s BABIP factor of 1.15, base hits range from Kansas City’s Municipal Stadium at 1.08 to Milwaukee’s County Stadium at 0.92. Candlestick Park in San Francisco had the highest SO factor at 1.11, while Coors Field is the hardest place to fan at 0.85. The bottom of the SO factor list is populated by the various incarnations of fields in Denver, Kansas City, Atlanta, Pittsburgh, Chicago and St. Louis-almost all of the major league cities away from the coasts and a thousand or more feet above sea level. The theory is that breaking pitches don’t move as much at higher altitudes, where the air is thinner, resulting in higher contact rates, but that’s another article.

In summary

  • Don’t expect more than two decimal places of accuracy
  • It takes three seasons to get a good homerun factor.
  • Park Factors should not change if the park does not change.
  • Team factors are the weighted mean of park factors which can be applied to individual players statistics.
NAME                         ParkID Ver Since Games  SDT  XBH   SI   DO   TR   HR   BB   SO   
Angel Stadium of Anaheim      ANA01   4  1997   812 1.00 0.96 1.02 0.99 0.76 1.01 1.00 0.99
Rangers Ballpark in Arlington ARL02   1  1994  1027 1.03 1.02 1.02 1.02 1.35 1.10 1.00 0.95
Turner Field                  ATL02   1  1997   810 1.01 0.94 1.03 0.95 1.03 0.96 1.00 0.99
Oriole Park at Camden Yards   BAL12   1  2002  1100 0.98 0.89 1.01 0.89 0.70 1.13 1.02 0.97
Fenway Park                   BOS07   7  1956  3965 1.07 1.15 1.03 1.27 1.01 1.02 1.00 0.98
Wrigley Field                 CHI11   7  1956  4006 1.02 0.98 1.02 1.01 0.98 1.19 1.02 0.99
U.S. Cellular Field           CHI12   2  2001   569 0.99 0.96 1.00 0.97 0.80 1.26 1.02 0.97
Great American Ballpark       CIN09   1  2003   406 0.97 0.99 0.97 1.01 0.50 1.24 0.97 0.99
Progressive Field             CLE08   1  1994  1008 1.01 1.02 1.00 1.05 0.78 0.95 1.03 1.00
Coors Field                   DEN02   2  2005   244 1.10 0.97 1.11 1.03 1.24 1.09 0.98 0.85
Comerica Park                 DET05   2  2003   324 1.00 0.93 1.02 0.86 1.56 0.87 0.95 0.94
Minute Maid Park              HOU03   1  2000   648 1.02 1.00 1.02 0.98 1.39 1.18 0.96 1.00
Kauffman Stadium              KAN06   4  2004  7323 1.04 1.08 1.01 1.11 1.21 0.83 1.04 0.92
Dodger Stadium                LOS03   6  2001  7567 0.99 0.89 1.03 0.91 0.61 1.08 1.03 1.03
Land Shark Stadium            MIA01   2  1994  1017 1.00 0.99 1.01 0.95 1.36 0.92 1.06 1.05
Miller Park                   MIL06   1  2001   570 0.98 1.03 0.97 1.02 0.92 1.13 1.04 1.01
Hubert H. Humphrey Metrodome  MIN03   2  1983  1836 1.03 1.09 1.00 1.11 1.28 0.98 1.00 1.04
Shea Stadium                  NYC17   3  1985  1744 0.98 0.95 1.00 0.94 0.90 0.93 0.97 1.02
Yankee Stadium                NYC16   7  1988  1420 0.99 0.94 1.01 0.95 0.73 1.07 0.96 0.99
Oakland Coliseum              OAK01   6  1996   885 0.96 1.01 0.96 0.98 0.89 0.97 0.97 0.96
Citizens Bank Park            PHI13   1  2004   324 1.01 0.96 1.03 0.97 0.96 1.23 0.89 0.97
Chase Field                   PHO01   1  1998   729 1.05 1.06 1.03 1.07 1.60 1.11 1.03 0.92
PNC Park                      PIT08   1  2001   565 1.03 1.01 1.03 1.08 0.77 0.89 0.95 0.92
PetCo Park                    SAN02   2  2006   162 0.94 0.86 0.99 0.77 1.07 0.90 1.00 1.08
AT&T Park                     SFO03   2  2004   325 1.05 0.98 1.05 1.00 1.24 0.87 0.96 0.94
Safeco Field                  SEA03   1  1999   650 0.96 0.96 0.97 0.94 0.76 0.93 1.09 1.07
Busch Stadium III             STL10   1  2006   161 1.01 0.90 1.05 0.91 0.82 0.82 0.97 0.90
Tropicana Field               STP01   2  2001   561 0.99 1.01 0.99 0.97 1.29 0.98 0.98 1.02
SkyDome                       TOR02   1  1989  1320 1.00 1.10 0.96 1.10 1.11 1.10 1.02 1.01
Robert F. Kennedy Stadium     WAS10   3  1971   324 0.97 0.94 0.99 0.90 0.98 0.77 0.88 1.01
You need to be logged in to comment. Login or Subscribe
kgoldstein
5/24
I liked so much about this piece, as I've often though about the true accuracy of park factors -- the part about how park factors should not change when the park doesn't was especially good -- but the opening part just really bugged me. I think it's become the norm to look for guys like Buster or Stark or Gammons writing something that people can rip, and it's kind of cheap. Buster was using the numbers to make a point about how many damn homers there have been. Just like you know that it's really not going to end up there, he knows that too.
wcarroll
5/24
I like this, but Brian did what I thought he would. He's very good at the high level stuff, but he only brought it down so far. For a Basics piece, where we're introducing people to a concept or stat, he stayed at a bit too high a level. I see enough of a process here that I think if this were a normal BP article that went through drafts and edits, it could have been tightened up. To use an Idol metaphor, Brian doesn't have that low range that we were looking for, but he's got a voice.
ckahrl
5/24
I was gratified to see someone tackle this topic with this kind of sobriety, but that might be in no small part because of the quick mainstream hysterics, which I felt were dealt with a nicely crafted blend of politeness and dismissiveness; what Brian's addressed here reminded me of the overreactions from among many in the fourth estate to what kind of park the Jays were playing in once they moved into the Skydome. That said, a few spelling errors and a tendency to resort acronyms bedevil a fine bit of analysis; I'd like a piece labeled "Basics" to spell out 'standard deviation' instead of using 'SD.'
JayhawkBill
5/24
"To use an Idol metaphor, Brian doesn't have that low range that we were looking for, but he's got a voice." Brian Cartwright has one heck of a voice! Every Week One article was good, but Brian tackled an issue currently consuming the baseball writers' attention and he offered unique value to the readers of BP. Perhaps he wrote at a level that could have been misunderstood by some baseball fans, but he wrote at a level easily understood by the readers of BP and he offered that for which we seek, insight missing from sites such as ESPN.com, MLB.com, or our local Sunday newspapers. I gave just one thumbs up this week. Congratulations, Brian: you got my vote. Superb analysis for working on so short a deadline!
blcartwright
5/25
"Brian Cartwright has one heck of a voice!" I have sung backup to two different Grammy Award winners.
Oleoay
5/24
I like Brian's voice, but the two pieces I've read so far by him seem to be more suited to an academia journal, textbook or research paper at a SABR convention. Still, this came across to me as more reader-friendly than the Initial Entry. I think the first half of the article did a good job at explaining the basics to a new reader... I like succinct lines like: "71 homers at Yankee Stadium divided by 49 in the Yankees road games gives a factor of 1.45" and "A stadium having a factor of 1.45 tells us that plays in that park will be increased by 45% over normal rates." I do feel that, besides a change in schedule, there should've been a mention about park factors changing possibly because of stadium renovations or a humidor. Overall, I liked the article.
Oleoay
5/24
This article was one of the five that I gave a thumbs up to on my initial read through.
llewdor
5/25
"...the two pieces I've read so far by him seem to be more suited to an academia journal, textbook or research paper at a SABR convention." I'd count that as a positive feature.
Oleoay
5/25
Positive for the baseball community as a whole, I agree... but I wonder what Brian would do if he was given a month to seriously investigate a specific topic of his own idea and interest. Something new, as opposed to an analysis or revision of existing systems... Instead, he has weekly deadlines and word limits.
rbross
5/24
In light of Will's comments, some of the challenge here could be usefully eliminated by using simpler language and/or by clarifying the terms that you do use (e.g., what does "true value" mean? And you could very quickly define "standard deviation" for those of us who haven't taken a statistics course in more than a decade or at all. And you could note that "SD" refers to standard deviation). But as far as the complexity of your analysis goes, don't change a thing. Personally, I like reading BP articles that challenge me; that's what attracted me to the site in the first place. This is Baseball Prospectus, not the FOX game of the week. Your article seems to be less about park factors per se and more about how and when park factors can be used as a reliable statistic. Couldn't all that be simplified by referring to the "larger sample sizes are needed" mantra? Does it really matter (to the reader) exactly how large, and exactly what kind, of a sample size is needed to properly evaluate park factors? Moreover, is a new BP reader going to care about the reliability of one park factors figure versus another? That might be a petty critique, but you might want to think about not just what your topic is, but why you're writing about it (as opposed to other things). Nonetheless, you have fantastically impressive statistical analysis skills and I'd LOVE to read several more articles by you. Thumbs up!
skiier4384
5/24
I think he's going to end up as this year's Adam Lambert. A bit over the top, which will make him popular with the BP intellectual crowd, but I'm not sure if he broke down everything enough for the average reader.
skiier4384
5/25
*but I'm not sure that he did a great job of breaking down the topic of park factors sufficiently for the average reader.
gersh22
5/25
That moustache deserves a thumbs up
blcartwright
5/25
On Standard Deviation, I did not attempt to explain how it's calculated, but instead how it's used to show a range of possible answers, a measure of uncertainty in the calculation. 'SD' was on the second reference, as I chose to vary my wording. "The chart below shows the standard deviation...If Yankee Stadium still has a homerun factor of 1.45 at the end of the year, with a SD of .149, that means there's a 70% chance the 'true' value is between 1.30 and 1.60, and a 95% chance of it being between 1.15 and 1.75" 'true' being the real underlying value we are estimating, the result we would get with an infinite amount of data. Richard said: there should've been a mention about park factors changing possibly because of stadium renovations "Three Rivers Stadium opened in Pittsburgh in 1970...In 1975, an inner wooden fence was constructed, about 6 feet shorter, creating version 2"
Oleoay
5/25
Ok I must've skimmed the Version section of the article... so make it more obvious :) I also missed where you said "standard deviation" but I see it now. Next time, capitalize "Standard Deviation" or annotate it like "standard deviation (SD)" to make it more obvious what the abbreviation SD refers to.
roughcarrigan
5/25
He's wrong. Fenway Park *did* change between 1978 and 1999. Around 1989, (I might be off by a year) they built what was then called the "600 club" a second deck behind home plate that added about 600 (get it?) new seats set, bizarrely, completely behind plexiglass. Mike Greenwell and other Red Sox outfielders were all quite clear that this addition to the park caused balls to not carry as well as they had before it was constructed. It cut off the prevailing west-southwesterly wind that pushed balls out toward center and left center field in warm weather. Fenway Park did change.
JayhawkBill
5/25
Yes, the "600 Club" and a new press box were added in 1988. From 1967-1987, Batter Park Factors* at Fenway ranged from 118 (1977) to 99 (1986/87) with a median of 107. From 1988-2008, Batter Park Factors* at Fenway ranged from 111 (2007) to 97 (1997) with a median of 105. There's a lot of talk about how the change affected air currents, and I guess that there's maybe a little change, but Fenway is still a hitter's park most years. In terms of the article, though, the HRpf hadn't changed: it was 1.02 both seasons cited. * source Baseball Reference
roughcarrigan
5/25
I'm not sure about the numbers you post. For instance, my ridiculously large Bill James Presents Stats Inc.'s All Time Baseball Sourcebook, has Fenway's run index for 1977 as 137 with a home run index of a Coorstastic 146! Maybe my book's wrong but that's a pretty big difference.
JayhawkBill
5/25
Maybe both your source and my source are correct by the algorithms each choose. That's why I try to cite the external sources I use, and that's why I try to use BP stats when posting here: park factors can vary by definitions such as actual runs scored or runs created, or runs per game or runs per out, as examples.
hotstatrat
5/26
I wonder how often stadiums have changed their foul lines and home plate location that this study missed.
blcartwright
5/25
roughcarrigan - I was not aware of the '600 Club' - it's one of the things that doesn't chow up when looking at a list of park dimensions. To be fair and objective, I just went back to my database and created a new version for Fenway from 1988 on. Base hits (babip) was unchanged at 1.07, but all the extra base hits dropped - it does appear that the ball did not carry as well. Doubles went from 1.30 to 1.22, triples from 1.03 to 0.97, and homers from 1.07 to 0.92. I thought it was a good example, and although now struck down, the point stays the same - if the park hasn't changed, the park factor shouldn't. It's the team factor, the weighted mean of all the parks each team plays in, that can change from year to year as the schedule or any one park changes.
roughcarrigan
5/25
Fair enough. I agree that the larger point holds.
Shkspr
5/25
If this is how Brian presents the basics, we're going to need Wolfram|Alpha to get through the competition. Thumbs up if it were just about the analysis, but I don't think the writing is there for a newbie.
SkyKing162
5/25
I really like all of Brian's work, but this was another example of not sticking to The Basics. That being said, I'd rather read and use Brian's non-basic research than most others' non-basic research.
dpowell
5/25
This might just be a factual question for Brian (or anyone else who's calculated park factors before). To adjust for the unbalanced schedule, it looks like you are doing the following steps: (1) Calculate park factors assuming balanced schedule; (2) Adjust the team's road stats based on park factors found in step 1; (3) Recalculate park factors; (4) Iterate for as long as you want (though it stops mattering very quickly). Why do you do it this way? Is this just the traditional way to do it? (I really have no idea.) My bigger point is that, unless I'm missing something, this method gets the wrong answer, doesn't it? Say every stadium is exactly the same, except for Coors which dramatically increases HRs. Team A plays a disproportionate number of games in Coors so their stadium ends up with a park factor of 0.9. Team B ends up with 0.97 (#s are illustrative only - not sure they really make sense). Team A and Team B _should_ end up with the same park factor. But this would never happen. Because A and B are exactly the same, they (on average) have the same numbers of HRs hit in them. But, after your adjustment, it looks like Stadium A has 0.9/0.97 as many HRs as Stadium B. The park factors should definitely converge quickly, but they're converging to the wrong numbers. Am I off-base? I can think of some straightforward ways to get the right numbers (which, honestly, I've always just assumed were used to get PFs) so I'm wondering why this method was used. Thanks!
blcartwright
5/26
I think you may be off base. I stepped through this in Excel to make sure everything worked as I expected. I created four teams, A, B in Division 1, C and D in Division 2. Each team plays the one other team in their division 36 times at home, 36 on the road, and play the two teams in the other division 18 games each at home and on the road, for a total of 72 home and 72 road. Let's assume we have perfect knowledge. Teams A, B, and C have a home park home run rate of .040 while Team D's home park rate is .060. The mean of all four parks is .045. In real life, we do not know these numbers, all we know is how many home runs were hit by each batter of each pitcher in each ballpark. Traditional factors are expressed as home/road ratios. I am fairly alone in trying to determine 'normal' rates at each park, which is the rate at which a stat will occur if a league average selection of players played there over a long period of time. There's more math than I can ask you to wrap your head around right now. In our test case, in round one of calculating factors, A and B are both .040 at home and .045 on the road for a factor of .89. C's home rate is .040, but plays twice as many games at D, so it's road rate is .050 for a factor of .80. D's home rate is .060 and it's road is .040, so it's factor is 1.50. C has the exact same ballpark as A and B, so it should have the same factor (.89) not .80. In round 2, each team's expected road rates are calculated by multipying the number of games against each opponent by the opponent's home park rate divided by their round 1 factor. A and B don't change, as they are in the other division. C's expected road rate is now .043 for a factor of .94, D's road rate is .048 for a factor of 1.26. In round 1 (raw), C's factor was too low and D's was too high. The new estimate is on the other side of the true value, but closer. Round 3, C's road rate is .046 (true .045), factor .86 (true .89), D's road rate .044, factor 1.37 (true 1.33). Round 4, C's road rate is .044, factor .90, D's road rate .046, factor 1.32. One last time, Round 5, C's road rate is .045, factor .88, D's road rate .045, factor 1.34. D has a home park rate of .060. If a batter there had an observed rate of .060, a raw (round 1) factor would normalize that batter to .040, but we know league average is .045. After round 2 the batter is rated at .048, round 3 .044, round 4 .046, round 5 .045. Three rounds gets the results to within .001, which is close enough, so let's not waste any more time waiting for the computer to do the extra calculations. In the end, all four teams had an expected road park rate of .045, the same as the mean of the four home parks. You might ask, why not skip this exercise and just use this league average for the road rates? I assigned these values for this test, but in real life we do not know it. Two teams may have identical parks, but A has a lot of boppers while B has all slap hitters, which hides the truth of the park from us. This process is to strip out the players and show us the park. This particular test shows that the iteration works. Team C played a disproportionate number of road games in ballpark D, causing it to have a different factor than A or B, when we knew that the true value should be the same. Each step of correcting for th road rates brought C's factor closer and closer to A and B. Also, we may think that A, B and C are 'average', while D is the outlier, which is to say that league average is .040, not .045. A, B and C should then gave a factor of 1.00, while D's is 1.50. In a real life case where there are many more teams, ballparks and seasons, I believe that the long term league average would approach .040 and A, B and C would come out close to 1.00.
dpowell
5/27
Thanks for the lengthy response. I misinterpreted how you were doing this and it was 100% my fault. I'm guessing you're not responsible for this methodology, but I still don't think it's optimal. Admittedly, I'm surprised that - holding team quality constant - the park factors converge to the correct numbers. I'm not entirely convinced this holds generally, but I'll take it as given for now. I'm less surprised that if you allow heterogeneity in team quality but force all parks to be the same, that this method works fine as well. However, I'm pretty sure that this gets the wrong answer once you let teams have different HR rates and stadia have different park factors. Instead of iterating, there's a pretty straightforward way to get park factors. The problem with unbalanced schedules is that they weight some parks more than others. Just eliminate the implicit weights. Using your example above, I don't need to iterate. Instead of using (A vs C in your example) A=[.04*4]/[2*.04+.04+.06]=.89 and C=[.04*4]/[.04+.04+2*.06]=.8 and then iterating to get C to converge to A, you can calculate the values automatically by just "unweighting" the balanced schedule (and including the park's own HR rate in both the numerator and denominator): A=[.04*4]/[.04+.04+.04+.06]=.89 and C=[.04*4]/[.04+.04+.04+.06]=.89. In other words, you don't want to use a team's aggregate road numbers and aggregate home numbers. Each teamA-teamB matchup is an observation - calculate the ratio and don't weight any matchup more than any other. Just find the park factor for each team-team matchup and aggregate. I'm afraid I haven't explain my point well. I don't think the advantage of what I'm proposing is just to eliminate iterating. By using one aggregate ratio, you can't separately (correctly) identify the team effect and the stadium effect. Instead, you need to separate things by team and by park - this separately identifies each one, allowing you to identify the park factor. Again, thanks for responding! This is a helpful discussion.
hotstatrat
5/25
Yeah, the swipe at Olney bugged me for the same reason it bugged KG. Then Brian proceeded to tackle park factors the way I would hope they are tackled. This wasn’t so artful as some of the more glib writers, nor was this as basic as requested, but I’m heavily rooting to see more articles from Brian Cartwright.
Oleoay
5/25
I like swipes at Buster Olney. In fact, I wrote a piece for ProTrade comparing him to an agreeable sock puppet... which is a shame because I used to like him.
metty5
5/25
I disagree about all of the newbie talk. Why "dumb" down analysis? If something needs explanation, make a post in the comments section. As a newbie to Brian's writing about a year ago on Fangraphs (it seems like a year, but who knows), I would do that. He answered. I learned. The idea is to challenge the baseball "norms" and get to the true. Brian is a forward thinker, you don't want to bore your vets for the the sake of newbies. The reason the newbies come is to learn. This is a great guy to learn from.
rbross
5/25
I agree completely. I want to read someone who is smarter than I am, who challenges me to become smarter than I think I am, and who advances baseball analysis beyond what we already knew or thought we knew. And Brian *does* stick to the basics--an analysis of the way in which a single statistical measurement, park factors, is computed. His analysis is complicated and could be made more readable with a few slight stylistic changes, but he does stick to the assignment.
rbross
5/25
to clarify: "I agree completely" with you, metty5, not the "newbie talk."
wcarroll
5/25
I can't disagree more. This isn't specific to Brian's article, so please don't take this as criticism of him, but if someone can't be introduces to sabermetrics in a way that's comfortable and even easy, many won't. The reason more people listen to Joe Morgan and Steve Phillips is because they seem authoritative and easily understandable, even if demonstrably wrong. Phillips made a point about comparing Mauer's swing to a figure skater last night that went against the laws of physics, but millions heard it and it makes just enough sense that I bet we'll hear it again (especially in Minnesota!) I never would have found BP if it weren't for Rob Neyer and I think Neyer is the gold standard for the "make something hard seem easy"/Basics niche. If you don't make converts, the revolution never moves forward.
SkyKing162
5/25
I agree with Will here. There's a lack of interesting introductory sabermetric writing out there. Too many people familiar with sabermetrics write ABOUT sabermetrics instead of baseball topics with a hint of saber-ideas. You can write a quality article about even a somewhat advanced topic without using a single acronym or non-whole numbers. Sure, people already into saber thoughts wont' be as interested, but they aren't your audience at that point. Neyer and Joe Posnanski are currently the best at this sort of things. You don't even know they're writing for novices, because they don't frame their writing that way. That being said, I assumed 95% of what's published on BPro isn't intended for newbies, the exception being these Basics articles. People who are already here are willing to take the time to do more research (which can simlply be aking a question) when they encounter a new idea. This is why The Basics was a strange topic for week one of BP Idol. These authors all have something unique and mostly original to provide to baseball writing, and asking them to mostly put that aside right away handcuffs them. Yes, it's a good test of writing, but 90% of this contest should be about content, not writing. I like the topic, buy maybe save it until after we've had a chance to to figure out the authors' approches for a few weeks. Here's something along the lines of what I was expecting. It discusses why batting average isn't good enough, how we can do better, what OPS is, and gives meaning to the OPS scale. On a slightly advanced note, it shows how various measure of offense correlate to run scoring, and that OPS i better than other options with similar complexity. http://www.redreporter.com/story/2007/7/13/0523/81591
Oleoay
5/26
More people listen to Joe Morgan and Steve Phillips because they are commentators on ESPN's broadcasts and on their webpage. Rob Neyer's only on their webpage and occasionally on TV in a short analyst segment. I don't think it has anything to do with Morgan and Phillips's communication abilities or else, Jon Miller wouldn't act so blatantly flabbergasted at times.
MattBishoff
5/26
I am with Will and Sky and disagree on this one. The topic of the week was "Basics". Yes, many subscribe to BP for the breakthrough statistical analysis, but that was not the topic this week. IMO, it is unfair to the other competitors to reward someone who didn't "dumb" down their analysis during a weekly topic that is geared towards that sort of article.
blcartwright
5/26
I started off with an example that showed bad use of numbers - don't extrapolate a small sample and put numbers in context. Those are basics for any type of numerical analysis. As for context, I explained that the basic park factor is home divided by road, but that it's best to divide rates and not counting numbers, because you may play more or fewer games at home than on the road. Then I showed how that number is still an estimate, one that has less uncertainty as the sample size gets larger, which is another basic concept of any analysis. Then it goes a little higher, but I think it's important to understand that a park factor only applies to the games in that park, and if you want to use these types of numbers to adjust a player's stats, you have to use a weighted mean of all the parks on the schedule. I do not think that mine is any more advanced than Silver's "Science of Forecasting" or Sheehan's "Stolen Bases", two of the articles given to us as examples from the original 'Basics' series of five years ago. Please check them out.
Oleoay
5/26
Writing about the Basics has nothing to do with "dumbing down" analysis. A writer can just be as analytical in an introduction article. I think the difference in an introduction article is that there's a bit more handholding as the writer takes the reader methodically through the process, making fewer assumptions of what a new reader already knows. An Introduction to Calculus class assumes a certain set of skills and knowledge different than an Introduction to Algebra/pre-algebra class, but that doesn't mean "dumber" calculus or algebra is used.
dogstar30
5/25
"If the park doesn't change, then the park factor shouldn't change." But if HRpf = Home HR / Road HR, wouldn't the denominator (and therefore the pf) potentially change every time *any* park changed? Granted, the change wouldn't be large, but ...
blcartwright
5/25
It's a concept. With that, you define that you will measure a factor between changes in the home park. On the most basic level you measure with home/road. Of course, as I pointed out, if road changes, the ratio will change. You try to counteract that by going back and adjusting the road stats with each road park's factors and rerunning. Once you get the uncertainty of the result under .05, consider it stable. Even with a higher variance, using the factor to adjust a player's stats will be less precise but still fairly reasonable. You just want to make sure that adding another season isn't going to result in wild swings in the calculations. This leads in to adding some amount of league average performance (regression) to moderate, cancelling out extreme results from smaller samples.
gwguest
5/25
There's something weird to me about ending the piece with a table.
gwguest
5/25
whoops, submitted too soon sorry. Anyway, ending with a table isn't enough to make me not vote, but I didn't feel like the text above the table was strong enough to just give me some data and a pat on the back.
daiheide
5/25
This is a good piece, of the sort that you might find in ESPN's "BP Daily" section. I didn't think Brian was particularly mean to Olney. Olney wrote something and Brian politely pointed out that it was false. BP writers shouldn't trash mainstream writers. But they probably should be pointing out inaccuracies.
hotstatrat
5/25
Olney wasn't false. People who don't have a good feel for the significance of the sample size may be misled by their ignorance. You can't blame that on Olney, he was just dramatizing how many home runs have been hit at the new Yankee Stadium so far.
llewdor
5/25
Despite really liking this piece, I'm left with a nagging question. Why shouldn't the park factor change if the park doesn't change? As long as the park changes relative to the average major-league park, that should be enough to change the park factor, shouldn't it?
jsnell
5/25
For me, this piece fell off track as a Basics article with this phrase: "so let's instead find the ratio of the home HR% (hr/(ab-so)) of .064 to their road game rate of .043, which is 1.48—slightly higher, but basically the same." I get it, I get it, but it's too much for Basics. You also introduce park factor numbers and show us a chart of them before giving us the nut graf about how the stat actually works: "A stadium having a factor of 1.45 tells us that plays in that park will be increased by 45% over normal rates..." -- that's the wrong order. As with many of these articles, this is actually a pretty good piece. Its failings are that, in my opinion, it's just not Basic enough, and makes too many assumptions about a reader's knowledge (standard deviations). Try imagining your mom when you write these Basics articles. Or, if you prefer, some dude sitting next to you at the ballpark who has been watching baseball for years but knows nothing about sabermetrics and things that Batting Average is the bee's knees.
wcarroll
5/25
What he said in the last paragraph ... exactly.
hotstatrat
5/26
I guess I don't get out enough, but I didn't know there were still baseball fans like that. I thought we all evolved, at least, to OBA and SlgA. I know they even post OBA on the Jumbotron here in Toronto. I haven't been to a game this year, but as I recall it gets more prominence than Batting Average. I'm not questioning Will, but I am surprised to hear BP gets readers who haven't moved on past Batting Average, HR, and RBI.
wcarroll
5/26
There are lots of people in the GAME who haven't move past the traditional stats. I can remember Joe Sheehan speaking to a GM at the Winter Meetings a couple years ago about what he was looking for in a power hitter and his immediate response was "Well, he's got to have good RBI numbers."
Oleoay
5/26
The idea that Winter Meetings still exist in this era of Blackberrys/iPods and twittering shows there's still some traditional thinking at play :)
hotstatrat
5/26
Wow. Would you get sued if you told us who it was?
Oleoay
5/26
I'd be more worried about losing the source than about getting sued... and not only that, but if it was revealed Will gave up the source, other sources might be less likely to talk to him too...
SkyKing162
5/26
Right, and that's why the Astros have no hope. ; )
dpowell
5/25
I agree that I think this article ended up overreaching. There are plenty of issues with park factors that could've been explained before (and instead of) starting to calculate "time-invariant park factors." For example, these two sentences really bother me: "After the initial calculation of each park's factors, use those to normalize each team's road statistics and rerun to generate a new version of factors. A third time is even better, but more than that doesn't add any meaningful accuracy." I would've loved to hear more about why this is the accepted method since (as I commented above), I would assume this actually gets the wrong answer. Even if it's not a Basics article, it's always good to start at the beginning and explain why the community does something in a certain way. Instead, this article just glosses over a pretty major adjustment when there was plenty of opportunity to discuss it in detail. General question for anyone: is there an explanation/justification for this method somewhere?
natashaos
5/25
So I just finished reading all of the articles, and this article got my first vote. I will probably vote for 2 or 3 more, although I thought most of the articles were at least two or three wins above replacement. In any case, I thought that this article actually told me something new, and in a convincing way. However, I voted for this article because I thought that it was the best article, in a vacuum. At the same time, it is probably a few levels above a "Basics" article. One of my main problems with American Idol is that people vote based on each week, forgetting past results, and future upside. I would rather vote on potential, at least this early in the competition, rather than how closely the week's prompt was followed. I think that Brian is capable of having a Hanley Ramirez-type breakout season, as soon as his manager stops asking him to bunt the runner over. At this point, I'd rather vote for that then the solid singles hitter with limited upside.
hotstatrat
5/26
At some point your prospect has to hit a home run. If he never does and you cut your reliable singles hitter, your in trouble. I agree Brian seems as though he has the most potential, but he hasn't hit a home run to my reckoning, yet.
mbodell
5/25
I like this article a lot. I think there is a lot of good content there. Constructive criticism: 1. Don't use an acronym without introducing it first (at least not an acronym for something that wouldn't be found in the standard box scores). I wasn't so concerned with SD for standard deviation as I was with something like SDT in the table. I'm guessing it is singles+doubles+triples, but I'm not 100% sure. One tip to make it clearer too so that people don't miss your abbreviations is to use full words the first time and also include your abbreviation there. So something like "The chart below shows the standard deviation (SD) of the results for each category at each sample size". Now it is crystal clear in the next paragraph that SD is standard deviation. 2. I know a lot of people have moved into component park factors instead of aggregate park factors. However, for the guy in the ballpark slightly more refined in statistics then a true basic reader, but not yet a full, Brian/Tango/MGL type they may have thought about park factors only in terms of runs. Or hitters park versus pitchers park. It would be nice to either include the run factor of each park and/or explain why components are more important than runs. But touching on it in the tables would be good (unless I missed it and that is what SDT is). 3. I think using Buster's quote as an opening is fair, but one way to keep the tone from seeming to arrogant or snarky would be to credit Olney with intentionally exaggerating for effect and then have you come along to refine the argument to show how it is done for real. I think Olney knows that Yankee stadium is unlikely to have 400 HR hit there this season, even if he maybe couldn't run a bunch of multiyear regressions. Overall though, the piece is easily better than the average BP article (which I think is the proper strict bar for BP Idol).
hotstatrat
5/26
Point 2 - Right on. I read these late last night and wrote my comments when my brain is normally sleeping, so I forgot to make this point. It would be nice to come up with an OBA multiplier and a Slugging Multiplier for those who don't have time to work out the effect on each component. Of course, it wouldn't be as accurate, but you could base it on league average component rates. I know BP doesn't use OBA and SlgA much, but I figured those are fairly "basic" entry level BP stats.
blcartwright
5/26
1) Valid point on acronyms. I will pay extra attention to that kind of thing. 2) Runs and home runs are probably the two most quoted factors, and they are the ones that vary the most from park to park. Home runs were the 'hook' for this article, and being a component it's a number you can turn around and use for another computation. It's hard to do that with a runs factor and that's mainly why I don't track them, but a runs factor is easy to comprehend. 3) I'm not really familiar with Olney, and I don't assume that he was intentionally exaggerating. Maybe so, but to my reading he showed all the wrong ways to use numbers.
vandorn
5/26
There are two ways to read Olney's comment: a) "Based on the first three games at Yankee Stadium, we can expect 400 home runs to be hit there this year." b) "Is 17 homers in three games a lot? Well, of course it is, but to put it in context, this rate, if sustained over the course of a whole season, would mean 400 home runs would be hit there this season. By way of comparison . . ." This first, as you point out, would be an ignorant use of statisics. But the second way is a reasonable, if not optimal, way to present the information. But the sentence after the Olney quote was labeled a "mistake" without any supporting statements, and a later passage attributes a 250% increase as if Olney had calculated a park factor himself. If Olney in another article called a player a "malingerer" or "clubhouse cancer" based on an interpretation of a quote when a more innocent interpretation existed, he would rightfully be excoriated by BP readers. Calling Olney's analysis a mistake isn't as strong a statment because it isn't a personal attack, but the logic still applies.
molnar
5/26
I liked the piece overall; it gets a definite thumb from me. But it could have stayed on course better - it ends up being at least as much about small sample size as about park factors, which muddies the main message. I don't think that was quite on target for this assignment. If one is in fact reading about park factors for the first time, it would be enough just to digest the fact that there are different park factors for different quantities, and get some idea of what the ranges are. Looking at the differences between the highest and lowest park factors might have sufficed; I wouldn't have used standard deviation at all.
DrDave
5/26
"Home run" is two words; getting that wrong in your first sentence is not a great start. Getting it wrong again later in the article is worse; you can't possibly sound like an expert at that point. One hopes that these would get caught by the editor, but BP is not the gold standard in that regard. Overall... odd mix of detail and generality. Nothing grabbed me in the prose, much as I love making fun of Buster Olney. C+
jdavlin
5/26
I had no idea who this writer was before BP Idol. I gather from some of the comments that Brian has a fine reputation in the baseball analysis community. He sure knows more about math and statistical mumbo jumbo than do I. What he does not seem to possess, based on this and his initial entry, is an ability to write in a manner that appeals to anyone without a degree in math. Putting aside one's feelings with regard to math-heavy analysis, Brian seems to have ignored the rules for this week's entries. The most basic requirement is to "craft an article around one statistic or concept and explain it." If the concept of "Park Factors" is explained in this article, the explanation eludes me.
omalleycat
5/26
While reading all the original submissions and the first week entries for all the contestants, I have wondered if this contest shouldn't have been called America's Next Great Sabermetrician. I originally subscribed to BP because i'm sucker for Napaleonic War references in transaction analyis. Having said that, i think the criticisms leveled at this article as being not basic enough are misplaced. It seems to me that this is a cautionary tale. For the average baseball fan, first impressions seem to linger and with every network throwing up the graph of the home runs hit in new yankee stadium vs. old yankee stadium, the danger of new yankee stadium being labeled a hitters park is pretty great. The author, at least for me, helped with clear and easy to understand statisticall analysis put the brakes on that first impression.
bsolow
5/26
More important to me was this clear misinterpretation of math in the article itself: "Yankees hitters would be normalized by having their homerun percentage reduced by 22%, and the pitchers increased by 22%." If the new Yankee Stadium increases home runs by 22% relative to other parks, in order to normalize the hitters' home run percentages, you should reduce them by 22%. However, to penalize pitchers by increasing their home runs allowed by 22% because they pitch in a home run-inflating ballpark is clearly adjusting the numbers in the wrong direction. Why should we penalize the Kevin Millwoods of the world for pitching in Texas? Isn't it bad enough being a Kevin Millwood? I liked parts of this article a lot, but I am a stickler for accuracy on the basics. At first glance (and only having read 2 or 3 others), it doesn't get a vote from me.
blcartwright
5/26
Whoops. That was a misstatement. You are absolutely correct. I'm surprised no one (including myself) saw that yet.
RayDiPerna
5/26
The piece started off well but lost my attention. Nevertheless, I have a serious problem with Brian's analysis: ----------------------- "As of this writing on May 20, the Yankees have played 19 games at home, which have seen a total of 71 homeruns, 37 by Yankees hitters, 34 by their opposition. They've played 21 games on the road, with 49 homeruns, 27 by Yankees hitters and 22 by their opposition. 71 homers at Yankee Stadium divided by 49 in the Yankees road games gives a factor of 1.45—indicating the new Yankee Stadium inflates homerun rates 45%. The Yankees have played two more games on the road than at home, so let's instead find the ratio of the home HR% (hr/(ab-so)) of .064 to their road game rate of .043, which is 1.48—slightly higher, but basically the same. Is 20 games, a quarter of a season, enough of a sample size to get a reliable factor?" ------------------------------ This is the wrong question. The problem with the 1.48 number is not that the sample size may be too small, BUT THAT IT IS WRONG. Park factors are calculated using the SAME SET OF OPPONENTS at home and on the road. The Yankees have not played the same set of opponents at home and on the road. Therefore, the calculation is simply wrong.
blcartwright
5/26
In the more advanced version of park factors, yes of course you hold for the same oponent...this is how I do it, and how I explained it later on "compare the Pirates and Phillies stats in Pittsburgh with the same two teams stats in Philadelphia. Repeat for every combination of ballpark versions, then compare the total home to road stats for the entire range of years."
RayDiPerna
5/26
Yes, but I don't think you get anything except gibberish when you calculate a "park factor" with a different set of opponents at home and on the road.
blcartwright
5/26
Agreed. I do it the way you say. I just waited until later in the article to explain it.
RayDiPerna
5/26
Also, I'm all for attacking mainstream writers (though BP doesn't do that as a general rule), but Olney doesn't deserve that criticism here. He was simply pointing out that a lot of home runs have been hit at the Stadium this year. He presented the home run "pace" over a full year only to put that pace in context (comparing it to last year) and demonstrate how high it was; he was not presenting an "analysis" of the park, or alleging that the park would give up that many home runs this year. He used words like "might" and "appears." He said AT THIS PACE there would be 400 homers.
RayDiPerna
5/26
As for Brian not quite being in tune with a "Basics" column, well, perhaps he went a little overboard. But I don't think it's a great idea for BP to dictate the theme of the piece each contestant must write, either. Speaking generally here (i.e., this has nothing to do with Brian), writers have certain skills; they're good at some things and bad at others. Presumably BP would hire the winner to write to his strengths -- e.g., a "notes" column such as Gammons might present; an analysis column such as Nate might present; or commentary such as Sheehan might present. Sure, you want writers who can handle various assignments, but I thought the goal here was to find a talented, interesting writer to add to BP's staff, and I don't see how shoehorning the writers into topic areas they may not be comfortable with furthers that goal. I was hoping to read entries that were chosen exclusively by the contestant according to his or her strengths and interests, instead of forced-theme columns.
tkniker
5/26
When I submitted and knew there would be "themes" I was expecting more like "No-Hitters" and one had to write a piece in their voice that was related to No Hitters. It'll definitely be a challenge if what we are requested to do is each right the type of piece that appears in BP. I could see this week's topic "Fantasy" being difficult if someone has really not played Fantasy Baseball at all (or only played it very little." But hey, them's the rules.
SkyKing162
5/26
Yeah, as even Will admits, he's not a big numbers guy. But he sure as hell deserves a spot as a BPro author.
Oleoay
5/27
Will can (and has) written quality injury articles from a fantasy perspective. Not all fantasy articles need statistics.
RayDiPerna
5/27
Which is why it's kind of silly to have themes. Let the writers show what they can do. Don't shackle them with themes. If BP readers like what they're reading, the contestants will move on. If not, they won't. I really, really don't understand the point of the themes. But whatever.
Oleoay
5/27
Well, the problem is you have to narrow a person down each week.. so if you don't choose something like "The Basics", then basically people are being judged on their initial entry and if they are lucky enough to get a first week topic that they know something about. Also, as I suggested in another thread, if a writer can't introduce and discuss a basic concept, it calls into question whether the writer can do so for more advanced topics. As far as what kind of writer the BP staff wants in the long run, it's honestly a bit hard to tell. Judging by the finalist selection, most of the initial entries had some grounding in statistics. We do know from Will's comments that there were comedy pieces and other non-statistics pieces submitted as well. They apparently _really_really_ wanted someone who conducted an external interview. In the end, though, their finalist selection biased who we would be able to evaluate. Some people agree or disagree with the finalists chosen, some are agreeing with the judge's comments and others aren't... I think overall, they did a good job in selecting the kinds of people we would be interested in. So I think at this point, though BP selected the initial group, at this point I think they merely want a writer that we paying citizens find entertaining/engaging/insightful/thoughtprovoking.
hotstatrat
5/27
Right. If they are really good, they should be able to find an angle to work each theme in their strength. "Basics" and "Fantasy" are general enough to do that. Furthermore, it wouldn't be a fair contest, if the writers didn't have to come up with something new each week. Some have articles stashed away. I suppose they could find a way to tweak what they may have already written and adapt it to a theme, but that's OK. A good research article most likely takes more than a week to corral and crunch the data.
caprio84
5/26
I thought it was good. Nice intro, and didn't think it was too "advanced" for a newbie...good job.
blcartwright
5/27
thanks, I will analyze this