Notice: Trying to get property 'display_name' of non-object in /var/www/html/wp-content/plugins/wordpress-seo/src/generators/schema/article.php on line 52
keyboard_arrow_uptop

If you have ever tried to explain the concept of Pythagorean Record to a baseball novice, you probably have had to answer the following criticism: “That counts the extra runs at the end of a blowout as much as other runs, even though it does not matter whether you win 10-0 or 15-0.” The answer that we give to that criticism is that teams that can take advantage of blowouts have better offenses and those type of teams will be more likely to win close games in the future. That is the reason that we have thousand-run estimators that try to approximate how many runs a team will score on average, and why we evaluate players with statistics like VORP-measured in runs over replacement player. Runs are the building blocks of wins, and you win by scoring more runs than your opponent. We cringe when we hear offenses evaluated by batting average because we know that the goal of offenses is to score runs, not get hits.


The Inning

However, with all of these run estimators that sabermetricians have developed, we often forget the context in which runs are scored-by innings. Teams get to score as many runs as they can before their opponents record three outs; then they get to try again eight more times. That environment-how much you can score before three outs-is the environment to keep in mind when we talk about winning games. Nearly a decade ago, Keith Woolner wrote about the link between runs per inning and runs per game, and how well you can predict the frequency of zero-run innings, one-run innings, two-run innings, etc. by looking at how many runs teams score per game.

It is certainly true that the rate of scoring a certain number of runs in an inning and the average number of runs per game are related. In fact, teams that have more variance in their run-scoring per inning also have more variance in their run-scoring per game. This is tricky to show because teams that score more runs also have more variance in the number of runs they score per game-that makes sense, because they have a lot of eight-run games, 10-run games, and 15-run games, so they are bound to have a higher variance because they needed enough big innings to put up those run totals. Simply checking the variance of runs per game against how frequently those teams have big innings would obviously yield a positive correlation. Instead, I needed some way to neutralize the variance of runs per game. I initially tried dividing by the number of runs per game, but that statistic still had a positive correlation with runs per game. I tweaked with things until I found a way to measure variance of runs per game that did not have any correlation (-0.0025) with runs per game, which I call “Adjusted Variance” or “AdjVar,” is this:

          (Variance of Runs/Game)
AdjVar = -----------------------
           ((Runs/Game) ^1.30)

Looking at 1998-2008 data for each team (330 team seasons total), I found that this number was slightly positively correlated with the frequency of scoring zero runs in an inning (correlation = 0.097, two-sided p-stat = 0.075), highly correlated with the odds of scoring four or more runs in an inning (correlation = 0.258, two-sided p-stat = 0.000), and highly correlated with the odds of scoring five or more runs in an inning (correlation = 0.292, two-sided p-stat = 0.000). That much should not come as a surprise; we predicted that teams that have more variance in their runs-per-inning scoring would have more variance in their runs-per-game scoring.


Run-Scoring Variance and Pythagorean Record

The next step is to check if teams with more variance in their runs per game tend to underperform their Pythagorean records. In fact, this is true-the difference between actual wins and Pythagorean expected wins is negatively correlated with the AdjVar statistic above (correlation = -0.303, two-sided p-stat = 0.000). Teams that are more volatile in their rate of scoring runs are going to lose more often than other teams that score similar number of runs, but are not as volatile.

Now we know that teams that have high variance in their run-scoring by inning have more variance in their run-scoring per game. We also know that teams that have more variance in their run-scoring by game are not as likely to win as teams that put up the same number of runs but without as much of a spread. The next step is to figure out if there is any way to predict which offenses will have less variance in their runs per inning.


Which Offenses Spread Their Runs Around Better

Three years ago, Sal Baxamusa looked at 2006 team-scoring data and used the Weibull Distribution to predict how often they would score a certain number of runs. The Weibull Distribution does a pretty good job at predicting the number of times teams will put up certain run totals, but tends to underestimate how often teams are shut out. This is likely due to the fact that the talent level of pitchers is different, so analyzing how a team scores in general will not take this into account. You face Johan Santana sometimes, and you face Livan Hernandez at others, and Santana might shut you out more often than a model of hitting alone would predict. Baxamusa demonstrated that slugging teams were shut out less often, and also were more likely to score at least three runs in a game than their season run total and the Weibull Distribution would predict. This was useful information, but given the difficulties with the Weibull Distribution and the small sample size of just thirty data points, he was unable to check this in much detail.

By looking at runs per inning, we can look at a much larger sample-there were 477,884 half-innings from 1998-2008. Using this, we can check which type of offenses are more likely to spread their runs around and win more games as a result. The correlations between the odds of scoring at least a given number of runs in an inning and a number of common offensive rate statistics reveal even more evidence of Baxamusa’s suspicion-that the teams that score with power are more likely to win than other teams who score similar numbers of runs.

For reference, note that the average team from 1998-2008 only scored in 29 percent of the innings that they played, but they scored two or more 14 percent of the time, they scored three or more six percent of the time, they scored four or more three percent of the time, and they scored five or more one percent of the time.

Below I list the correlation between the frequencies of scoring at least a certain number of runs in an inning and on-base percentage and slugging percentage. Note that each of these have a 0.887 correlation with runs per game. You will notice an interesting trend:


At least
X Runs/inning	 OBP    SLG
1               .822   .872
2               .741   .723
3               .603   .573
4               .746   .716
5               .667   .611

The trend that you probably noticed is that high-slugging teams are more likely to pick up at least a run in an inning, but high-OBP teams are more likely to have big innings. The reason that this is so important is that we have shown that being able to spread your runs around different innings is more valuable than scoring a lot of runs in one inning, in terms of wins and losses, since high variance in run scoring tends to be correlated with underperforming your team’s Pythagorean Record. This means that all of our standard measures of run-scoring are overweighting the contribution of OBP towards winning and underestimating the contribution of SLG towards winning.

The connection can be highlighted even further by using regression analysis to predict the probability that a team scores at least X runs in an inning. I regressed the probability of scoring at least one, two, three, four, and five runs in an inning on on-base percentage and slugging percentage and found the following formulas:

Prob(Scoring at least 1 run)  = -0.154 + 0.659*OBP + 0.526*SLG
Prob(Scoring at least 2 runs) = -0.224 + 0.686*OBP + 0.307*SLG
Prob(Scoring at least 3 runs) = -0.164 + 0.462*OBP + 0.171*SLG
Prob(Scoring at least 4 runs) = -0.090 + 0.235*OBP + 0.094*SLG
Prob(Scoring at least 5 runs) = -0.050 + 0.138*OBP + 0.039*SLG

The important thing to realize when looking at these formulas is that the coefficient on SLG gets smaller relative to the coefficient on OBP as you increase the number of runs per inning. Teams that string together a lot of baserunners are more likely to score by putting up big innings than teams that swing for the fences, who will spread their runs around better.

The link remains strong when you look at similar statistics for scoring at least a certain number of runs in a game:

Prob(Scoring at least 1 run)   =  0.673 + 0.336*OBP + 0.387*SLG
Prob(Scoring at least 2 runs)  =  0.238 + 0.815*OBP + 0.814*SLG
Prob(Scoring at least 3 runs)  = -0.205 + 1.46 *OBP + 1.06 *SLG
Prob(Scoring at least 4 runs)  = -0.584 + 2.00 *OBP + 1.21 *SLG
Prob(Scoring at least 5 runs)  = -0.889 + 2.65 *OBP + 1.11 *SLG
Prob(Scoring at least 6 runs)  = -0.973 + 2.51 *OBP + 1.14 *SLG
Prob(Scoring at least 7 runs)  = -0.909 + 2.26 *OBP + 0.974*SLG


Conclusion

It is clear that power helps you score frequently, and on-base skill helps you pile on when you do score. In fact, a team’s home runs per at-bat has a 0.15 correlation with the difference between the number of wins a team gets beyond what their Pythagorean record predicts. Teams that hit more home runs do better than their Pythagorean Record suggests.

What this means is that power hitters are even more valuable than their VORP suggests. Power hitters not only change the scoreboard, but they change the scoreboard when it matters. The next time somebody tells you that a team is falling short because they rely too much on the long ball, you can reply that they may not rely on it enough.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
oira61
8/03
I'm curious: BP never pays attention to batting average, but I wonder what, if any, correlation there is between batting average and outperforming the pythagorean projection. Intuitively, it seems like teams that have a higher percentage of their OBP as hits, rather than walks, will also score single runs in a few more innings than expected.
swartzm
8/03
BP doesn't discuss batting average much, but it does actually account for it in VORP and other statistics-- so we do count a single more than a walk, which you're right to point out we should. Pythagorean Record just uses run totals though, so I think your question would be if batting average is correlated with second order Pythagorean Record. I can't tell you about that but the correlation with regular Pythagorean Record is only 0.038, which is not statistically significant. As far as per inning scoring, it seems that AVG has a similar effect as OBP in that it is more important in putting together big innings than getting one or two runs across, but the effect is not as distinct as OBP's.
jgibson
8/03
Matt- Very interesting analysis. You were my favorite writer in Idol. Glad to see BP is keeping you around. Keep up the great work.
Edwincnelson
8/03
Amazingly well written, and well researched article. I was discussing the reasoning behind the LaRoche-Kotchman trade from the Braves' point of view this morning, and this article certainly sheds some light on how the Braves may have looked at the trade.
crperry13
8/03
Excellent point. This article and your comment explained that trade better than anything I've heard by the talking heads. I'd be surprised if this type of analysis was used though - more likely it was just a money swap.
sunpar
8/03
Agreed. I love theory-based articles with instant application.

Great job on this, Matt.
beeker99
8/03
This is really interesting stuff, Matt. I wonder if this could, in part, account for the Yanks usually finishing ahead of their Pythag record, especially during the later Torre years. IIRC, the Yanks were usually among the league leaders in home runs those years . . .
swartzm
8/03
It's possible that played some role, but I think that the commonly cited reason is that the late 90s Yankees had such a great bullpen that they did not let close games get away. I would imagine that their power-hitting helped them get runs in those close games, too, but looking back they also seemed to have pretty great OBP's those years as well. Interesting application, though. Thanks.
dethwurm
8/03
Thank you for this. I'd been wondering about something along these lines for awhile now, but was never able to organize it into a cohesive idea, much less figure out how to investigate it.

Are you writing for BP regularly now? I've really enjoyed your articles the last few weeks.
swartzm
8/03
Yup, writing weekly here. Thanks!
Vyse0wnz
8/03
This is exactly what BP needs. Swartz for President!
thegeneral13
8/03
Interesting stuff. I would like to see a little more exploration of the negative correlation between variance and performance vs. pythagorean record. Intuitively, good teams should benefit from low variance while bad teams should benefit from high variance. If Team A scores 6 runs a game with zero variance and Team B scores 5 runs a game with zero variance, Team A will win every game. If you add variance to either team, it will hurt Team A's winning percentage and help Team B's. As variance becomes infinite Team A's winning percentage will approach 50%. I'm not sure if I understand why your findings would suggest that low variance is uniformly good. I realize this is relative to the pythag model, but if the pythag model is a good approximation in aggregate, then I would think that a team with a positive run differential should outperform if it has below average variance while a team with a negative run differential should outperform if it has above average variance. Maybe if you divided the sample into two cohorts--teams that outscored their opponents and teams that were outscored--the results would look different. Just a thought.

I was also curious about the adjusted variance metric that you created. You mention that it is uncorrelated with runs scored, but does is there any pattern to the residuals? If so it could introduce some bias. It seems like it would be better to regress variance in runs scored against runs scored and use the residuals as the predictors of the difference between wins and pythag predicted wins. Again, just a thought.

Like the article overall. Thought-provoking = good.
swartzm
8/03
I see your point, but it might help for me to explain a little bit more about the variance-- specifically, it's not normally distributed. The distribution for all teams has a very large right tail, so more variance almost always means that you outscore your opponents by a lot in blowouts but does not really much otherwise. So even below average teams prefer more variance. If it were normally distributed, your point would be dead on and I see what you're saying.

I played around a little with dividing the sample into two cohorts as you suggest, but the problem them is that I have biased variables-- teams that were outscored already have positive run differentials, so they are statistically more likely to fall short of their pythagorean record because we have eliminated teams above .500 with negative run differentials from our subsample but not the opposite.

I think I understand what you mean about the adjusted variance metric, but the results were strong enough that alternative specifications didn't really change anything. I like the approach, though. I think in general the distribution of runs scored with the long right tail probably explains the issue that you are looking at anyway, maybe?
thegeneral13
8/03
Thanks for the reply, Matt.

I suppose runs scored would follow a lognormal rather than normal distribution, since they are a product of multiple independent events. If you charted log(runs scored) I bet you'd get a normal distribution (long right tail would be gone), and the variance of that distribution would be more indicative of true volatility in run scoring for a team (wouldn't be skewed by blowouts). That would probably be the best method to come up with an adjusted variance metric. Then I'd regress that against the difference between wins and pythag expected wins. The analysis becomes a little more esoteric (what is a logrun?), but the conclusion would be stronger, I think.

If you take out the skew with that methodology I'd be interested to see what happens. My intuitive guess is that you'd see variance have little predictive value for the overall population, but significant predictive value for the individual cohorts, i.e. variance is good for the bad teams, and bad for the good teams. But I could be wrong. In fact I hope I'm wrong because that would be more interesting.

I'm not sure I understand the second paragraph of your response. I'm saying divide the population into "+ run diff" and "- run diff" and redo the regression of "wins minus expected wins" vs. "adjusted variance" for each of those cohorts. If you divide them based on run differential (as opposed to actual winning percentage) you shouldn't have any bias. In other words, it shouldn't be any easier or harder for a team in the "+" cohort to outperform or underperform its pythag than for a team in the "-" cohort.

Thoughts? Thanks again for the reply.
swartzm
8/04
It's actually not quite a log-normal distribution-- Baxamusa cited a math professor named Stephen Miller that showed it was Weibull. Obviously the two have similar shapes, but the issue is how they treat zeros.

I'm having trouble interpreting a "logrun", as well, and so I'm not sure what the point would be of transforming the variance. Instead, I think it's best to highlight the fact that some teams score a bunch of extra runs as blowouts, and ignoring that will be an issue if this tendency is more likely for certain offenses.

I'm pretty sure that the bias would come in from the fact that there are four subsamples of the population-- those over .500 with + run diff, those under .500 with - run diff, those over .500 with - run diff, and those under .500 with + run diff. By subsetting the groups into + run diff and - run diff, you create a + run diff group with teams over .500 and under .500, and so you are including teams with a + run diff under .500 but not teams with a - run diff and over .500. The residuals will be biased, because the some negative residuals will have been eliminated.
markpadden
8/10
I don't follow your explanation. Additional volatility (variance) in per-game scoring will *always* help the inferior team's chances of beating an opponent. Run scoring does not have to be normally distributed for this to be true.
KaiserD2
8/03
I have been studying these issues for decades, albeit without such statistical sophistication, and I have a simpler (although not contradictory) explanation for this finding.

Teams that exceed their Pythagorean projection are doing well in close games--I believe that is a given. Most close games are low-scoring games. In low scoring games, as Bill James realized 20-30 years ago writing about the World Series, long-sequence offenses (with high OBA) do poorly and slugging teams do better. So yes, slugging teams would be expected to outperform long-sequence teams in close games.

Comments?

David Kaiser
oneofthem
8/03
you've touched on an important and neglected topic here. good work
Vyse0wnz
8/04
My only complaint about the article is that it seems to be missing a thesis. In the middle of reading it, I was asking myself, "What is he trying to prove?" That could be easily solved, and the article was great nonetheless, but it was definitely a bit disconcerting as a reader.
oneofthem
8/04
refining the underlying metrics-runs-(wins) relationship that should be the macro level holy grail of baseball analysis
marioreturns66
8/04
Great stuff Matt. I'd love to see it peer-reviewed, but really good job.

One thing I'd be curious about is how the run environment affects the big inning-->W% correlation. I.e. in a very low-run environment, does that still hold? Same with a very high-run environment. And does OBP still lead to bigger innings in these different envts?
swartzm
8/04
Interesting thoughts, thanks. I'm not sure I have enough data on hand to answer that very well. I checked the correlations year by year from 1998 to 2008, after reading your comment, for W-XW and AdjVar and got (all values negative): .12, .07, .10, .40, .42, .32, .35, .40, .43, .25, .50. Since run scoring was a little higher in 1998-2000 than 2001-2008, there does seem to be some tendency for less correlation in higher run scoring environments, and the correlation for 1998-2008 combined is more negative in the NL than in the AL (-.34 vs -.27), so that's maybe a little more evidence towards a lower correlation in higher run scoring environments. I'm not sure that this is all that conclusive because of the sample size (30 teams per year for the 11 correlations above), but maybe the higher average runs/game gives more room for variance on the low side.
bucswin611
8/04
Clearly the best of the Idol writers...great read!
WaldoInSC
8/04
This is outstanding research. In time it will have far-reaching consequences refining Pythagorean accounting. Encore!
SkyKing162
8/04
Great stuff. Something I'd been meaning to get to, but never did.

The next step would be to take RPG and variance of RPG for different style offenses and defenses and see which ones maximize winning games. In general, consistent offenses and inconsistent defenses make for better games.

I'd also like to see things broken down a bit more. Instead of OBP and SLG, how about AVG, ISO, and BB%?
swartzm
8/04
Glad to help with that :-) I checked those too, but wasn't very persuaded that the results helped much.

correlation BB% AVG ISO OBP SLG
%inng w/1+R .47 .75 .73 .82 .87
%inng w/2+R .45 .65 .59 .74 .73
%inng w/3+R .37 .53 .46 .60 .57
%inng w/4+R .42 .68 .57 .75 .72
%inng w/5+R .40 .60 .47 .67 .61

It seems that AVG tends to follow a similar pattern as OBP, and ISO tends to follow a similar pattern as SLG. BB% tends to look like SLG because of their correlation, I think (0.36).
Oleoay
8/04
I know that at BP, they often use third order wins to use the elements of run scoring and prevention to predict a team's success. I don't know what formula they use, but I would be interested to see if they weight SLG higher or OBP lower.

Another tangent to explore would be to see if "good bullpens" tend to be those that tend to allow less home runs/extra base hits, or those that tend to allow less baserunners.

Also, I wonder how well park factors correlate with this kind of concept, since some parks have low effects on OBP and higher effects on SLG, and vice versa.
swartzm
8/04
1st order record uses RS/RA and and 2nd order record uses EqR for the team and its opponent; 3rd order record improves on 2nd order record by adjusting for difficulty of opponents. EqR is a run estimator in that it approximates the RS/RA of a team in a neutral setting over the course of the season. The point of the article was that run estimators like this will estimate RS/RA well, but that there is an additional factor to account for in terms of biased distributions of RS/RA.

To the extent that bullpens can control hits on BIP, it would be interesting to test their effects, but the effects would be pretty muted.
Oleoay
8/04
While I understand that the point of the article had to do with run distribution, there seemed to be a corollary that teams that get extra base hits/home runs tend to get shut out less in a single game than teams that get on base. The idea is that, on a given day, teams with better slugging don't need to chain as many events together to score runs.

That's why I was wondering how much SLG affected W-L record predictions and if an increased understanding of this principle would refine W-L predictions. Also, whether bullpens that reduced SLG leaded to an increased chance of winning. Also, home teams that play in parks with high SLG park factors might have more variance in their W-L records.

Am I off the wall?
swartzm
8/04
You are correct. The implication of the article is that SLG could refine W-L predictions a little bit better. If you had two equal quality bullpens in terms of Fair RA or something like that, but one was relatively more capable at reducing SLG and the other was relatively more capable at reducing OBP, I suppose that bullpen might fare a little bit better. Keep in mind that pitchers generally control K%/BB%/GB% and not much else. I guess the implication would be that a bullpen with higher BB% but higher K% and higher GB% might do a little at reducing SLG? That's probably true but the magnitude of that effect is likely pretty small. Certainly an interesting point, though.

The park factor thing should not matter because all games are played by two teams that are in the same stadium at the same time. The park would spread out or condense their run scoring equivalently.
Oleoay
8/04
Pitchers generally control K/BB/GB, but my understanding was that pitchers also appear to have an influence on home run rate too... so the question is whether a HR rate allowed is a better indication of surpressing an opposing team's ability to score in any given inning than BB/9, for example.

As an example, assume you are managing the visiting team in extra innings. If the home team scores a run, you lose immediately. If you had two relievers (ignore handedness and platoon) to choose from, with similar VORP/WXLR/etc., one of which has a slightly better rate at reducing OBP (but gives up more home runs) and the other has a slightly better rate at surpressing SLG (but gives up more walks), would it be better to use the pitcher with the better SLG-surpressing skills or the OBP-surpressing skills.. my instinct is it would be better to use the one who surpresses SLG since a pitcher could give up multiple walks/hits but still get the three outs.

As far as park factor goes, it might not matter much within the context of that game.. but if your team is in a park that allows more SLG, your team's W-L record might vary more from expected W-L based on Pythagorean Method since the increased chance of a "big inning" might cause more fluctuation. Perhaps, along those lines, those who play in so-called pitcher's parks have teams with records closer to their expected win-loss record. This might also have the added advantage of being better able to evaluate individual player performance.
anderson721
8/04
Okay, I'm puzzled. The Angels are a high OBP, average HR team, and they blow away pythag annually. Is it all bullpen and baserunning?
swartzm
8/04
It's mostly good bullpen. Although the OBP vs. SLG tendency is going to be true on the whole, it's not going to be the deciding factor for every team. The Angels have had a very good bullpen for several years, and that will definitely be a larger effect on Pythagorean Record. Additionally, the Angels OBP and SLG ranks aren't as lopsided as you may think, but they definitely lean a little towards the OBP side.
Scherer
8/04
Remember, too, the SLG is not really about HRs, or even power for that matter, as batting average is a significant component of SLG. Ichiro currently has a SLG similar to Curtis Granderson, but that doesn't mean he's a comparable HR hitter.

The Angels rank 10th in the AL in HR rate, and only 7th in ISO, but 3rd in SLG.
llewdor
8/04
I really enjoyed the article, but what I enjoyed most was that link to Woolner's article from 2000. That's the sort of thing I'd like to see more of from BP. I want more statistical heavy-lifting.
jdseal
8/04
The research and the premise behind this article were outstanding. Great to see you on the team, Matt. I'm a professional statistician so I actually understood most of this, but I felt that, by usual BP standards, there was more jargon and less explanation, and I fear more of the readers were lost. It's such good work, I hated to see that happening.
jessehoffins
8/05
Haha, you have two conflicting comments at the end here; some want more data some suggest less jargon. Good luck finding the balance. (I was taught stats by another penn econ phd, bob tayon, maybe you know him, following wasn't too hard)

The article was interesting. Questions I had: Any multi collinearity with runs per inning and runs per game stemming perhaps from the correlation between high variance in runs scored(i.e. lots of high run innings) and additional high run innings coming in the same game. As you pointed out, to miss the context of the runs within the game is somewhat problematic. Easy ways I can think to get started would be testing for the correlation between scoring x runs in an inning and scoring more in more than .29 percent of the remaining innings. After all, pitch counts, poor relievers ect. Not sure what the implications of this might be.

Other things, aren't slugging and obp really highly correlated(A really quick attempt to prove that led me to realize that I have no interns or data sets. shoot)? Can you still do that last trick and compare the probability of scoring x+ runs per inning over many x's and say things about the coefficients relative to one another? What are the p values on those regressions like? Could you also run those regressions on the probability of scoring x runs per game, not x plus, that way ranges of runs scored could be aggregated, and the correlation for scoring one run in a game and slugging would give something negative perhaps.
swartzm
8/05
Wow, I can't believe you know Rob Tayon! He's a great friend of mine! I see that you call him Bob, which I guess means you've known him for a while, because I think he mostly goes by Rob now, at least among people he knew in grad school we all called him Rob. That's very cool.

Anyway, happy to respond to your questions...

--I'm not sure what you mean about multicollinearity with R/Inn and R/G. I didn't use the two variables in the same regression at any point, so I'm not sure the problem.

--I don't have the data to compare the probability of scoring a lot of runs in an inning and of scoring a lot of runs in subsequent innings. I'm sure it's positively correlated, though, just because pitcher's ERA's show some persistence and park effects exist too. I'm not sure if that would complicate the results at all, though.

--OBP and SLG are very highly correlated, actually about 0.75 on a team level. But that's okay, because they aren't collinear. It's okay to run a regression on both.
--Running the regression for individual runs per inning would yield the following coefficients:

obp coef/slg coef
0 R: -404/-1259
1 R: -14/291
2 R: 310/188
3 R: 323/107
4 R: 137/77

That actually seems to strengthen my case, since the OBP has a negative (statistically insignificant) coefficient on the probability of scoring one run. The OBP coefficient for 0 R is also insignificant, and the SLG coefficient on 3 R is only weakly significant (p=.073). Everything else is strongly statistically significant (p<.004).

--The p-values on OBP and SLG coefficients for each of the x+ inning regressions are all 0.000.
jessehoffins
8/05
I should have said Professor Tayon. We didn't say either rob or bob, so it looks like I guessed wrong. Ahem.

I used the wrong word, I didn't mean Multi collinearity exactly. I meant to ask specifically about the relationship between the variance of runs scored in an inning and the variance on runs scored in a game, or predictions about total runs scored in the game. This would be important to understand in the debate on obp/slg. If a team is set up to be as likely as possible to score a run in an inning, can that help them score more later in the game significantly? Does scoring 2 runs in an inning have a greater effect on scoring at other points in the game than 1? i don't think pitcher era's or team offensive levels are grounds for tossing it out nessecarily. certainly for the team offense you could normalize with some success against the year's average results . Obviously it could turn up meaningless results, but if they were significant in some way, there are a number of different narratives that could be spun .

-Thats cool regarding individual runs per inning. thanks for doing that. That coefficient on slugging for one run is huge, and is a complex mathmatical way of describing why you walk albert pujols in close games(not last night though).

-As for the OBP/slg correlation, what happens if you drop one of the variables from the regression? Isn't that a good test for multicollinearity even though the variables aren't collinear. Apologies to Professor Tayon if I fail to grasp these concepts fully.
swartzm
8/05
I'm not sure I understand what you're saying. Certainly the coefficients would change if I took one out or the other. But I'm not trying to develop a production function for runs, just developing a way to look at correlations of two variables at the same time.

There certainly is a positive correlation with being able to score a lot and being able to score at all. That's why I normalized the variance of runs/game to find a scoring-level-neutral way of talking about more variance in runs/innings and runs/game. The key was only to link those two, so that I could justify looking at things on a per inning basis.
jessehoffins
8/05
I'm just wondering how if your coefficients might be deceptively significant given the high correlation between the two. If by dropping one variable you found p stats changing dramatically there might be something wrong with the coefficients on the full regression.
dpowell
8/05
This would work the other way. If they're highly-correlated, then this should drive the standard errors for both variables up (towards insignificance).
swartzm
8/05
Thanks. Yeah, that's what happened when I just ran them again on obp and slg individually in single regressions. They became higher and more significant because their coefficients in single regressions now picked up both of their effects.
jdseal
8/08
Wow, this discussion is so wonderfully geeky I can hardly contain myself. I wish more or my colleagues (who, professionally-speaking SHOULD be able to have these kinds of discussions) could follow all that. Anybody here interested in a job analyzing marketing and survey data?
Oleoay
8/09
I might be interested, depending on location. I'm getting a little winded from Denver's altitude and am open to options. My background is in business intelligence.
Oleoay
8/06
Ok so here's a question.. is there a break-even point on OBP or on SLG where, if playing for one run and that one run will win the game and there is currently a runner on second base, and assuming that each other person in the lineup is a league-average hitter, it's better for a batter to bunt to advance a runner instead of trying to swing for extra bases?

For example, someone with a .350 OBP would get on base 35% of the time and thus advance a runner. However, they might have a 60-65% chance (or higher) to advance the runner with a bunt (costing an out). Let's flesh out his stats and say he is a .250 hitter with a .400 SLG. Does that make him a good enough hitter where, playing the probabilities, he has a better chance of advancing (or scoring) the runner if he swings, thanif he bunts?