keyboard_arrow_uptop

Baseball Prospectus' Director of Technology Harry Pavlidis will be chatting with readers Thursday at 1 p.m. ET. If you have any questions after reading this overview of Deserved Run Average, ask them here.

Introduction
Earned Run Average. Commonly abbreviated as ERA, it is the benchmark by which pitchers have been judged for a century. How many runs did the pitcher give up, on average, every nine innings that he pitched? If he gave up a bunch of runs, he was probably terrible; if he gave up very few runs, we assume he’s pretty good.

But ERA has a problem: it essentially blames (or credits) the pitcher for everything, simply because he threw the pitch that started the play. Sometimes, that is fair. If a pitcher throws a wild pitch, he can’t blame the right fielder for that. And if a pitcher grooves one down the middle of the plate, chances are that’s on him too. Not too many catchers request those.

However, most plays in baseball don’t involve wild pitches or gopher balls. Moreover, things often happen that are not the pitcher’s fault at all. Sometimes the pitcher throws strikes the umpire incorrectly calls balls. Other times they induce grounders their infielders aren’t adept enough to grab. And still other times, a routine fly ball leaves the park on a hot night at a batter-friendly stadium.

ERA doesn’t account for any of that. It just tells us, in summary fashion, how many runs were “charged” to the pitcher “of record.” And so, a starting pitcher who departs with a runner on first gets charged with that run even if the reliever walks the next three batters. The same starter would get charged if the reliever makes a good pitch, but the shortstop can’t turn a double play. And none of these runs count at all if they are “unearned”— an exclusion by which the home team’s scorer decides whether a fielder demonstrated “ordinary effort.”

The list of problems goes on. Pitchers who load the bases but escape are treated the same as pitchers who strike out the side. Pitchers with great catchers get borderline calls. Guys who can’t catch a break for months show immense “improvement.” Guys who are average one year wash out the next. ERA, in short, can be a bit of a mess, particularly when we have only a few months of data to consider.

The problem is this: We know which runs came across the plate, but we can’t tell, just from ERA, which runs were actually the pitcher’s fault. What we need is a reliable way to determine which runs the pitcher deserves to be charged with. That is the challenge we took on in creating Deserved Run Average (DRA).

The Search for an Alternative
Baseball researchers have spent the past few decades trying to figure out a better way to measure pitcher quality. Voros McCracken is popularly credited for discovering that pitchers have varying (and often little) control over the results of balls put in play. Running with that theme, Tom Tango proposed the metric of Fielding Independent Pitching, or FIP. FIP looks only at a pitcher’s home runs, strikeouts, hit batsmen, and walks. From these four statistics alone, FIP can account for almost 50 percent of the variance in runs allowed by pitchers each year. At the same time, most plays in baseball do not involve a strikeout or home run. And so, many researchers have tried to improve on FIP’s formula.

A few years ago, our former colleague, Colin Wyers (now employed by the Houston Astros) thought he had a better solution. Labeled Fair Run Average (abbreviated “FAIR RA” or “FRA”), Colin’s approach tried to adjust for, among other things, what he considered to be a “fair” number of actual innings pitched, and assigned a “fair” number of runs allowed for each pitcher as a result.

Unfortunately, Fair Run Average has not succeeded. While some of its assigned values make sense, others do not. Many researchers have noted what appears to be a bias in Fair Run Average against pitchers who generate a lot of groundballs—a skill generally thought to be desirable. Fair RA just has not caught on, and, more importantly, our understanding of the tools for measuring baseball performance has advanced since the time Fair RA was conceived.

Today, we are transitioning to a new metric for evaluating the pitcher’s responsibility for runs that crossed the plate. We call it Deserved Run Average, or DRA. Leveraging recent applications of “mixed models” to baseball statistics, DRA controls for the context in which each event of a game occurred, thereby allowing a more accurate prediction of pitcher responsibility, particularly in smaller samples. DRA goes well beyond strikeouts, walks, hit batsman, and home runs, and considers all available batting events. DRA does not explain everything by any means, but its estimates appear to be more accurate and reliable than the alternatives. As such, DRA allows us to declare how many runs a pitcher truly deserved to give up, and to say so with more confidence than ever before.

Deserved Run Average
As you may have noticed, we are introducing DRA (and its underlying components) in two articles. This article provides an overview of these new statistics, and is meant both to provide an overview of DRA and to be approachable for all readers. The second article, entitled DRA: An In-Depth Explanation, discusses in detail the inner workings of DRA for our readers who enjoy such things.

So, as an overview, here is what DRA does, step by step:

Step 1: Compile the individual value of all baseball batting events in a season.

When a batter steps into the box, a number of different events can ultimately occur. These range from a strikeout to a single to a double play to a home run. Over the course of a season, those events each, as a category, tend to result in an average number of additional (or fewer) runs. For example, a home run on average results in about 1.4 runs, because sometimes there are runners on base and sometimes there are not. By the same token, a double play tends to cost a team about three-quarters of a run. Although a double play can sometimes allow a run to score (such as when there happens to be a runner on third with no outs), it far more often ends the inning or empties the bases with no runs scored.

In the world of baseball statistics, the average seasonal value of these events is known as a “linear weight.” To understand the ultimate effect of the batting events, we first must assign the typical value of those events. So, DRA begins by collecting every single baseball batting event in a given season and assigning the average linear weight for the outcome of that play.

Step 2: Adjust each batting event for its context.

Once we have the average value of each play in a season, we start making our adjustments. Home runs depend, among other things, on stadium, temperature, and the quality of the opposing batter. Ball and strike calls tend to favor the home team. The likelihood of a hit depends on the quality of the opposing defense. The pitcher’s success depends on how far he is ahead in the count, and both a catcher’s framing ability and the size of the umpire’s strike zone help get him there.

So, DRA begins by adjusting for the average effect of these factors beyond the pitcher’s control in each plate appearance, using what is known as a linear mixed model. These environmental factors include:

  • The overall friendliness of the stadium to run-scoring, accounting for handedness of the batter (using our park factors here at Baseball Prospectus);
  • The identity of the opposing batter;
  • The identity of the catcher and umpire;
  • The effect of the catcher, umpire, and batter on the likelihood of a called strike (e.g., framing / umpire strike zone, from 1988 onward);
  • The handedness of the batter;
  • The number of runners on base and the number of outs at the time of the event;
  • The run differential between the two teams at the time of the event;
  • The inning and also the half of the inning during which the event is occurring;
  • The quality of the defense on the field for each individual play (assessed through BP’s FRAA[1] metric);
  • Whether the defense is playing in their home stadium or on the road;
  • Whether the pitcher is pitching at home or away;
  • Whether the pitcher started the game or is a reliever; and
  • The temperature of the game at opening pitch (from 1998 onward).

There are two other aspects that affect how DRA scores pitchers.

First, rather than grade pitchers purely on the number of outs, like ERA does, DRA grades them on the basis of each plate appearance. Thus, pitchers who escape a bases-loaded jam are no longer treated the same as pitchers who retire all three batters they faced, simply because they both got three outs.

Second, DRA judges pitchers on the run expectancy of each play, rather than the runs that happen to cross the plate. If, for example, our hypothetical starter from earlier put a man on first and then was replaced, he would not be penalized the entire run if the reliever subsequently allowed that player to score. Rather, he would be penalized only by the likelihood that said player would have scored from first base on average, with the reliever getting charged the difference between that average likelihood and the full value of the run if it scores. Likewise, when a starter loads the bases, but the reliever gets the team out of it, the reliever doesn’t simply get credit for an out or two. Rather, he gets a bonus for all of the runs that were expected to score from a bases-loaded situation in an average situation, but didn’t. In this regard, true “stopper” relievers get more fairly recognized for their accomplishments, and we more accurately forecast their “deserved” runs allowed.

The DRA component that emerges from all these adjustments is value/PA: the average value of each plate appearance which the pitcher completed during the season.

Step 3: Account for base-stealing activity.

Understanding the average weight of a batting event is essential, but run-scoring also depends on who happens to be on the base at the time. Billy Hamilton is much more likely to score when on base than Billy Butler, all other things being equal. Certain pitchers also hold runners better than others. A runner who is afraid of being picked off will have fewer steal attempts. Runners who stay closer to the base should have a harder time scoring. And runners who are thrown out trying to steal are erased from the basepaths entirely.

To account for these situations, and provide some insight into the effect of baserunning on each event, we created two additional statistics: one looking at base-stealing success and one looking at the frequency with which baserunners attempt to steal bases. They are both (potentially) part of DRA, but are also useful in and of themselves.

We’ve also made an effort to make these statistics more approachable. Because we are looking at how pitchers compare to other pitchers in controlling baserunners, we are describing these stats as Swipe Rate Above Average (SRAA) and Takeoff Rate Above Average (TRAA).

Swipe Rate, as its name implies, judges each participant in a base-stealing attempt for his likely effect upon its success. Using a generalized linear mixed model, we simultaneously weight all participants involved in attempted steals against each other, and then determine the likelihood of the base ending up as stolen, as compared to the involvement of a league-average pitcher, catcher, or lead runner, respectively.

Stated another way, Swipe Rate allows us to evaluate how good Yadier Molina’s arm is while controlling for the inherent ability of his pitchers to hold runners and the quality of the runners he is facing on base. Likewise, we evaluate the ability of individual pitchers to hold runners while controlling for the possibility that they may be throwing to a catcher with a subpar arm. And for baserunners in particular, we now have something much more accurate to evaluate their base-stealing ability than base-stealing percentage.

Remember that base-stealing percentage, by itself, is not very useful: using straight percentages, an elite base-stealer who swipes 90 percent of his attempts and tries to steal 40 times a year ranks lower than a catcher who had one lucky steal all year (and therefore has a 100 percent base-stealing percentage). In the same way that Controlled Strikes Above Average (CSAA) controls for the effect of other factors on catcher framing, Swipe Rate Above Average regresses baserunners’ steal-success rates against both themselves and others to provide a more accurate assessment of each participant’s effect on the likelihood of a stolen base.

The factors considered by the Swipe Rate are:

  • The inning in which the runner was on base;
  • The stadium where the game takes place;
  • The underlying quality of the pitcher, as measured by Jonathan Judge’s cFIP statistic;
  • The pitcher and catcher involved;
  • The lead runner involved.

Because the statistic rates pitchers above or below average in preventing stolen bases, average is zero, and pitchers generate either positive (bad) or negative (good) numbers. In 2014, here were the pitchers who were hardest to steal a base on:

Name

Swipe Rate Above Average (SRAA)

Hisashi Iwakuma

-3.86%

Kyle Kendrick

-2.57%

Corey Kluber

-2.53%

Todd Redmond

-2.47%

Madison Bumgarner

-2.45%

Jake Odorizzi

-2.27%

And here were the pitchers baserunners exploited the most last year:

Name

Swipe Rate Above Average (SRAA)

Jake Arrieta

+2.84%

Roberto Hernandez

+2.75%

Phil Hughes

+2.24%

Tom Wilhelmsen

+2.19%

Yu Darvish

+2.14%

Drew Hutchison

+2.08%

The model for TRAA (Takeoff Rate Above Average) is similar to SRAA, but more complicated. With Takeoff Rate, we don’t care whether the baserunner actually succeeds in stealing the base; what we care about is that he made an attempt. Our hypothesis is that base-stealing attempts are connected with the pitcher’s ability to hold runners. When baserunners are not afraid of a pitcher, they will take more steps off the bag. Baserunners who are further off the bag are more likely to beat a force out, more likely to break up a double play if they can’t beat a force out, and more likely to take the extra base if the batter gets a hit.

Takeoff Rate stats consider the following factors:

  • The inning in which the base-stealing attempt was made;
  • The run difference between the two teams at the time;
  • The stadium where the game takes place;
  • The underlying quality of the pitcher, as measured by Jonathan Judge’s cFIP statistic;
  • The SRAA of the lead runner;
  • The number of runners on base;
  • The number of outs in the inning;
  • The pitcher involved;
  • The batter involved;
  • The catcher involved;
  • The identity of the hitter on deck;
  • Whether the pitcher started the game or is a reliever.

Takeoff Rate Above Average is also scaled to zero, and negative numbers are once again better for the pitcher than positive numbers. By TRAA, here were the pitchers who worried baserunners the most in 2014.

Name

Takeoff Rate Above Average (TRAA)

Bartolo Colon

-6.09%

Lance Lynn

-5.91%

Hyun-jin Ryu

-5.82%

Adam Wainwright

-5.75%

T.J. McFarland

-5.17%

Nathan Eovaldi

-5.17%

And here were the pitchers who emboldened baserunners in 2014:

Name

Takeoff Rate Above Average (TRAA)

Joe Nathan

9.60%

Tim Lincecum

9.41%

Drew Smyly

8.80%

Tyson Ross

8.08%

A.J. Burnett

7.61%

Juan Oviedo

7.55%

Current 2015 ratings for Takeoff Rate Above Average are on our leaderboards. We don’t’ have enough data yet to release Swipe Rate Above Average, but we expect it will have enough to work with in another month or so.

Step 4: Account for Passed Balls / Wild Pitches.

Under baseball’s scoring rules, a wild pitch is assigned when a pitcher throws a pitch that is deemed too difficult for a catcher to control with ordinary effort, thereby allowing a baserunner (including a batter, on a third strike) to advance a base. A passed ball is assigned when a pitcher throws a pitch that a catcher ought to have controlled with ordinary effort, but which nonetheless gets away, also allowing a baserunner to move up a base. The difference between a wild pitch and a passed ball, like that of the “earned” run, is at the discretion of the official scorer. Because there can be inconsistency in applying these categories, we prefer to consider them together.

Last year, Dan Brooks and Harry Pavlidis introduced a regressed probabilistic model that combined Harry’s pitch classifications from PitchInfo with a With or Without You (WOWY) approach. RPM-WOWY measured pitchers and catchers on the number and quality of passed balls or wild pitches (PBWP) experienced while they were involved in the game.

Not surprisingly, we have updated this approach to a mixed model as well. Unfortunately, Passed Balls or Wild Pitches Above Average would be quite a mouthful. Again, we’re trying out a new term to see if it is easier to communicate these concepts. We’re going to call these events Errant Pitches. The statistic that compares pitchers and catchers in these events is called Errant Pitches Above Average, or EPAA.

Unfortunately, the mixed model only works for us from 2008 forward, which is when PITCHf/x data became available. Before that time, we will rely solely on WOWY to measure PBWP, which is when pitch counts were first tracked officially. For the time being, we won’t calculate EPAA before 1988 at all, and it will not play a role in calculating pitcher DRA for those seasons.

But, from 2008 through 2014, and going forward, here are the factors that EPAA considers:

  • The identity of the pitcher;
  • The identity of the catcher;
  • The likelihood of the pitch being an Errant Pitch, based on location and type of pitch, courtesy of PitchInfo classifications.

Errant Pitches, as you can see, has a much smaller list of relevant factors than our other statistics.

In 2014, the pitchers with the best (most negative) EPAA scores were:

Name

Errant Pitches Above Average (EPAA)

Carlos Carrasco

-0.405%

Ronald Belisario

-0.403%

Jesse Chavez

-0.392%

Clay Buchholz

-0.380%

Felix Doubront

-0.378%

Daisuke Matsuzaka

-0.375%

And the pitchers our model said were most likely to generate a troublesome pitch were:

Name

Errant Pitches Above Average (EPAA)

Masahiro Tanaka

+0.611%

Jon Lester

+0.541%

Matt Garza

+0.042%

Dallas Keuchel

+0.334%

Drew Hutchison

+0.327%

Trevor Cahill

+0.317%

Step 5: Calculate DRA (Deserved Run Average).

We’ve now got our components, so it is time to calculate each pitcher’s DRA. Here are the steps we follow:

First, we put all of our identified components—value/PA, Swipe Rate Above Average, Takeoff Rate Above Average, and Errant Pitches Above Average—together into a new regression, this time looking for their combined effect on run expectancy.[2] We added two more variables that struck us as relevant: the percentage of each pitcher’s plate appearances that came as a starter versus as a reliever (we call this Starter Pitcher Percentage, or SPP) and the total number of batters faced. That gives us a total of six potential predictors for each pitcher to come up with their DRA for a season. We regress these using a method known as “MARS.” If the detail interests you, we invite you to enjoy the In Depth article, which discusses it further.

Second, to smooth out season-to-season variation, and to tease out the most accurate connection between these variables and runs allowed, we actually train our model on the previous three seasons. From this we derive the most accurate connection between our potential predictors and actual runs allowed by pitchers in the current run environment.

Finally, we take the connections determined by our model and use them to calculate each pitcher’s DRA for the current season: his Deserved Runs Average per nine innings. DRA does not distinguish between earned and unearned runs, because that distinction can be arbitrary and over the course of a season it tends to obscure rather than reveal differences between pitchers. We therefore adjust DRA so it is on the scale of Runs Allowed per nine innings (RA/9) rather than Earned Run Average (ERA). We understand that ERA is what many of you are used to, but once you get over that, you’ll be much happier.

We do ensure that, in converting runs per plate appearance to runs per nine innings, we use each pitcher’s individual ratio of batters-faced to innings pitched, rather than just a league average. This allows us to credit the pitchers who are most efficient, and avoid over-crediting pitchers who are putting baserunners on and getting lucky with the outcome. Pitchers in the latter category do not “deserve” the lower runs-allowed numbers they might (temporarily) be putting up.

What It Means
So there you have it: DRA, explained. Most of you really don’t care how we got there; you just care that DRA will be easy to look up and be a good evaluator of pitcher performance. In both respects, you are in luck.

As for the first issue, past DRA is available on our leaderboards right now. In-season DRA during 2015 will be calculated each night after the previous day’s games have concluded. You will be able to use DRA not only to put past pitching performances in context[3] but also to monitor the value of pitchers as we progress through the 2015 season, and beyond. As with our other statistics, DRA will be available for you to download and use for your own comparisons and work.

As for the second issue, rest assured that your time spent reading this article was not in vain. DRA does a very good job of measuring a pitcher’s actual responsibility for the runs that scored while he was on the mound—certainly better than any metric we are aware of in the public domain. And only DRA gives you the assurance that a pitcher’s performance is actually being considered in the context of the batter, catcher, runners on base, as well as the stadium and stadium environment in which the baseball game occurred.

The detailed explanation of DRA’s effectiveness is saved for the accompanying In Depth article. But since you’ve made it this far, we’ll give you the Reader’s Digest version. There are two measures of accuracy that we pay particular attention to in evaluating the accuracy of a new metric.

First, we look at how close, mathematically, the metric’s prediction is to the actual number of runs allowed with the pitcher on the mound. If the pitcher actually allowed four runs per nine innings, we test our alternative metric by how close it comes to that same number. The most commonly used calculation that does this is called the Root Mean Square Error or RMSE.

The second test looks at how accurately the metric ranks the various pitchers relative to each other. Why do we care about rank? Because we know that all pitcher run estimates are a bit off from their actual runs allowed, and more so early in the season. So as a check, we test whether it is at least ranking the pitchers correctly relative to each other. In other words, if the metric can’t estimate runs allowed down to the exact decimal point, the least it can do is tell the difference between Max Scherzer and Ricky Nolasco. This second approach is called the Spearman Correlation.

To judge DRA’s accuracy, we’ll compare it to the leading brand: FIP. We know FIP does a reasonable job of predicting a pitcher’s actual runs allowed in a season. Does DRA do a better job than FIP? It does.

We compared how well FIP and DRA predicted each pitcher’s RA/9 in each of the past four major-league seasons. We looked at their performance with all pitchers, and then two subsets: pitchers who faced at least 170 batters (about the workload of an established major-league reliever, or 40 IP), and pitchers who faced at least 660 batters (about 162 innings, which is a qualified major-league starter).

We then averaged the results over four seasons (2011­–2014) to get a consistent (and recent) picture of each metric’s performance. Here is how it ended up:

Metric

Minimum BF

RMSE

(lower is better)

Spearman Correlation
(higher is better)

FIP

0

3.59

0.64

170

1.14

0.70

660

0.67

0.72

DRA

0

2.65

0.76

170

0.96

0.76

660

0.54

0.78

DRA is consistently superior to FIP at all sample sizes. By accounting for the context in which the pitcher is throwing, DRA allows us to determine which runs are most fairly blamed on the pitcher. DRA is particularly effective with smaller samples. Even for pitchers with only a few batters faced, DRA is already separating the good pitchers from the bad with superior accuracy.

In the end, of course, we are not satisfied simply to have brought you DRA. In addition to being useful in and of itself, DRA has become the new foundation of Pitcher Wins Above Replacement Player (PWARP) here at Baseball Prospectus. By integrating DRA into WARP, we can do a better job than ever of evaluating how much value individual pitchers delivered to their teams, both during the current season and as compared to past pitchers in other seasons and eras. The new PWARP figures featuring DRA are also available on the leaderboards, under the column “DRA_PWARP.”

Just for fun, here are the 25 best qualified starters by DRA over the past 25 years. You’ll note that in some cases their DRA basically matches their RA/9; in others, it does not. Our position, of course, would be that when DRA and RA/9 disagree, you should go with DRA, as it tells you how well the pitcher really pitched. Without further ado:

Rank

Season

Name

DRA

RA/9

1

2000

Pedro Martinez

1.03

1.87

2

2004

Jason Schmidt

1.23

3.30

3

1997

Pedro Martinez

1.49

2.46

4

1995

Greg Maddux

1.55

1.62

5

2004

Randy Johnson

1.64

3.29

6

2009

Zack Greinke

1.80

2.54

7

2009

Tim Lincecum

1.87

2.79

8

2013

Jose Fernandez

1.89

2.48

9

2013

Max Scherzer

1.90

3.09

10

2013

Matt Harvey

1.93

2.33

11

2013

Clayton Kershaw

1.93

2.14

12

2007

Erik Bedard

1.98

3.29

13

2011

Justin Verlander

2.01

2.66

14

1997

Roger Clemens

2.04

2.26

15

2004

Johan Santana

2.05

2.79

16

1992

Curt Schilling

2.11

2.63

17

1995

Randy Johnson

2.14

2.79

18

2011

Josh Beckett

2.15

3.07

19

2009

Chris Carpenter

2.17

2.32

20

2003

Pedro Martinez

2.17

2.52

21

2014

Clayton Kershaw

2.18

1.94

22

2009

Josh Johnson

2.20

3.38

23

1997

Greg Maddux

2.23

2.29

24

1992

Juan Guzman

2.27

2.84

25

2002

Curt Schilling

2.27

3.24

One caution: DRA is not (presently) adjusted for run-scoring across different eras. Rather, it is adjusted to the average runs-allowed by the league for that season. So, please don’t directly compare Pedro’s DRA of 1.03 in 2000 to somebody else’s DRA in 1985 or some other season.[4] A DRA metric that compares players across eras will be coming soon.

A second caution: DRA corrects for what is known as survival bias: the tendency of better pitchers to pitch more innings in a season. Applying the full DRA model early on can result in some extreme values. To avoid that, we will keep the model simple at first during the season, and model only value/pa to RE24. As we get further along, we’ll allow the full model to operate and achieve the best explanation of each pitcher’s performance.

Conclusion

We are excited about DRA, as well the other statistics we have introduced: Swipe Rate Above Average (to measure base-stealing success), Takeoff Rate Above Average (to measure base-stealing attempts), and Errant Pitches Above Average (to measure passed balls and wild pitches).

Three final things to remember.

First, while DRA accounts for a great many things, DRA doesn’t need to be complicated for fans. DRA is on our leaderboards. Just look up the pitcher(s) that interest you, and you’ll have the best estimate of how good they’ve been in a particular season. If you want to leave the details to us, feel free.

Second, remember that DRA was created to evaluate past performance. If you want to project future performance of a pitcher, use PECOTA. And if you want to evaluate how talented the pitcher is regardless of his performance to date, use cFIP, which is also on our leaderboards. In fact, cFIP is in the same table to DRA so you can compare recent results with the likelihood of future improvement (or decline).

Finally, DRA is now the foundation for Pitcher Wins Above Replacement (PWARP) here at Baseball Prospectus. For the time being, if you want to see how many wins a pitcher has been worth in a particular season, check — you got it — our leaderboards and the column DRA_PWARP. (WARPs that appear on pitchers' player pages remain, for now, the old, FRA-based WARPs.) We’ll change the description it to plain old WARP once people have gotten used to the new idea.

We welcome your comments, and hope you find DRA as useful as we do.

Special thanks to Rob McQuown for research assistance; to Rob Arthur, Rob McQuown, and Greg Matthews for their collaboration; to Stephen Milborrow for modeling advice; and to Tom Tango and Brian Mills for their review and insights.


[1] Please note that FRAA (Fielding Runs Above Average) is different than FRA (Fair Run Average), which we are replacing for everyday purposes with DRA.

[2] We use RE24/PA: the average effect of the pitcher on run expectancy per batter faced over the course of a season.

[3] We have populated DRA back to 1953.

[4] In fact, don’t try to suggest Pedro’s 2000 season is comparable to what anyone else has ever done at any time. It’s probably very unfair to the other player.

You need to be logged in to comment. Login or Subscribe
antonsirius
4/29
Only two of the top six in TRAA are left-handed? That's surprising
roarke
4/29
Once the statistic is available that compares pitchers across eras, I would be interested in seeing a leaderboard with which pitchers have the highest difference (both positive and negative) between DRA and RA/9.
joseconsuervo
4/29
I've been doing exactly this sort of thing for the last hour now. So cool to see this stuff. The thing I found most interesting was looking at the difference between FIP and DRA. There are some large discrepancies. Also fun to look at was the players where FIP predicted an increase in ERA and DRA a decrease, and vice versa. I really want xFIP in my spreadsheet instead of FIP but am way too lazy and excited to put that dataset together.
TroJim
4/29
This is awesome. Thanks.
mikedee
4/29
This is truly next level. Congratulations on this, I can't wait until this is widely available for all current and former players.
MylesHandley
4/29
Ok, I think I have THE MOST TRIVIAL question of all time. You say that you take gametime temperature into account (measured at start of the game). Does the average time per pitch then play a role? If we take temperature to be an important factor, than it stands to reason that pitchers that take longer to throw the ball (and extend the game) will allow for greater changes in gametime temperature (which will have some effect on conditions, which will have some effect on DRA). This could have an effect in the ten thousandths (maybe. probably not). Another comically trivial question is with regards to takeoff rate. You use cFIP as a general measure of pitcher skill, but another component that might not be measured (perhaps it is and I don't know) is speed of arsenal. Is that wrapped up in "the pitcher involved?" Seems like it's easier to steal on someone who throws 40% offspeed stuff than someone who throws 10% offspeed stuff, and it seems like someone who throws 98 mph is easier to steal off of (generally) than someone who throws 89 mph. Thanks for this awesome work.
bachlaw
4/29
Hi Myles, Certainly the temperature changes throughout a game and with more granular data we would have more information. But, simply accounting for the opening temperature makes a difference and there probably is some uniformity in how the temperature tends to decline over the course of a game, which might be captured in our by-inning controls. As for velocity, that is certainly part of it. But that is part of who the pitcher is, and if he has a good or bad rating in the takeoff rate, his velocity is by nature accounted for, just like his pickoff move or other aspects of how he pitches.
bloodface
4/29
Thanks for the reply, Jonathan. I'm curious if you are weighting a temperature of 80ª in Arlington as roughly the same as 80° in say Seattle or Minneapolis? I agree that just providing some kind of measurement of weather better than nothing at all, but I'm curious as to how humidity is being accounted for, as it does affect how a ball travels, and humidity-to-temp ratios vary across the country.
bachlaw
4/29
Right now we are sticking with temperature as that is something easy that is officially recorded data. Things like humidity and other aspects may be the subject of future work, although remember we are including stadiums as a control factor, so any relevant tendencies they have in the humidity department probably are being considered to some extent.
morro089
4/29
Once again being granular, but wouldn't stadium factors already have an "average season" temperature ingrained in their numbers? So DRA is doubling down on temp unless it's comparing that day's temp to the average temp.
bachlaw
4/29
Stadium factors do have an average temperature baked in, but their average temperature in April is different than their average temperature in August. The model told us that we explained more variance by including stadium and daily temperature both, which makes sense.
markpadden
4/29
No, that makes no sense. The only way you should be using temperature as an input is a by how much it differs from the stadium norm. Why would you look at raw temps. after adjusting for park?
rnolty
5/01
Seems to me you're trying to take two variables that are correlated (temperature and stadium) and transform them into two that are not (delta_temp and stadium). Makes some intuitive sense but it doesn't really create new information. I don't know regression well enough to know if that really helps. Doesn't the regression somehow "take account" of the correlation of input variables?
markpadden
5/28
It dilutes the information. If you try to make a linear coefficient for temp. across all stadiums (in addition to a generic park adjustment), it will be much less informative (contain more error) that if you did it by stadium. Same for humidity and wind vectors.
markpadden
4/29
Humidity is an under-rated variable. Hint: the sogginess of the ball far outweighs the lighter air in humid conditions. Also, wind?
morro089
4/30
I imagine wind would be hard to capture given the structure of a park is a big factor in direction. Temp is a "simple" website query. And according to Jonathon, "double" counting the temp is better ("reduces noise") than only using park factors so it still has value...but yeah, not all of it. But do we know what the advantage of 80 deg vs 70 deg is and how it would effect the park factor? I have a feeling this isn't as easy as we think it is and they decided to invest their time into the context research instead of possibly insignificant temp differences.
markpadden
5/13
Temperature is huge. Ask anyone who bets baseball totals for a living.
JohnChoiniere
4/29
"There probably is some uniformity in how the temperature tends to decline over the course of a game" I would agree if you separate day/night games.
bcavers
4/29
Maybe I am having a reading comprehension fail, but I don't understand this line: "One caution: DRA is not (presently) adjusted for run-scoring across different eras. Rather, it is adjusted to the average runs-allowed by the league for that season." Doesn't the latter accomplish the former?
bloodface
4/29
I'm not positive, but I do not believe so. Adjusting to within the season does tell you something about that year, but it's not comparative outside of that season. So, within that season, you have an idea relative to everyone else, but you don't to anyone else outside of that season. So, comparing 1991 to 2014, while you would understand the numbers, presumably, within each of those seasons, there's nothing quantitative differentiating the relative value of those two seasons against each other.
bachlaw
4/29
No, they are different. The same DRA in 1998 is much less impressive in 2014 because it is harder to score runs.
morro089
4/29
Could you further explain how "Rather, it is adjusted to the average runs-allowed by the league for that season." doesn't account for the hardness to score runs difference between 1998 and 2014? Unless bloodface's point is that just because more runs were scored in a season you're not sure to blame bad pitching, good hitting or a more offensive environment for the additional runs scored?
ravenight
4/29
If the average runs-allowed in season A is 3 and in season B it is 4, then a guy who generated -0.5 runs-allowed-above-average would have a 2.50 DRA in season A, but a 3.50 DRA in season B. Even if this was done on a percentage basis (i.e., you measured his skill by saying he allowed 20% fewer runs than average) you would end up with different numbers (2.40 vs. 3.20). Conversely, a guy with a 3.00 DRA in season A is an average pitcher, whereas that same DRA in season B is well below (i.e. better than) the average.
draysbay
4/29
This is awesome stuff. I really enjoyed the cFIP stuff, too, so it's nice to see that you guys are able to port these datasets over to different uses. This seems more all-encompassing compared to cFIP. Sort of like wOBA vs. FIP for pitchers. One thing I caught is that it's Root Mean Square Error. There is no such thing as Real Mean Square Error.
bachlaw
4/29
Good catch; it was late. We'll correct that.
harold
4/29
I noticed that 24 of the 25 top pitching performances had a DRA < RA/9. Is that skew just a coincidence, or is it expected? I was trying to figure out the reason for that and came up with the following: 1) The DRA model doesn't quite match the RA/9 distribution, and so the fitting gets a little weird on the tails. 2) Starters get a little bit of a bonus relative to relievers (I might have read that in the details article?), and so in general starters DRA will skew lower than their RA/9.
bachlaw
4/29
Harold, You will definitely see a skew like that at first because DRA presumes everyone is average (~4.1 RA/9) until they prove they are not. The more innings you pitch, the more you prove you are something different from average. Relievers haven't pitched many innings yet so DRA is conservative with them.
roarke
4/29
Is there any way to take into account sub-optimal management strategies? So if a pitcher gives up a single with no outs and the manager has the next batter bunt - the expected runs for the inning goes down, but that really isn't reflective of any particular talent of the pitcher (other than throwing a buntable pitch). It's probably a miniscule effect anyway, but I think this is an issue anytime you use expected runs as part of a metric.
Schere
4/29
Looks like great work, I'm excited to dig in. I'm not thrilled with the choice of name here, we've got enough moralizing. And besides, like Clint Eastwood told us in Unforgiven, "Deserve's got nothing to do with it."
morro089
4/29
I don't know if I like or dislike that ERA and DRA both sound the same. For each BP podcast this week I'd put the +/- at 2 for "Are you saying 'EeeRA' or 'DeeRA'?" But DRA only has one letter difference from ERA, so I have a clue when remembering it. Unlike EPAA, which I already forgot and had to scroll up to find. And for people new to metrics I would guess any sort of familiarity in terms is a plus.
walrus0909
4/29
You could always pronounce it "Dray," like the doctor. Plus then you get to make "forgot about DRA" puns when pitchers regress.
morro089
4/29
They 100%, definitely backronymed that for this specific reason.
KJOKBASEBALL
4/29
Looks very good. Not crazy about the DRA acronym though since Defensive Regression Analysis is already using DRA.
sbnirish77
4/29
Is the DRA model developed to be a best-fit model of the entire data set or a best-predictive model using cross-validation using some subdivision(s) of the data set? What data (years) were used for the regression and, if cross validation was done, how was the data divided into subsets for cross validation? A best fitting regression model is notorious for having limited predictive ability for new data sets outside those used to develop the model.
morro089
4/29
I haven't read it yet, but might want to head over to the other DRA page http://www.baseballprospectus.com/article.php?articleid=26196
tannerg
4/29
Maybe I'm wrong, but now we have a WARP that considers context for pitchers, but not for hitters? Am I wrong?
bachlaw
4/29
You are correct sir. Give us time! And, I think we were a lot more confident about our ability to rate batters than pitchers before this.
tannerg
4/29
But it's easy to measure context for hitters -- RE24. I thought the whole point of WAR/WARP was to be predictive? So I guess the question is... What is the point of WAR/WARP?
morro089
4/29
WAR/WARP is a "what have you done for me" not a "what are you going to do for me" stat. It's for who was better than whom in the past. The reason it is also referenced a lot for predicting is because the projected actions of players (AVG, SLG, FRAA, etc) are put into the WAR function and it spits out a number.
tannerg
4/30
No, it is certainly not. WAR ignores context on the offensive side.
nberlove
4/29
Have you checked to see how well Deserved Runs allowed for a team correlate with actual runs allowed?
jrbdmb
4/29
Will PVORP also be converted to using DRA at some point?
bachlaw
6/01
Yes, that has now been done for everything except projections.
jfcross
4/29
I think this is fantastic, guys. One (I think) significant point and one trivial one. First, I don't agree with "RMSE (lower is better)" in your table given the context. If my pitching statistic eliminates fewer of the problems with RA/9 and yours eliminates more, mine might well be more similar to RA/9 as a result, and therefore have a lower RMSE. So, I think instead of "RMSE (lower is better)" this column should be called "Similarity to RA/9 (interpretation is ambiguous)" -- RA is both what you're trying to get away from AND what you're trying to approximate. DRA might well be better than FIP (almost certainly it depends on what exactly you're trying to do) but this table doesn't begin to make that case since by this metric, RA/9 itself would obviously be the best choice. Second, (and this is the trivial one), it's possible to do a little double counting if you have a park factor and a temperature effect since parks have different average temperatures. Can't imagine this makes any difference though.
ravenight
4/29
Yeah, this struck me as a problem also - being much closer to RA/9 isn't actually a good thing, is it? It would be interesting to see how RA/9 correlates with itself relative to DRA's correlation. So then the question is, what are you trying to measure and how can you tell if you successfully measured it? I think the answer would be pretty hard to come up with. You are trying to measure how many fewer runs your performance should have resulted in, compared to an average performance in the same context. If we took the "should have resulted" runs from every pitcher in the league and added them up, should that equal the "actually resulted" runs? No, because we are laying some blame on the fielders and the catcher, and the hitters, and so on. So I guess that means that it should really be compared to RA/9 - (sum of all non-pitching sources of runs)/9, assuming that you trust those other metrics (like FRAA, BRR, and BRAA).
markpadden
4/29
Absolutely, on the double-counting of temp. Once you park-adjust, all weather variables should be vs. park average, not raw.
BarryR
4/30
Park adjustments are a seasonal concept. If a pitcher pitches in Colorado on a snowy day in April, his park effect is inherently different from the park effect of someone pitching on a sunny day in July. In Wrigley, the park changes from day to day all year, based on the wind direction. I suspect there is some correlation between temperature and wind direction there, but I'm not certain and it may not be a possible to find the daily wind direction in Wrigley.
markpadden
5/13
Not sure what your point is, but just FYI -- historical wind/temp data is definitely available for every game. And, yes, the wind direction-to-temp correlation is huge for Wrigley. Weather is a complex issue to dissect, but it absolutely needs to be accounted for in any serious attempt to adjust pitching performance by degree of difficulty.
rnolty
5/01
I've been pondering on this for a couple of hours, since I listened to the Effectively Wild podcast this morning. I think the most likely explanation is that the guys are being misleading with their language when they say DRA accounts for 72% of the variation, or when they are comparing DRA to RA/9. I think they mean the whole DRA model, not the actual DRA number. The whole idea is that some of the probability of a run scoring is under the pitcher's control, and some is not. The system uses every relevant variable to predict the number of runs, then subtracts out all the parts the pitcher has no control of and calls what's left the DRA. But I think when they compare "DRA" to RA/9, they really mean the runs predicted using all the variables, not just the pitcher-controlled variables. If I'm right, lower RMSE is indeed better.
bachlaw
6/01
rnolty, the basis for the claim is that the weighted Pearson correlation between the DRA and the RA/9 of all pitchers in the seasonal population is around the number. I think you make arguments in favor of focusing on error or the correlation. We included both since DRA seems to be a better fit regardless how you slice it. Thanks for reading and for the thoughts.
bachlaw
6/01
Jared: I think the point of similarity is a good one. I'll use that in the update article which will come shortly including our revisions to DRA. On the parks, the model explains more variance when both plain old park and raw temperature are included than when only park is included. An interaction between raw temperature and stadium was not adding much value. We feel fine with just letting temperature and park work off each other in the way that they do, for the time being.
russell
4/29
Nice work guys! I look forward to digging into it.
morro089
4/29
Pedro's MLB team icon probably shouldn't the Phillies. Probably. Somebody can prove me wrong. http://www.baseballprospectus.com/player_search.php?search_name=Pedro+Martinez
markpadden
4/29
After reading this article I have no idea what method you used to test it? Are you saying you calculated DRA in Year-1 and compared to Year? Or something else? Let's assume that's the case, and the DRA beats FIP when using only last year to predict this year. I would claim that one could quite easily use a modified xERA (adjusted for park and some pitcher qualities) that crushes FIP in accuracy for 1-year predictions. [Source: 10 years as a pro gambler]. I.e., you chose an incredibly easy target in FIP, since it relies on the absurdly noisy HR allowed stat. Don't get me wrong: I'm all in favor of this research. But your validity test should never be something as trivial as beating raw single-season FIP. Beat PECOTA. Beat Steamer. Beat something that a few people actually use for prediction.
markpadden
4/29
Correction: should have read "modified xFIP," not "modified xERA."
rnolty
5/01
As I hinted above, I've concluded that the validation was to calculate the runs scored per plate appearance, using every useful variable at their disposal, not just the pitcher-controlled variables. They chose their parameters to minimize some function (probably square) of (predicted_runs - actual_runs). After they had that, they quantified how many runs in their predicted_runs, positive or negative, were due to factors beyond the pitcher's control. After subtracting that off, you have the pitcher's DRA. But this is just conjecture on my part -- I hope you get an answer, but I think they've already moved on from this article!
JohnStryker
4/30
Any idea when DRA will be added to Fantasy Team Tracker? Did I miss that above?
randolph3030
4/30
It states the DRA adjusts for: The run differential between the two teams at the time of the event; Is there proof that runs are more or less likely to be scored depending upon the run differential of the two teams? Seems like a weird thing to adjust for, unless I'm missing the point, which I probably am.
rnolty
5/01
The proof (if you trust them :-) is that it made their model better -- otherwise they would have left it out. It makes sense to me -- the losing team will play small ball late in the game if it's close -- I don't know if the expectation value of runs goes up or down, but it certainly shifts some of the probability from more runs to one run.
bachlaw
6/01
rnolty is right: being up, down, or tied and the extent of the lead affects the decisions teams make.
evanpetty
4/30
The inspiration is to improve on RA/9, which is the benchmark to show that DRA is more representative than FIP. I can't get over the thought that it's a bit strange. Maybe just ironic?
bachlaw
6/01
Evan, it's a fair question. It is traditional to score metrics by how well they "predict" or "account" for run-scoring. But the "perfect" metric by that measure will always be run-scoring itself. So, an error rate of basically 0 or correlation of nearly 1 would be meaningless and silly. I think similarity to RA/9 is an important benchmark but I would place an awful lot of weight on whether sound methods appear to be getting followed. We'd like to think we can demonstrate both, which is why we feel good about DRA.
mabenson00
4/30
I'm confused by the predictive/ past performance part. If this is the best way to judge how good a pitcher has been, wouldn't it be the best way to predict how well they will do? EG. Pitchers who have been good, will mostly continue to be good.
mabenson00
4/30
To hopefully clarify... The goal is a better RA/9, because RA/9 involves luck/factors a pitcher can't control. But to judge how well it works, you see how well it fits with RA/9. What if, for example, FIP is farther from RA/9 because the luck/other factors are actually a bigger part of RA/9 than DRA says. You could test it against future RA/9 but you mention it isn't meant to be predictive.
bachlaw
6/01
Essentially this is a judgment call. We want to account for as much RA9 as is reasonably possible while still recognizing that luck plays a role. I feel pretty comfortable with 70% or so explained variance for run expectancy; I would feel a lot less comfortable with 90 or 95%. At that point, it's unlikely the statistic is doing much more than running in circles. So, I think it's important to consider both whether sound methods are being followed as well as the statistics ability to account for run expectancy. I feel like we are able to do both.
Plucky
5/01
I'll pile on with weather suggestions- While I'm sure the difference would be trivial, it is possible to estimate temperature at the time of event with freely available data. Hourly-level historical station data can be found here https://www.ncdc.noaa.gov/cdo-web/datasets . The official weather station won't exactly match at-the-stadium temperature, but you could establish a stadium/station delta pretty easily. This will also allow you to estimate temperature for pre-1998 events
markpadden
5/13
http://mlbfarm.com/data/weather.csv (no affiliation with site) And it's definitely not a trivial variable when predicting run scoring.
wazvito
5/01
Why do you compare DRA to FIP rather than xFIP, considering that xFIP is more predictive than FIP?
bachlaw
6/01
We are tried to account for past results, not anticipate future events. xFIP performs very poorly as a descriptive statistic because its distribution is so narrow.
danrnelson
5/01
What would cause Yordano Ventura to be excluded from the top of the TRAA list? Only two players even attempted a steal on him last year, fewest among all SP.
bachlaw
6/01
He was still above average overall. The issue would be the particular lead-runners and catcher he was interacting with on those plays.
hotstatrat
5/04
Is there any FIP or xFIP nuance to DRA? i.e. is a home run wieghted precisely as more valuable than a single in DRA as it is in reality on the average - without considering that the ball may have just dropped in? is the assumption that a pitcher with a higher BABiP really probably is pitching worse than a pitcher with a lower one?
bachlaw
6/01
Hi John, we just use the average, park-adjusted linear weight for each event for starters. We then apply context to those through the average effect of the parks, defenders, batters-faced, and such. A pitcher who has a higher BABIP is held responsible for that, after we account for defense, park, framing, the umpire strike zone, and all the other factors that we think most likely explain what else besides luck could be the cause of it.
yacitus
5/22
I'm confused by what appears to be a contradiction between:
To judge DRA’s accuracy, we’ll compare it to the leading brand: FIP. We know FIP does a reasonable job of predicting a pitcher’s actual runs allowed in a season. Does DRA do a better job than FIP? It does.
...and...
Second, remember that DRA was created to evaluate past performance. If you want to project future performance of a pitcher, use PECOTA. And if you want to evaluate how talented the pitcher is regardless of his performance to date, use cFIP, which is also on our leaderboards. In fact, cFIP is in the same table to DRA so you can compare recent results with the likelihood of future improvement (or decline).
markpadden
5/28
It's about shit's relative adhesion to a wall...
bachlaw
6/01
The "word" prediction is tricky here. We talked about this briefly in the in-depth article I believe, but in a sense all outputs of any model are "predictions," yet not all models are necessarily aiming at the future. Sometimes we are trying to fit the past. DRA is an interesting hybrid because we are using data from the past three seasons to "predict" the value of CURRENT events as they happen. Over the course of a season, those events move into the past. We do this in part because otherwise we would not be able to offer in-season predictions: we would have to wait until every season was over until we fit its events. But if you're confused, that's probably where it arises from. DRA is "fitting" past events; cFIP is "fitting" anticipated future events. Both are predictions, but only one has significant utility for the future. I hope this helps.