keyboard_arrow_uptop
Baseball Prospectus is looking for a Public Data Services Director. Read the description here.

From time to time—if not at all times—organizations must examine their own operations and ask some difficult questions.

The answers often reveal a range of things done right and things done wrong. Healthy organizations can handle those answers in more than one way—there are many routes to success, but even more to failure—but one hallmark of organizational integrity, to borrow from James Collins, is looking in the mirror when assigning blame and out the window when giving praise.

Here at BP we’ve been faced with an opportunity to ask ourselves some questions, and we’ve decided to grapple with the answers, even though in some cases we don't like them. In short, we have work to do in order to live up to our own high expectations. Despite our pride in much of the progress Baseball Prospectus has made, now is not the time to rest on our laurels. And some recent events make that abundantly clear.

After the 2010 season, Colin Wyers wrote about replacement level and how he was improving its integration with the rest of the component stats at Baseball Prospectus.

This is something of a culmination of work I’ve been doing over the past few months—taking a menagerie of stats available here at Baseball Prospectus and merging them together under the heading of “Wins Above Replacement Level.” We’ve had WARP for quite a while—and its close sibling, VORP, as well—but it has been rather distinct from the rest of our offerings. That’s coming to an end.

The goal of making WARP play well with the component statistics left behind at BP by previous staffers was worthwhile, but the implementation caused problems: We inadvertently raised replacement level for 2011 and 2012. Taking a summation of the WARP or VORP values for those two seasons resulted in league totals which weren't in line with pre-2011 data. They were much lower. By implication, this meant that replacement level was much higher, or that a “replacement level team” would win more games than the data had indicated for previous seasons.

At any point starting in about May of 2011, it should have been clear to anyone looking closely at the stats that something was different, and not just because Colin had re-engineered (read: greatly improved) some of the WARP formulae or because offense was down in 2011.

For the record, we know that these re-engineered formulae work. The chart below shows league-wide WARP totals by year since 2000, along with the winning percentage of a notional “replacement level team” (really, it's just a subtraction of WARP from wins, so there's some noise there for a variety of good reasons, but it's close enough to give a good idea).

Year WARP BWARP PWARP WARP Per Tm Rep. Wins Per Tm Rep. Win%
2000 884 598 286 29.5 51.5 0.318
2001 913 588 326 30.4 50.5 0.312
2002 921 576 345 30.7 50.1 0.31
2003 939 585 354 31.3 49.7 0.307
2004 927 578 349 30.9 50 0.309
2005 941 577 364 31.4 49.6 0.306
2006 940 592 348 31.3 49.6 0.306
2007 907 588 319 30.2 50.8 0.314
2008 906 596 310 30.2 50.7 0.314
2009 887 596 291 29.6 51.4 0.317
2010 912 602 309 30.4 50.6 0.312
2011 838 563 275 27.9 53 0.328
2012 891 573 318 29.7 51.3 0.317

Voila! Exactly the results we'd hoped to get.

Except…

One of the steps we take to improve the speed of queries—and thus to expand the scope of subjects we are able to research—is to put the seasonal replacement level for each event into our events database. In that process, we allowed some bad data to be introduced in 2011. We didn't catch it. It really was that simple, the data equivalent of a typo. We’ve corrected the data, and Baseball Prospectus WARP values for 2011 and 2012 are now representative of the theory we meant for them to represent.

Two additional things need to be pointed out about the scope of this problem: first, VORP was also affected, though FRAA and BRR were not—this was entirely an “at the plate” and “on the mound” problem. Also, slight adjustments to some previous-season WARP values were made, as some of our calculations rely on a multi-year smoothing of baseline data, even including forward-looking data when available.

While we're on the subject of evaluating our data, we've decided, after extensive testing, that the 10-year projections just weren't producing the results we desired. It’s difficult to evaluate long-term projections, and we intend to make that a more standardized, easily repeatable process in the future, but we hold our work up to a certain standard, and in this case, we didn't feel that that standard was being met. Instead of putting out an inferior product, we’ve essentially ordered the design team back to the drawing board to get 10-year projections and UPSIDE correctly formulated and out to the public in a timely manner going forward. We will be releasing the PECOTA percentiles soon, and that will conclude our pre-season projections releases.

It's not enough to fix these problems. We will be addressing these issues at their root—with a hard look at and overhaul of our internal processes and quality control.

But we also want to regain your trust. So we’re going to open the kimono and make our work transparent. Not only will this create a wealth of knowledge for everyone involved—readers and writers alike—but it will give BP the opportunity to leverage the wisdom of crowds.

We've recently named Harry Pavlidis our Director of Data Analysis. His first responsibility is to lead this effort. It will be a team undertaking, with all hands on deck. We will be sharing our progress and plans as they develop. But right now we're looking in the mirror. Looking hard.

Harry's first task is to conduct a full audit of our systems and stats. In essence, we're making him do his "day job"—assessing our systems and developing a plan to move forward. Harry will be bringing a process-driven approach to the effort, with the ultimate goal of improving our stat offerings. The experience he has in this area ranges from tiny start-ups to large, publicly traded companies. We'll all be working together to find the best-fitting tools and processes to bring BP up to the level of operational excellence we all expect.

Finally, I want to personally apologize for any inconvenience we may have caused our readers. The people we employ at BP are perfectionists. They spend more hours than anyone knows to get things done right and in a timely manner. They love this game and this company with a passion and will gladly fall on their sword if it means building a bigger and better Baseball Prospectus in the future. But if something goes awry at BP, it’s my fault and mine alone. I’m ultimately in charge, and I take full responsibility for any and all of our shortcomings. I’ve made mistakes and deserve any criticisms I receive. I may hold Baseball Prospectus to a high standard, but I hold myself to an even higher one. I’m sincerely sorry, and I promise you that I will continue to devote my blood, sweat, and tears to make BP the best it can possibly be.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
markpadden
3/18
Any way to dig up the code from Nate Silver's last year running PECOTA, and produce some provisional projections for this year? Even if they were flawed, they seemed to be fundamentally sound and offered an alternative view of the career paths of important players.
joechris96
3/18
I assume you mean the 10 year projections...I really wish it were that simple. And believe me, something that may have been sound 5 years ago, may not be sound today. I know it's hard to take my word for that, but it's true. And honestly, we want to get things right for the long haul. Even if we could reconstruct Nate's code from 2008, and it was relevant, we don't just want to patch in a one time fix. Otherwise we'll be going through this situation every year. Thank you though. I understand your desire for the numbers, and that's why we're taking this VERY seriously.
jrmayne
3/18
Joe, thanks for this.

As a pitchfork-and-torch salesman to the PECOTA mob the past few years, I've been very disappointed in the contrast between confidence level expressed and actual performance. (When Wayne Causey is your best comp for Bryce Harper, you're doing it wrong; predicted results for teams should average fewer than 86 wins.)

I hope that things go well on this front, and I wish y'all the best. I agree that there's a significant process problem. There are some good things BP is doing (Scoresheet Draft Aid!), and there are good articles. I'm glad to see that efforts are being made to fix some product rather than just have BP go to pure cash-cow mode.

Shorter ramble: Humility - while often a vice - is good if it leads to fixing.

--JRM
joechris96
3/18
The plan is to look at all aspects of our projections...from single year projections to multiple year projections and all the inputs that go into them. There's a lot more data available now than when Nate started the PECOTA process, and I, for one, would like to start incorporating more of it into our forecasts.
doog7642
3/18
I'm thankful that UPSIDE is being unretired and fixed. Thanks, BP.
kevinebert
3/18
We all appreciate the amount of work and effort that you guys put into the site. Believe me, it doesn't go unnoticed. I'm sure you guys will get it right.

One question - Dave Cameron and Sean Forman are rumored to be trying to come together on an agreed upon replacement level. Even though their WAR stats are computed differently, the idea is that people will have more faith in them if they start from the same place. Has BP given any thought to joining the discussion? The idea of an industry consensus on replacement level is highly intriguing.
joechris96
3/18
Yes, in fact, Colin is in discussion with Sean and the guys at Fangraphs.
kevinebert
3/18
That's great. Looking forward to hearing the resolution of these discussions.
joechris96
3/18
Oh, and thank you for the kind words. I didn't want to miss that :)
kmbart
3/18
I, for one, would be perfectly content with a seven-year projection. Ten years seems to me to be the equivalent of searching for signs of life on an extra-terrestrial planet with an optical telescope. What's the length of the average major-league career? Something like four years? I understand about the urge to get it right for a longer timeframe, but more correct over a shorter range is better, I feel.
joechris96
3/18
If we can do 7, we can do 10. It's not so much the amount of years as it is the formula, the aging curves, the inputs, etc. But I appreciate the comment, and we will look at all the options.
TheRedsMan
3/19
Joe, I'm sure you can do it, but at what point does the projection not only lose its ability to convey meaningful information but actually have its necessary uncertainty undermine the perceived value of the rest of the system?. That is to say, at a certain point of uncertainty, the mere presence of having data may suggest a level confidence that simply cannot be "undone". Or put differently still, simply having a 10-year projection may project a hubris that turns off a less sophisticated consumer and which no amount of "but look at the confidence intervals" can offset.
joechris96
3/19
Those are excellent points...even better because I had the same thoughts the past few years! It seems as though the consumers, though, want the 10 year projections so that's what we're going to try first. It doesn't mean we can't alter the plan in the future. This is an ongoing process, and I don't think we'll be disappearing anytime soon...at least I hope we won't be There are lots of interesting things we want to uncover and experiment with, and forecasts and projections are some of them. Thanks
gpurcell
3/18
Does this have any effect on the data from the Player Forecast Manager?
mcquown
3/18
2013 projections should not be impacted in any way, so you won't see any PFM changes due to this.
lloydecole
3/18
Good for you. This is a great way to run a business, and to communicate with your clients.

ps I do play in fantasy keeper leagues, and a useful UPSIDE number seems like a great way to evaluate players for the (very) long term. Thanks for taking a new look at that and trying to come up with a more meaningful number. I hope that forward-looking GMs may look at it too (are you listening, Ruben Amsaro Jr.?)

Lloyd Cole
*PHILADELPHIA* (land of short-term baseball thinking)
Grasul
3/18
I think just some basic stat availability enhancements would be very useful; like what team a player is currently on, whether he hits Left/Right/Switch, etc, downloadable into a CSV. The ability to download CSVs is useful and appreciated, but another consideration might be the development of a baseball statistic API for subscribers. To my knowledge, there isn't a good one out there today aimed at individuals and having an API could be a differentiator for BP.
mattidell
3/18
I work in software development, so I'm familiar with mistakes being made, processes being scrutinized and improved, apologies made with difficult explanations, etc. Personally, I wince at these kinds of apologies. It is tempting for the business side to see a mistake in development and feel compelled to solve every problem with broad strokes, like improving the process.

I'm sure Harry will do a great job with that task. But I have seen efforts to improve development process lead to one or more of the following outcomes:
1. Being overly ambitious and not actually implemented
2. Having too great an impact on the delivery of the product
3. Too general a solution and not solving the original problem

I am less concerned that a mistake was made and more concerned that lessons are learned. Whatever you can share would be appreciated, especially if you have some confidence you can actually improve the process.

You said something like this is ultimately your fault Joe... Did Colin and team tell you the product was susceptible to errors if they didn't improve the development process? And you ignored said advice? What mistakes did you make? What would you have done differently? What are you going to do differently? Are you proposing changes that will affect the product delivery timeline? Or affecting subscriber fees?

As was mentioned, we all appreciate your hard work and effort. Most of us understand mistakes can be made. The apology in the last paragraph makes me uncomfortable though. I'm not sure the right way to put this, other than that if you are apologizing emphatically for a problem which may be a natural consequence of developing a complex system, why would I think even more of your blood, sweat and tears is going to help?

Thanks.
joechris96
3/18
It's up to me to prove to you what I can do Matt. I totally get that. How we roll out the solutions to the issues will become apparent in time. That's why Harry is here, and we will be working together closely.
BarryR
3/18
What percentage of BWARP is hitting and what percentage is fielding? Is this a fixed percentage or does it change from season to season?
Logic tells me that since offense and defense play an equal part in scoring, the fielding component should be half the difference between BWARP and PWARP, but there may be reasons why that isn't true, so I am curious as to the numbers here.
mcquown
3/18
Hi Barry,

Thanks for writing. It's not a constant. It changes over time as a function of BIP rates, mostly (the more TTO, the fewer BIP and so the less fielding matters and so the pitching WARP as a percentage of total WARP grows).
BarryR
3/18
Rob

Okay, so fielding WARP is relative to pitching WARP. (As an aside, note that if you mouse over PWARP you get "Wins above replacement level as a BATTER" - really)

The question then is,if we break BWARP into its components, does batting WARP = pitching WARP + fielding WARP?
If it doesn't, what causes the variance between offense and defense? Is there more or less BWARP based on increased (or decreased) offense, or vice versa? Just trying to pin things down here.
dpease
3/18
Fixed PWARP, thank you.
BarryR
3/18
Joe

I'm just trying to figure out what the chart is. I see a drop in WARP between 2010 and 2011 of about 8%, with a little over 6% being recovered in 2012. Is this significant drop due to the "typo", as you put it? Was the 2012 "correction" due to your fixing the "mistake"? Or are these the numbers that are the end result of all the efforts you've made to get things where they should be? If these are the "correct" numbers, then what causes that kind of severe drop in a relative statistic, followed by a significant return in the other direction? Even if there is a drop in offense, the relative WARP should be fairly stable, unless the numbers inherently differ between higher and lower scoring eras. A year-to-year variance like that in the quality of replacement player seems quite odd, if that was the case.
mcquown
3/19
The erroneous numbers represented a much lower WARP total, and are not shown here.

You are correct that the numbers are inherently different based on the league offensive rates (among other things). While this chart seems to indicate that 2011 had a high level of wins for a 'replacement team' (a nebulous concept at best), that's a quirk of starting it at the year 2000 - going back further, 2011 is well within the range that was evidenced, not an outlier as it seems to be by choosing 2000 as the starting year.
BarryR
3/19
Okay. Before I can use WARP and refer to it like it's a meaningful metric, I need to have some confidence in it. I need to be able to understand how numbers are arrived at which are counter-intuitive.
So let's take a look at 2002 and 2011. Between 2002 and 2011, run scoring declined approximately 7%. At the same time, strikeouts increased 10% and HR declined 10% - both increasing the TTO positives for the pitchers, as did the decline in walks, also 7%. You said in your previous post that the more TTO, the more PWARP increases as a percentage of total WARP. Yet despite these changes in TTO, all in the pitchers favor, the percentage of PWARP dropped from 37.4% of total WARP to 32.8%. How did this happen?
Also what caused the 8% drop in total WARP in 2011 and the subsequent 6% increase in 2012? Was it a great year in the Pacific Coast League?
joechris96
3/19
There are two different topics really being addressed here Barry. The focus of today's announcement was purely on replacement level. Your question goes more to the core of WARP. We didn't discuss it here, but we intend on peeling back WARP one layer at a time over the next several months to provide insight into understanding the metric. So if you can hold your thoughts until we start that discussion, I think you'll get more of the answers you're looking for.
BarryR
3/19
Sorry, Joe, I didn't mean to hijack your topic. It's just that you raised the subject and presented me with numbers in a format which immediately led me to questions which couldn't really wait, as the numbers are sitting there and will probably not be presented in the future.
It's been over 40 years since I took a college math course and I freely admit that when confronted with a series of formulas and equations, my eyes glaze over. But before I dive in to any series laying bare the entrails of WARP, I still want answers to my questions.
You see, Rob answered my first question with a very logical construct, TTO goes up, pitcher impact, as stated on PWARP as a percentage of total WARP, increases. Makes absolute sense. Unfortunately, the numbers don't support it, they are the equivalent of dropping a rock and having it float skyward. Now I would wonder why something like that happened, especially if I was in the rock-dropping business. I expect your numbers people asked and answered that question and I would like to hear it. If they don't have an explanation, then I have little interest in seeing the layers peeled away, I don't want to cook with that onion.
Similarly, the second question seems like an obvious one. An 8% drop in WARP one year followed by a 6% rise the next - if I was involved in the analysis of that metric, my first question would be why did this happen. There may be a simple answer, one for each year. I just want to hear it, because there is another question which should follow that one - what is the effect of this drop/rebound? Was there an 8% drop in WARP across the board? Or was some class or group of WARP scores changed more than others? This goes to the heart of the reliability of the metric itself.
I need answers.
joechris96
3/19
You're not hijacking the topic, and I appreciate the fact you want answers, but we're not going to go into all the details, calculations, assumptions, etc. here. After we roll out the information on WARP, and you have many more details, if you still have questions, we'll answer them. That discussion will go to the heart of the reliability of the metric itself...which is what you're looking for.
drmorris
3/19
In recent years, BP's articles -- which is to say the insights, prose, and wit of its editorial staff -- have become a bigger draw for me than PECOTA. I'm obviously all for the process review and (presumed) improvements in your core analytical offerings, but let me take this moment to congratulate everyone at BP on a site that is delightful and thought-provoking far beyond the projections arms race.
joechris96
3/19
Thank you. I appreciate the kind words. I really do. We do some great things, but I'm not satisfied with that. I want to do even better things, and products like the 10-year projections and UPSIDE are the next to get a bulk of our attention.
Oleoay
3/19
I'll echo that. I originally came to BP for PECOTA and comparables and projections and VORP/WARP. I stay for the articles and insight but, sadly, haven't cared for WARP in awhile (though the comparables in this year's Annual were better than previous years). FRAA just seems to affect WARP so much that it's hard for me to really consider it anymore.

Out of curiosity, full disclosure-wise, but when did BP become aware that bad data had entered the system?
joechris96
3/19
It was probably about 4-6 weeks ago when we were looking at the framework for the long term projections. From there we had to perform several tests to make sure what we thought was wrong, was truly wrong. Then we had to clean the data, rerun the new calculations, test it again, etc. and that leaves us where we are today.
hannibal76
3/19
I really appreciate this message. UPSIDE was the reason I continued subscribing to BP for my first few years here and its demise was the reason I stopped (a friend was kind enough to gift me a subscription this year). Do you expect UPSIDE/10-year projections to be ready for next year?
joechris96
3/19
Yes, I fully expect both UPSIDE and 10-year projections to be ready for next year. I don't want to put a specific time frame on it this far in advance, but the main purpose behind adding resources is to devote more attention to both those products.
hannibal76
3/19
Awesome. Thank you!