There never was a stats vs. scouts war. If there was, it was silly. A good researcher knows that you never throw away perfectly good data.

Yes. Data.

It seems that because of that scene in Moneyball, people believe that there is still some sort of war against scouts going on. As someone who prefers to promote peace, love, and understanding (what's so funny about that?), perhaps what's needed is a better understanding of how we might be misunderstanding scouting reports. They are data, just in a different form than those of us who usually work the numbers are used to. May I present a short five-step plan to understanding how scouting data can (and perhaps, should) be used to further our understanding of baseball in a perfectly scientific manner.

Step 1: Recognize a data collector when you see one.
Retrosheet data files are wonderful. PITCHf/x is beautiful. If FIELDf/x or HITf/x make it out into the wild, that will be some kind of awesome. Heck, I get excited looking at a good box score. Just because a data stream does not come pre-packaged as a spreadsheet does not make it useless.

Consider the scout, particularly the one who draws the 16-year-old amateur beat. He has spent a good chunk of his adult life watching really bad baseball, all in the hopes of finding one or two kids who are slightly less bad than the others, and projecting out the fact that in 10 years (!), this special kid will turn into one of the five best baseball players on the planet. He spends a lot of time thinking about how to accomplish this goal, because it’s his job. One of the things that happens when you spend a lot of time either thinking about or doing something is that you begin to, sometimes without even knowing it, pick up on subtle differences that no one else even knows to look for. The little tilt in the left elbow that distinguishes a guy who might develop some power rather than the guy who will not develop any further.

Let's give scouts credit for the skills that they have developed. They need to be experts in how the human body develops from late adolescence onward. They also need to understand how to both break a pitching motion or a swing down into parts, understand how all those body parts work together, figure out which pieces can be altered, and if so, what the swing/motion would look like with the changes in place. Then he has to piece it all back together at the macro level and visualize whether that would work against major-league pitching or hitting. He must do this in a vacuum, because the high school pitcher whom the prospect is currently facing has a career ahead of him of watching MLB games when he gets home from his real job. Also, scouts have to play amateur psychologist and try to get some idea of what makes this guy tick and whether he'll even take the teaching that he'd be offered. That's a very specialized skill set and one that's very hard to master.

In the analytical community, we're used to placing things within a horizontal context. How good has Miguel Cabrera been? We can compare his numbers to those of others who have been in the same league at the same time, which answers our question fairly well. What happens when you delete that context? Suppose that we only ever saw Miguel Cabrera bat against pitchers who hailed from one small geographical area. And then there were guys to whom we wanted to compare him and they all faced a different set of pitchers, and each set from a different geographical area. That design would fail every research methodology class out there. Except that's how things actually have to work in the real world.

The scout is the context. Through his previous experience of watching bad baseball, he's able to compare talent across time and place. He can look back over his years of doing this, make comparisons that way, and put things into that framework. Yes, that will have problems, but all data collection methods have problems. Let's recognize scouts for what they really are: highly skilled data collectors who are using a different research paradigm than we’re used to.

Step 2: Realize that while those data are going to be biased, bias can be overcome.
Most of the critiques of scouts as data sources are actually correct. Scouts are human and, like all humans, they will have cognitive blind spots of which they may not be aware. Humans are far too confident in their own predictions. The same ability that leads them to be able to detect subtle differences between players allows them to be influenced by factors that have nothing to do with anything. (There's a study that shows that applicants to medical school were more likely to be admitted if they interviewed on a sunny day, rather than a rainy one.) Many of the advances of the statistical revolution came from exploiting the fact that people remained unaware of these biases and didn't have the ability to provide the appropriate horizontal context to test them. There's still plenty of room for that. The problem with data isn't bias itself. It's hidden bias.

There are ways, methodologically and statistically, to overcome bias. In fact, a process evaluation of the way in which scouting is actually done would be a fascinating project. Teams already know one method for reducing bias: the cross-checker. If a team is seriously considering a player to sign or draft or trade for, they'll often send out a second scout (or more?) to take a look. In research, we call this inter-rater reliability. If scouts agree with one another, we can at least know that they are all on the same page. That doesn’t mean it’s the right page, but that's another discussion.

Step 3: Learn some new research techniques
One of the biggest frustrations that quantitative folks face when looking at scouting data is that a good amount of it is in narrative form. Scouts tend to describe players in these strange combinations of lines and dots put together in sequence to make a form of communication known as "words." There's a language that goes with it, and while a "plus-plus fastball" sounds ever more worryingly like Newspeak, I don't quite get why ideas borrowed from mixed-methodology research or text-based analysis has never been well represented among baseball researchers. Are there descriptions or words that appear over and over in scouting reports? (Yes.) Do these descriptions tend to clump together in some noticeable ways? (Probably.) Do any of these clumps predict success in any meaningful way? Now, there's a question worth looking into, but it'll require some skills in QUALitative research methods.

Do Jason Parks and Zach Mortimer rate players highly on their level of #want at the same rate? Do they rate the same players highly? Can we calibrate them against one another? Do they comment on the same things about the same players? Which is more powerful: the fact that a player got a high rating, or the fact that the scout noticed it enough to comment on it? We’ve now entered the world of content analysis.

No, teams won't release their scouting data directly to you, but there are plenty of websites out there that review prospects for a living. It's not exactly the same thing (especially since those sites only write up the "interesting" prospects), but an enterprising researcher could tap that data source, get his or her feet wet, and still run some truly groundbreaking analyses.

Step 4: Repeat after me: The 20/80 scale is an ordinal measure.
Okay, so there are numbers on a scouting report. There's the 20/80 scale. A 20 is "You tried hard." An 80 is elite. A 50 is theoretically MLB average. Some folks use 2 to 8. Some use 1 to 10. Some give out letter grades. Some give out stars or rainbows or stickers. It doesn't matter. They are all ordinal scales. Statistically-minded folk are generally used to dealing with ratio and interval data, and there's a temptation to simply use the 20/80 scale in the same way. You can't do that.

For those of you who missed that day in stats class, the gory details are as such. An ordinal variable is one in which the values tell you information about the order in which the things being rated fall. We know that 60 is better than 50 is better than 40. We know that four stars are better than three. But by how much? Baseball research has generally focused around measures that are on interval and ratio scales. In an interval scale, the “distance” between 40 and 50 would be the same “distance” between 50 and 60. In a ratio scale, not only are the distances between numbers the same, but the number zero means “the absence of” and one can make ratio comparisons between numbers (i.e., a “60” is twice as good as a “30.”) On-base percentage, which answers the question, “What percentage of the time did Smith not make an out?”, is a ratio scale. An OBP of .000 means that he was never able to reach base during the period in question, and a player with a .400 OBP was twice as successful as a player with a .200 OBP.

The biggest mistake that I see people make (and have made myself!) in working with data is pretending that ordinal variables are actually interval or ratio variables. Because of the mathematical properties that interval/ratio variables have that ordinal variables do not, there are things that one can do with a ratio scale, but not with an ordinal scale. It’s not that ordinal variables are useless; there are methods developed for dealing with ordinal variables. It’s just that many of the favored methods in research (OLS regression, Pearson correlation, taking a simple average) don’t actually work when you shove an ordinal variable into them. Oh, Excel or R or whatever program you like to use will run the procedure and spit out a number, but it’s a garbage number. Caveat number cruncher!

Step 5: Learn to live with (and love) error bars
There are methods for figuring out whether someone is good at predicting things. They mostly boil down to pulling old predictions and seeing whether or not they happened. And, of course, there will be mistakes. Predicting the future is hard, and those of us who do quantitative work don’t have a 100 percent accuracy rate either. Fear not, those errors are fascinating data unto themselves.

Scouting is an ongoing process, and studying any process in depth will reveal inefficiencies. Once you can spot an inefficiency, you might be able to address it. With the scouting process, the fixes can be implemented across a large system and in a high-leverage part of a baseball organization. I'd wager that teams do spend a good amount of time trying to address any weaknesses in their scouting processes, and for good reason. A well-functioning talent identification system produces cost-controlled players who put up real value.

But publicly, there's a surprising lack of investigation into these types of research questions, and a lack of the use of these types of research methods. There are, of course, plenty of errors that teams make, so perhaps there is a brilliant mind out there who can identify a few holes in how the system works. I'm often asked where the next frontier is in sabermetrics (there are several!), and I believe that the field is wide open for someone who wants to formally and systematically study the talent identification process.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
Okay, got it:

Elvis Costello and the Attractions. On Armed Forces, 1979.

"(What's So Funny About) Peace, Love, and Understanding."
I didn't actually offer a billion points, but... one billion points to you!
Brinsley Schwarz. New Favourites of... Brinsley Schwarz, 1974.
I've seen it suggested that a 60 grade for a particular tool is one standard deviation above the mean, a 70 is two standard deviations above, and so on. Hence the lack of 80 grades. Obviously, this is easier/more accurate for some tools than others. I think I've seen work on other sites applying this and grading major leaguers' tools based on this approach.
This is roughly what I assume to be true, but we're fitting a well-known statistical concept to a process that we know was created without any concern for standard deviations. So you can bet that the 20-80 scale wasn't designed to be based on SDs.
60 is supposed to be 1 SD above the mean, and that's a good analogy for how to conceptualize and use 20/80, and there are probably plenty of people who use it that way. The problem is that what we're trying to get at it is so nebulous and un-standardized that, if we're being super technically nerdy about our stats, it's better to just say it's ordinal and be done with it rather than make the ordinal/ratio jump.
This shouldn't be behind the paywall.
Was there a reason you didn't use the term 'qualitative data' in your discussion of scouting data? You mention the word qualitative once in Step 3, but there's never any real explanation about the term. This article would have been a lot better by introducing the term and explaining why scouting data is qualitative data and how there are specific techniques to analyze data of this type.

Step 2 is ridiculous...anything dealing with humans is biased...pitch f/x is biased...most fielding ratings are biased...heck, radar guns are many times have we heard about a slow gun or a fast gun at a particular park. You make it seem like scouting is the only place where bias occurs in the baseball analysis narrative, and you know that is simply not the case.
or those darned cops with the fast guns...
Actually, the reason that I shy away from using the word qualitative is because it looks and sounds so much like quantitative data that it makes for a hard read. When I taught stats, I would spend five minutes just making sure that people were hearing the words correctly. You're correct that the term most certainly applies and in my head, that's what I call it.
Russell certainly knows other data sources are biased, the guy has to put up with me all the damned time.
In terms of inferential statistics, I don't think there are many things you can't do with ordinal (or even nominal) data. Ordinal data don't have fine grained descriptive statistics, especially with a judgment based (Likert) scale like the 20-80 system that doesn't have too many values. Median/mode and interquartile range aren't going to feel like they tell us anything. On the other hand, the series on the reliability of offensive measures in WARP suggests that differences in our ratio statistics are probably over-valued. What I mean is, if player 1 has 0.05 higher projected OPS than player 2, then maybe we'd be better to view them has both having a 60 hit tool (or whatever number) and not pretend there is a reliable difference.
In like 5 years' time I think there will be a great opening for computational linguists to work with scouting data and discover all sorts of awesome things, but right now there seems to be a major lack of reliable scouting data available to the public. Kudos to BP for trying to change that in a big way.

I'd be shocked if no teams were already attempting to do it with their in-house scouting departments.
"It’s not that ordinal variables are useless; there are methods developed for dealing with ordinal variables." Recognizing that this isn't Coursera, what are those methods?
Spearman's rank-order correlation (rather than Pearson) is one. You could also use 20-30-40-50-60-70-80 as separate categories and load the variables that way into a regression, or code guys as perhaps 50 or better vs. 40 or worse and run things binarily. The trick is remembering that ordinal variables are more akin to nominal variables than interval or ratio variables.

People often shove 20-80 into a regression anyway and pretend that it's a ratio variable, which is OK if you're just looking for an effect size so obvious that it'll show up no matter what you do.

If you're trying to predict ordinal variables, there are ordinal logit regression techniques, although I don't know why (in this situation) a scouting score would ever be a dependent variable.
The Wikipedia page on non-parametric statistics has a list of potential tests to browse over. It doesn't quite lay them out in terms of "if you're tempted to do this with ratio data, you should do this with ordinal data" but either there or if you follow the link you usually see some sort of statement about equivalent parametric tests. There are also more and more introductory texts as researchers in all kinds of fields are getting more serious about doing this right.