I've got a little puzzler for you – a brain teaser, if you will.

Here is a CSV file containing descriptive measures of a batter's batted ball distribution over his first 100 plate appearances, from two separate sources – Set A and Set B, as they are called. Then you have that batter's results for the rest of the season, in terms of BABIP, BACON (batting average on contact – included largely because I love having a reason to say BACON in a sabermetric context) and home runs on contact. Each player season has been identified by a "hash," in order to provide a unique identifier without giving any information about the player's identity. The reason is that I'm asking you all to participate in a blind taste test of two sources of information about the distribution of a player's batted balls, and how well they predict that player's future results.

Once people have had a chance to look over the data and provide their analysis, I'll go ahead and pull back the curtain and you can see whether or not people preferred Pepsi or Coke. Until then, have fun!

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
You need to be logged in to comment. Login or Subscribe
The formatting makes it almost impossible to read.
Geez, then drop it in excel and run text-to-columns using commas as the delimiter.
Saving as a .CSV text file and then opening in excel works great too...
I'll bite! Sort of.

I haven't done what Colin asked, but it's interesting to compare the consistency between the two providers. These are from least squares regressions. As the match between the data drops, the slope (m) should approach zero.

GBA vs GBB: m = .93, R^2 = .81
LDA vs LDB: m = .19, R^2 = .22
FBA vs FBB: m = .64, R^2 = .45
PUA vs PUB: m = .38, R^2 = .31 (plus what looks like heterogeneous variance)

So, nice agreement on ground balls (slope is almost one, relatively high R^2), okish on fly balls, not so good on pop-ups, and terrible on line drives.

Some combinations:
All Air Balls, without pop-ups (LD+FB) A vs B: m = .70, R^2 = .56
All Air Balls, including pop-ups (LD+FB+PU) A vs B: m = .95, R^2 = .81

From all that, I'd be pretty comfortable using GB vs. Air Ball (including pop-ups) distinctions from either provider. With the others, I'd worry: one or both providers aren't very reliable in correctly interpreting what they're seeing. If I get time later, I may try to actually answer the question, but I'm sort of hoping someone else beats me to it. :) -j
I used a boatload of fuzzy math and did some questionable conclusion-jumping, but my resulting spreadsheet tells me that A and B are pretty darn close to equal.

I'm sure I made a mess of it, so I'll be interested to know the result and the best way to reach it.
It appears A and B are fairly close but the data from the B provider did a better job of predicting future BABIP. I'm omitting the statistical details behind my assertion because I have no confidence in it.
I don't have any special statistical education/knowledge, but I'll give it a shot...

I was mainly focused on LD %, as this was the area where the two sets diverged the most (as far as I could tell), and my understanding is that LD % has the biggest impact on babip. Set A had much more extreme LD % values than Set B. The highest LD % in Set A is 40% vs. 28% in Set B, and the lowest is 4% in Set A vs. 14% in Set B. Set B exists in a narrower range than Set A.

Based on this, I'd be more inclined to use Set B to predict future performance, since generally things regress to the mean over time. When Set A says a player has a LD % of 4%, while Set B is saying 18%, 18% is probably closer to what's actually going to happen in the future, because 4% LD % is aberrantly low.

Set A looks to me to be some form of raw play-by-play observed data. Set B, I don't really know, but it seems like it was derived in some way as opposed to being just observed results.

I took a look at a few of the extremes, and the Set B data doesn't appear to match up with reality. In particular, Cristian Guzman from 1999 (hash=7b0dbb617dcdf7c9) seems like Set B can't come from what actually happened. Set B is claiming Guzman hit 13 line drives, 57 ground balls, 1 fly ball, and no pop-ups in his first 100 PAs, but the play-by-play data conflicts with this. Unless something weird is going on with bunts, I'm not seeing how those results can be even close to accurate.

I will be interested to learn what Set B is.

I will also be not all that shocked if I'm completely off base on some or all of this...
Sorry, just noticed that I wrote the wrong hash value - the correct one is 3c771e9fb7264. The row with LDA_RT_PRIOR = 0.042.

(for some reason, I can't use "Post Reply")
You should see some of the discussion here:

Note that you may fall into the same trap as I did. If you limit yourself to the "A" dataset's 4 batted ball parameters, and ignore the other 3 performance numbers, specifically BACON_PRIOR, you are conferring an advantage to the "B" dataset. That's because the B dataset's 4 batted ball parameters are actually the other 3 performance numbers, but translated into 4 batted ball parameters.

Therefore, you cannot focus on the labels, and presume that LD in A has any relationship to LD in B. And that means you can't discard the 3 performance numbers in the A dataset.