CSS Button No Image Css3Menu.com

Baseball Prospectus home
  
  
Click here to log in Click here for forgotten password Click here to subscribe

<< Previous Article
Prospectus Hit and Run... (05/23)
<< Previous Column
Between The Numbers: G... (05/12)
Next Column >>
Between The Numbers: T... (05/29)
Next Article >>
Premium Article Contractual Matters: T... (05/24)

May 24, 2011

Between The Numbers

A Batted Ball Puzzler

by Colin Wyers

I've got a little puzzler for you - a brain teaser, if you will.

Here is a CSV file containing descriptive measures of a batter's batted ball distribution over his first 100 plate appearances, from two separate sources - Set A and Set B, as they are called. Then you have that batter's results for the rest of the season, in terms of BABIP, BACON (batting average on contact - included largely because I love having a reason to say BACON in a sabermetric context) and home runs on contact. Each player season has been identified by a "hash," in order to provide a unique identifier without giving any information about the player's identity. The reason is that I'm asking you all to participate in a blind taste test of two sources of information about the distribution of a player's batted balls, and how well they predict that player's future results.

Once people have had a chance to look over the data and provide their analysis, I'll go ahead and pull back the curtain and you can see whether or not people preferred Pepsi or Coke. Until then, have fun!

Colin Wyers is an author of Baseball Prospectus. 
Click here to see Colin's other articles. You can contact Colin by clicking here

9 comments have been left for this article. (Click to hide comments)

BP Comment Quick Links

R.A.Wagman

The formatting makes it almost impossible to read.

May 24, 2011 09:46 AM
rating: -1
 
CRP13

Geez, then drop it in excel and run text-to-columns using commas as the delimiter.

May 24, 2011 09:57 AM
rating: 2
 
jinaz

Saving as a .CSV text file and then opening in excel works great too...

May 24, 2011 10:27 AM
rating: 1
 
jinaz

I'll bite! Sort of.

I haven't done what Colin asked, but it's interesting to compare the consistency between the two providers. These are from least squares regressions. As the match between the data drops, the slope (m) should approach zero.

GBA vs GBB: m = .93, R^2 = .81
LDA vs LDB: m = .19, R^2 = .22
FBA vs FBB: m = .64, R^2 = .45
PUA vs PUB: m = .38, R^2 = .31 (plus what looks like heterogeneous variance)

So, nice agreement on ground balls (slope is almost one, relatively high R^2), okish on fly balls, not so good on pop-ups, and terrible on line drives.

Some combinations:
All Air Balls, without pop-ups (LD+FB) A vs B: m = .70, R^2 = .56
All Air Balls, including pop-ups (LD+FB+PU) A vs B: m = .95, R^2 = .81

From all that, I'd be pretty comfortable using GB vs. Air Ball (including pop-ups) distinctions from either provider. With the others, I'd worry: one or both providers aren't very reliable in correctly interpreting what they're seeing. If I get time later, I may try to actually answer the question, but I'm sort of hoping someone else beats me to it. :) -j

May 24, 2011 10:27 AM
rating: 0
 
CRP13

I used a boatload of fuzzy math and did some questionable conclusion-jumping, but my resulting spreadsheet tells me that A and B are pretty darn close to equal.

I'm sure I made a mess of it, so I'll be interested to know the result and the best way to reach it.

May 24, 2011 10:27 AM
rating: 0
 
eamuscatuli

It appears A and B are fairly close but the data from the B provider did a better job of predicting future BABIP. I'm omitting the statistical details behind my assertion because I have no confidence in it.

May 25, 2011 09:24 AM
rating: 0
 
vertumnus

I don't have any special statistical education/knowledge, but I'll give it a shot...

I was mainly focused on LD %, as this was the area where the two sets diverged the most (as far as I could tell), and my understanding is that LD % has the biggest impact on babip. Set A had much more extreme LD % values than Set B. The highest LD % in Set A is 40% vs. 28% in Set B, and the lowest is 4% in Set A vs. 14% in Set B. Set B exists in a narrower range than Set A.

Based on this, I'd be more inclined to use Set B to predict future performance, since generally things regress to the mean over time. When Set A says a player has a LD % of 4%, while Set B is saying 18%, 18% is probably closer to what's actually going to happen in the future, because 4% LD % is aberrantly low.

Set A looks to me to be some form of raw play-by-play observed data. Set B, I don't really know, but it seems like it was derived in some way as opposed to being just observed results.

I took a look at a few of the extremes, and the Set B data doesn't appear to match up with reality. In particular, Cristian Guzman from 1999 (hash=7b0dbb617dcdf7c9) seems like Set B can't come from what actually happened. Set B is claiming Guzman hit 13 line drives, 57 ground balls, 1 fly ball, and no pop-ups in his first 100 PAs, but the play-by-play data conflicts with this. Unless something weird is going on with bunts, I'm not seeing how those results can be even close to accurate.

I will be interested to learn what Set B is.

I will also be not all that shocked if I'm completely off base on some or all of this...

May 25, 2011 21:44 PM
rating: 0
 
vertumnus

Sorry, just noticed that I wrote the wrong hash value - the correct one is 3c771e9fb7264. The row with LDA_RT_PRIOR = 0.042.

(for some reason, I can't use "Post Reply")

May 26, 2011 07:59 AM
rating: 0
 
TangoTiger

You should see some of the discussion here:
http://www.insidethebook.com/ee/index.php/site/comments/batted_ball_puzzler/

Note that you may fall into the same trap as I did. If you limit yourself to the "A" dataset's 4 batted ball parameters, and ignore the other 3 performance numbers, specifically BACON_PRIOR, you are conferring an advantage to the "B" dataset. That's because the B dataset's 4 batted ball parameters are actually the other 3 performance numbers, but translated into 4 batted ball parameters.

Therefore, you cannot focus on the labels, and presume that LD in A has any relationship to LD in B. And that means you can't discard the 3 performance numbers in the A dataset.

May 26, 2011 10:42 AM
rating: 0
 
You must be a Premium subscriber to post a comment.
Not a subscriber? Sign up today!
<< Previous Article
Prospectus Hit and Run... (05/23)
<< Previous Column
Between The Numbers: G... (05/12)
Next Column >>
Between The Numbers: T... (05/29)
Next Article >>
Premium Article Contractual Matters: T... (05/24)

RECENTLY AT BASEBALL PROSPECTUS
Fantasy Article Fantasy Players to Avoid: Starting Pitchers
Fantasy Infographic: Starting Pitchers
Fantasy Article Dynasty League Positional Rankings: Top 175 ...
Premium Article Rumor Roundup: Diamondbacks Third Baseman is...
Premium Article Transaction Analysis: The Bad Bullpen Teams ...
Prospectus Feature: A.J. Preller's Offseason...
Premium Article Raising Aces: The Eyes of March

MORE FROM MAY 24, 2011
Premium Article Transaction Analysis: Quad-A Saves the Day?
Premium Article Clubhouse Confidential: Bend It Like Hallada...
Premium Article Prospectus Hit List: St. Louis Supreme
Premium Article Divide and Conquer, NL West: When All Else F...
Premium Article Painting the Black: Should the Pirates Walk ...
Premium Article Contractual Matters: The Orioles' Roster Rou...
Fantasy Article Fantasy Beat: Value Picks at Catcher, Second...

MORE BY COLIN WYERS
2011-05-29 - Between The Numbers: The Emptiest Batting Av...
2011-05-28 - BP Unfiltered: Sacrifice Walks
2011-05-25 - BP Unfiltered: Jose Bautista is Getting No H...
2011-05-24 - Between The Numbers: A Batted Ball Puzzler
2011-05-15 - BP Unfiltered: A Poem About Jorge Posada
2011-05-10 - Manufactured Runs: The Deconstruction of Fal...
2011-05-05 - BP Unfiltered: 2011 Stats are Live
More...

MORE BETWEEN THE NUMBERS
2011-08-18 - Between The Numbers: The Change in Chapman
2011-06-08 - Between The Numbers: Another look at ZiPS
2011-05-29 - Between The Numbers: The Emptiest Batting Av...
2011-05-24 - Between The Numbers: A Batted Ball Puzzler
2011-05-12 - Between The Numbers: Good Days for Debuts
2011-04-08 - Between The Numbers: Playing the Odds
2011-04-07 - Between The Numbers: Fun with selective endp...
More...