CSS Button No Image Css3Menu.com

Baseball Prospectus home
  
  
Click here to log in Click here for forgotten password Click here to subscribe

<< Previous Article
The Lineup Card: 13 Ba... (10/26)
<< Previous Column
Spinning Yarn: Removin... (09/24)
Next Column >>
Spinning Yarn: Who Con... (11/16)
Next Article >>
Fantasy Article The Keeper Reaper: Sta... (10/27)

October 26, 2011

Spinning Yarn

Can We Predict Hot and Cold Zones for Hitters?

by Mike Fast

A few weeks ago, during the division series, Brandon McCarthy remarked on Twitter that it would be more interesting for TBS to show a diagram of the batter hot and cold zones for every batter than to show the PitchTrax strike zone and pitch location graphic. He argued that knowledge of the hot and cold zones would give viewers additional insight into the battle between the pitcher and the batter.

The pitcher-batter confrontation lies at the heart of baseball; learning about it is one of my favorite pursuits. Thus, I was intrigued by McCarthy’s comments. As far as I know, no one has published a study about the reliability of batter hot and cold zone data. If batter performance in particular areas of the strike zone is very repeatable, that knowledge could be highly valuable, both to teams and to fans. On the other hand, perhaps such data is no more useful than knowing that a batter is 2-for-10 in the postseason or 3-for-7 against a given pitcher in his career, in which case it might be interesting for entertainment purposes but practically useless for decision-making in a game.

Batter hot and cold zone information has been provided for at least a decade by scouting services like Inside Edge, and teams routinely make this information available to their players. Starting in the 2010 season, MLB Advanced Media’s online Gameday application added hot and cold zone information developed from PITCHf/x data.

These hot and cold zone reports typically divide the strike zone into nine zones using a 3x3 grid and report the player’s past batting average in each zone over some time period, along with coloration to assist in recognition of the hot and cold areas for the batter. Other analysts and data sources have used other grid sizes to divide the zone, but the 3x3 grid is by far the most popular and recognizable presentation of this data.

Other analysts have recently used heat maps, often based upon PITCHf/x data, for similar purposes without being constrained to a 3x3 grid depiction of the data. TruMedia’s heat maps are a common example.

In addition, hot/cold zones may be based upon slugging average, runs above average, or some other metric besides batting average.

In order to evaluate the usefulness of the hot and cold zone data, I took the results of every plate appearance for which we have detailed pitch location data from PITCHf/x during the period 2007-2011 and assigned those results to the location of the final pitch of each plate appearance. I grouped the pitches by zones for each batter and calculated the average run values for each zone using linear weights. I split the data for each batter into two halves, randomly assigning games from 2007-2011 into each half for comparison.

First, I examined the traditional division of the strike zone into nine zones. I divided all the pitches within the strike zone that ended a plate appearance into nine fixed bins. I separated the pitches vertically at 1.74, 2.30, 2.86, and 3.42 feet. (I ignored the height of the batter, but when I controlled for it later, it had little effect on the split-half correlations.) I separated the pitches horizontally by dividing the plate in thirds at +/-0.83 and +/-0.28 feet.

I ran a regression for all the right-handed batters with at least 630 plate appearances in 2007-2011 that ended on a pitch in the strike zone. I used performance in a given zone in one half of the sample along with performance in the other eight zones in the same half of the sample to predict performance in that zone in the other half of the sample. The resulting regression equation is as follows, where performance is measured in runs above average:

Zone Performance in Split Half 2 = (0.17 * Zone Performance in Split Half 1) + (0.43 * Performance in Other Eight Zones in Split Half 1) + (0.40 * League Average Performance).

The correlation coefficient was r=0.30, and the p-values for both input variables were highly significant (<.0001).

The biggest problem with this data is that, even when considering five seasons, most batters had less than 100 plate appearances of data in both halves of the sample in all of the nine zones. Our best predictive results occur when we use only 17% of the observed zone performance and base the remainder of the prediction on the batter’s overall performance and the league average performance.

There is some true signal there amidst the noise, but if we expect pitchers to form their pitching strategies based upon this data, that signal is very weak. Let’s look at a couple of typical examples.

The first group of zone data suggests that you want to pitch Michael Young on the outer third of the plate or over the middle of the plate if you keep the ball down. If you need to come inside, you might be able to sneak one up and in without too much damage.

If you executed that strategy against Young in the other half of the sample, you would do pretty well in all the zones on the outside third, but you would get killed whenever you tried to sneak one up and in, and you would not do too well with pitches down over the middle of the plate, either.

Yuniesky Betancourt is a weaker hitter, and the zone data from the first half of our sample suggests that the only places in the zone where he is a real threat are up and in and down over the middle of the plate. Moreover, he is very vulnerable both down and away and over the heart of the plate.

A pitcher who pitched to those cold zones in the other half of the sample would find Betancourt a decent hitter and might miss his weakest spots up and away or down over the middle of the plate.

Young and Betancourt are typical examples. At the extremes, Freddy Sanchez is an example of the best zone correlation between sample halves, and Mike Napoli is an example of the worst zone correlation between sample halves.

The division of the strike zone into nine boxes does not seem to serve us very well. The hitters do have tendencies toward hot and cold areas, but dividing into nine pieces makes the data very noisy and unreliable, and it becomes difficult to pick the true tendencies out of the vagaries of the noise. Moreover, imagine what would happen to the sample sizes if we split the data further by pitch type or if we used only a single season of data.

Perhaps using fewer zones in order to increase the sample size would produce more statistically meaningful and practically useful results. I experimented with a few different possibilities but ultimately settled on using four zones, including the area outside the strike zone. I extended the sample beyond the boundaries of the strike zone to include the susceptibility of a batter to chasing bad pitches, with the added benefit of increasing the total sample of PA-ending pitches by over 50 percent.

I divided all the pitches that ended a plate appearance into four bins, separated at the vertical and horizontal midpoints of the pitch location distributions. I separated the pitches vertically at 2.4 feet. For right-handed batters, I separated the pitches horizontally at 0.07 feet, just slightly outside from the middle of home plate. For left-handed batters, I separated the pitches horizontally at -0.28 feet, a few inches outside from the middle of home plate.

With larger sample sizes, the split-half correlation improved somewhat, as expected. However, even with only four zones, much noise remained in the results. Here is the regression equation for right-handed batters:

Zone Performance in Split Half 2 = (0.32 * Zone Performance in Split Half 1) + (0.32 * Performance in Other Three Zones in Split Half 1) + (0.36 * League Average Performance).

The correlation coefficient was r=0.46, and the p-values for both input variables were highly significant (<.0001).

With sample sizes from larger zones between 200 and 300 plate appearances in each half of the sample, both the split-half correlations and the statistical significance of the results have improved.

Do the results have better baseball meaning? Let’s revisit a couple of our earlier examples.

We see that Young has a consistent hot zone up and in and that his weakest zone is down and away.

Betancourt is weak on the outside part of the plate and a little closer to capable on the inside half, particularly up and in.

The 3x3 grid contained about the same information as the 2x2 grid, but the 3x3 grid gave us a false sense of greater granularity than is present in the data, at least at the sample sizes we typically have available. Given the limits of the ability of most pitchers to locate within a small zone, the 2x2 grid is likely more representative of actual pitching strategy anyhow.

Heat maps of batter hot and cold zones should be regarded with a similar sense of skepticism, depending on the sample size involved. Such heat maps are drawn from the same underlying data and should have similar statistical correlation between sample halves. If a heat map has insufficient spatial smoothing of the data, it could be an even less reliable predictor of future performance than a 3x3 grid.

Levels of coloration, whether for the gridded bins or for heat maps, are another important facet of how accurately hot and cold zone information is communicated to the viewer. I chose to use six levels of coloration in the examples in this article, with the traditional red for hot and blue for cold. I observed, however, that the switch from light blue (for slightly below average) to light red (for slightly above average) seemed to have more visual impact than the change in performance warranted. I did not experiment further to find an optimal color palette, but I would warn both creators and viewers of hot/cold zone graphs that proper interpretation is heavily affected by the choice of palettes.

Let’s close by looking at the True Average batting leaderboards for the four quadrants of the hitting area for batters with at least 1000 plate appearances in 2007-2011.

Batter

Hand

Up-and-In PA

Up-and-In TAv

Josh Hamilton

L

645

.392

Kevin Youkilis

R

686

.380

Josh Willingham

R

573

.377

Albert Pujols

R

697

.376

Paul Konerko

R

723

.367

Garrett Atkins

R

415

.365

Dan Uggla

R

698

.364

Evan Longoria

R

659

.362

Joey Votto

L

690

.362

Travis Hafner

L

564

.361

Average

R

551

.279

Average

L

447

.278

Ronny Paulino

R

341

.221

Julio Lugo

R

349

.218

Jason Kendall

R

571

.216

Gabe Gross

L

286

.216

Emilio Bonifacio

L

266

.213

Mark Kotsay

L

327

.213

Alcides Escobar

R

422

.213

Chris Getz

L

292

.212

Jack Hannahan

L

355

.211

Mike Jacobs

L

259

.205

 

Nelson Cruz, who gained attention for hitting six home runs against the Detroit Tigers in the AL Championship Series, just missed the top ten here, with a .356 TAv in plate appearances that ended on a pitch up and in. All six of his ALCS home runs were hit off pitches up and in.

Batter

Hand

Up-and-Away PA

Up-and-Away TAv

Lance Berkman

L

419

.453

Prince Fielder

L

793

.383

Chipper Jones

L

453

.381

Adrian Gonzalez

L

671

.374

Manny Ramirez

R

414

.373

Albert Pujols

R

772

.371

Carlos Beltran

L

411

.361

Matt Holliday

R

471

.359

Miguel Cabrera

R

631

.359

Vladimir Guerrero

R

530

.355

Average

R

459

.263

Average

L

439

.288

Clint Barmes

R

517

.214

Garrett Atkins

R

398

.212

Austin Jackson

R

240

.210

Carlos Gomez

R

329

.209

Brandon Inge

R

526

.207

Jason Kendall

R

417

.206

Franklin Gutierrez

R

480

.205

Ronny Cedeno

R

291

.202

Sean Rodriguez

R

246

.201

Jeff Mathis

R

279

.159

 

Batter

Hand

Down-and-In PA

Down-and-In TAv

Mark Teixeira

R

224

.426

Joey Votto

L

689

.395

Justin Morneau

L

549

.372

Ryan Braun

R

590

.372

Mark Teixeira

L

472

.370

Aramis Ramirez

R

511

.365

Miguel Cabrera

R

722

.363

Marlon Byrd

R

509

.362

Manny Ramirez

R

343

.361

Josh Hamilton

L

482

.359

Average

R

415

.276

Average

L

436

.278

Andy LaRoche

R

244

.228

Garrett Atkins

R

257

.226

Edgar Renteria

R

440

.225

Cristian Guzman

L

254

.221

Chris Getz

L

307

.221

Aaron Rowand

R

413

.216

Jason Varitek

L

243

.211

Omar Vizquel

L

249

.210

Aaron Miles

L

276

.208

Bobby Crosby

R

229

.193

 

Batter

Hand

Down-and-Away PA

Down-and-Away TAv

Jose Bautista

R

714

.308

Pablo Sandoval

L

382

.293

Alex Rodriguez

R

538

.291

Carlos Pena

L

623

.290

Edwin Encarnacion

R

569

.285

Shin-Soo Choo

L

523

.285

Mark Teixeira

R

254

.284

Milton Bradley

L

248

.280

Jim Thome

L

423

.279

Miguel Cabrera

R

693

.278

Average

R

539

.211

Average

L

436

.219

Rod Barajas

R

475

.169

Pedro Feliz

R

515

.169

Sean Rodriguez

R

264

.168

Yorvit Torrealba

R

438

.164

Jeff Mathis

R

384

.161

Craig Counsell

L

313

.161

Ivan Rodriguez

R

472

.158

Aaron Miles

L

278

.157

Andy LaRoche

R

339

.153

Miguel Olivo

R

588

.133

Hot and cold zone information for batters does have some predictive value for future performance, but all such data continually flirts with the problem of small sample sizes, more so as the hitting area is divided into smaller grids or more granular heat maps. Aggregating into bigger zones improves the predictability. A 2x2 grid works fairly well with multi-year samples, but for smaller samples of data, even that level of aggregation may not be sufficient to render the hot and cold zone information useful.

Mike Fast is an author of Baseball Prospectus. 
Click here to see Mike's other articles. You can contact Mike by clicking here

16 comments have been left for this article.

<< Previous Article
The Lineup Card: 13 Ba... (10/26)
<< Previous Column
Spinning Yarn: Removin... (09/24)
Next Column >>
Spinning Yarn: Who Con... (11/16)
Next Article >>
Fantasy Article The Keeper Reaper: Sta... (10/27)

RECENTLY AT BASEBALL PROSPECTUS
Premium Article Minor League Update: Games of September 12-1...
The Week in Quotes: September 8-14
In A Pickle: How To Lose Lots of One-Run Gam...
Premium Article The Prospectus Hit List: Monday, September 1...
Premium Article What You Need to Know: September 15, 2014
Premium Article Monday Morning Ten Pack: September 15, 2014
Premium Article Baseball Therapy: Starving Young Royals, Bat...

MORE FROM OCTOBER 26, 2011
The Lineup Card: 13 Bad Players Who Are (or ...
Fantasy Article The Keeper Reaper: Second, Short, and Catche...

MORE BY MIKE FAST
2011-11-22 - Spinning Yarn: How Does Quality of Contact R...
2011-11-16 - Spinning Yarn: Who Controls How Hard the Bal...
2011-10-28 - BP Unfiltered: Lowe's Changeup to Freese
2011-10-26 - Spinning Yarn: Can We Predict Hot and Cold Z...
2011-10-18 - BP Unfiltered: New Pitch Blocking Research
2011-10-08 - BP Unfiltered: NLCS Umpire Charts and Data
2011-09-24 - Spinning Yarn: Removing the Mask Encore Pres...
More...

MORE SPINNING YARN
2011-12-21 - Spinning Yarn: Hit-and-Run Success is No Acc...
2011-11-22 - Spinning Yarn: How Does Quality of Contact R...
2011-11-16 - Spinning Yarn: Who Controls How Hard the Bal...
2011-10-26 - Spinning Yarn: Can We Predict Hot and Cold Z...
2011-09-24 - Spinning Yarn: Removing the Mask Encore Pres...
2011-09-07 - Spinning Yarn: Home Plate Umpire Positioning
2011-08-17 - Spinning Yarn: Why are Batters Hit by Pitche...
More...