April stats are meaningless. OK, that’s not entirely fair. March

stats are meaningless, April stats are just misleading. As Joe

Sheehan pointed out yesterday, most everyone knows this and

understands it, but when you love talking about baseball, no one wants

to say “let’s wait until July.” Instead, we qualify all our statements

before launching into discussions of **Brian Roberts**‘

home run chase, **Tim Hudson**‘s hard luck, and

**Edgardo Alfonzo** chasing .400.

As an exercise in restraint, here are the Best and Worst hitters on

April 30, 2004 as ranked by MLVr

(min 50 PAs in April and 300 on the year):

Batter Year AVG OBP SLG MLVR Barry Bonds 2004 .472 .696 1.132 1.481 Charles Johnson 2004 .333 .458 .875 .848 Lew Ford 2004 .419 .471 .710 .784 Adam Dunn 2004 .328 .538 .750 .767 Sean Casey 2004 .414 .458 .667 .698 Jim Thome 2004 .364 .456 .714 .682 Moises Alou 2004 .361 .400 .735 .645 Manny Ramirez 2004 .388 .448 .647 .617 Laynce Nix 2004 .365 .397 .714 .617 Ron Belliard 2004 .417 .500 .548 .582 ------------ Neifi Perez 2004 .220 .260 .275 -.371 Gabe Kapler 2004 .233 .270 .250 -.380 A.J. Pierzynski 2004 .236 .267 .250 -.385 Luis Rivas 2004 .190 .227 .317 -.391 Tike Redman 2004 .226 .229 .301 -.391 Jimmy Rollins 2004 .183 .263 .268 -.392 Ty Wigginton 2004 .188 .216 .333 -.394 Alex Gonzalez 2004 .182 .222 .312 -.413 Jason Phillips 2004 .162 .275 .221 -.435 Derek Jeter 2004 .168 .255 .232 -.460

While **Barry Bonds** had already established his dominance, there are quite a few names (**Charles Johnson**, **Laynce Nix**, **Derek Jeter**, **Jimmy Rollins**) who did not finish the year anywhere near where they began. Similarly, on the morning of May 1 last year, the Red Sox were 15-6, the Orioles 12-9, and the Yanks

12-11. Texas was leading the AL West and the Cardinals were 12-11, a

game and a half behind the Astros and Cubs, tied for the division lead

at 13-9.

Though there are always a few outliers every April, simply dismissing

the first month of the season is obviously not the way to go. Games in

April count as much as games in September, it’s just that the ones in

September have greater implications because the likelihood of various

outcomes is vastly different. Much like leverage as it pertains to

relievers, games later in the season have an apparently larger bearing

on the standings. But a slow April, much like a starter who gets shelled

in the early innings, can make those late games meaningless.

Similarly with individual player statistics, we can estimate just how

meaningful that first month is. There are a couple different ways to do

this. The first is to use something called confidence intervals for

population proportions (referred to as “p-hat” because the symbol is a

“p” with a”^” over it). P-hat allows us to determine how accurate our

data is with varying degrees of confidence and ranges. Essentially,

based on the sample size, the normal distribution curve, and the value

in question, p-hat provides a quick formula to provide a range under

which the “true” value lies.

The best way to think of p-hat is like a coin. We “know” the coin

will land on heads 50% of the time if we flipped it forever, but if we

only flip it five times, obviously it’s not going to come up at 50%. As

the number of flips increase, the more information we have about the

coin and the closer the total proportion of heads flips will be to 50%.

There’s a normal curve of outcome distributions with 50% being the most

likely (in the middle of the curve) and higher and lower proportions of

heads less likely (the tails). Selecting a certain percentage of the

area under the curve gives us that much confidence that the “true”

likelihood of a heads flip will come up. Using p-hat, we can estimate

the minimum and maximum values we need in order to cover the area of the

true likelihood. The more times we flip the coin, the tighter the curve

gets, and thus the closer the minimum and maximum values get to the mean

for a particular confidence level.

Getting back to ballplayers, in 2004, Bonds had an OBP of .696 over

his first 92 PAs of the season. Using p-hat, we can say that there is a

95% chance that Bonds’ “true” OBP is between .602 and .790. If we want

to scale back to an 80% confidence interval, the boundaries are .635 and

.757. While Bonds finished the 2004 season with a .609 OBP–within the

95% range but outside 80%–over the larger set of all ballplayers,

p-hat is very accurate.

Unfortunately, there are two problems with employing p-hat to the

data above. The first is that p-hat is used with binomial variables, so

something like OBP or AVG works well since it’s dealing with a simple

question of yes/no: hit/no hit; on-base/not on-base. SLG and MLVr,

however, are not simple binomials and thus we can’t use p-hat for them.

Secondly, even after the season is over, the confidence intervals

using p-hat are very large. This is because a 162-game season isn’t

nearly long enough to confidently determine a player’s “true” ability.

Keith

Woolner discussed this with regards to teams a few weeks ago, but

the same goes for players. A total of 600-700 plate appearances is a lot, but

based on confidence intervals, even with a sample size that large, the

95% confidence range is typically between 90 and 100 points of OBP.

Looking at everyone who had an OBP of .350 in 2004, that means that one

out of every 20 of them had a “true” OBP of over .390 or under .310.

Given a larger sample size–over a career–they’ll likely regress

towards their “true” OBP. Thus, comparing confidence ranges based on

April stats to confidence ranges based on full-season stats gets us into

large areas of overlap as well as some rather complicated confidence

measurements of the results.

Instead, in an effort to keep things a little simpler, let’s see how

the actual April stats compare to the full-season results. Looking again

at the list above, some of those names are right where we’d expect them.

Bonds is on top, joined by **Adam Dunn**, **Jim Thome**, and **Manny Ramirez**. Aside from Jeter on

the bottom, most of those players are some of the lightest hitters in

baseball: **Neifi Perez**, **Luis Rivas**, and

occasional #3 hitter **Tike Redman**. Far from being

worthless, stats in April are more often than not a good indicator of

the season to come.

Getting back to the sample group of all players registering at least

50 PA in April and 300 PA on the season, here’s how well correlated the

stats are. In essence, how well April numbers predict the rest of the

season for 2000-2004:

Using April stats from that season, the coefficient of correlation

(r-squared) is .346, meaning that the April stats explain about 34.6% of

the variance in MLVr. Given that they comprise about 16.7% of the season

total, that’s not very impressive. Contrast that with the previous

season’s MLVr:

That’s not that much better, but notice both the change in scale (as

April MLVr has a much wider range) and slope, indicating that there

isn’t nearly as much regression to the mean from season to season as

from April to the end of the year. Looking at the previous year’s MLVr

reveals that we’re not dealing with a case where April stats pale in

comparison to other simple predictive measures. Running the two together

as a multivariable regression, r-squared rises to .5595 with the

previous season’s MLVr about twice as valuable as April’s MLVr.

So where does all this leave us? For starters, April stats are not

meaningless, but rather there are a few outliers every season that draw

a lot of attention. On the other hand, those outliers cannot simply be

written off as a hot streak or cold streak. Instead, when combined with

the previous season’s stats or other projections, they can give an early

indication about the expected performance of players this season.

Averaging two helpings of last year’s MLVr and April’s MLVr will get you

most of the way to an estimate of how a player is going to perform over

the course of the season.

The other lesson is that only 95% of players will fall into those

large confidence areas mentioned above. While it’s difficult to generate

them for MLVr, of the top 20 players in AVG or OBP at the end of April,

it’s a good bet that one of them will finish more than 90 points of OBP

or 80 points of batting average above or below their current pace. Will

it be Roberts? **Clint Barmes**? **Jacque Jones**? We’ll have to watch to find out.