The following is an edited transcript of an in-house discussion among the Baseball Prospectus team about WPA

Colin WyersSort of inspired by Will’s comments on Twitter (although I’ve actually been doing a lot of thinking about these things recently), I thought I’d share my favorite example of WPA producing questionable results.

The story so far: Mariners down a run, bottom of the ninth, two outs. Mike Sweeney hits a double to center. The WPA (at least this implementation—you can probably get slightly different answers depending on how you set up the win expectancy tables) is .092. Then Ichiro homers to win the game; WPA records .867.

I am willing, I suppose, to agree that the change in win probability at this point of the game is what WPA says it is. Where I sort of part ways is in giving the whole .867 to Ichiro. That says that Ichiro was nine times as important to winning the game as Sweeney. I, uh, disagree. But the way that WPA is being parceled out here, there’s nobody else to give the WPA to besides Ichiro—all WPA accrues to the batter on offense and the pitcher on defense.

But in a very real sense, it doesn’t matter what Ichiro does if Sweeney doesn’t double. And I mean very real—if Sweeney makes an out in that spot, the game ends before Sweeney can bat. Sweeney receives credit based upon an average performance from Ichiro—but Ichiro didn’t perform average.

 That’s the problem I have with metrics that aren’t context-neutral—the context only runs one way. Sweeney provides "context" for what Ichiro does, but Ichiro’s actions don’t provide "context" for Sweeney. His value stays the same regardless of what Ichiro does. 

Will Carroll: The issue I have with this specific example is that it's not… reversible. If Sweeney doesn't get on, Ichiro's value would go down to zero. If Ichiro walks, his value is unlikely to go down nearly as much. I just don't get how that works out.

CW: Well, on one hand, we have a "fact"—the odds of the Mariners winning did in fact improve 86.7 percent on that play. That part of WPA is unassailable.

Then there's a decision to accredit all WPA for a play to the batter on offense and the pitcher on defense. Honestly, I'm about as uncomfortable with that decision as you are.

WC: Let's go a bit further—let's pretend Sweeney has speed and is distracting the pitcher. The team wants to keep him out of scoring position and grooves a fastball to Ichiro on an attempted steal. Ichiro yards it, win. He still gets all the credit? No, no… that's reductionism, not accuracy.

The more I look at this, the more I realize WPA is graph-crack. It makes a nice, easy chart that appears right at first glance, but it implies a level of accuracy and granularity that it simply can't have. It's good and likely "good enough" for many applications, like a "game heartbeat," but at heart, it is the reverse of the people that want to argue intangibles. By agreeing that we can't factor in everything, it throws up its hands, does something pretty, then goes back for another hit off the stat-bong.

Ben Murphy: Mmmmm… stat bong.

Rob McQuown: Sorry, but I would debate this "unassailable" point. WPA is based on "everything being equal," in the same way that every game begins at 50/50. But, obviously, everything is not equal. What is unassailable is that historically, the situation in which Ichiro came to bat had resulted in a win .867 less than the post-event state. The dependent variable here, which is completely ignored in this calculus, is the fact that Ichiro's presence alters the percentages (as does the speed of the runner—as Will alluded to). A comparable situation would be using run expectancies, and suggesting that when Carlos Ruiz (batting eighth) gets on base to lead off an inning for the Phillies, the run expectancy is the same as when Shane Victorino gets on base to lead off an inning in front of the massive Phillies order… based on identical base-out states. We don't draw this conclusion, obviously, nor should we conclude that win expectancy is context-independent.

Put more simply, the WE after Sweeney's double might have been higher than 13.3 percent. In this case, it's hard to suggest that it actually was, since Mo is so awesome, but pretend it is Brad Lidge pitching or something. (The truth of the matter is that the original 2-out WE was probably lower than 4.1 percent given Sweeney was batting against Mo). Of course, we are almost compelled to use the 13.3 percent if we want enough sample-size-per-state to have any meaning, but I do think your original point of having variable "context" for each player is an area worth exploring.

Ken Funck: Take this contextual weakness of WPA, sprinkle in the yet-more-confounding variable of pitch sequencing, and you have that which most gives me the willies: the summing of Pitch Type Linear Weights per hundred pitches to calculate, say, the "most effective changeup in baseball." If Neftali Feliz strikes out the side on nine pitches, blowing eight straight fastballs past hitters before making the last guy flail helplessly at a changeup, by pitch-type linear weights his most effective pitch is the changeup.

Matt Swartz: Hear, hear! The pitch-type linear weights things are so incredibly ignorant of context and used so horribly inappropriately that I almost wish they didn’t exist at all. It’s a shame because they could be used so effectively at a theoretical game level to figure out which pitches are being thrown too rarely or too much, but they say nothing about "the most effective changeup in baseball"or anything like that.

This article by Sky Andrecheck is a great one. Hitter performance on pitches has very little persistence year to year. The reason is that you throw pitches that are effective more often and pitches that are less effective less often until you reach equilibrium. Any time your fastball is more effective than your changeup in a given count, you’re probably throwing it too much. For an example no one is talking about lately, take Ryan Howard—his pitch type linear weights says that in 2009, the slider was the pitch he hit best. No, it isn’t. Pitchers just threw it to him so often that he looked for it so often that he hit it enough. But it lowered his opportunities to see fastballs, so he capitalized on them less and did worse on them than he would if he saw them often enough to expect them. It’s a mixed-strategy Nash equilibrium, basically. But people misuse these stats.

WPA is a story stat. Like RBI, it’s fun to know if you like to have a number tell a tale. It’s not a moral. It’s an anecdote. It tells the story of the game. It’s limited in that sense, and should be taken as such.

Russell Carleton: This doesn't solve the leverage problem, but you can plug in a good set of context-neutered linear weights (Colin?) and parse out the credit that way, no?

MS: Does WPA weight certain statistics too much or too little? Is it biased toward certain players, or just imperfect in attributing credit in a randomly distributed way?

CW: I don't know the answer to that yet. Right now I'm building my own set of win expectancy tables so I can look into some of these questions.

Here’s more fun with WPA—the 1996 Yankees bullpen. Mariano Rivera, the setup man that season, has a 5.4 WPA. John Wetteland, the Yankees closer, has a 4.2 WPA.

At first blush, it makes no sense. Rivera pitched close to twice as many innings as Wetteland, had a lower ERA, higher K, drastically lower HR rate, a practical tie in walk rate—Rivera was much better than Wetteland, to an extent that WPA isn't capturing.

The answer is leverage. Wetteland had a 2.37 aLI, compared to 1.54 for Rivera. Now, if we swapped out Rivera with an average setup man, Wetteland's aLI, and thus his WPA, would drop. Wetteland, and not Rivera, gets all the credit for the extra leverage above average he is being provided by Rivera.

Clay Davenport: There's nothing wrong with the win expectancy values—they are what they are. The problem is that the end of a game is a boundary condition, and you are dealing with a model that doesn't really handle boundaries. The phrase that comes to mind (my mind, anyway) is "collapse of the wave-function"—a problem in quantum mechanics that occurs when you take an actual measurement, replacing a probabilistic distribution with a discrete point. Similarly, here we have a system that has been continuously considering all possibilities suddenly reducing to one real answer, and an inordinate amount of credit flows to the person who's there at the end.

CW: Yeah, I’m not suggesting I’m about to set the world on fire with new win-ex tables. I could probably use someone else’s for this; I just have a bad case of wanting to do things myself.

The point about a boundary condition is a good one. But if it’s "breaking" at the end of the game, how far back in the game is it at least bending?

CD: That's a very interesting question. Thinking about it after going to bed, it seemed that the win probability is essentially a function of (run differential)/(time remaining), and time remaining becomes zero at the end. Think of how the value of 1/N changes as you change N from 100 to 0 in steps of -1, almost imperceptible differences through most of the range, accelerating rapidly as you approach zero.

WC: Couldn't we take someone like Ichiro—who has a lot of PAs—and see what situations he's in most, then compare him to other players in similar situations to see if the WE is different? I'm curious to see what the range of possible player-adjusted WE might be.