DVOA and other topics in football analytics

Bellhorn

Lumiere
SoSH Member
Aug 22, 2006
2,328
Brighton, MA
I have noticed a few posts in recent weeks that have begun with something along the lines of "I'm not sure how DVOA works, but...". (There are also a couple of posters in this forum who seem to find the stat personally offensive, for reasons that I have not quite been able to fathom). So I am going to take out my recent football frustrations by opening up a discussion of this stat, which I hope will supplement the official Football Outsiders write-up (available here) and the Aaron Schatz chat from a couple of years ago in terms of making it more universally accessible. And I hope that others will chime in with thoughts on DVOA or on other topics in football analytics.

In a nutshell, DVOA (Defense-adjusted Value Over Average) is basically just yards per play, adjusted for the most significant elements of game context. This adjustment is a three-step process corresponding to the elements of the acronym. FO does a good job of explaining this, so I will only briefly touch on each:

Step 1: Value. Not all yards are created equal, due to differing implications for the probability of picking up a first down. A four-yard gain is much more valuable on 3rd and 4 than it is on 3rd and 15. So DVOA begins by assigning a success value to each play based on the down/distance context in which it occurred.

Step 2: Over Average. As other elements of game context change, so does the importance of achieving plays that are nominally successful by the standard established in Step 1. When trailing by 17 points with 5 minutes left, picking up a first down is of little consequence. DVOA compares the success value of plays to the average value to be expected, based a database of plays in comparable game situations, in order to account for this.

These first two steps generate VOA, which can be a useful stat in its own right.

Step 3: Defense-adjusted. This is actually something of a misnomer: "opponent-adjusted" would be more accurate. Given the small number of games in an NFL season and the division-heavy schedule, some teams will find it much easier than others to achieve (context-adjusted) per-play success. So the final step is to adjust VOA in comparison with the baseline implied by the opponents faced.

Again, all of this is well explained in the FO write-up, which also offers empirical confirmation that DVOA is better than unadjusted yards per play in terms of both autocorrelation and correlation with game results. The interesting question is why this is the case - in particular why, in contrast with baseball, autocorrelation improves when we include elements of game context. I think it can be most succinctly explained as follows: DVOA is based on the understanding that in football, the game state provides the players with actionable information. When it is 3rd and 4, both offense and defense know that the fourth yard is the make-or-break yard, and can (should) adjust their actions accordingly. And when defending a 17-point lead with five minutes remaining, the defense knows that a 12-yard gain for a first down is all but irrelevant, and will play the situation much differently than they would if they were trailing by 3 with three minutes remaining. As such, we can see that context-adjusted success, not raw yards per se, are the currency in which per-play performance should be measured in football.

So it is interesting to note that the sabermetric revolution has, to a large extent, proceeded in opposite directions in baseball and football. In baseball, most traditional statistics simply assign game results to the player most obviously connected to them (e.g. runs scored, RBI, W-L, and even ERA). But we now know that this practice is sub-optimal, as in baseball, game context does not (to first approximation) provide any such actionable information. The batter is always trying to hit the ball as hard as he can, or get on base via a walk, while the pitcher is always trying to prevent him from doing either.* So we need to remove exogenous game context factors from player stats such as RBI, in order to avoid crediting/debiting players for factors over which they have (virtually) no control. In football, on the other hand, responding to game context is a significant element of the player's performance on each snap, and as such needs to be added on to the raw yardage figures that have traditionally been used.

Reading through this forum, one finds various criticisms of DVOA, which are of varying levels of interest. We can begin by addressing those who seem almost to feel threatened by the stat, and noisily object whenever it is introduced into discussion (e.g. dcmissle's oh-so-eloquent "Fuck DVOA" post from last week.) I can only assume that this attitude is based on an assumption that proponents of DVOA somehow view it as the be-all and end-all of football analysis. If so, this assumption is mistaken in virtually all cases - I don't know of anyone, including FO writers themselves, who attempt to use the stat in this manner. (See Schatz's response to dcmissle in the chat, for example.) As discussed above, DVOA is probably the best stat we have for measuring the per-play performance of a given team at a given point in the season; as such, it forms a useful starting point for evaluation of various questions, such as an upcoming playoff match-up. But of course, a fully robust analysis will go well beyond this, and attempt to show how the particular game may differ due to individual player match-ups, game plans, etc. In the old days, we might have begun such a discussion by pointing out that Team A averaged, say, 8.0 yards per pass attempt over the course of a season, while their opponent gave up 7.0 yards per pass attempt on defense. While it would be perfectly correct to insist that further analysis remained to be done, would it really occur to anyone to respond to this with "Fuck YPA?"

Much more interesting are specific quibbles with the results that DVOA generates, which are occasionally counter-intuitive in the extreme. One example of this occurred in Week 1 of this season, when the Patriots did not score well by VOA in their narrow win over Buffalo, despite what seemed like a clear superiority on a per-play basis. Discussion in this forum centered on the possibility that FO's approach to the value-adjustment component (Step 1 above) is sub-optimal, and I do believe that this likely to be the case. As FO's write-up makes clear, they assign "success points" to a play based on the yards gained relative to the down and distance. 1st down plays are successful if they gain 45% of the required yards, 2nd down plays if they gain 60%, and 3rd/4th down plays only if they actually gain a first down. This is based on research from The Hidden Game of Football, where authors show that a team is approximately as likely to achieve a new first down when facing 2nd and 6 as on 1st and 10; therefore, gaining 40% of the yards required on 1st down has kept the team "on schedule" toward their next first down.

When I first read about the DVOA methodology around ten years ago, I thought that this made good intuitive sense: staying on schedule in terms of first-down probability should confer a disproportionate degree of benefit, as it seemingly keeps the entire play-book open, and avoids ending up obvious passing situations. So I was a little surprised a few years later when I found this post by Brian Burke at AdvancedNFLStats.com, where he shows that first-down probability is more or less linear with respect to yards required. There is no obvious inflection point at the more moderate distances on second and third down that would seem to lend themselves to greater offensive flexibility (at least, not until you get to around 3rd and 1). As such, while keeping on schedule for the next first down is perhaps mildly interesting as a benchmark for play success, there is no obvious reason to assign it disproportionate importance in the success points system.

It might be suggested here that FO handles this objection through their system of fractional success points - a success value of 1 is simply a benchmark on a continuous value scale, not an inflection point. But the insurmountable problem (or so I see it, anyway) remains: given what we see in Brian Burke's graphs, there is simply no way that a five-yard gain on 1st and 10 should be treated as equal in value to a three-yard gain on 3rd and 2. The former represents a miniscule rise in first-down probability; the latter, a gain of around 40%. Or in another way of looking at it: a series starting on 1st and 10 that gains 5, 3, and 0 yards will receive two success points (the first and second-down plays). A series that gains 0, 0, and 10 yards will receive one success point (the third-down play). But it is the latter series that gains a new first down. I see no way around the conclusion that their success point system is fundamentally miscalibrated, due to excessive reliance on the research from THGOF.

This would seem to explain the strange DVOA outcome of the Week 1 NE-BUF game, as the Bills had a lot of moderate gains on first and second down that DVOA might tend to overrate. And it is worth noting that after another such anomalous game two years ago, Aaron Schatz actually acknowledged the possibility that their system gives too much credit to these partial successes on first and second down.

Nonetheless, even if this criticism turns out to be accurate, this is a comparatively minor flaw. And it should not detract from DVOA's demonstrated track record of superiority over unadjusted yardage stats, or from the conclusion that it represents a superior approach for measuring true performance on the football field, as discussed above.

Any other thoughts on DVOA? Other stats? Brian Burke's Expected Points Added is also interesting - I might discuss that in a later post.

--------------------

* There are exceptions to this, of course, even beyond the existence of small amounts of clutch ability that have been detected in recent years: when batting with a runner on 3rd, 1 out, and Lugo and Varitek due up next, the optimal approach will be different than it would be leading off the ninth inning down by two runs. But over the course of a season, these differences are very slight, and an attempt to explicitly account for them is likely to cause more problems than it solves. We don't see the marked change in incentives, depending on game situation, that we do in football.
 

Super Nomario

Member
SoSH Member
Nov 5, 2000
14,015
Mansfield MA
Good opening post, Bellhorn.
 
I think it was Burke, but it might have been Dave Berri, who criticized DVOA for the following items:
1) It's a black box. We unworthies aren't given access to the methods, formulas, or weightings used to calculate DVOA. This makes it difficult to evaluate in the cases (like the Buffalo game or the Pats / Jets game a few years ago) where its results don't make intuitive sense. Can we even judge whether they're assigning too much credit for partial success when we don't even know how much they're assigning? Apart from this, it limits its value as an analytical tool. I can't take a split of the Pats' offensive DVOA with and without Gronkowski, for instance, or calculate their defensive DVOA on third down.
2) It's neither a wholly descriptive or predictive tool. It makes some allowances for controlling for randomness (such as assigning half a turnover for fumbles no matter which team recovers them) but not others (the randomness of INT rates, for instance). It doesn't adjust for clutch but does weight red zone plays higher. And because it's a black box, we can't tease out the predictive and descriptive elements to try to make it more sound.
3) The number itself is meaningless. A -12% or +4% DVOA isn't a thing except in reference to other values of DVOA. Contrast to EPA or WPA, which represent their values in terms of points or wins, or even stats like Y/A or ANYA that relate back to yards.
 

Shelterdog

Well-Known Member
Lifetime Member
SoSH Member
Feb 19, 2002
15,375
New York City
The biggest issue with DVOA is that it's not clear to me that it measures anything other than DVOA.  On the FO website they still tout a bunch of correlations from 2000-2005 seasons which suggest that DVOA is better at predicting next year wins than simple stats and is reasonably good (although not as good as point scored/ allowed or point differential) at correlating with same year wins, but that's the extent of the evidence about why it works.  
 

SMU_Sox

queer eye for the next pats guy
SoSH Member
Jul 20, 2009
8,923
Dallas
While they didn't have a good year against the spread in 2013 I made bank with them. Plus I used them to win an elimination pool.

They are usually really good with their top 5 picks against the spread especially if you tease it. You have to know though they are bad at adjusting for injured players. I'd say dvoa is 60% descriptive and 40% purely predictive with the note that of that descriptive part a lot of that is predictive too.

Schatz has noted that when they predict games they don't just use dvoa. They actually use the spread too.
 

SeoulSoxFan

I Want to Hit the World with Rocket Punch
Moderator
SoSH Member
Jun 27, 2006
22,102
A Scud Away from Hell

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
Football Outsiders has not done an especially good job explaining whether their system beats much simpler, much more transparent systems like SRS (available at football reference - just a strength of schedule adjusted margin of victory measurement). 
 
With baseball projections, this was always the gold standard - show that PECOTA/ZiPS/etc... beat a very simple Marcel the Monkey type system. Granularity and complexity is great when it achieves something. It's never been clear to me that DVOA does.
 

coremiller

Member
SoSH Member
Jul 14, 2005
5,854
The black box issue is the biggest problem with DVOA. Without knowing all the weighting and how all the adjustments work it's impossible to evaluate. Every so often they leak a new piece of information, e.g., Aaron casually mentioned recently that the opponent adjustments take into account down and distance, so that they adjust a team's success on third downs based on the opponent's third down performance, and not the opponent's overall performance. Does this granularity increase the model's accuracy or just introduce more noise? No one knows.

The real problem, though, is not that their model is a black box, but that their dataset is proprietary. If there were open-source access to all the PBP data in a downloadable format, some smart people with free time on their hands could probably reverse engineer much of the model, or use the data to test out other models. But no one has the data.

Bellhorn, the standard FO response to your 5,3,0 vs 0,0,10 issue is that it's a predictive vs descriptive problem. A purely predictive model will prefer the 5,3,0 scenario because you had two successful plays out of three, which indicates you are more likely to have successful plays on the future than if you only had one. The model will hate the two zeros in the 0,0,10 scenario and penalize your for it.

A separate problem with FO is that they continue to tout their individual player stats, which often produce bizarre results and which are frequently worse than useless. That says nothing about the validity of the underlying DVOA model for team success, but it hurts their credibility. Their generally thin skin, and inclusion of a number of newer writers who have a tendency make obvious mistakes and say outlandishly silly things, do not help in this department.

FO is frustrating. It's clearly a big improvement over unadjusted yardage stats, but I'm always left feeling like it could be much better than it actually is.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
FWIW, as far as the "black box" complaints, I just ran a quick regression, and found that 97.2% of the variation in DVOA can be explained just through simple box score stats. Now maybe that last 2.8% is where the magic happens, but it's a pretty small of it happening it seems.
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
Out of curiosity, I threw in some PFF data (which is obviously super black box itself), and got that soup up to a .99% correlation for in sample data.
 

SMU_Sox

queer eye for the next pats guy
SoSH Member
Jul 20, 2009
8,923
Dallas
Can't edit on my mobile. What program did you use to run the regression?
 

bowiac

Caveat: I know nothing about what I speak
Lifetime Member
SoSH Member
Dec 18, 2003
12,945
New York, NY
SMU_Sox said:
Ok... I'm impressed. Do you have thst bad boy available for download?
PM me with your e-mail address - I'll send you a copy. It's at 99% for single season in sample data, and 97% for 3 seasons (2011-2013) of in sample data. The three season data is significantly more useful because of it retains a 97% correlation for an out of sample season (2010). I haven't bothered t-stat-ing any of this to remove extraneous coefficients, so while I'm pulling 28 variables right now, probably only seven or eight are really doing anything. 
 

SMU_Sox

queer eye for the next pats guy
SoSH Member
Jul 20, 2009
8,923
Dallas
For not using excel for a simple regression? No. But if you made a model using data from 2005 onward you might have done it in something else. You said simple regression and then said you added something. I'm just trying to see what you did. I'm a math/stats nerd. I apologize.