Ok. I've been waiting awhile before stepping in, and I probably shouldn't. But something does need to be stated about Exponent's poor statistical analysis of the data given.
There are two primary problems with the statistical analysis in Exponent's report. The first has been mentioned already; using parametric statistics on non-normal distributions leads to inaccurate tests of statistical significance. In other words, the model exponent used to answer the question, "Did the Patriots' balls drop in PSI more than the Colts?" assumes that the data are distributed like major league career OBP numbers:
View attachment 1041
Note how the most frequent values in the range of career OBP values fall in the center of the distribution. Also note that the dropoff in frequency is symmetrical on both sides. This is often referred to as a "bell-curve" distribution because it is shaped like a bell. The model used for analysis uses the standard deviation (technically the variance, but since standard deviation is a more familiar term that people might understand, I'm using that here) to assess whether the drop in PSI is greater for the Pats' balls, then what would be expected due to random chance. The problem is that the PSI-drop data looks like this:
View attachment 1042View attachment 1043
This is a common problem when dealing with small sample sizes; distributions of data look weird and the standard deviation is inaccurate. As a result, the significance assessed from the standard deviation is also inaccurate. As a side note, I've looked at this small data in multiple ways and it doesn't really matter how you look at it, the distributions always look non-bell curvy.
One simple solution is to use a better model. The question is, how do we construct a better model? If the hypothesis is that the drop in PSI is greater for the pats than the colts, then the null hypothesis is that the drop in PSI is similar between the pats and the colts. Under this null assumption, if we mixed up the two sets of balls, there should be no difference between the average PSI drop. Therefore, we can construct a distribution of possible results by mixing up the groups in lots of different ways and calculating the difference of means. If our observed value when the groups are preserved is far from the center of the distribution, the probability that the null assumption is correct (a.k.a. the "p value") decreases. This is known as a permutation test.
For those in the know, I ran 10000 permutations, so the "p value" here is extremely accurate.
Here, we actually have two permutation tests, one for piroleau and one for blakeman (see below why). Below are the distributions for such a test assuming, like exponent, that the pats balls were at 12.5 and the colts at 13.0.
View attachment 1044
The red dot here denotes our "observed" value if we don't assume the null hypothesis. This result is highly significant (the probability of accepting the null hypothesis is 0.0012). However, we are dealing with a small sample size, and it could be that blakeman's measures are off. Here is what a test of Piroleau's ratings look like:
View attachment 1046
As you can see from these ratings, the differences between the pats and colts balls are less clear. While the observed value lies along the left tail of the distribution (where pats balls have a bigger PSI drop), its not on the tail itself. The the probability of accepting the null hypothesis is 0.0796, which is above the threshold normally used to reject the null hypothesis (0.05).
One limitation in using this approach is that we must assume that the measures here are independent, this limits our ability to permute the data from both sets of measures to increase power. Worse, there are potential confunds to the repeated measures that makes the differences between raters appear systematic, as opposed to random, here. This limits our ability to conclude anything about the data. In one set of ratings, the results are statistically significant, in the other, they are not.
One thing that hasn't been modeled is the variation in PSI pressure from the original data. We can simulate this by assuming, conservatively, that the standard deviation is 0.16 for all data (and that the 0.41 standard deviation for the pats was due to something). Making this assumption does not change the results much. Blakeman remains significant (p = 0.0026) while piroleau's are not (p = 0.0978).
View attachment 1048
View attachment 1047
The problem here is that our measure of standard deviation is inaccurate (due to the problems of small sample sizes discussed above). In fact, this is a valid explanation for why the pats balls vary more than the colts balls, as the variance of PSI drop is not stable. However, there is a way we can estimate the standard deviation of the PSI by using another sampling approach, called a bootstrap.
A bootstrap is a method for exploring the population underlying small sample sizes. The Pats footballs and colts footballs belong to a population called "inflated pre-game NFL footballs". One way we can simulate this population is by constructing a series of "average" inflated NFL footballs. Because of math (specifically the law of large numbers), any distribution of averages will be shaped like a bell-curve, which allows us to calculate the variance along the distribution. Specifically, the population standard deviation will be approximately four times the standard deviation of the bootstrap.
Now, one could argue that the colts and pats deviations should be handled separately because they may have been inflated to different values, so we can run the bootstrap on each population, to see what deviations we should plug in for the colts and pats. This is a conservative estimate, as it will reduce the standard devations if this assumption is true, however, this estimate may still be higher than the observed standard deviation. In this case the standard deviation is about 3.3 times the bootstrap deviation for the pats, and 2 times the bootstrap deviation for the colts.
Here, we start to see the problem with small sample sizes. Below are the pats distribution of balls:
View attachment 1049
You can easily see that the raters are significantly different from one another (suggesting that the raters are actually unreliable, which limits the interpretability of any data). However, the variance from the population means are really similar. The standard deviation of the bootstrap is approximately 0.39 and 0.4 for the two raters, which matches the standard deviation observed for our data (0.41 I believe). Unfortunately, the distributions for the colts look weird:
View attachment 1050
While piroleau's ratings looks somewhat normal. Blakeman's look uniform. Both have a much wider spread than observed for the patriots. I'm not suggesting that the officials did anything at all to obscure their measures for the colts. Rather, this is a problem of extrapolating standard deviation from a sample size of four footballs vs. eleven. In any case, the standard deviation here is about 0.31 for both ratings. The fact that this standard deviation is higher may explain why the permutation test was less significant than the parametric test; the colts sample may be underestimating the true standard deviation leading to an overestimation of significance.
Let's be conservative here and re-run our tests with 0.3 standard deviation for both groups, just for grins.
View attachment 1052
View attachment 1051
As you can see from the above, we still have the same result. Blakeman's measures show a significant difference in PSI drop (p = 0.0082), while piroleau's does not (p = 0.16).
Finally, let's say that the data are flipped, and that prioleau and blakeman switiched gauges between measuring the colts and the pats balls. In this case we will have significant effects for both groups (p = 0.0232, and p = 0.0438). However, the significance is not nearly as high as before, and the latter p value would not be significant if controlling for the fact that I ran two tests.
Regardless of whether you control for variance in the footballs, you have sets of measures that show a significant drop off and sets of measures that do not. The strongest conclusion we can draw is that we are underpowered to detect any differences in inflation between the two datasets without knowing two things:
1) The PSI values of all footballs at both halftime and the start of the game.
2) Who used what gauge at both halftime and the start of the game.
Such knowledge would enable better confidence in estimating the statistics we need to determine whether the difference in PSI drop is no different than what would be observed by chance.
In truth, Exponent shouldn't be blamed solely here; they were given a sloppy dataset to analyze, and chose to analyze it in a manner consistent with many peer-reviewed publications of sloppy data. In fact, many of their models maintain significance when using a better test. The larger problem here is that the refs were not paying attention when they measured the footballs at half-time or at the start of the game, which neither the Wells report nor the Exponent report explained.