The Statistical Perspective on DEFLATEGATE

The Statistical Perspective on DEFLATEGATE

51508114_10107811241087098_5116828864550535168_n.jpg

Full disclosure: I’m a Patriots fan. I grew up in New England and I started watching NFL football right around the time Tom Brady became the starting quarterback for the Pats. We have been spoiled as fans to see our team go to the Super Bowl over and over again, taking home the trophy six times. But, I have to admit, the number of times the Patriots organization has been accused of cheating in some sort of fashion does taint the joy of winning a little bit (and to be clear, I do not support the Patriots cheating to win games). One of those instances was when the Patriots were accused of purposefully deflating their footballs during the AFC championship game against the Indianapolis Colts in the 2014 - 2015 playoff season, affectionately referred to as the Deflategate scadal (a play on words – remember Watergate?).

A few things you need to know before we dive into some statistical analyses (as this IS a statistics blog) is that NFL regulations indicate that footballs need to be inflated between 12.5 and 13.5 pounds per square inch (psi). Another important note is that since 2006, protocol is for each team to use their own footballs while on offense, which allows a quarterback to use footballs that suit their preferences. I don’t want to get into all the weeds of this story, but here are the highlights that will be important: (1) the psi of the footballs used in this infamous game were not checked prior to the start of the game, but anecdotally, the Patriots claim their balls were 12.5 psi at the start of the game and the Colts indicated their balls started at 13 psi, (2) after rumors indicated suspicion that the Patriots had intentionally deflated their footballs to gain an offensive advantage, the psi for 11 of their 12 footballs was investigated at half time, (3) only 4 of the Colts footballs were measured for comparison because the officials ran out of time prior to the beginning of the second half, (4) there were two different gauges used by the officials to make the psi measurements, (5) upon discovering the Patriot balls were underinflated according to NFL regulations, they were properly inflated for the 2nd half of the game.

History tells you that the Patriots went on to ultimately win this game, and later defeat the Seattle Seahawks in the Super Bowl. I was living in Tacoma, WA, at the time, so you can imagine the amount of crap I was getting at work. Lots of hate and lots of blame. The Patriots played better in the 2nd half of that AFC championship game, so I just let the nasty talk slide because it did not seem like we had an advantage to playing with deflated footballs anyhow, and I suspect all of the footballs used in that Super Bowl were thoroughly checked. In the end, it was the Butler that did it. He won us that trophy. Tom Brady, our star quarterback, ended up serving a 4 game suspension for his alleged role in telling a staff member to intentionally deflate balls. But anyhow, I digress…

As a statistician, I’ve always wanted to take a look at the psi data myself to see if the conclusions that were made in the official “Wells Report” seemed to stick, so let’s get to it. NOTE: I used SAS to perform all of my analyses.

First, let’s take a look at the data (taken from the Wells Report). Eleven Patriots footballs and four Colts footballs were each measured at halftime by two separate officials (total footballs measured = 15):

Codes.png

Additionally, the Wells Report indicates that Official #1 used one gauge to measure the Patriots footballs, and a second gauge to measure the Colts footballs, while Official #2 used the second gauge to measure the Patriots footballs and the first gauge to measure the Colts footballs. So I created an additional variable Gauge:

Codes2.png

At first, analyzing this data may seem very straitforward. There are two fixed factors that could affect the psi of the footballs: TEAM and GAUGE (NOTE: The effect of OFFICIAL cannot be examined as it is confounded with TEAM*GAUGE). Therefore, you may be tempted to write your model this way:

Codes3.png

However, this would be wrong. Why? When you run the codes written this way, you will see that there are 26 degrees of freedom for the error term in the F tests.

Pseudoreplication Output.png

Because only 15 footballs are measured, we know that the total number of degrees of freedom for investigating the effect of TEAM is incorrect (dfs should not exceed 14 total for this effect). When the codes are written this way, it actually treats the data as if there were 30 footballs, not 15 footballs that were each measured twice. If you use the P-values produced from these codes, you are committing one of the most common statistical errors: pseudoreplication. You have given your test more statistical power than you actually have, which means you are more likely to commit a Type I error (falsely rejecting the null hypothesis). In order to avoid making this kind of statistical error, you need to understand how many independent experimental units you have, thus knowing how many true replicates you have, which will dictate how many degrees of freedom you have. In this case, the two measurements made on each football are subsamples. One way to correct for this is to take the mean of the two measurements so that you only have a single psi value that represents each football:

Codes4.png
Codes5.png
AVG PSI Output.png

Now, looking at this output you will see how much less power this test has by looking at the F value for TEAM. When it looked like we had 30 footballs, the F value was much larger, 62.49, vs. 33.89 when we have only 15 footballs. In either case, F is large, and the P-value is very small, < 0.0001. Thus, in this particular case, the conclusion we would come to would not change, but this will not always be the case. At this point, what we have learned from this analysis is that there was a significant difference in the level of inflation of the footballs between the two teams at halftime. But WE ALREADY KNEW THAT. We were told that that Colts footballs were 0.5 psi higher than the Patriots footballs at the beginning of the game, so at this point, we have not learned anything new. I would expect the Patriots footballs to be less inflated at halftime if they were less inflated at the start of the game. I will come back to this later.

Another thing you may have noticed is that we lost information. By taking the average of both measurements, we have lost any information about the two different gauges that were used to measure the footballs. Also, we lose precision in the estimate of the experimental error. For these reasons, the data should be analyzed properly as a split-plot design. We do this by identifying for SAS what the level of the experimental unit is (BALL) for the main-plot factor (TEAM), by identifying the additional error term required in the RANDOM statement.

Codes6.png
Split-Plot Output.png

Now you can see in the output produced that we have two error terms (see Covariance Parameter Estimates). Also notice that not only do we have the same information we had for TEAM that we got in the previous analysis, but we also have information about GAUGE, and the interaction term between TEAM and GAUGE. We can check the output to make sure we have not committed pseudoreplication, by comparing the degrees of freedom reported against what we would expect them to be.

Expected degrees of freedom:

TOTAL DF: 30 (Total # data points) – 1 = 29

TOTAL MAIN PLOT DF: 15 (Total # of FOOTBALLS) – 1 = 14

FACTOR 1: TEAM (Levels = 2) DF: 2 – 1  = 1

ERROR A: TEAM*BALL DF: 14 (Total Main Plot DF) – 1 (Spent estimating effect of TEAM) = 13

FACTOR 2 : GAUGE (Levels = 2) DF: 2 – 1 = 1

INTERACTION: TEAM*GAUGE DF: 1 (DF for TEAM) * 1 (DF for GAUGE) = 1

ERROR B DF: 29 (Total DF) – 16 (Spent estimating above effects) = 13

Based on this information, we know that the numerator DF for TEAM should equal 1, and the denominator DF for the F Test for the effect of TEAM (the ratio of the mean squares of TEAM over ERROR A) should equal 13. CHECK! We have more replicates (15) for the effect of GAUGE as each gauge was used independently 15 times for a total of 30 data points (compared to 11 replicates for the level of TEAM = Patriots and 4 replicates for the level of TEAM = Colts). By coincidence in this case, ERROR B also has 13 degrees of freedom. CHECK!

Phew! We know we are no longer committing pseudoreplication, but there is something else we should do before we interpret the rest of the output, and that is to check our statistical assumptions, specifically normality and homogeneity of variances assumptions. This is done easily in PROC MIXED by adding the option RESIDUAL to the MODEL statement (see codes above). We examine the assumptions on the residuals (the differences between the observed values and their expected values based on the model) so that we can pool the data, otherwise our sample size would be too small to examine the assumptions. What we are looking for in the plot of residuals is a lack of a pattern such that the range of residual values should be the same across all predicted means.

Residuals1.png

Based on the plot of conditional residuals I am concerned about the homogeneity of variances assumption as the conditional residuals produce a horn shape.

Residuals2.png

When looking at the unconditioned residuals it appears that there is a wider range of residual values among the Patriots balls (the lower predicted means), regardless of gauge used, compared to the Colts balls (higher predicted means).

The good news is that even though it appears our data violate the assumption, we can tell SAS to estimate a separate variance for each Team’s balls. This is done using the GROUP option within the REPEATED statement.

Codes7.png

You will see in the resulting output that there are now two separate ERROR B terms (variances) estimated for each level of TEAM (compare to the single “Residual” estimate in the previous output).

2Variances.png

When looking at the fit statistics, it appears that our new model that uses two separate variances is a better fit. To confirm, we will re-examine the plot of conditional residuals.

Residuals3.png

Indeed it looks like we have eliminated the troublesome pattern in the residuals. Also, looking at the right hand panel, it appears that there is not a significant departure from normality among the residuals, so we can now feel comfortable interpreting the results. Since there is no significant interaction between TEAM and GAUGE, it means the difference between measurements that was due to using a different gauge is the same for both team’s balls. It is interesting to note that there was a significant difference between the two gauges, such that measurements taken with gauge 1 were significant lower than measurements taken with gauge 2 (P = 0.0130). Looking at the means, you can see that the Colts balls were within the regulatory limits for psi, measuring 12.53 psi on average. The Patriots balls had significantly less psi, measuring 11.30 psi on average (P = 0.0002).

End Results.png

Both teams balls deflated since the beginning of the game, as the Colts claimed their balls began at 13 psi and the Patriots claimed their balls began at 12.5 psi. So then, the real question should be if the Patriots balls deflated MORE than the Colts balls. I’m not going to get into physics and whether the starting psi of the balls would affect the rate of deflation. I don’t know the answer to that. Therefore, the hypothesis I want to test is whether the difference in psi between the footballs at halftime is significantly greater than 0.5 psi (the difference at the start of the game). To do this, I will ask SAS to produce a 90% confidence interval for the difference between TEAM means in the LSMEANS statement: 

LSMEANS TEAM / PDIFF CL ALPHA = 0.1; 

I have used the option CL to ask SAS to provide the lower and upper confidence limits for the mean difference between the Colts and Patriots balls. I have chosen 90% (as opposed to the default of 95%) because I have a one-sided alternative hypothesis, which is specified using the option ALPHA  = 0.1. If 0.5 falls within the 90% confidence limit, I would fail to reject the null hypothesis, and conclude that the change due to deflation during the first half of the game was the same for both teams.

Contrast2.png

Looking at the result, we see that the difference in psi between the Colts and Patriots balls is significantly greater than 0.5 psi (0.81 – 1.65 psi difference). Therefore, either the difference between the inflation of the balls prior to the game was also greater than 0.5 psi (which we do not know because the balls were not measured by the officials prior to the game), or it appears someone tampered with the Patriots balls by deflating them. I have not read the Wells Report in full, but we know that the latter is the conclusion that the NFL made based on all of the other evidence, thus why Tom Brady ended up serving a 4 game suspension.

Whatever. Tom Brady is still the GOAT.

Being NORMAL is Overrated!

Being NORMAL is Overrated!