Power Analysis: An underutilized tool
Power. It’s something you hear about in statistics, but do you know what it really means?
Before we can talk about power, there is one thing we need to cover first, and that’s making mistakes. There are two types of errors one can make when drawing conclusions about population parameters based on sample statistics. The first, a Type I error, can only be made when you reject your null hypothesis, declaring a significant difference of some kind (e.g. the population mean is different than a predicted value, or two populations are significantly different from one another). When you conclude that there is a difference, but in reality that difference does not exist, you have made a Type I error. The second kind of error is a Type II error, and this type of error can only be made when you fail to reject the null hypothesis and you declare that two populations are not different from each other, when in reality, there is a difference. See the table below for a visual depiction of this concept.
Whenever you read the results of a study in the primary literature, you see a “p-value” associated with the concluding statements being made. For example, a paper may conclude that “male California sea lions weigh more than female California sea lions (P < 0.05)”. P-values come from measuring the area under a probability distribution, thus they can range from 0 to 1. The p-value is a conditional probability (NOTE: the condition is that there is no difference between the populations) and it measures the rate of making a Type I error. It is the probability of observing the difference between two samples measured, or a difference more extreme, just due to chance, given that the populations are actually not different. The most commonly accepted Type I error rate, otherwise known as alpha (α), is 0.05, or a 5% chance of committing a Type I error. Thus, researchers protect themselves by only rejecting the null hypothesis and declaring a significant difference when P < 0.05.
What about Type II errors? How does a researcher protect themselves from making one of those? The answer here is POWER. The probability of committing a Type II error is called beta (β). Power = 1 – β, therefore, if there is a true difference between two populations and you increase power, you will decrease beta, or decrease the chance you commit a Type II error. So how does one increase statistical power? Experimental design plays a huge role in trying to maximize statistical power. For example, you may choose to use very homogeneous experimental units, limiting the amount of random variation that is observed between units. Another strategy would be choosing treatment levels that are more different from each other (e.g. compare growth at 25°C to 35°C vs. 25°C to 28°C). Increasing sample size, or the number of experimental/observational units, also increases statistical power. In general, you should design your experiment or study to have 80% power or more (β < 0.2).
So let’s talk about power analysis. What is it and why should everyone be doing it?
Power analysis is most often used to determine the sample size required to detect a difference or effect of a given size with a certain amount of confidence. If you are writing a grant application, it will often be necessary to perform a power analysis in order to justify your sample size to reviewers. This is particularly true when your study uses animals or humans as subjects. It can also inform you whether or not it is worth pursuing a particular study, because you may find that you will need an unreasonably large sample size to detect the effect you are hoping to measure.
Power analysis is performed using the relationship between five pieces of information: sample size, magnitude of the difference between populations (effect size), the amount a natural or random variation that exists within the population, the acceptable level of Type I error (α), and power (1 – β). In order to perform a power analysis, you will need all but one of these pieces of information, from which you can calculate the one that is unknown. How the effect size is represented will depend on the type of statistical analysis you plan to use (e.g. comparing proportions vs. comparing means). I recommend using a statistical package to perform your power analysis. In some cases, you may need to make your best guess about one of these necessary pieces of information, for instance the variance. It is good to perform a literature search to see if you can find information about what you hope to measure, or if possible, you can conduct a small pilot study to collect some preliminary data which may help inform your larger study.
Example for a Two-Sample T-Test:
The horned lizard (Phrynosoma mcalli) is named for the fringe spikes that surround its head. Their main predator is the loggerhead shrike, a small bird that skewers its prey on thorns or barbed wire to save to eat later. Your research question may be: Do the horns protect the lizards from being eaten by their main predator, the loggerhead shrike?
Study design: Compare the length of horns on dead (skewered) vs. live lizards. You plan to find at least 10 dead lizards and capture at least 30 live lizards. Sample size is unequal because it is more difficult to find dead lizards.
Predictions: It is known from a previous study that the horns of live lizards average 24.28mm with a standard deviation of 2.63mm. Let’s hypothesize that the mean length of dead lizards will be shorter by at least 2mm. NOTE: Our alternative hypothesis in this case is one sided because we are only interested if dead lizards have shorter spikes.
To determine the amount of power this study has, I would use the following codes in SAS® (Cary, NC):
In the SAS output you will see that this design only results in a power = 0.656. This would be considered insufficient power, because it means that if we were to perform the study 100 times, in 34 of those studies (1 – power = β), we would fail to detect this difference even if it does exist.
Let’s say then that we want to control for a power = 0.9, to reduce the rate of failure to detect a difference to 10% (10 out of 100 studies). In this case, we will leave NTOTAL = as the missing value and add POWER = 0.9.
The result is a computed NTOTAL = 84. Remember, the groups are not of equal size, and this value was calculated based on the 3:1 ratio I specified, thus we would interpret this as needing to sample 21 dead lizards and 63 live lizards to be confident about the power in this study. You may have noticed that I did not specify the desired value for alpha in my codes. When you don’t specify a value, SAS will use α = 0.05 as the default.
SAS can also produce some informative plots, such as the one below which was generated using the first set of codes above. This helps you visualize how much power is increasing as you increase the total sample size.
By altering the codes, you could also investigate the effect size you would be able to declare significant, given the original design (N = 40) when you increase power to 0.8. Try it!