NHST can tell you how likely randomly sampled data would be like your data or even more extreme than it given that the null hypothesis (e.g., there is no difference in the mean across the groups to compare) is true. As a standard threshold, we use 0.05, and we call this alpha. This means that if randomly sampled data can be like your data at lower than 5% change, you reject the null hypothesis (and claim that we observe a difference). Thus, if your p value is lower than alpha, we say that you have a significant result.
If a difference does not exist, but we reject the null hypothesis, it will be an error. We call this the Type I error or false positive. As you have seen in various NHSTs, your p value will never be zero (can be really really small). And your p value directly means the probability of this Type I error. The threshold alpha = 0.05 means that we simply determine that if the chance that we have a Type I error is less than 5%, we can kinda ignore it because it is unlikely to happen.
However, there is another kind of errors we may have in NHSTs as illustrated in the following table.
|Reject the null hypothesis||Fail to reject the null hypothesis|
|The null hypothesis is true.||Type I error (false positive)||True negative|
|The null hypothesis is false.||True positive||Type II error (false negative)|
Another possible error is that we fail to reject the null hypothesis although the null hypothesis is actually false (or we conclude that we do not have a difference although there is a difference). This is called a Type II error or false negative. Its probability is represented as beta. Thus, 1 - beta is the probably to have true negative, and it is called power. Power analysis is a way to how likely we have a Type II error.
To summarize in a bit more mathematical notation, alpha = p(Reject H0 | H0 is true). H0 is the null hypothesis. And p = p(Data | H0 is true). If p is smaller than alpha, we call the result significant. beta = p(fail to reject H0 | H0 is false), and power = p(reject H0 | H0 is false) = 1 - beta.
There are several pieces of information we need for power analysis.
We just need three of the above information. Power analysis then estimates the rest. Based on what you estimate, power analysis often has different names.
Here, we focus on a priori power analysis because I think it is the most important application of power analysis.
Probably, the most important way to use power analysis is to estimate a sample size to get the desired alpha and power. As explained in the section above, as long as you have three parameters, you can calculate the rest parameter. Particularly, estimating the effect size can be useful when you design an experiment. Let's take a look at an example with t tests.
Let's say that we have two groups to compare and we do a within-subject design (so we are using a t paired test). We want to have a difference with alpha=0.05. It is kinda hard to set a proper effect size, but let's say 0.5 (regarded as a middle-size effect). The power is also kinda tricky to set. Here, we shall accept 30% false negative in cases where the null hypothesis fails to be rejected. So, the power is 0.7.
Thus, we need 27 samples for each group to meet the criteria we specified.
For an ANOVA test, we can use the pwr.anova.test() function. The parameters are similar except that k for the number of groups to compare, and f for the effect size. Usually this effect size is reported as Cohen's f^2 instead of f, but for power analysis, we need f. And not that this does not equal to the eta-squared or partial eta-squared. So be careful about it. For example, if we have three groups to compare, we assume that the effect size of the underlying populations is 1.0, we set the power to 0.7, we can estimate the sample size we need to get as follows.
Another thing I should mention is retrospective power analysis. It means that we are trying to estimate the power (or the observed power) based on the observed effect size. This looks similar to post hoc power analysis, but the difference is in the effect size: retrospective power analysis uses the observed power, and post hoc power analysis uses the population effect size determined in a priori manner. This retrospective power analysis is often considered to be useful to determine if there is really no difference between the groups we are comparing.
However, retrospective power analysis is regarded as an inappropriate analysis for determining no existence of differences between the groups, and many researchers have suggested avoiding this type of analysis. A more detailed discussion is available in the following paper: Hoenig, J. N., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. American Statistician, 55, 19-24., so if you are interested, I would recommend you to read their paper.
Here is a quick summary of why retrospective power analysis is not appropriate. Power analysis assumes the effect size of the underlying population. It is known that the observed effect size (the one calculated from your data) generally is not a good approximation of the true effect size (the one we want to use for power analysis).
When you want to claim that there is no difference, you want to have a case in which your p value is high (so it is unlikely that you say there is a difference although there isn't) and your power is also high (so it is unlikely that you say there is not a difference although there is). However, the p value and the observed power (the power calculated based on the observed effect size) are correlated to each other: when the p value becomes large, the power tends to become low. Thus, you cannot really expect what you want to get from power analysis for claiming no difference.
So if you see this type of power analysis, be very careful.