Misconceptions about P-values
P-values are currently the most common metric used to draw conclusions from research. Given this, one would expect they are well-understood and non-controversial. Nothing could be further from the truth. Recently, misconceptions about their use and misinterpretation of their meaning have even triggered a statement from the American Statistical Association (ASA).
Background to P-valuesIn its official statement, the ASA provided the following definition:
Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.There are a number of nuances packed into this statement (e.g. "or more extreme") that can be ignored for our current purpose. Of primary concern is that the first step in calculating any p-value is to assume some model (hypothesis) is correct/true. This is called the null hypothesis, meaning the hypothesis to be nullified. As noted by the ASA, any number of assumptions can be made as part of this model, the most common including:
- The difference between two groups is exactly zero on average.
- All measurements were independent of each other.
- The observations were sampled from a specified distribution (e.g. a normal distribution).
- The variance is equal for both groups.
Things that are not P-valuesOver the years, researchers in industry, academia, and media have generated a litany of misunderstandings, misconceptions, and misuses of p-values. Below is a list of common offenses.
Transposing the Conditional Fallacies A p-value is not the probability that...
- An observation/effect isn't real or doesn't exist
- An observation/effect occurred by chance
- The null hypothesis is true
- The alternative hypothesis is false
- The research hypothesis is false
- An effect is small
- An observation is unimportant
- A false positive rate or Type I error rate
- The probability an observation will replicate
Transposing the Conditional FallaciesThe first four fallacies listed are all variants of the same statement. The claim that an observation is "real" means that there is indeed a deviation from the null hypothesis. This is opposed to the data being explainable by chance. Here, the null hypothesis corresponds to only chance deviations showing up in the data, as specified by the statistical model. Meanwhile, the "alternative hypothesis" corresponds to something in addition to chance influencing the data.
Assuming the truth of the null hypothesis gives it a probability of 100% during our calculation of the p-value. Thus, if the p-value is calculated to be 5%, these fallacies claim an obvious contradiction: 5% = 100%. These fallacies are an example of the inverse fallacy (also known as "transposing the conditional"). They claim that the "probability the data occurred due to chance" is the same as the "probability of getting the data if only chance was operating".
In more formal notation we would write: Pr(Data|Chance) = Pr(Chance|Data). Despite even a mention on the Wikipedia page about p-values , this family of errors remains rampant:
The fifth fallacy is that the p-value provides the probability the research hypothesis is false. It is yet another variant of the first four, but it also contains an additional error. Besides the above issues, this statement claims the research hypothesis is the only explanation for a deviation from the null hypothesis. This is almost never the case.
Example: A/B Testing
Web developers commonly perform A/B testing. Perhaps someone has redesigned their website to make it easier to navigate. Then they observe visitors spending more time at version A of the site than they did at version B. The developer may then conclude that version A was easier to navigate. They may think easier navigation provided a better user experience, thus people stayed longer to consume the content. In fact, the opposite conclusion could explain a difference just as easily. Version A may have been more difficult to navigate! Perhaps people spent more time at version A searching for right content, to the detriment of the user experience.
Importance FallaciesImportance fallacies claim that a p-value can tell us the probability we have observed interesting/substantial deviation from the null hypothesis. As we have discussed, the p-value is not the probability that the null hypothesis is true, either in part or in whole. E.g., it is not the probability that the difference is zero. Thus, it also not be the probability that the difference takes on any other value, whether of any importance or not.
Twisted TonguesOverlaps between statistical jargon and common English contribute to importance fallacies. Small p-values are commonly termed statistically significant by statisticians. This is may be an oversimplification when communicating to non-statisticians. To defend each other against this fallacy, medical researchers often teach:
Statistical significance is not clinical significance.
Error Rate FallaciesError rate fallacies assume the p-value is a equivalent to a false positive rate. For a p-value of 5%, researchers often believe that repeating the experiment will yield a p-value at least this low 5% of the time. Small p-values are usually deemed "positive" results, so this cutoff (α = 0.05) is termed the false positive rate.
When the null hypothesis is true and the same experiments/observations are made repeatedly, α applies to the situation. If we set a cutoff of α = 0.05 (5%) for our p-values, then 5% of the p-values generated will be at least this small. The p-value obviously cannot have the same meaning as α since it is being compared to it (and is free to take on any value between 0 and 1).