Misconceptions about P-values
P-values are currently the most common metric used to draw conclusions from research. Given this, one would expect they are well-understood and non-controversial. Nothing could be further from the truth. Recently, misconceptions about their use and misinterpretation of their meaning have even triggered a statement from the American Statistical Association (ASA).
Background to P-values
In its official statement, the ASA provided the following definition:
Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.
There are a number of nuances packed into this statement (e.g. "or more extreme") that can be ignored for our current purpose. Of primary concern is that the first step in calculating any p-value is to assume some model (hypothesis) is correct/true. This is called the null hypothesis, meaning the hypothesis to be nullified. As noted by the ASA, any number of assumptions can be made as part of this model, the most common including:
- The difference between two groups is exactly zero on average.
- All measurements were independent of each other.
- The observations were sampled from a specified distribution (e.g. a normal distribution).
- The variance is equal for both groups.
A key to spotting common fallacies about p-values is to realize they are contingent on these assumptions (collectively called the null hypothesis/model).
Things that are not P-values
Over the years, researchers in industry, academia, and media have generated a litany of misunderstandings, misconceptions, and misuses of p-values. Below is a list of common offenses.
Transposing the Conditional Fallacies A p-value is not the probability that...
- An observation/effect isn't real or doesn't exist
- An observation/effect occurred by chance
- The null hypothesis is true
- The alternative hypothesis is false
- The research hypothesis is false
Importance Fallacies A p-value is not the probability that...
- An effect is small
- An observation is unimportant
Error Rate Fallacies A p-value is not...
- A false positive rate or Type I error rate
- The probability an observation will replicate
Transposing the Conditional Fallacies
The first four fallacies listed are all variants of the same statement. The claim that an observation is "real" means that there is indeed a deviation from the null hypothesis. This is opposed to the data being explainable by chance. Here, the null hypothesis corresponds to only chance deviations showing up in the data, as specified by the statistical model. Meanwhile, the "alternative hypothesis" corresponds to something in addition to chance influencing the data.
Assuming the truth of the null hypothesis gives it a probability of 100% during our calculation of the p-value. Thus, if the p-value is calculated to be 5%, these fallacies claim an obvious contradiction: 5% = 100%. These fallacies are an example of the inverse fallacy (also known as "transposing the conditional"). They claim that the "probability the data occurred due to chance" is the same as the "probability of getting the data if only chance was operating".
In more formal notation we would write: Pr(Data|Chance) = Pr(Chance|Data). Despite even a mention on the Wikipedia page about p-values , this family of errors remains rampant:
The fifth fallacy is that the p-value provides the probability the research hypothesis is false. It is yet another variant of the first four, but it also contains an additional error. Besides the above issues, this statement claims the research hypothesis is the only explanation for a deviation from the null hypothesis. This is almost never the case.
Example: A/B Testing
Web developers commonly perform A/B testing. Perhaps someone has redesigned their website to make it easier to navigate. Then they observe visitors spending more time at version A of the site than they did at version B. The developer may then conclude that version A was easier to navigate. They may think easier navigation provided a better user experience, thus people stayed longer to consume the content. In fact, the opposite conclusion could explain a difference just as easily. Version A may have been more difficult to navigate! Perhaps people spent more time at version A searching for right content, to the detriment of the user experience.
Importance Fallacies
Importance fallacies claim that a p-value can tell us the probability we have observed interesting/substantial deviation from the null hypothesis. As we have discussed, the p-value is not the probability that the null hypothesis is true, either in part or in whole. E.g., it is not the probability that the difference is zero. Thus, it also not be the probability that the difference takes on any other value, whether of any importance or not.
Twisted Tongues
Overlaps between statistical jargon and common English contribute to importance fallacies. Small p-values are commonly termed statistically significant by statisticians. This is may be an oversimplification when communicating to non-statisticians. To defend each other against this fallacy, medical researchers often teach:
Statistical significance is not clinical significance.
Error Rate Fallacies
Error rate fallacies assume the p-value is a equivalent to a false positive rate. For a p-value of 5%, researchers often believe that repeating the experiment will yield a p-value at least this low 5% of the time. Small p-values are usually deemed "positive" results, so this cutoff (α = 0.05) is termed the false positive rate.
When the null hypothesis is true and the same experiments/observations are made repeatedly, α applies to the situation. If we set a cutoff of α = 0.05 (5%) for our p-values, then 5% of the p-values generated will be at least this small. The p-value obviously cannot have the same meaning as α since it is being compared to it (and is free to take on any value between 0 and 1).
Conclusion
Despite a long history of controversy, the p-value enjoys widespread use for drawing conclusions in business, policy, academia, and even medicine. Misconceptions and misinterpretations are rampant, but spotting them is not an esoteric task. Users do not need a higher degree in mathematics to spot and avoid the logical errors. As seen above, many of the common fallacies suffer from the same underlying flaw. Specifically, the calculation of the p-value assumes the null hypothesis is 100% true and correct, so the p-value cannot possibly tell us about the possible scenarios where it is incorrect.