Misconceptions about P-values

by Stephen Nawara, PhD.

P-values are currently the most common metric used to draw conclusions from research. Given this, one would expect they are well-understood and non-controversial. Nothing could be further from the truth. Recently, misconceptions about their use and misinterpretation of their meaning have even triggered a statement from the American Statistical Association (ASA).

Background to P-values

In its official statement, the ASA provided the following definition:

Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

There are a number of nuances packed into this statement (e.g. "or more extreme") that can be ignored for our current purpose. Of primary concern is that the first step in calculating any p-value is to assume some model (hypothesis) is correct/true. This is called the null hypothesis, meaning the hypothesis to be nullified. As noted by the ASA, any number of assumptions can be made as part of this model, the most common including:

The difference between two groups is exactly zero on average.
All measurements were independent of each other.
The observations were sampled from a specified distribution (e.g. a normal distribution).
The variance is equal for both groups.

Note: Testing the first assumption is typically of the most interest to researcher, so they often forget that the model is a precise mathematical statement. The derivation of the model requires all the assumptions to be correct. However, interestingly enough, the mathematical details behind the calculation are not critical understanding the concept of a p-value. For any given data set, there will be numerous ways to calculate a p-value, each giving a different answer.

A key to spotting common fallacies about p-values is to realize they are contingent on these assumptions (collectively called the null hypothesis/model).

Things that are not P-values

Over the years, researchers in industry, academia, and media have generated a litany of misunderstandings, misconceptions, and misuses of p-values. Below is a list of common offenses.

Transposing the Conditional Fallacies A p-value is not the probability that...

An observation/effect isn't real or doesn't exist
An observation/effect occurred by chance
The null hypothesis is true
The alternative hypothesis is false
The research hypothesis is false

Importance Fallacies A p-value is not the probability that...

An effect is small
An observation is unimportant

Error Rate Fallacies A p-value is not...

A false positive rate or Type I error rate
The probability an observation will replicate

Transposing the Conditional Fallacies

The first four fallacies listed are all variants of the same statement. The claim that an observation is "real" means that there is indeed a deviation from the null hypothesis. This is opposed to the data being explainable by chance. Here, the null hypothesis corresponds to only chance deviations showing up in the data, as specified by the statistical model. Meanwhile, the "alternative hypothesis" corresponds to something in addition to chance influencing the data.

Assuming the truth of the null hypothesis gives it a probability of 100% during our calculation of the p-value. Thus, if the p-value is calculated to be 5%, these fallacies claim an obvious contradiction: 5% = 100%. These fallacies are an example of the inverse fallacy (also known as "transposing the conditional"). They claim that the "probability the data occurred due to chance" is the same as the "probability of getting the data if only chance was operating".

In more formal notation we would write: Pr(Data|Chance) = Pr(Chance|Data). Despite even a mention on the Wikipedia page about p-values , this family of errors remains rampant:

P-values: don't transpose the conditional

The fifth fallacy is that the p-value provides the probability the research hypothesis is false. It is yet another variant of the first four, but it also contains an additional error. Besides the above issues, this statement claims the research hypothesis is the only explanation for a deviation from the null hypothesis. This is almost never the case.

Note: Most commonly the predictions of the research hypothesis correspond to anything but the null hypothesis, but this need not be the case.

Example: A/B Testing

Web developers commonly perform A/B testing. Perhaps someone has redesigned their website to make it easier to navigate. Then they observe visitors spending more time at version A of the site than they did at version B. The developer may then conclude that version A was easier to navigate. They may think easier navigation provided a better user experience, thus people stayed longer to consume the content. In fact, the opposite conclusion could explain a difference just as easily. Version A may have been more difficult to navigate! Perhaps people spent more time at version A searching for right content, to the detriment of the user experience.

Importance Fallacies

Importance fallacies claim that a p-value can tell us the probability we have observed interesting/substantial deviation from the null hypothesis. As we have discussed, the p-value is not the probability that the null hypothesis is true, either in part or in whole. E.g., it is not the probability that the difference is zero. Thus, it also not be the probability that the difference takes on any other value, whether of any importance or not.

Note: Experimental design also determines a p-value. If any aspect of the null hypothesis is incorrect, a large enough sample size or careful enough measurements will eventually lead to a small p-value. This will occur even if the deviation from the null hypothesis is extremely small and exists in some minor auxiliary assumption.

Twisted Tongues

Overlaps between statistical jargon and common English contribute to importance fallacies. Small p-values are commonly termed statistically significant by statisticians. This is may be an oversimplification when communicating to non-statisticians. To defend each other against this fallacy, medical researchers often teach:

Statistical significance is not clinical significance.

Error Rate Fallacies

Error rate fallacies assume the p-value is a equivalent to a false positive rate. For a p-value of 5%, researchers often believe that repeating the experiment will yield a p-value at least this low 5% of the time. Small p-values are usually deemed "positive" results, so this cutoff (α = 0.05) is termed the false positive rate.

When the null hypothesis is true and the same experiments/observations are made repeatedly, α applies to the situation. If we set a cutoff of α = 0.05 (5%) for our p-values, then 5% of the p-values generated will be at least this small. The p-value obviously cannot have the same meaning as α since it is being compared to it (and is free to take on any value between 0 and 1).

Note: In the cases where the null hypothesis is false, the false positive rate is always zero. It is not possible to falsely conclude a false hypothesis is false.

Conclusion

Despite a long history of controversy, the p-value enjoys widespread use for drawing conclusions in business, policy, academia, and even medicine. Misconceptions and misinterpretations are rampant, but spotting them is not an esoteric task. Users do not need a higher degree in mathematics to spot and avoid the logical errors. As seen above, many of the common fallacies suffer from the same underlying flaw. Specifically, the calculation of the p-value assumes the null hypothesis is 100% true and correct, so the p-value cannot possibly tell us about the possible scenarios where it is incorrect.