Regina Nuzzo has a news feature in Nature about P values that every biomedical scientist should read. P values — the common measure of statistical significance (i.e. the “believability” of an experiment) — do not mean what most scientists think they mean.
The P-value calculation was originally developed in the 1920s by the statistician Ronald Fisher as a way to judge whether an observed result was worth looking into further.
Researchers would first set up a ‘null hypothesis’ that they wanted to disprove, such as there being no correlation or no difference between two groups. Next, they would play the devil’s advocate and, assuming that this null hypothesis was in fact true, calculate the chances of getting results at least as extreme as what was actually observed. This probability was the P value. The smaller it was, suggested Fisher, the greater the likelihood that the straw-man null hypothesis was false.
For many biomedical experiments, an experiment with a P-value below of 0.05 or 0.01 is considered “statistically significant”, and therefore interpreted as a believable result. Many experiments can have calculated P-values of 0.001 or even lower. Attracted by the apparent precision of a calculated P-value and it’s resemblance to a true probability calculation, working scientists have come to interpret the P-value as the actual probability of their result being correct. But that is not true. The P-value summarizes data in the context of a specific null hypothesis, but it does not take into account the odds that the real effect was there in the first place.
The mathematics are complicated, but by one widely used calculation quoted by Regina, a P-value of 0.01 actually corresponds in the real world to an 11% probability that the experimental result might be due to random chance. For P=0.05, that the probability rises to 29%! Even worse, some scientists are guilty of data-dredging or “p-hacking”, the practice of trying different conditions until you get the resulting P-value you want. As a consequence, the P-value assumptions of random sampling go out the window and, if you’ve tortured the data enough, the calculation becomes meaningless. No wonder that the overall level of reproducibility of biomedical research has been called into question.
A statistically significant P-value is in fact just an invitation to repeat the experiment. A practicing scientist needs to realize that, even with a highly “significant” P-value, there is still a relatively high probability that the result will not repeat. The best advice — something that I learned in the first week of grad school — is that you shouldn’t believe anything until you see n=2. Better yet, n=3.