One of our statisticians recently co-authored a paper in Nature Methods on the use and misues of p-values: The fickle P value generates irreproducible results. After reading this paper I really felt that I’d learned something useful; that p-values, which we use to determine if an experimental result is statistically significant, are potentially so variable in the kind of experiments we’re doing (3-4 replicate RNA-seq) as to be of little value in their own right.
Many of the statistically minded readers of this blog may well be rolling their eyes at my naivety, but I find stats hard and often explanations of how and why to choose statistical tests impenetrable. This paper is clear and uses some good examples (see figure below). Although I would have liked then to pick apart some recent RNA-seq experiments!
In a recent lab meeting when presented with a list of 7 options for the definition of a p-value, some, none or all of which were true, only one member of the lab of a dozen or so got the right answer. This is a very small survey but it does raise the question how many users of p-values don't really understand what the p-value means?
In a recent lab meeting when presented with a list of 7 options for the definition of a p-value, some, none or all of which were true, only one member of the lab of a dozen or so got the right answer. This is a very small survey but it does raise the question how many users of p-values don't really understand what the p-value means?
The fickle p-value: the main thrust of the paper is that the p-value used to assess whether a test result is significant or not (or rather its probability of being due to random chance) is in itself variable, and that variability is exaggerated
because small samples cannot reflect the source population well: in
small samples, random variation has a substantial influence. Reporting the confidence interval for the p-value does allow us to see the likely variability in the p-value. But if a clear experimental design is not reported then interpreting p-values is as meaningless as interpreting charts without error bars.
In the example used in the paper (see figure above) the authors sample ten times from two populations that have only a small difference, i.e. a small effect size. Whilst the number of replicate measurements appears high (10 compared to the usual 3 for RNA-seq), the size of the effect is small and this makes the results very susceptible to chance.
Effect size affects sample size: One of the questions we ask users in experimental design discussions is how big is the effect they are expecting to see. By understanding this, and ideally the biological variability within sample groups we hope to be able to design a well powered study. Most users are unable to specify effect size or sample group variability with any precision, and the experiment is often the first time they’ve seen this data. In essence we might perhaps consider every experiment a pilot.
Because of this the only thing we can reasonably adjust in our experiment is the sample size i.e. the number of biological replicates in each group. However this is always more work, more time, and more money, and so users are often reluctant to increase replicate numbers to many more than four.
We might be better asking users what the minimum effect size they think they can publish will be and designing the experiment appropriately. Unfortunately we can’t help determine what the smallest meaningful effect might be – most gene expression experiments start by looking for changes of 2-fold or more, but we see knock-down experiments where only a small amount of protein is left with no discernible effect on phenotype.
What to use instead, or what to report along with p: The authors do suggest some other statistical measures to use instead of p. They also suggest reporting more details of the experiment: sample size, effect size. Perhaps the most important suggestion and also a very simple one is to report the 95% confidence intervals associated with the p-value, which tells us the likely value of p from repeated sampling.
Commentary on the paper: There has been a pretty good discussion of the paper in the commentary on theconversation. One of the commenters stated that they thought a p-value included an estimate of variability, the reply to this comment, and the follow up say a lot about how people like me view p.
“Geoff Cumming has quantified the variability around p. For example, if a study obtains P = 0.03, (regardless of statistical power) there is a 90% chance that a replicate study would return a P value somewhere between the wide range of 0 to 0.6 (90% prediction intervals), while the chances of P ≤ 0.05 is just 56%. You can't get much more varied than that!”
“I thought P already contained the '90% ...' calculations. When one hears that the probability of something is X +- Y 19 times out of 20, it sounds like that's what P is, and that it 'knows' it's own variability. This simplistic P that you seem to describe sounds almost childishly useless. Why quote a datum when it is known that the next test would produce a quite different result?”
Is your RNA-seq experiment OK: What is the impact of this on your last RNA-seq experiment? I asked one of the authors to look at data from my lab and the results were not as great as I'd like...but this experiment was done with a collaborator who refused to add more replicates so was somewhat doomed. Don't worry this was an external project!
Want to learn more about stats: The authors have published a series of introductory statistics articles aimed at biologists in the Journal of Physiology. You might also check out theanalysisfactor “making statistics make sense”. They have a great post titled 5 Ways to Increase Power in a Study, which I’ve tried to summarise it below:
To increase power:
Increase alphaConduct a one-tailed testIncrease the effect sizeDecrease random error- Increase sample size
Just to avoid confusion, p-values and confidence intervals stem from different schools of statistics and shouldn't be mixed. A p-value does not have a confidence interval but it is possible to estimate a confidence interval around the effect size.
ReplyDeleteThere are (at least) three schools of thought in statistics: Fisherian (p-values), frequentist (confidence intervals) and Bayesian. Bayesian is probably the most philosophically correct but p-values are likely the easiest to apply in practice. "R. A. Fisher in the 21st century" (http://projecteuclid.org/euclid.ss/1028905930) is a nice paper by Efron explaining differences between the three schools (figure 8 is especially helpful).
I also like the "P values are not error probabilities" (http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf) paper and wonder if the two commentators above might benefit from a read.