Friday 21 March 2014

Some help with your stats

Stats: not everyone's favourite subject but something we can't avoid so understanding the basics is a very good idea. We're lucky in my Institute having biostatistical support in our Bioinformatics core facility and try to have a statistician with us every time we design a genomics experiment. The same questions come up time and time again, how many samples and how deep to sequence, we're slowly getting answers, but the experience we're building up helps nearly every time. I also find other sources of information can be really helpful and have listed a couple of them below.

Books about stats: You can buy the very useful Lab Math by Dany Spencer Adams, published by Cold Spring Harbour press and available from just £32.69 on Amazon. The book covers the most common mathematical tools you might apply in molecular biology, anyone making up reagents, performing simple statistical tests or working with nucleic acids and proteins is likely to benefit from a quick read through this book.

Stats from Nature Methods: You can now get all 35 Points of View columns in one place thanks to Nature methods and the Methagora blog. I still feel that these could be collected together in a single document as an eBook. I've always liked the format that PoV took, short focused articles that gradually introduce the important concepts in presenting data, I wrote about the series in the Summer of last year.

Now Nature Methods have gone for a similar format but with a focus on stats in the Points of Significance column, which puts statistics in the limelight. Lets face it there's little to be gained from a beautiful or carefully constructed visualisation if the data underneath are crippled by poor statistical analysis.

Stats from BiteSizeBio: a great series of articles by BiteSizeBio author Laura Fulford  can be read together as a good stats introduction - Let’s Talk About Stats: Understanding the Lingo, Comparing Two Sets of Data, Comparing Multiple Datasets, and Getting the Most out of your Multiple Datasets with Post-hocTesting.

Laura covers the language used and makes the very important point that you need to understand this to be able to talk to statisticians, but don’t forget you need to explain you’re language to them too: RNA-seq, exomes, read-depth and single- vs paired-end, are all likely to be a mystery to most statisticians. I particularly liked her coverage of samples size (n), variance and false-positives & -negatives. Laura’s advice is simple “the larger your sample size, the better… [and] as an absolute minimum you need an n of 3 to perform a statistical test”. Variance is important to understand, if you high variance within your test and control groups than making comparisons between them is going to be more difficult; you have “noisy” data. Be careful to check your data are normally distributed, if not then your statistical test might not be appropriate. Lastly Laura makes the point that statistical tests are not perfect, they generate errors and the two most people watch out for are false positives (type 1 errors): where a result looks statistically significant but is not, and false negatives (type 2 errors): where a significant result is missed.

In the second piece Laura provides a simple diagram to help you choose your statistical test.  She describes the commonly used T-test and Mann-Whitney test to find differences between two groups of samples; i.e. A vs B, tumour vs normal, treated vs untreated. If using a t-test then make sure data are continuous, have a normal distribution (or nearly normal) and have equal variance between sample groups. The Mann-Whitney test is used for unpaired samples and does not care how your data are distributed (normal or otherwise), or what the variance is, it is a non-parametric test.

In the last article Laura covers statistical tests suitable when comparing more than two datasets. Again the choice of test is dependent on the design of your experiment, but of course you’ll have included a discussion with a statistician in the design process before you generate any data. Experiments with a single variable then one-way ANOVA might be appropriate, e.g. treated vs untreated for two drugs. Experiments with more than a single variable require different tests, two-way ANOVA e.g. treated vs untreated for two drugs in male and female mice.

Unfortunately these tests for more complex experiments only tell you if there is a statistically significant difference, not what it is, for this you need to do some post-hoc testing. You also need to make sure to consider multiple testing correction especially when applying statistical tests to data like exomes and RNA-seq, without it a p-0.05 is going to throw up a lot of false positives.

Hopefully some of this helps you next time you're deciding how many times to replicate your experiment and thinking about what the variance might be within and between sample groups.

1 comment:

  1. This was a great post! I Tweeted it -- LinkedIn it -- and Google+d it.


Note: only a member of this blog may post a comment.