Stats: not everyone's favourite subject but something we can't avoid so understanding the basics is a very good idea. We're lucky in my Institute having biostatistical support in our Bioinformatics core facility and try to have a statistician with us
every time we design a genomics experiment. The same questions come up
time and time again, how many samples and how deep to sequence, we're slowly
getting answers, but the experience we're building up helps nearly every time. I also find other sources of information can be really helpful and have listed a couple of them below.
Books about stats: You can buy the very useful Lab Math by Dany
Spencer Adams, published by Cold Spring Harbour press and available from just
£32.69 on Amazon. The book covers the
most common mathematical tools you might apply in molecular biology, anyone
making up reagents, performing simple statistical tests or working with nucleic
acids and proteins is likely to benefit from a quick read through this book.
Stats from Nature Methods: You can now get all 35 Points of View columns in one place thanks to Nature methods and the Methagora blog. I still feel that these could be collected together in a single document as an eBook. I've always liked the format that PoV took, short focused articles that gradually introduce the important concepts in presenting data, I wrote about the series in the Summer of last year.
Now Nature Methods have gone for a similar format but with a focus on stats in the Points of Significance column, which puts statistics in the limelight. Lets face it there's little to be gained from a beautiful or carefully constructed visualisation if the data underneath are crippled by poor statistical analysis.
Stats from BiteSizeBio: a great series of articles by
BiteSizeBio author Laura Fulford can be read together as a good stats introduction - Let’s Talk About Stats: Understanding the Lingo, Comparing Two Sets of Data, Comparing Multiple Datasets, and Getting the Most out of your Multiple Datasets with Post-hocTesting.
Laura covers the language used and makes the very important point that
you need to understand this to be able to talk to statisticians, but don’t
forget you need to explain you’re language to them too: RNA-seq, exomes,
read-depth and single- vs paired-end, are all likely to be a mystery to most
statisticians. I particularly liked her coverage of samples size (n), variance and false-positives & -negatives. Laura’s advice is simple
“the larger your sample size, the better… [and] as an
absolute minimum you need an n of 3 to perform a statistical test”.
Variance is important to understand, if you high variance within your test and
control groups than making comparisons between them is going to be more
difficult; you have “noisy” data. Be careful to check your data are normally
distributed, if not then your statistical test might not be appropriate. Lastly
Laura makes the point that statistical tests are not perfect, they generate
errors and the two most people watch out for are false positives (type 1
errors): where a result looks statistically significant but is not, and false
negatives (type 2 errors): where a significant result is missed.
In the second piece Laura provides a simple
diagram to help you choose your statistical test. She describes the commonly used T-test and
Mann-Whitney test to find differences between two groups of samples; i.e. A vs
B, tumour vs normal, treated vs untreated. If using a t-test then make sure
data are continuous, have a normal distribution (or nearly normal) and have
equal variance between sample groups. The Mann-Whitney test is used for
unpaired samples and does not care how your data are distributed (normal or
otherwise), or what the variance is, it is a non-parametric test.
In the last article Laura covers statistical tests suitable when comparing more
than two datasets. Again the choice of test is dependent on the design of
your experiment, but of course you’ll have included a discussion with a
statistician in the design process before you generate any data. Experiments
with a single variable then one-way ANOVA might be appropriate, e.g. treated vs untreated
for two drugs.
Experiments with more than a single variable require different tests, two-way ANOVA e.g. treated vs untreated
for two drugs in male and female mice.
Unfortunately
these tests for more complex experiments only tell you if there is a
statistically significant difference, not what it is, for this you need to do
some post-hoc testing. You also need
to make sure to consider multiple testing correction
especially when applying statistical tests to data like exomes and RNA-seq,
without it a p-0.05 is going to throw
up a lot of false positives.
Hopefully some of this helps you next time you're deciding how many times to replicate your experiment and thinking about what the variance might be within and between sample groups.
Hopefully some of this helps you next time you're deciding how many times to replicate your experiment and thinking about what the variance might be within and between sample groups.
This was a great post! I Tweeted it -- LinkedIn it -- and Google+d it.
ReplyDelete