Thursday, 2 July 2015

Does survival of the fittest apply to bioinformatics tools?

What do 48 replicates tell you about RNA-seq DGE analysis methods: that two the most widely‐used of the tools DESeq and edgeR are probably the best tools for the job*. These two tools also top the rankings of RNA-seq methods as assessed by citations with 1204 and 822 each. These are conclusions in probably the most highly replicated RNA-seq study to date**. The authors aimed to identify the correct number of replicates to use and concluded that we should be using ~6 replicates for standard RNA-seq, and we should consider increasing this to ~12 when identifying DGE irrespective of fold‐change.

The paper is a very good one and certainly makes comforting reading in a place where people are open to increasing replicates. It also supports two messages we’ve been giving out in our Tuesday afternoon experimental design clinics for many years; with n=3 you only need one to drop out and you’re screwed, and more replicates but fewer reads per replicate = an acceptable level of cost increase!

But the paper also got me wondering about what made DESeq and edgeR so popular? Why does a bioinformatics tool sometimes end up dominating a field? Is it because the dominant tool has the best statistical approach? Is it easy to use? After talking to a couple of bioinformaticians here the answer seems to be a combination of the following:
  • The tool is actually quite good at what it was designed to do
  • The tool is easy to use, is implemented well and can be configured for advanced use, is well documented, and is supported - ideally by the person that wrote it, but if made open access support might come from the community e.g. BioConductor (described very comprehensively in this paper).
  • The tool is published in a high-impact paper (not usually a methods paper)
The order of these may well depend on the person using the tool, and there are almost certainly other factors of more interest to bioinformaticians, however the last point is one that deserves some consideration. When I spoke to one of our bioinformaticians they made the point that biologists often want to reproduce an analysis they've seen in a paper "I'd like a plot like this one please", and the kind of papers biologists read are biology ones - not bioinformatics (or genomics) methods papers.

If a bioinformatician develops a tool to help answer a biological question and that tool is published in the high-impact biology paper then they have lost the 1st author slot, but potentially gained in many other ways.


*The same conclusion was made about the quality of DESeq and edgeR results by researchers at the Queensland Brain Institute in 2014.

**The data for the S.cerevisiae experiment are available on ENA: project ID PRJEB5348. Samples were processed in four batches of 24 samples with 12 of each strain in each batch using Illumina’s TruSeq (stranded) mRNA kit. Seven pools were created and each was run in a lane as single-end51bp. This is a goldmine for looking at batch effects, albeit in an almost perfect experimental situation. This is discussed in a paper published alongside this one by the same group, and you should check out Geoff Barton's Blog for more...

No comments:

Post a Comment