CoreGenomics: ENCODE's RNA-seq recommendations need revising

One of the great things about ENCODE was the amount of effort put into standardising how different groups did their experiments. As any core manager knows a good SOP is a huge step to high-quality data. The ENCODE consortium put together guidelines for ChIP, RNA and RIP. Those guidelines, published almost a year ago, are due for revision in a month or so. I'm looking forward to the revised document as the current recommendations have some serious problems from my personal experience.

The RNA guidelines are aimed at differential gene expression, at least I think that's what they mean!

"The guidelines and standards discussed here do not exhaustively cover the entire matrix of this experimental space, but instead emphasize best practices designed to support “reference quality” transcriptome measurements for major RNA sample type.

Experiments whose purpose is to evaluate the similarity between the transcriptional profiles of two polyA+ samples may require only modest depths of sequencing (e.g. 30M pair-end reads of length >30NT, of which 20-25M are mappable to the genome or known transcriptome."

The main issues I have problems with are the number of replicates and the depth of sequencing. First off you might want to read the guidelines yourself before listening to me, see Standards, Guidelines and Best Practices for RNA-Seq: 2010/2011.

So what do I think the guidelines should say?

Replicates: 2 replicates is not enough. not for the majority of statistical methods that are being applied to the data today. Three is the minimum and it has been for a long long time. We always try to start with 4 to make the experiment robust should one sample drop out. The higher the better but sequencing costs can increase dramatically if you are sequencing to deep, which leads me onto my next point...

Sequence Depth: 30M paired-end reads for differential gene expression is at least twice as many reads as necessary and possibly 6x too high. We've looked at the effect of single-end vs paired-end for calling DGE and it makes no difference; zero, zip, zilch, nada, none, naught, maru, líng, null, nula, cero, ṣifr, éfes, bado, soun, không, saiva, suziyam, midén, śūnya (enough of all these other words for zero).

And it certainly looks like 10-20M reads is the maximum you would want to use for DGE. If you have more sequencing power available then add more replicates and understand the biological variance in your groups better. I'd agree with ENCODE that more reads allows you to detect the very lowly expressed transcripts, but 100-200M PE76 is insane.

What impact does this have on my DGE experiment: If you consider the relatively straight-forward 24 sample experiment: two group comparison (treated vs untreated) with 2 cell lines at 2 time points, and everything replicated 3 times. You want to discover what genes and pathways are differentially expressed in the two groups, whether time has an impact and if all 3 cell lines behave the same way. I'm sure many of you will have done experiments like this.

Assuming you use the guidelines from ENCODE you need to generate about 480-720M reads or 3-4 lanes of a PE30+. In a typical core this will cost $60 for each library prep and $750 for each lane of sequencing. About $3000-4000.

If you use just 10-20M SE50 reads then the libraries still cost $1500 to make but the sequencing comes down to just 240-480M or 1.5-3 lanes. A modest saving of perhaps $1000 perhaps but bring sample numbers up to the 100's and the savings quickly rack up. And if we can run twice as many samples through our sequencers then the massive cost of the Illumina service contract will be shared by more users. Did you know it costs $80,000 to keep a HiSeq serviced!

4 comments:

Aaron Statham11 April 2013 at 21:39
Is $60 for an RNA-Seq library prep accurate?
Georgi Marinov12 April 2013 at 05:54
ENCODE was not concerned with differential expression when those guidelines were written. Thus the two replicates and the higher sequencing depth.
Anonymous15 April 2013 at 16:15
$100 would probably be more accurate for a libray prep when you consider consumables external to any prep kit (clean up beads/columns, Bioanalyser chips, plastics etc).
Of course, that is a consumables only cost. You may need to spend more money on lab staff than you do on the consumables.
Anonymous19 September 2013 at 18:10
the recommendation 100-200M PE76 is for isoform level, I think.

Note: only a member of this blog may post a comment.

CoreGenomics

Pages

Thursday, 11 April 2013

ENCODE's RNA-seq recommendations need revising

4 comments: