Saturday, 15 March 2014

Cheaper RNA-seq: but you might have to give up strandeness and splicing

I'm going to start this post by saying something that may be controversial "splicing analysis is a niche, strandedness is useful in a minority of cases and that most RNA-seq users are only interested in getting Affy-like 3' biased differential gene expression data".

I'm constantly looking at how we perform RNA-seq analysis for differential gene expression (DGE) analysis with the major goal of making it cheap enough that replication levels increase from the commonly used 3 replicates to a more powerful 6+ replicates. Obviously users are put off doubling the cost of their experiment when everyone's been running triplicate as the maximum level of replication for the last ten years or more, so ideally we'll keep experimental costs the same while trying to improve the quality.

Now I do believe RNA-seq has made a dramatic difference to the cost of performing DGE experiments and is way cheaper than Affy arrays used to be. But a side-effect of moving over to RNA-seq is that most scientists are very interested in science and read widely so when they see papers exclaiming all the possibilities RNA-seq brings they'd like to get as much as possible from their own RNA-seq experiments.

What can you do with RNA-seq: i) differential gene expression in an (almost) unbiased manner, ii) analyse splicing isoforms using splice-junction reads, iii) identify the full extent of the transcriptome including non-coding, antisense and over-lapping transcripts often making use of strand information.

The challenge with the first statement is that there is always bias experiments, just because RNA-seq does not suffer from the problems microarrays has/had; of sensitivity, reliance on an annotated transcriptome, limitations of design/use of oligonucleotide probes, etc, does not mean RNA-seq isn't biased. The choice of time-point of cell-type in any experiment adds an immediate, although hopefully informed, bias, RNA extraction methodology will have an impact and RNA-seq library preparation, and sequencing will add other layers of bias on top.

Statement two gets many biologists salivating. We've all read some fantastic papers showing the biological importance of splicing and would like to be able to show similar results in the systems we're studying. However splicing analysis is limited by the fact that today we can only sequence through a single splice junction in each read meaning we need to infer which isoforms might be present, and if there is differential isoform usage. The competing methods of analysis work in quite different ways: cufflinks use of the Bowtie+TopHat spliced alignment versus DEX-seq's analysis of differential exon usage by couting reads in individual exons.

Statement three has been massively impacted by the introduction of stranded RNA-seq protocols. Methods that preserve the information about which strand the transcript came from have allowed an even more detailed view of the transcriptome than we have ever had especially in the cases of transcripts that overlap in genome location and the extent and biological importance of antisense transcription.

What does this have to do with making RNA-seq cheaper: Unfortunately ii) and iii) add significant complexity to the design, cost and analysis of RNA-seq experiments. The impact of splicing analysis is relatively simple to explain to users, we need many more reads and this increases the cost of the experiment in a linear fashion and users can decide if the splicing analysis is "worth" it. But the impact of getting strand information is less easy to explain (certainly for me in mammalian systems), very few of us are strand-savvy and we're most likely to ignore the stand information other than making nicer figures for presentations to show how well the method worked. Strand information does not come free-of-charge, the most popular method for stranded RNA-seq uses dUTP and uracil-DNA glycosylase (UDG) to remove the second-strand cDNA after ligation of sequencing adapters. Strand-preserving methods cost more to manufacture than non-stranded methods and hence cost more for end users.

The cost of RNA-seq in 2014: I recommend 10-20M single-end 50bp reads per sample for DGE studies and that allows 10 samples per lane on a HiSeq 2500/2000. At about £600 per lane this makes sequencing £60 per sample. Library prep reagents and labour are about £35 each, making RNA-seq one of the few NGS applications where sequencing costs are lower than sample prep.

Ideally we'd maintain a 10 fold difference between sequencing and library prep, but this would mean bringing prep down to under £15 or $10 per sample. Is it possible to squeeze oligo-dT mRNA enrichment, cDNA synthesis and an Illumina prep into this budget? I'm not sure this is possible outside of massive labs that can afford to make their own kits and buy enzymes in bulk.

We're thinking about how we might approach RNA-seq prep to keep costs as low as possible. Home-brew is certainly going to get us some way to $10 per prep, but some very creative thinking may be necessary. Would you give up strandedness for an RNA-seq kit that gave perfectly good DGE data but cost half as much as the leading stranded protocol?

Would it make much difference to your next experiment: not if you once you think carefully about your experiment you can clearly state that splicing and strandedness are not going to be primary areas for analysis. I do think most scientists don't have these uppermost in their minds when embarking on RNA-seq experiments and a quick search in PubMed might back me up: there are 3148 hits in PubMed for RNA+Seq or 2359 hits for "RNA-seq", but only 3% (73) come up with the terms RNA+seq+strand+specific, so strandedness is not yet massive, which was a surprise to me given the reception Josh Levin's 2010 Nat Methods paper: Comprehensive comparative analysis of strand-specific RNA sequencing methods got.

So here's my plea to RNA-seq users: think about what you really want to get from your RNA-seq experiments and only aim for "expensive" splicing experiments with 20-50M reads per sample if you really need to, think about whether you'd give up other features to get faster, cheaper RNA-seq experiments; and save the money and effort for the experiment that needs it.


  1. The most interesting application of RNA-seq after DGE is finding fusion genes and translocations! Finding fusion genes in RNA-seq works pretty well and there are plenty of successes!!!@

  2. Agree 100%. The other cost to remember is the analysis. Plain 'Affy-like' count data feeds straight into established pipelines (edgeR, DESeq, voom/limma); really getting into stranded, splice-aware, ncRNA data is an order of magnitude more work for the bioinformatician. A black hole for effort if it's not the primary goal of the experiment.

  3. Excellent points James. What is your preferred analysis pipeline for splicing/isoform detection?

  4. The author raises stand specific info problem without having a clue if it is a problem or not.

  5. By processing larger batches using automation and homebrew (where applicable), we currently have library prep costs around £40 - this includes QC and quant of the starting material, library prep, QC and quant of the library.
    The NextSeq 500 is ideal for DEG RNA-Seq; single read 75bp is less than £900 and you should be able to plex 24 samples in a run for ~£40 per sample sequencing. Unfortunately, labour costs pretty much double this price plus amortisation of equipment mean we're at about £200 per sample. The Holy Grail would be £100, but unless the university allows us to hire actual monkeys, it ain't gonna happen.
    Of course BaseSpace can also provide a ready made analysis pipeline for RNA-Seq on NextSeq 500. Although Tophat/Cufflinks is not cutting edge it should be adequate for most purposes.

  6. Have you done a power analysis that compares those two approaches?

  7. We do have 10$ RNA-seq library prep protocol, but we do not plan to publish it until at least Winter by political reasons. It works for both mammals and bacteria, and for any type of RNA, including cross-links.
    All reagents are commercial.

  8. I think this way is effective and affordable. People often think they can have a better result with a higher price when they do their experiments. But the fact is that they cost unnecessary. This can not only be applicated in research on RNA-Seq but also other experiments.