The amount of RNA used in an RNA-seq library prep is often listed as a competitive advantage by kit manufacturers. As late as 2004 I used up to 30ug of RNA in a microarray prep, and even a couple of years ago 100ng was considered "low". Nowadays kits are available for picogram quantities, but have you ever considered how much of the total RNA you measure is actually going to be informative?
The answer is not a lot! Stopping to think about this is important as the amount of something in a sample directly corresponds to how easily we can measure it. Wendell Jones (Global Head of Bioinformatics at Expression Analysis), gave a great talk at the recent RNA-seq Europe meeting, where he discussed the relative abundance of different RNAs and the ease (or not) of measuring these on different gene expression platforms. He kindly gave me a copy of his slide deck and I've used this as the basis of my figures below.
RNA QT/QC: We often measure RNA quantity with Ribogreen and quality using the Bioanalyser. When you take a look at a atypical Bioanalyser trace you'll see two major peaks from the 18S and 28S ribosomal RNAs, the ratio of which is used to calculate the RIN. It should be obvious to all that what we see on the Bioanalyser are two stonking great peaks from just two rRNAs, and these, usually unimportant from our perspective, account for a large portion of the total RNA in our Eppendorfs.
What is total RNA: The RNA we get after a total RNA extraction is a complex mix of millions of transcripts. However it is also a mix that is dominated by a very few species: tRNA, rRNA and some very highly expressed transcripts (e.g. Globin or Rubisco). The RNAs we are usually interested in are expressed at very low levels compared to these, and at first glance at the figure below you might just wonder how we measure any of them at all! This is because the most abundant RNAs, namely tRNA and rRNA are uninteresting to most scientists and we usually enrich for mRNA/ncRNA which are in the bottom 5% (by abundance).
If you look at a typical Bland-Altman plot (below) you'll see how the spread of gene expression data (e.g. two replicates) increases at lower RNA expression levels due to measurement noise. This was simplified in Wendel's presentation so we can more easily see there is a point at which we move from quantitation of transcripts to detection. The line is of course artificial and where you consider this should be drawn will depend on many things.
RNA-seq Bland-Altman plot |
We can take advantage of this and get flexibility in the dynamic range of our experiments by sequencing to different depths (usual), and/or increasing replicates (preferable).
How does RNA-seq compare to other DGE methods: Wendel presented a great slide where he compared where the detection/quantitation boundary lies for multiple differential gene expression technologies. Most of us are rarely going to go past 20 or 50M reads for differential gene expression, so qPCR is still looking like a tool we'll be using for many years to come. How it competes against some of the newer targeted RNA-seq assays will be interesting to see, and the impact molecular barcodes will have on true transcript abundance measurements is going to give newer RNA-seq methods an edge over current ones.
How low can you go: It is important to remember that while there are methods that can work with incredibly low amounts of RNA, including single-cell RNA-seq, the lower your inputs go, the less chance you have of sequencing the RNAs you might be interested in; especially if they are low-abundance transcripts. Sampling error is something you really need to understand before dropping inputs down way low. In a microarray experiment we clearly showed that reduced RNA input had a clear impact on detection sensitivity, but there was no impact on specificity. Even at low inputs when we saw differential gene expression, the results were accurate -see Lynch et al - The cost of reducing starting RNA quantity for Illumina BeadArrays. The same is (hopefully) going to be true for RNA-seq. For single-cells it will be interesting to see what the community decides is the right read-typ and read-depth to use - I'd be surprised if we go above 10M reads, and we might prefer to use 384 cells with just 1M reads each.
Acknowledgements: Thanks very much to Wendell for sharing these slides.
Hey,
ReplyDeletethat's a nice post. thanks for sharing your thought. I have a question regarding number of reads for gene expression analysis. Do you mean with 20 - 50M reads for differential gene expression the number of raw reads or reads that fell into gene regions?
We are using 10-20M single-end 50bp reads for mRNA DGE, this is reads that have passed the quality filters.
ReplyDeleteHey James,
Deleteokay but sorry to be so persistent. What do you mean with quality filters? What I mean is the number of reads that went into a dge analysis. For example, I have a sample which was sequenced with 20 millions reads. After alignment about 11 million go into gene regions and are usable for DGE. So I use 11 mio for the analysis.
Is that what you mean with quality filtered reads?
best
Mathias