CoreGenomics: RNA-seq and the problem with short transcripts

There are over 9000 Human transcripts <200bp in length which is about 5% of the total number of transcripts. When analysing some recent RNA-seq data here at CRI we noticed that only 17 detected transcripts from almost 30,000 are from transcripts shorter than 200bp. About 100 times lower than might be expected.

We have been asking why this might be the case. Firstly this is control data from the MAQC samples, UHRR and Brain. It may be that short transcripts are more often expressed at low levels but it may be that we are not picking them up because they are too short.

Spike-in experiments can help:
A recent paper presented data from a complex RNA spike in experiment. Jiang et al individually synthesised and pooled 96 RNA's by in vitro transcription which were either novel sequences or from the B. subtilis and M. jannaschii genomes. The RNA was stored in a buffer from Ambion to maintain stability. The synthsised RNA's were 273-2022bp in length and distributed over a concentration range spanning six orders of magnitude. They observed a high correlation between RNA concentration and read number in RNA-seq data and were able to investigate baises due to GC content and transcript length in different protocols.

This type of resource is useful for many experiments but difficult to prepare.

I have thought about using Agilents eArrays to manufacture RNAs of up to 300bp in length and to use spot number to vary concentration (15,000 unique RNA molecules spotted 1, 10, 100, 1000 or 10,000 times to vary concentration) this would create a very complex mix which should be reproducibly manufactured at almost any time for any group. This would also be very flexible in varying the actual sequences used to look at particular bias of the ends of RNA molecules in RNA-seq protocols.

But the RNAs need to be small enough:
The current TruSeq protocol uses metal hydrolysis to degrade RNA and no gel size selection is required later on. This is making it much easier to make RNA-seq libraries in a high throughput manner, however this technique possibly excludes shorter transcripts or at least makes the observed expression as measured by read counts lower than reality.

The TruSeq fragmentation protocol produces libraries with inserts of 120‐200 bp. Illumina do offer some advice on varying insert size in theor protocol but there is not a lot of room to manipulate inserts with this kind of fragmentation and size selection. They also offer an alternative protocol based on cDNA fragmentation but this method has been sidelined by RNA fragmentation due to increased coverage of the latter method across the transcript, see Wang et al.

The libraries prepared using the TruSeq protocol do not contain many observable fragments below 200bp (see image below) and 110bp or so of this is the adaptor sequence. This suggests the majority of starting RNA molecules were fragmented to around 150-250bp in length, so shorter RNAs could be fragmented too low to be sequenced in the final library.

From the TruSeq RNA library prep manual

I'd like to hear from people who are working on short transcripts to get their feedback on this. Does it matter as long as all samples will be similarly affected as many times we are interested in the differential expression of a transcript between two samples rather than between two transcripts in a single sample.

PS: What about smallRNA-seq?
There have been lots of reports on the bias of RNA ligase in small and micro-RNA RNA-seq protocols. At a recent meeting Karim Sorefan from UEA presented some very nice experiments. They produced adapter oligos with four degenerate bases at the 5' end and compared performance of these to standard oligos. The comparison made use of two reagents, a completely degenerate 21mer and complex pool of 250,000 21mer RNAs. The initial experiment with the degenerate RNA should have resulted in only one read per 400M for any single RNA molecule. They clearly showed that there are very strong biases and were able to say something about the sequences preferred by RNA ligase. The second experiment used adapters with four degenerate bases and gave significantly improved results, showing little if any bias.

This raised the question in my mind that the tissue specific or Cancer specific miRNAs published may not be quite so definitive. Many of the RNAs found using the degenerate oligos in the tissues they tested had never been seen in that tissue previously.

Jiang et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011.

CoreGenomics

Pages

Friday, 9 December 2011

RNA-seq and the problem with short transcripts

No comments:

Post a Comment