Friday, 30 March 2012

Can gene expression be considered a quantum problem?

I had a very interesting discussion over coffee yesterday at TRON, a new Translational Oncology institute in Mainz, Germany. TRON is specialising in cancer immunology and genomics, or Immunomics. One of the things they are aiming to do is develop personalised cancer vaccines by finding imunogenic markers in tumours and using long RNA molecules to present these markers to T cells to activate and modulate the imune respsones.

I was giving a seminar about technology development in next-generation sequencing, during the day I spoke with several people about single-cell analysis.

A problem with analysing a single cell is you only get one chance to take a look at its molecular biology as we use destructive methods to assay the cells genome and transcriptome. Single-cell gene expression usually requires the use of RNA amplification methods with their own biases and this led the discussions to a question.

Is RNA expression limited by quantum mechanics and is their a fundamental limit on the accuracy with which the level of gene expression can be measured in its true biological context? Analysing a single cell means analysing the whole transcriptome as a snap-shot of a dynamic biological system. Our discussions went around the question "is the absolute measurement of any one gene to some extent meaningless, when asking questions about what that level of gene expression means in terms of biology for that cell?"

We discussed technical issues with single-cell gene expression analysis and both TRON and CRI are using the Fluidigm gene expression chips for this kind of analysis. However replicate numbers need to be high and analysis methods probably need to take into account variance in gene expression as much as mean expression levels and fold-changes.

Wednesday, 28 March 2012

What is "total RNA"?

I wrote this post after a recent experimental design discussion where we went round and round discussing RNA analysis.

There are many different RNA molecules in a cell. When I was studying there were only three, mRNA, ribosomal RNA and tRNA. Simple! However over the last couple of decades our understanding of the complexity of RNA transcription and what different RNA's do in the cell has changed dramatically. Regulatory RNAs have become incredibly important to consider in biological systems and include both small RNAs including miRNA, piRNA and long-noncoding RNAs such as Hotair. These can target mRNA for degradation, block mRNA translation or even target genes for methylation and block transposon mobility.

We are still finding out an awful lot about the transcriptome that is surprising, and controversial (e.g. RNA editing).

Unfortunately our earlier ignorance of the varied RNA landscape has meant the term ‘total RNA’ is somewhat confused. As more and more scientists look deeper into the transcriptome and RNA biology, it is important to understand what total RNA is a particular experimental context. I try to understand what it is the experimenter wants to assay and use that as the starting point for discussions about particular extraction chemistry, labelling and other methodologies.

Just for the record, for me total RNA means all the RNA in a cell.

What is in Total RNA: It’s all the RNA in a cell and includes lots of different RNAs and certainly not just messenger RNAs!
Here I have suggested a pretty comprehensive list but feel free to point out any I may have missed: mRNA, polyA RNA, polysomal RNA, tRNA, ribosomal RNA, lincRNA, miRNA, piRNA, siRNA, SRP RNA, tmRNA, snRNA, snoRNA, SmY RNA, scaRNA, gRNA, aRNA, crRNA, tasiRNA, rasiRNA, 7SK RNA

A historical perspective and Affymetrix U95A arrays: When I first started running gene expression arrays in 2001 we often saw degraded RNA at the bottom of a gel or Bioanalyser trace. Many users wanted to clean up this degraded RNA and so we would run a column based clean-up which effectively removed all the small RNA species from a sample. Nearly all extractions were performed with Trizol or something similar and column-based extractions were the exception rather than the rule.

Back then we were only interested in gene expression assayed by labelling mRNAs by oligo-dT primed cDNA synthesis in an Eberwine amplification reaction. We were not aware that in that degraded RNA we were so happy to remove was an undiscovered regulatory universe of RNA molecules. The techniques we were using to assay gene expression would not have picked these up anyway, but nether-the-less we made a decision based on a very naive understanding of RNA biology. We should all have learnt from that lesson.

RNA extraction: Now when asked for a recommendation on RNA extraction I usually suggest Qiagen miRNeasy or another kit that preserves the small RNA fraction in a sample. Some users come to us with RNA already prepared and ask if we can also look at miRNAs but if they have used a kit that does not keep these small molecules we have to suggest they start again from scratch.

Kits that preserve small RNAs are only a little more expensive to use than others, and they do not involve much more lab work. Even if someone is interested in mRNA gene expression I always try to persuade him or her to collect small RNA containing total RNA. For many experiments the samples will only be collected once and might be reused in experiments many months or even years later.

What about assay biochemistry: Even if you have collected total RNA it is not necessarily possible to analyse all the different types of transcript in a single assay. Most mRNA gene expression arrays use very different labelling methods than microRNA arrays for instance. You can’t very easily label mRNAs and miRNA with the same chemistry.

With Illumina and other RNA-seq methods most users need to choose whether to assay mRNAs or micro-RNAs and use different methods to do so that use very different biochemistry. Whilst it is possible to use an RNA ligation method to analyse small RNAs and fragmented mRNAs very few people do so as this is more complex to prepare and analyse than running two independent experiments with the same samples as inputs.

What should you do for your next RNA experiment: If you are about to run a microarray or an RNA-seq experiment I’d encourage you to take a look at some of the literature out there and read the manuals for several kits to understand what is happening to RNA molecules in a particular method. By doing this you will be better placed to decide how best to approach your experiment and understand the limitations of the method you have chosen. An excellent resource is the Ambion website.

If you are lucky enough to have a good core-lab or service provider to run your experiment then take the opportunity to talk to them before you start the experiment. This means before you get cells out of the freezer or tissue from your mice, and hopefully you can work together on an experimental design that will get you the most from the samples. The extraction may only cost you a few £, $ or ¥ extra and make your day an hour or two longer in the lab, but you’ll be happy you did it when someone publishes something interesting about the regulation of your favourite gene in a similar sample set, as you’ll be able to go back to the freezer and confirm their finding in the samples you already have.

Friday, 23 March 2012

AmpliSeq design tool demo

I have previously blogged about amplicon sequencing (here, here and here) because I think it is one of the most exciting developments for clinical application of next-generation sequencing. Anyone can access the technology but there is a lot of choice about which system to use that can be off-putting. I'd certainly like to encourage more people to give it a go with their favourite genes, or start collaborating with other groups to design a panel you all might use.

I was asked if I'd like to play with Life Tech's new AmpliSeq design tool before general release and below you'll find out what I thought of it. All-in-all it looks good, is simple to use and the results include primer sequences.

Primer design history: many of you may well have sat down and tried to design primers by eye in a 1000bp stretch of DNA. I know my first experiences back in my undergraduate third year were not great but there were no tools available at the time.

The huge change for me was the release of Primer3, it was an update of a previously unreleased programme called simply Primer written by Mark Daly and Steve Lincoln in Eric Lander's group in the very early '90s. Primer3 was released in 2000, published in the Methods in Molecular Biology series. It is the basis of most primer design software I have ever used. But I don't know if it is used in the AmpliSeq designer.

What is AmpliSeq: AmpliSeq is LifeTechnologie's amplicon sequencing application. It allows up to 1536 amplicons in a single-tube PCR based assay. It is a highly-multiplexed PCR system rather than a droplet, microfluidic, hybridisation or Extension:Ligation based system. Ampliseq requires as little as 10ng of DNA and produces 200 or 150bp amplicons which should amplify FFPE samples very well.

AmpliSeq Designer: LifeTech are about to release a new product for custom panel creation the AmpliSeq Designer. This tool allows creation of custom AmpliSeq panels based on your input of genes (in multiple formats). They say once designed a panel can be ordered and delivered in weeks. Simply log into your Ion Community account, agree to a license (I'd suggest reading it as you don't want your gene list being scanned for interesting panels without getting something back yourself) and get designing!

I uploaded a list of 23 Cosmic genes (here is my list) as gene names and only got a couple of easily fixed errors. You can choose between 150bp and 200bp amplicons. There is currently a 5bp "padding" at the ends of exons but this will become user tuneable later on. Obviously you would not want the primers to be in the exons or you would ever "sequence" the ends, rather just resequence your oligos.

This will be the subject of another post soon to be written about choosing oligo providers.

My first design consisted of 23 target genes, 124KB of sequence, 1081 amplicons and 90.89% coverage of targeted bases.

After hitting "submit targets" I got a message an an automated email saying the results would be emailed soon. They suggest 48 hours for design but I got mine back in just under 4 hours. The results download included four files (available as a zip file), a Bed file, a data sheet, and coverage details and summary files. It looks like there is a very strong correlation between the exon length and the final coverage, with short exons being almost always successfully targeted but longer ones being less so.
There was an odd discrepancy between the number of ampiconsl reported in the web page (1081) and the summary files (1143) however the data sheet containing primer sequences has 1081 pairs. I guess a few targets have droped out completely and primer sequences are never reported.

Primer sequences: A big congratulations to LifeTech for releasing these back to users. However as I did not read the license in any detail there may be restrictions on how these sequences are used.

There is a note in the data sheet file saying that Ampliseq primer sequences contain proprietary modifications. Whether this makes them unsuitable for other applications (I am immediately thinking of many) is not clear but I am sure some users will give it a go. I'd certainly be calling for the other companies to release sequences rather than just send back panels of amplicons.

Barcoding: From the Ampliseq data sheet it looks like only 32 barcodes are available at this time. It is also probable that these are single barcodes at one end of molecule. I very much favour the two barcode strategy where the combination allows fewer primers to be used in a secondary PCR against tag sequences. Just 16 forward and 24 reverse oligos allow a single 384well plate to be amplified and sequenced in one sequencing run.

Of course on Ion Torrent 384 samples would not get the required coverage for 1536 amplicons. But the Proton should be able to work with numbers like these and getting the tools and kits built today will save changing things later on.

The BED file: Included in the zip file is a BED file of my panel. This was very simple to upload to UCSC and visualise coverage across the genes targeted. I have zoomed into TP53 for the example below.

UCSC TP53 BED file

Coverage of my targets: Only Kit achieved 100% coverage of all 5358 target bases in the design. Amplicons for FGFR3 and AKT1 were the worst with only 75% coverage.

It will be very important for amplicon-sequencing users to understand what bases they are likely to see in the final sequence data before they run an experiment on their samples. It would probably help if the community thought a bit more about what the important metrics to report are for this kind of panel. Obviously most user will want to get 100% of all target bases covered. But this is unlikely to be possible with most panels as they increase in size and we should consider how much stringency of coverage can be relaxed.

Coverage is reported for target regions as a percentage. This is the percentage of bases targeted by AmpliSeq and it would be important to monitor final coverage in case there is dropout of particular amplicons in the final sequence data.

One thing that might help in design tools like the AmpliSeq one, Agilent's HaloPlex Design Wizard, Fluidigm's Access Array designer and Illumina's TruSeq Custom Amplicon Design Studio is the ability for users to specify bases that must be covered. TP53 for instance has several hot-spots a user would not want to miss, whereas other Cancer genes are mutated more or less randomly along their sequence. It may be important to specify particular bases for TP53, but 85-95% coverage for non-hot-spot genes may be OK if large numbers of patients are being screened and coverage drop out is random. If coverage drop out is not random then careful review of the literature before finalising a design might be warranted.

Sequencing results: Unfortunately I can't report back on the quality of the final panel. My lab does not have an Ion Torrent, of course if LifeTech want to leave one in the lab for a few days with the reagents I'd happily give you all an update.

Other considerations: I'd like to see tools like this develop and ideally see an open source variant, perhaps in Galaxy? I am sure a lot of people would use such a tool so whoever puts it out can expect a good citation record! The things we are considering are flexibility in amplicon length rather than specific cutoffs, an ability to add tags for any sequencing platform, higher degrees of multiplexing, an ability to target SNPS by very short read sequencing(>10bp), etc.

Please feel free to comment on what you would like to see from Amplicon-sequencing tools.

Friday, 2 March 2012

Please sponsor my half-marathon for Cancer Research UK

This year I will be running a half-marathon and aiming to raise £2000 for Cancer Research UK. This is enough to sequence a Cancer genome which many of you will know is something we are doing in my lab. CRUK spent £332M in 2011 on Cancer research.

I'd like to ask readers of this blog to give generously by visiting my fundraising page.

The Core-Genomics blog got almost 9000 page-views last month so that is only about £0.25p per read to raise £2000!

If you prefer to donate by sequencing cancer genomes for CRUK directly then let me know and we'll try to work out the best way of getting this done.

I am aiming to run in 2h30m which is about as long as it would take to sequence 30bp on the MiSeq in my lab. Each £1 raised will sequence about 60Mb on the HiSeq sequencer we use for Cancer genomes and £15 will buy 1Gb.

The race is on March 11th and I'll keep an eye out for faces in the crowd and I know some of you will be running alongside me (for a while at least). Thanks for your support and please do feel free to pass this along to anyone you think would be happy to sponsor my run for CRUK.

Thursday, 1 March 2012

Choosing between exome-arrays and exome-seq

Illumina and Affymetrix both recently released Exome microarrays.

The "death" of Gene expression microarrays: This is an interesting development considering there has been a lot of debate over the “death of arrays” in the last year or so.

It does looks like 3’ expression arrays are fading fast due to the cost and quality of RNA-seq data and improvements in RNA-seq analysis. Although we’ve not quite stopped using GX arrays yet.

SV-seq vs snpCGH: The relative merits of SV-seq vs snpCGH for copy number and LOH are less clear. 1M (or more) high-quality genotype calls can be obtained cheaply and quickly using arrays and the intensity information allows sensitive copy number detection. Whilst sequencing is ultimately more sensitive to structural-variation as we assay the DNA structure directly, there are real limitations in per-genotype quality with low to medium coverage. It is this that holds sequencing back and I think keeps a market open for snpCGH arrays. The costs and time for medium to high coverage SV-seq can’t yet compete with arrays where you can run 1000 samples in a couple of days for a few hundred dollars each.

Exome-arrays vs Exome-seq: Now exome arrays are vying to compete with exome-seq and to be honest I was quite surprised that array companies are bothering. Especially when one of them happens to be the leading next-gen sequencing company!

GenomeWeb covered the announcement of the products at last years ASHG.

Affy and Illumina aim to offer a fast and economical method to assess exomic variants. They suggest we might use these to follow up exome-seq or to complement GWAS in large cohorts to achieve statistical power in exome centred studies. There has been a huge amount of content generated for microarrays over the past few years of sequencing. And variants in the exome are likely to be functionally relevant (although we should make sure regulatory regions are equally well assayed).

Affy’s exome array targets around 320,000 snps, InDels and other variants. It is being offered at $70 to customers.

Illumina’s exome array comes in three versions, the HumanExome targets around 250,000 snps, the OmniExpressExome targets 950,000 and the Omni5Exome targets 4.75M snps. According to the companies press release the 250,000 exonic SNPs were identified from an analysis of over 12,000 sequenced genomes. It was being offered at $45 to early customers. The HumanExome chip is being offered at $80.

A difference of 70,000 exonic SNPs sounds quite large but I’m not sure what the real impact will be on research projects. Whilst Affymetrix have not revealed the scale of interest Illumina have said they are expecting to process over 1M samples on their Exome-chips.

The costs quoted appear to be chip only prices and one of the issues facing researchers as chips drop in cost is the relatively static price of the processing. It looks like Illumina’s OmniExpressExome for instance will cost about £150-200 to run once service costs are incorporated.

At around £200-250, Exome-seq is slightly more expensive but currently much more labour intensive. Illumina recently dropped the price of an exome to $50, if you include a TruSeq library prep at $50 and a one-sixth of a lane of PE75bp sequencing the total Exome-seq cost is £200-250. The major limitation compared to arrays may not be cost but is more likely to be time as it takes about 7 days to sequence 96 exomes on HiSeq 2000.

Which might you use? I think the arrays are going to fill a real need for researchers who have large sample collections and want to access exome content. Many more samples can be run in a given time for a given budget on arrays than on a sequencer. I think smaller projects are likely to be run on a sequencer though. And improvements to the Exome-seq workflows might make chips seem clunky to use.

Can we get rid of a two day hyb and still generate high quality exome data?

When might we start using newer library prep technologies like Nextera to access low DNA samples for Exome-seq?

Can we analyse FFPE exomes? Both chips and sequencing suffer here but this is a huge potential market.

Arrays RIP? My career in genomics started over ten years ago when I set up an Affy lab processing expression arrays with just 8000 probes. I have run lots of arrays since then including one project of 2500 HT12’s which we completed in just five weeks, running this number of samples on a sequencer is tough. Whether arrays will completely disappear or not will be a discussion for a while yet I think.