CoreGenomics

Friday, 29 June 2012

Epigenome-seq state-of-the-art: but is hmC worth the effort for everyone?

A couple of recent papers have demonstrated the ability to distinguish between 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) using modified bisulfite sequencing protocols. These methods are likely to make a real impact on epigenomics when combined with Bis-seq. By sequencing two genomes, Bis-seq for mC and oxBS-seq or TAB-seq for hmC a fuller picture of methylation will emerge. However it is not clear how important biologically the additional data will be or how much it is worth to researchers.

Trying to complete this kind of experiment today on mammalian genomes requires quite a lot of sequencing muscle. Bis-seq depth guidelines are lacking but most people would aim for 50x or greater coverage. This suggests a 100 fold genome coverage per sample, or an expensive experiment. This could leave the approaches as a niche application for people with a strong focus on epigenetics.

oxBS-seq: The University of Cambridge’s method prevents hmC from being protected in a normal bisulfite conversion. 5-hmC is oxidised such that upon bisulfite treatment the bases are converted to uracils. See Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution.

TAB-seq: The University of Chicago’s method differs by protecting 5-hmC from conversion. 5-hmC is protected from oxidation by using beta glucosyltransferase whilst 5-mC’s are oxidised such that during bisulfite treatment they behave as if they were not methylated and are converted to uracils. See Base-resolution analysis of 5-hydroxymethylcytosine in the Mammalian genome.

How to make the methods accessible to all: An area that the two methods may have an impact on more quickly is for capture-based studies in cell lines where material is not limiting. It would be possible to create a PCR-free library and perform capture for an exome (or regions of equivalent genomic size) then follow this capture with bisulfite conversion and sequencing. This MethCap-seq (BisCap, oxBS-Cap or TAB-Cap) method could allow larger sample numbers to be run at the depth required without being too expensive.

We have been testing a Nextera based capture prep in my lab which potentially would allow this to be done on very small inputs, however the method is not yet released and requires quite a bit of amplification.

Another alternative would be to combine the methods with the recent Enhanced Reduced representation bisulfite sequencing (ERRBS) published by Maria E. Figueroa’s group in PLoS Genetics.

Wednesday, 27 June 2012

Has Qiagen bought the Aldi* or Wal-Mart* of DNA sequencing companies?

This update was added on 11 July 2012: An article on GenomeWeb made interesting reading as it talked about the current litigation between Columbia University and Illumina. In it Thomas Theuringer (Qiagen PR Director) said "We support Columbia's litigation against Illumina, and are also entitled to royalties if we prevail". Maybe the royalties will be worth the reported $50M they paid?

Qiagen has just purchased Intelligent Bio-Systems and is, for now, the new kid on the NGS block. But what does IBS offer, what are the MaxSeq and Mini-20, can they compete against the monster that is Illumina and who owns the rights to SBS chemistry?

The Max and Mini instruments have been on my radar for a year or so now and I really did not think they would compete against Illumina when I first heard about them. I think one has been installed in Europe.

The instruments use technology from the noughties with maximum read lengths of 55bp and 1/3rd the yield of HiSeq. If the instrument and running costs are cheap enough users might be tempted. However a cheap alternative to a gold standard often turns out to be a disappointment.

If the title reads a little harsh then I’ll explain some of my thinking below.

*Aldi and Wal-Mart have a reputation in the UK for cheaper and lower quality products, they offer an alternative to supermarkets like Sainsburys. You should not make any NGS purchasing decisions based on a personal preference of where I like to do my weekly shop!

What did Intelligent BioSystems offer Qiagen and is there space in the NGS market for another player: Qiagen wants to get into the diagnostic sequencing space. However it is not clear if a Qiagen sequencer can compete against Roche, Life and Illumina in the research or clinical space. Clinically Roche already have a strong diagnostics division even if their 454 technology appears to be suffering with the strong competition of MiSeq and PGM. Illumina have thrown their weight behind clinical development and have a reputation for investing in R&D so expect them to become a major player. Life have a reputation for delivering products that work (lets ignore SOLiD) and the continued development of PGM and now Proton has got to keep them in a strong position.

Can Qiagen take any market share: The instrument will have to work robustly, deliver high quality data, be competitive on cost and include bioinformatics solutions. It is not clear what will differentiate the Qiagen instrument from others already widely adopted.

Although I think it is going to be tough for Qiagen it is not impossible; and if there is a $25B diagnostics market (UnitedHealth Group’s personalized medicine paper suggests that the molecular diagnostics market could grow to $25bn by 2021), even a 5% share of this is equates to $1.25B!

What will the Qiagen sequencers look like: The sequencers offered by IBS make use of sequencing-by-synthesis technology licensed from Columbia university. This technology was published in 2006 by Jingyue Ju in Nicholas Turro’s chemical engineering group at Columbia University. Their PNAS paper describes a system that will be familiar to anyone using Illumina’s SBS. 3-O-allyl-dNTP-allyl fluorophores are incorporated by DNA polymerase, imaged and then cleaved with Palladium catalyzed deallylation ready for the next cycle. They sequenced 13bp in the 2006 paper at almost the same time as Illumina were buying Solexa for $600M.

Perhaps the most intriguing aspect of the technology is that the flowcells are reusable! Sounds great, but will clinical labs see this as a benefit over disposable consumables? I think not as there is too much risk a sample will become contaminated. The same goes for removing the need to barcoding on Mini-20 by using a 20 sample carousel. Barcoding is useful in labs just for sample tracking even if you end up doing a run in a single lane/flowcell.

Max-seq: In the brochure they claim that “thousands of genomes have been sequenced utilizing 2nd generation technologies, such as the MAX-Seq”, I would argue that 10,000’s of genomes have been sequenced but I am not aware of a single Max-seq genome to date.

Library prep uses ePCR bead-based or DNA nanoball (Rolony) methods. Libraries are loaded onto a flow-cell and sequenced with SBS chemistry. The instrument has dual-flowcells and each will generate 100M reads per lane in single or paired-end format at 35 or 55bp length, with >80% of bases being Q30 or higher.

The Azco website suggests that the Max-seq is thousands of dollars cheaper than a SOLiD or HiSeq instrument, and that run costs are 35% cheaper than Illumina or ABI. If you only get 25-50% of the data this make the system cost more like twice as much per run for lower quality data and much shorter reads.

Mini-20: I am not certain the Mini-20 actually exists yet. The brochure on the Azco website has a picture of the Max-seq with a line drawing of a flowcell carousel. The carousel should allow loading up to 20 flowcells and run up to 10 samples per day (SE35bp). A flow cell will generate 35M reads in single or paired-end format at 35 or 55bp length, with >80% of bases being Q30 or higher and 4Gb per flowcell.

Cost per run is predicted to be about $300 per flowcell. However it is not clear what the price would be if you wished to dispose of the reusable flowcells after a single use.

Those numbers do not add up to me but they must have made sense to Qiagen.

Friday, 22 June 2012

Improving small and miRNA NGS analysis or an introduction to HDsRNA-seq

Small RNA biases have been very well interrogated in a series of papers released in the last 12 months. The RNA ligation has been shown to be the major source of bias and the articles discussed in this post offer some simple fixes to current protocols which should allow even better detection, quantification and discovery in your experiments.

Small RNA plays an important regulatory role and this has been revealed by almost every method possible that can be used to measure RNA abundance; northern, real-time qPCR, microarrays and more recently next-generation sequencing. These methods do not agree particularly well with each other and the most likely candidate issue is technical biases of the different platforms.

Even though it has its own biases, small RNA sequencing appears to be the best method available for several reasons, it does not rely on probe design and hybridization, you can discriminate amongst members of the same microRNA family and you can detect, quantitate and discover in the same experiment (Linsen et al ref).

Improving small RNA sequencing: As NGS has been adopted for smallRNA analysis focus has appropriately been made on the biases in library preparation. Nearly all library prep methods use ligation of RNA adapters to the 3’ and 5’ ends of smallRNAs using T4 RNA ligases, before reverse transcription from 3’ adapters and amplification by PCR. However RNA ligase has strong sequence preferences and unless addressed these lead to bias in the final results of sequencing experiments.

All four of the papers below show major improvements to RNA-seq bias for small RNA protocols.

I particularly like the experiments performed in the Silence paper using a degenerate 21-mer RNA oligonucleotide. Briefly the theory is that a 21-mer degenerate oligo has trillions of possible sequence combinations and that in a standard sequencing run each sequence should appear no more than once as only a few million sequences are read. The results from a standard Illumina prep showed strong biases for some sequences that were significantly different from the expected Poisson distribution, and where almost 60,000 sequences were found more than 10 times instead of once as expected (the red line in figure A from their paper reproduced below). When they used adapters where four degenerate bases were added to the 5′ end of the 3′ adapter and to the 3′ end of the 5′ adapter, they achieved results much closer to those expected (blue line).

I don’t think we should be too worried about differential expression studies as long as the comparisons used the same methods for both groups, the results we have are probably true. However we may well have missed many smallRNAs because of the bias and our understanding of biology is likely to be enhanced by these improved protocols.

The recent papers:

Jayaprakash et al NAR 2011: showed that RNA ligases have “significant sequence specificity” and that “the profiles of small RNAs are strongly dependent on the adapters used for sample preparation”. They strongly suggest modifications to current protocols for smallRNA library prep using a mix of adapters, "the pooled-adapter strategy developed here provides a means to overcome issues of bias, and generate more accurate small RNA profiles."

Sun et al RNA 2011: "adaptor pooling could be an easy work-around solution to reveal the “true” small RNAome."

Zhuang et al NAR 2012: showed that the biases of T4 RNA ligases is not simply sequence preference but affected by structural features of RNAs and adapters. They suggested "using adapters with randomized regions results in higher ligation efficiency and reduced ligation bias".

Sorefan et al Silence 2012: demonstrate that secondary structure preferences of RNA ligase impact cloning and NGS library prep of small RNAs. They present “a high definition (HD) protocol that reduces the RNA ligase-dependent cloning bias” and suggest that “previous small RNA profiling experiments should be re-evaluated” as “new microRNAs are likely to be found, which were selected against by existing adapters”, a powerful if worrying argument.

Monday, 18 June 2012

Even easier box plots and pretty easy stats help uncover a three-fold increase in Illumina PhiX error rate!

One of the things I wanted to do with this blog was share things that make my job easier. One of the jobs I often have to do is communicate numbers quickly and effectively and a box plot can really help. I also have the same kind of troubles most people face with statistics, I find it hard! In this post I will discuss the GraphPad Prism package from GraphPad. This allows you to use stats confidently and make lovely plots (although annotating them is a nightmare). Recently the statisticians in our Bioinformatics core gave a short course in using GraphPad Prism. I thought I'd explain the box plot in a little more detail and tell you a bit about GraphPad.

Previously I had shown how to create a box plot using Excel. I went down this route because I did not have time to learn a new package and Excel is available almost everywhere. However the result is less than perfect and it is hard work, indeed one major reason for writing the previous blog post was so I had somewhere to go next time I needed to create a plot! Statisticians like box plots as they can get across a lot more than just the mean and also say something about the size of the population being investigated.

Explaining box plots: A box plot is a graphical representation of some descriptive statistics; generally the mean and one of the following; standard deviation, standard error or Inter Quartile Range. A dot box plot is a version that allows these figures to be represented along with all the data points which allows the size of the population to be clearly seen. This helps enormously when comparing sample groups and deciding if a change in mean is statistically significant or not.

Dot box plots rule!

In figure 1 very similar mean and standard error are plotted with each dot representing a sample, see how the removal of some data does not significantly affect the "results" but seeing the data allows you to make a call on how much you are willing to trust it.

Figure 1

In figure 2 you can clearly see that there appear to be some "outliers" in group 2 but these have no affect on the results as the number of measurements is so high compared to group 1. Deciding if any "outliers" are in group 1 is much harder as the number of samples is so much lower. Removing outliers is really hard, and our statisticians generally advise against it.

Figure 2

GraphPad Prism: The software costs about $300 for a personal license. It might be a lot when budgets are tight but an academic license is not so expensive when shared across a department or institute. I’d certainly encourage you to take the plunge. It very quickly allows you to produce plots like the ones in this post as well as run standard statistical tests, and a whole lot more I won't go into. Take a look at their product tour if you want to find out more.

PhiX error rates: I used GraphPad to investigate an issue I had suspected for a while. We have been seeing a bias in the error rate on Illumina sequencing flowcells where lane one appeared to be higher than other lanes. Whilst the absolute numbers are not terrible and all lanes pass our QC there may be a real impact on results if this is not taken into account; mutation calling single lane samples and comparing tumour to normal for instance.

I took one months GAIIx data (8 flowcells) and plotted error rate for each lane. Entering the data into GraphPad is the most annoying bit and I usually copy and paste from Excel. However generation of the statistics and plots (figure 3) took about three minutes from start to finish.

A one way ANOVA with a Bonferroin correction showed how significant the differences were, with a very significant difference between lanes 1&2 and the rest. In fact there appears to be more of a gradient across a flowcell as lane 2 is affected, but at a lesser degree to lane 1.

A 2 way ANOVA allowed me to determine that in this data set lane accounted for 80% of the variance and instrument only 2.5%.

Figure 3

The biggest headache with GraphPad is the woefully inadequate annotation of graphs. Quite simply you will have to get an image out of the software and into Illustrator or PowerPoint. I guess if they are making the stats easy we should not complain too much.

I am using GraphPad on a weekly basis and for most reports where I have to summarise larger datasets. Why don't you give it a try.

PS: I'll let you know what Illumina say about the error rates. Please tell me if you've seen similar.

Wednesday, 30 May 2012

Don’t bother with a biopsy; just ask for a drop of blood instead.

I’d like to highlight some research that I have been involved in, which has just been published in Science Translational Medicine; Noninvasive identification and monitoring of cancer mutations by targeted deep sequencing of plasma DNA. The work makes extensive use of Fluidigm’s Access Array and Illumina sequencing, technologies that we have been running in my lab for over two and almost five years respectively. I’m proud of this work and I hope you like it as well.

Liquid biopsies for personalised cancer medicine: Tumours leak DNA into blood and the SciTM paper shows how this circulating tumour DNA (ctDNA) can be used in cancer management as a “liquid biopsy”. The study by Forshew et al demonstrates the feasibility of testing and detecting mutations present at about 2% MAF (Mutant Allele Frequency), in multiple loci, directly from blood plasma in 88 patient samples. The method has been christened Tagged Amplicon sequencing or TAm-seq.

We know that specific mutations can impact treatment, e.g. ErbB2 amplification & Herceptin, and BRAF V600E & Vefurabenib, etc. Many cancers are heterogeneous with metastatic clones differing from each other and/or the primary, and biopsying all tumour locations is unrealistic for most patients. Understanding this heterogeneity in each patient will ultimately help guide personalised cancer medicine.

ctDNA however has not been easy to assay. It is usually fragmented to about 150bp and is present in only a few thousand copies per ml of blood. There has been a recent explosion in neonatal test development since the publications from the Lo and Quake groups. Whilst people were aware of ctDNA for many years it is only similar technological advances that allow us to assay it in such a comprehensive manner.

How does the liquid biopsy work: ctDNA is first extracted from between 0.8 and 2.2ml of blood using the QIAamp Circulating Nucleic Acid kit (Qiagen). Tailed locus-specific primers are used for PCR amplification and the loci targeted in the paper account for 38% of all point-mutations in COSMIC. 88 patient plasma, a couple of controls and 47 FFPE samples were tested in duplicate clearly demonstrating the utility and robustness of their method.

Each sample is pre-amplified in a multiplex PCR reaction to enrich for all targeted loci. This “pre-amp” is used as the template on a Fluidigm Access Array where each locus is individually amplified in 33nl PCR reactions that are recovered from the chip to produce a locus-amplified pool from each sample. A universal PCR adds barcodes and flowcell adapter sequences. 48 samples were pooled for each Illumina GAIIx sequencing run achieving 3200 fold coverage. We recently ran a library with 1536 samples in it from a collaborator using the technology in my lab. The potential of 12288 samples analysed in a single HiSeq run is astonishing.

How sensitive is the liquid biopsy: The paper presents results from a series of experiments to test the sensitivity, false discovery rate (using mixed samples) and validity (using digital Sanger-seq) of the TAm-seq method. ctDNA can be successfully amplified from as little as 0.8ml of plasma, far easier to get than a tissue biopsy! They were able to detect mutations at around 2% MAF with sensitivity and specificity >97%. The paper has some very good figures and I’d encourage you to read it and take a look at those.

Two of the figures stand out for me. The first shows results from one ovarian cancer patient’s follow-up blood samples in which they identified an EGFR mutation not present in the initial biopsy of the ovaries. When they reanalysed the samples collected during initial surgery they could find the EGFR mutation in omental biopsies. A TP53 mutation was found in all samples except white blood cells (normal) and acts as a control (Figure A from the paper).

They presented an analysis of tumour dynamics by tracking 10 mutations discovered from whole tumour genome sequencing, using TAm-seq in the plasma of a single breast cancer patient over a 16 month period of treatment and relapse (Figure D). They also demonstrated of the utility of TAm-seq by comparison to the current ovarian cancer biomarker, CA125 (figures B & C not reproduced here).

What does this mean for clinical cancer genomes: There are many reports of whole genome sequencing leading to clinically interesting findings and some labs have started formal exome and even whole genome sequencing on patients. Whilst there is little doubt that tumour sequencing is likely to be useful in most solid tumours it is still hard to see how this will trickle down to the 100,000 or so new cancer patients we see in the UK. Challenges include cost, bioinformatics, incidental findings and ethics.

The TAm-seq method has fewer of these challenges (although it is not good for detecting amplifications) and I think it is a really big step in clinical cancer genomics and hope it translates quickly into the clinic to inform treatment and prognosis. Perhaps it will be the first technique to make a big splash in personalised cancer genomics?

Hopefully this liquid biopsy will be quickly translated into the clinic.

Perhaps a future version might even turn up in your doctor’s surgery on a MinION in a few years time as a basic screening tool?

Saturday, 26 May 2012

The exome is only a small portion of the genome

Ricki Lewis (Genetic linkage blog) wrote a great guest post for Scientific American on what exome sequencing can’t do. It seems timely considering the explosion of interest in exome sequencing and exome arrays. Not so long ago most people I knew still talked about junk DNA, exome sequencing and exome arrays essentially allows users to ignore the junk to get on with real science. As Ricki points out exome analysis is a phenomenally useful tool but users need to understand what they can’t do to get the most from their studies.

Ricki listed 10 things exomes are not so good for, my list is a lot shorter at just 4.

Regulatory sequence is missing (although this is being added, e.g. Illumina).
Not all exons are included.
Structural variants (CNV, InDel, Inv, Trans, etc) are not easily assayed with current exome products.
No two exome products are the same.

Exome analysis has had a real impact, especially on Mendelian diseases that remain undiagnosed. However users need to remember they are only looking at a very small portion of the genome. Ricki puts it this way “the exome, including only exons, is to the genome what a Wikipedia entry about a book is to the actual book”.

I posted a month or so ago about choosing between exome-chip and exome-seq. The explosion in exome-chips has been an even bigger surprise then exome-seq. Illumina admitted that they had been overwhelmed with demand for their array products. It appeared to be pretty clear that exome-seq would take off as soon as the cost came down to something reasonable. However according to Illumina over 1M samples have now been run on exome chips!

Of course analysis of an exome is allowing studies to happen that would never get off the ground if whole genome sequencing were the only option. The cost and relative ease of analysis makes the technology accessible to almost anyone. As the methods and content improve over the next coupe of years this is going to get even easier.

The simplest thing for users to remember is that they are restricting analysis to a subset of the genome. This means that just because you don’t find a variant does not mean one is not lurking outside the exome; absence of evidence is not evidence of absence as statisticians would put it.

It is also helpful to remember that not all exomes are created equal. Commercial products are designed with a price and user in mind. Academic input is usually limited to a few groups and there are always other bits that could be added in. Illumina have done a great job including some of the regulome in their product but the commercial products are in a similar arms race to the one faced by microarray vendors a decade ago. Just because a product targets a bigger exome does not mean it is better for your study.

Exomes are well and truly here to stay. We'll probably see an exome journal soon enough as there is so much interest.

Thursday, 24 May 2012

AmpliSeq 2

Ion Torrent released AmpliSeq 2.0 a little while ago. The biggest change is an increase to 3072 amplicons per pool. I saw a Life Tech slide deck which had up to 6144 amplicons pre pool so there looks like more room for improvment.

Who knows when we might see an multiplex PCR exome!

PS: see a previous post about AmpliSeq for more general details.

Tuesday, 22 May 2012

Resources for the public understanding of cancer genomics

We have just had an open day at our institute, we had Science Week here in the UK a month ago and I went to my son's school to talk about why people have differently coloured eyes. I like communicating science to other people; at work, in the pub, on the train, at school, etc, etc, etc. I am a science nerd and an happy to be known as one. Most of the time.

Public understanding of science (PUS) is important and we need to make sure the people funding us, and hopefully benefiting from the work we do, realise why we do what we do. However it is not always easy to find the time to put together something everyone will understand and engage with.

There are surprisingly few resources out there to get PUS materials and examples from. This post outlines the work I've done for our open day and discusses some of the resources that are out there to offer some inspiration. I'd also like to enthuse others about the idea of Creative Commons licensing of your PUS materials so others can use them. Then we just need to find somewhere to put them and organise them. Does anyone fancy writing a grant to Wellcome to get some funding for a web-resource?

Why bother with PUS in a genomics core facility: I wanted to share the recent posters I put together to show visitors to our open day what we do in a genomics core facility lab and how those shiny Illumina HiSeq, MiSeq and GAIIx (a little old and not so shiny!) machines can help us help patients.

Mutations in Cancer: The PUS poster comes as a pair which have huge Sanger style sequence traces of the Braf V600E mutation in normal and mutant versions. The visitors are asked to "spot the difference" and identify the mutation, which is a very difficult thing to do by eye. We explain that finding this kind of mutation is what we are doing on our instruments about 1000 times a day. Of course this is a massive over-simplification of the possibilities offered by cancer genome sequencing but hopefully it shows why we spend so much money on cancer genomes. Braf V600E was chosen as it is one of a few mutations that can be tested for to determine treatment. Tests we are developing in out institute will hopefully mean every cancer patient in the UK is screened for mutations like this from a tumour biopsy and, maybe one day, a simple blood sample.

Can you spot the difference between normal (left) and mutant (right) Braf sequences?

How can I tell other scientists about my posters: I wanted to make these posters available to others to use or modify for their own PUS events. The posters can be downloaded here under a creative commons license. However there does not appear to be a central repository of resources like this. A few sites do offer material under creative commons, like the University of Oxford maths department podcasts. I wonder if the resource we need today is somewhere to upload materials with keywords and abstracts in a searchable form. If these were available as shared documents then the community could work on them together. I am thinking something like GoogleDocs or a Wiki for PUS. This would be a wonderful thing for someone like Wellcome or MRC to fund. If your materials became widely used then they could also become something that was worth adding to your CV, demonstrating additional skills outside of research experience.

PUS sites I liked: There are people out there communicating about PUS. There is even a journal called "Public Understanding of Science" that has an interesting editorial blog post about open access publishing. PUS is a subscription journal and they appear to lean away from a PLoS model of open access which seems totally at odds with the journals remit to promote public understanding. How can the scientists better learn to help communicate science if we have to subscribe to a journal to hear about the best ways to do it? PUS covers all kinds of interaction between science and the public, they cover topics such as: "science, scientific and para-scientific belief systems, science in schools, history of science, education of popular science, science and the media."

A post on Diffusion summarises a seminar from martin Rees. He advocates scientists doing more communications work and making sure the public and politicians get access to the very best explanations of science possible.

Marie Boran has some wonderful blog posts at The Strange Quark about public understanding of science. I'd also recommend her posts on what science is, science journalism and what communication means. She communicates things in a way people are likely to remember them. I particularly liked her way of presenting the scale of cells, apparently if you use an M&M to represent a red blood cell then a single grain of sugar would be about the same size as a bacterium.

Lastly I'd like to point everyone to Small Science Zines. This is a site that promotes the use of "Zines", small, easily and cheaply reproduced magazines. They provide instructions (zine-folding directions) to help you produce a simple eight page zine for easy distribution. The zine on DNA computing is an interesting if wordy read. I think I'll give one a go for Illumina sequencing so watch this space. Perhaps we can produce a series of NGS "how to" comics?

The site discusses how to communicate science and how to design zines. It does not all have to be about well honed presentations or laminated A0 posters. PUS could simply be the last time you told someone a science fact. Making a zine to show people is much more personal than pointing them to a website or asking them to visit the lab. All you need is "to know and care about your topic, and want to share this with others".

Wednesday, 16 May 2012

HiSeq 2500: how much will the "genome in a day" cost?

The launch of HiSeq 2500 generated a buzz as it came hot on the heels of the Ion Proton. Both instruments will allow users to generate a genome in a day. HiSeq 2500 was launched with a 127Gb in 24 hours spec. Current specs on the Illumina website are at 120Gb in 24 hours, 300M reads and PE150 supported (yielding 180Gb in 39 hours). All this for a $50,000 upgrade fee which makes it seem likely that many users will upgrade at least one instrument.

If you want to know more read the 2500 app note on Ilumina's website, although the Yield figures in the table appear to be incorrect!

There has been much less noise about the likely cost of the data from the rapid run "MiSeq on steroids" format. A recent post on SEQanswers is the first sniff of HiSeq 2500 pricing, although it may not be accurate. It suggests a PE cluster kit will cost $1225 and a 200 cycle SBS kit will be $1690.

I used these figures to get to the possible cost per lane of a HiSeq 2500 run:

PE100 multiplexed: £900 or $1500

This compares incredibly well to the normal output. In fact to me it looks like HiSeq 2500 rapid run mode could be the best choice for core labs like mine as it offers incredible flexibility as a two lane flowcell is quicker to fill up than an 8 lane one. And five dual-flowcell rapid runs will take less time and generate the same data as a dual-8-lane-flowcell standard run. The cost per Gb is going to be a little higher but many users will see this as a fair trade-off for faster turn-around-times.

The HiSeq 2500 rapid runs will also use on-instrument clustering. Exactly how this is going to fit inside the instrument with the available fluidics is not completely clear. I'd expect that we will have to run both positions in the same configuration using the current PE reagent rack.

Whether Illumina are able to really turn HiSeq 2500 into "MiSeq on steroids" and up read lengths to the 687bp presented at AGBT is still to be seen. They might have to if Ion Torrent can push their read-lengths out to current 454 lengths.

The competition: The latest specs from Life suggest that the Proton II chip will generate 20x coverage of a genome (and analyse in a day). However it is not clear if the run time will be longer or multiple chips will be run, current times are 2 hours per chip. A 20x genome in 2 hours would be great, but I don't think we can expect quite that from Life just yet. There is also a video of the first 4 Proton's to be installed (at BCM); "install to sequence in 36 hours" although the video only shows samples being centrifuged before loading and no real sample prep.

What's next: One thing I am happy to predict is that advances in sequencing technology are not going to stop any time soon, and when ONT come out from under their invisibility cloak we might finally get a peek at some data that shows what tomorrow holds.

Wednesday, 9 May 2012

NGS is the ultimate QC for your PCR oligos!

About two years ago when we started using Fluidigm Access Array sequencing we noticed something in the reads that was a bit of a surprise, although not totally unexpected once we realised what was going on. We were amplifying all the exons in seven cancer genes across 48 cancer cell lines, sequencing them in a single experiment and finding known SNPs at the expected allelic ratios. However we also found quite a large number of what seemed to be random deletions and truncations in the targeted regions, and these all occurred towards the beginning of the reads.

One of the “problems” with many amplicon sequencing methods is that they often tail locus specific primers with NGS adapters and this means you have to read through the PCR primer before you get to interesting results in your samples. In our case the first 25bp or so of each read was from the primer.

It appeared that we were seeing the incorrect by-products of the oligo production process. Oligo manufacturers use varying methods of clean-up and QC but none are perfect and it looks to me like NGS might be the ultimate, if slightly expensive, oligo QC tool.

One way to test this would be to get the same pair of oligos made by multiple companies, PCR amplify a control DNA template with all of them and then sequence the primer sequence only in a pooled sequencing run. Anyone fancy collaborating?

An oligo “primer”: I thought I would follow up on this with an overview of oligo manufacture and some tips for PCR primer design.

You can buy oligos from many places and a very few labs still make their own. It is possible to get oligos of up to 400bp in length, chimeric oligos (DNA:RNA) and with all sorts of modifications: Fluroescent dyes, Amino-modified, Biotinylated, Phosphorylated, 2'-Deoxyinosine (INO) 5-Methyl-dC, Phosphorothioate, dI, dU, 2'-Deoxyuridine (URI), Amino C6-dT, Spacers (Deoxyabasic or C3), Thiols, etc, etc, etc. Most people are simply out to buy a primer for PCR or Sanger sequencing but there are also RNA, siRNAs, PNA, LNA and many other types available for many applications.

Choosing a provider most often comes down to cost and the price per base is very low for a standard desalted PCR primer. However the options offered by oligo manufacturers are numerous and some might be a prerequisite for your experiment.

Most standard oligos are supplied already quantified and even at your pre-determined concentration. The amount of oligo you actually get is dependent on the synthesis scale (how much is made), the efficiency of each coupling and the purification used. The lowest synthesis scale is generally fine for PCR applications, however if you want specific cleanup of your oligo you may have to get a larger scale synthesised. Most providers suggest that you resuspend oligos in TE rather than water, which can be slightly acidic. I have always used a buffer with a lower amount of EDTA (10 mM Tris, pH 8.0, 0.1 mM EDTA) as this can inhibit some molecular biology applications at higher concentrations.

Oligo QC: The current standard for QC is mass spectrometry (MALDI-TOF or ESI-MS), or gel electrophoresis (CE or PAGE). MALDI-TOF is used for most applications because of its high throughput, but ESI is better for oligos >50bp.

Oligo Yields: Most companies will specify yields that seem incredibly low compared to the starting nucleotide concentrations. The oligo synthesis is performed sequentially with a single nucleotide being coupled to the growing oligo in a 3’ to 5’ direction. This coupling is often less than 99% efficient so some of the oligos are not extended. This means the final product is a mix of full-length product (n) and truncated sequences (n-1, etc).

Oligo purification can be performed in many ways, cartridge, HPLC, PAGE. On the Sigma website there is a very handy table showing which clean-up you should choose for different applications.

Sigma's clean-up guide

Desalting: This is the most common and cheapest clean-up method and is perfectly fine for standard PCR based applications. Desalting removes by-products from the oligo manufacturing process. If your oligos are >35bp then desalting will not remove the relatively high number of n-1 and other truncated oligos.
Cartridge: A reverse-phase cartridge purification by Sigma will give about 80-90% yield. Full length oligos contain a 5'-DMT and are separated from truncated sequences by making use of their higher hydrophobicitiy. However this difference is reduced as oligos increase in length and should not be used for anything >50bp.
HPLC: A reverse-phase HPLC purification allows higher resolution and gives higher yields of >90%. HPLC also allows purification of larger amounts of oligo (>1 umol). Again this method is not ideal if oligos are more than 50bp.
PAGE: A poly-acrylamide gel electrophoresis can achieve single-base resolution and very high-quality purification and is highly recommended for longer oligos, >50bp. Unfortunately the yield after gel extraction can be quite low.

Primer design tips: Oilgos for PCR, qPCR and sequencing are pretty easy to design if you follow a few simple rules. Once you have primers designed it pays to use the Blat and in silico PCR tools on the UCSC genome browser. Order the lowest synthesis scale for basic PCR applications and resuspend oligos in low EDTA TE (10 mM Tris, pH 8.0, 0.1 mM EDTA). If you are in any doubt about contamination, throw oligos away and buy new ones!

Use Primer 3: Primer 3 is the tool many others are based on, forget all the other rules just use it! But in the spirit of educating readers here are the other rules...
Amplicon length: Standard PCR is fine up to 1-3kb, over this then the primer design may not need to change but the reaction conditions almost certainly will. For qPCR keep amplicons around 100-200bp.
Oligo length: 18-22bp is optimal and should allow high specificity and good annealing characteristics.Longer tehn 3-0bp and you can affect teh annealing time reducing reaction efficiency.
Melting Temperature: A Tm of 52-58C works best in most applications, make sure primer pairs are as similar to each other as possible and test using a gradient cycler run.
GC Content: The GC content (the number of G's and C's in the primer as a percentage of the total bases) of primer should be 40-60%.
Avoid secondary structure: If your primer develops secondary structures you will get low or even no yield of the final product. 2’ structure affects primer template annealing and thus the amplification. Internal hairpins and primer dimers are common problems.
Avoid repeats and runs: Repeats (e.g. di-nucleotides) and mono-nucleotide runs can easily misprime (and promote 2’ structure), try to avoid them or keep to fewer than 4-6bp.
Don’t design to homologous regions: Sounds obvious but if your primer can anneal to multiple regions in the genome then you are likely to get multiple products in your reaction. Blast or Blat your final sequences before ordering.

Pages