Wednesday 27 February 2013

Fixing Illumina's low diversity problem

Illumina's is the most widely used next-generation sequencing technology, but like all technologies it is not perfect. You'll have to wait for their Nanopore sequencer for perfection! One challenge we have to deal with ever more with Illumina sequencing is the balance and number of libraries in a multiplex-pool. If either of these are wrong, then the final sequencing results can be useless or require significantly more sequencing than would normally be required.

In this post I am going to outline the advice I give to my users.

The low diversity problem: The technology does suffer from a low diversity problem that has been explained over at Pathogenomics. They describe this "fly in the ointment" for Illumina users sequencing amplicons and point to a SEQanswers thread suggesting some work arounds e.g. spike in lots of PhiX. The main problem derives from the need to find distinct clusters on the images, this is done by HiSeq Control Software (HCS) on the fly. The first step is template generation which defines the X,Y coordinates of each cluster on a tile and is the reference for everything else. Template generation currently uses the first four cycles (but can be configured to use more) and this data is analysed after the fourth cycle is complete which users will see as a pause by the instrument. Clusters are found in each of the 16 images (A, C, G & T for four cycles). A Golden Cycle (g) is determined as the one with the most clusters in A&C, this is used as the reference for everything else. By comparing merged A&Cg clusters with A&C and G&T in all cycles a two-colour map of clusters is formed that distinguishes broad clusters from closely neighbouring but distinct ones. However all though there are four bases Illumina instruments only have two lasers, red (A&C) and green (G&T, which I remember by thinking of the colour of a Gordon's gin bottle, green for G&T!) and in each cycle at least one of two nucleotides for each colour channel needs to be read to ensure proper image registration.

See technote_rta_theory_operations.pdf for more detail.

The causes of index read failures: If the index read is unbalanced due to poor mixing or low numbers of indexes then registration can fail due to the low diversity. The balance and number of libraries in a multiplex-pool impacts diversity and low diversity means too many clusters fail dempultiplexing and are thrown away. Illumina provide advice in their sample prep guides (e.g. TruSeq DNA PCR-Free Sample Preparation Guide (15036187 A)) on how best to quantify libraries before pooling and also on how to safely create low-plexity pools (more than 12 is usually safe). I'll now address both of these; Quantifying & Normalising and Deciding Index Plexity in a little more detail.

Quantifying & Normalising: Simply use qPCR quantification to get the very best estimate of nM concentration. Bioanalyser, QuBit and Nanodrop all work but can be very inaccurate when compared to qPCR, it all depends on your libraries. These other methods also quantify molecules that cannot form clusters, such as molecules without both adapters, oligos and nucleotide and can significantly over- or under-estimate pM concentration. A very good QC document is one I found from an Illumina Aisa/Pacific meeting.

Illumina recommends the use of Kapa's KK4824 Library Quantification Kit - Illumina/Universal kit, although now they are selling the Eco systems and their own NuPCR I'd expect an Illumina kit to come along any time soon.

KAPA's is a SYBR-based real-time PCR assay that uses two primers complementary to the ends of TruSeq, Nextera and other library types. During PCR only amplifiable molecules contribute to the CT reading and so the assay is more robust than spectropohotometry or fluorimetry. The method is not without its drawbacks but if care is taken with preparation of the samples and controls then very accurate results are possible.
  • Pipette large volumes - we use a 1:99ul dilution followed by serial 10:90ul dilutions to create a 1:100, 1:1000, 1:10,000 and 1:100,000. 
  • Replicate the dilutions - we make three independent serial dilutions. 
  • Use the controls - we aliquot controls to avoid freeze/thaw. 
  • Include NTC's - vital if you want to see contamination, add them at different stages to see where contamination is coming from.
For library normalisation after sample-prep we have more recently we have moved to a single dilution and only running the 1:10,000 as technical replicates because the results have been so robust. But for clustering we still make the three dilutions above and run the 1:10,000 dilutions as independent replicates. The reason for this is that normalisation can be off a little bit without impacting the experiment but only getting 80M reads when we are expecting 160-180M is embarrassing!
All this uses lots of tips and is a pain in the lab but it does give very good results, which are seen in the final barcode balance analysis.

Two words of warning. Firstly total fragment length is required not just the insert size, don't forget the adapters add another 120-130bp. Secondly, if you are using Illumina's TruSeq PCR-free kits then do not use the Bioanalyser to estimate library size as it significantly over-estimates fragment length due to "the presence of certain structural features which would normally be removed if a subsequent PCR-enrichment step were performed". The Illumina sample prep guide has a nice figure illustrating this comparing Bioanalyser and insert size distribution determined by alignment, and suggests using 470bp for 350bp libraries or 670bp for 550bp libraries in the calculations after Kapa qPCR.

Figure 27 from Illumina's TruSeq PCR-free guide

Improving qPCR: A big problem with qPCR as currently implemented is the need to know library size to calculate pM concentration. The assay uses an intercalating dye and is effectively measuring the amount of DNA present rather than the number of molecules. Fragment length is used to calculate number of molecules and if this is wrong then quantitation will also be inaccurate. Ideally we would not bother with the Bioanalyser but we need to understand insert size, this potentially doubles the work we need to do for QT of a library.
In my group we discussed a few years ago the use of a TaqMan assay in place of SYBR and other groups have subsequently published methods for this. A TaqMan probe assay counts the number of molecules hydrolysed in each cycle of the qPCR, as such CT gives a direct measurement of the number of molecules present. We're working on an assay to use in the lab as none are available commercially but if someone wants to make one for us please get in touch!
Illumina's low-plexity pooling guidelines. Simples!
Deciding Index Plexity or "how many samples should I put in my sequencing pool": Illumina do provide advice in their sample prep guides on how to safely create low-plexity pools but the diagrams can be confusing when you see them for the first time. Usually 12 or more libraries in a pool are "safe" but the mix can go badly wrong. Adding larger numbers of samples to the pool makes this issue disappear, but if a user gets it wrong then a whole lane or flowcell of sequencing can be ruined.
Since we started multiplexed sequencing I have been trying to convince users that small pools are the wrong way to go. Most people simply decided to put as many libraries in a single lane as they could and still get the required number of reads at the end of the run, e.g. 4 ChIP-seq samples pooled in 1 SE36bp lane that will generate 160M reads or 40M per sample.
I always preferred the mixing of all experimental samples into a super-pool and sequencing as many lanes as required. Users were wary of this because quantification was not robust, but as most have swapped to Kapa qPCR this is no longer a problem. However we still have users giving us pools of three or four samples; which can result in low diversity problems in the index read causing too many perfectly good clusters to be thrown away when they do not demultiplex.
Fixing low diversity once and for all: Sequencing is cheap and fast and at least three "simple" fixes appear to be possible. Firstly we can simply use higher numbers of indexes to remove the problem entirely. Secondly we could use a longer cluster definition "read" which would lessen the issue. Thirdly would be to make sure the first four bases of the index read were from random nucleotides. These random bases could be added during adapter synthesis and add four or eight cycles of sequencing to a run.

An added benefit of large "super-pools" being sequenced across multiple lanes and/or flowcells is that failures in the sequencing of any kind can be tolerated as long as most of the data is generated. IF a tumour:normal pair are sequenced as single lanes and one lane fails then no analysis can be performed. If the same pair are indexed and mixed with 7 other pairs and one lane on a flowcell fails copy number analysis would be hardly affected at all. The same principle applies to pretty much any method.
Ultimately I'd like to see micro-molar scale synthesis of pairs of i5 and i7 indexes such that almost every sample in a lab is unique. The index pair would be used once an thrown away. For Genomes the cost would be truly negligible, even for RNA-seq with a £60 sample-prep an additional £3-5 on oligos could be tolerated if it removed sample-mixups. 80 pairs of oligos (32 i5 x 48 i7) allow 1536 samples to be run, more than most labs run in a year.

Wednesday 20 February 2013

Genome partitioning: my moleculo-esque idea

De novo assembly of complex genomes is still harder than many would like.

For Cancer genomes de novo assembly could be the ultimate method for discovery of all somatic events. But the analysis requires long-reads or long-insert reads and this generally eats up lots of DNA in complex methods. The Broad spoke last years AGBT about the "perfect" mix of reads required for using All Paths to assemble a genome (300bp and 3kb I think); and Oxford Nanopore are promising us 100kb reads which would more or less solve the problem, although it may be some time before we can access the technology!

An alternative solution is to sequence a genome the old fashioned way, clone-by-clone, but with NGS. At least one consortium is attempting a BAC tiling-path of the Wheat genome, which is one of the most complex genomes out there. A limitation is the need to sequence some 150-200,000 BAC clones!!!

My "moleulo" idea: At the end of 2011 I was researching alternative digital PCR, and keeping up-to-date with genome-capture methods. While reading around these subjects I had the idea to mix the RainDance emulsion PCR with Illumina's Nextera tagmentation.

Using a set of transposomes that insert unique barcode tags it would be possible to dilute large fragments of DNA such that a single 100kb fragment would be mixed with a single Nextera tag. The resultant transposition library prep would create a set of sequences that all came from the same genomic locus. Et voila a genome ready for two-step assembly; first a local  de novo assembly would stitch together reads from single 100kb fragments, then the long-reads would be stitched together to create the entire genome. Amplifying the DNA first would help and both Moleculo and Population Genetics Technologies (Genome Pooling) have developed their methods to do this.

I had discussed using this on something like the Wheat genome with our Tech Transfer department but they thought it was not practical or protectable. For Wheat I'd ask RainDance to make a library of emulsion droplets from my 200k BAC clones (in the same way they make primer libraries), combine the amplified DNA with the multiplex Nextera droplets and in a single tagmentation get a pretty awesome Wheat genome. It would be possible to use something like Lucigen's NxSeq fosmid prep kit to make a library for the human genome as well.

How does Moleculo work: I still don't know the details, but expect to find out more at AGBT this year where there are several talks on the technology. It appears to be a combination miniaturisation of barcoded-genome library prep in microtitre plate, microfluidics or emulsions and standard NGS. How much DNA it requires as input and how confident you can be about the likelihood of two reads coming from a unique fragment will be the most significant issues for users.

The likelihood of generating reads from a single fragment is going to have something to do with the number of barcodes available and the number of individual reactions performed. The RainDance method I described would allow many 100'000s of tagment reactions to be made so even low-coverage sequence of each one should a robust long-read assembly. There's a whole lot of maths and Poisson distribution statistics that need to be thought!

Illumina tells us more on their website and SEQanswers has a thread on the technology.

Thursday 14 February 2013

AGBT... where are the big announcements

The 14th AGBT meeting kicks off in one week. Unfortunately I won't be there this year so don't expect a daily round-up of what's good from me! For those of you blogging and tweeting AGBT have crafted some simple graphics to indicate what you can, and more importantly can't talk about...

There don't seem to have been the big announcments from Illlumina, Life and others we have seen for the past few years. No MiSeq, no Proton; I guess you'll have to do with what's on offer from the speakers instead, but of course that's the real reason we all go anyway.

Interstingly there appears to be no Gold or Platinum sponsors this year, I know budgets are tight but who's paying for the parties and picking up teh bar tab! A look at the sponsorship page shows that the welcome cocktails have not yet been sponsored and nor have some of the dinners (I am sure they'll still be on though). And the legendary AGBT rucksack appears to have space for more corporate sponsorship, although I suspect there is no time to add anyone elses logo now.

I looked through the agenda as it includes some fantastic talks, especially on the use of NGS in the clinic. Below is the stand-out stuff I'll be sad to miss. Please do write about these if you are going.

The most interesting clinical talks from my perspective:
On Thursday Christine Eng and Kjersti Aagaard from Baylor presents 450 clinical exome sequencing cases and their work on metagenomics medicine. Elizabeth Worthey, Medical College of Wisconsin presents  experiences from their WGS Based genomic medicine clinic. Stephen Kingsmore, Children’s Mercy Hospital and Jonathan Berg, UNC Chapel Hill talk about experiences of pediatric genomic medicine and management of incidental findings. Matthew Wiggin, Boreal Genomics is talking about something right up my street “Multiplexed Detection of Low Abundance, Tumor Related Nucleic Acids in the Plasma of Cancer Patients”. On Friday Steve Scherer, Sick Kids will talk about whole genome sequencing and its use in autism research. And on Saturday Phil Stephens from Foundation Medicine will talk about their genomic profiling of over 1,000 FFPE Tumors with the Foundation One assay.

The most interesting non-clinical talks from my perspective:
Gabe Rudy, Golden Helix talks about his familys experience with their 23andMe exomes. Whilst they have no big pre-announcments Illumina's Geoff Smith will be talking about their aquisition of Moleculo “10kb Reads on HiSeq” (Geoff gives a good talk and I expect lots of people are looking forward to hearing about this technology). There is another talk on Moleculo from Jeremy Schmutz, HudsonAlpha Institute “Evaluating Moleculo Long Read Technology for de novo Whole Genome Sequencing”. Michael Schatz, Cold Spring Harbor Laboratory is talking about the use of PacBio (I think) in assembling crop genomes and X. Sunney Xie, Harvard University
is talking about single cell detection of SNV & CNV with whole genome seqeuncing.

PS: I'm writing a post on an idea to create Moleculo style reads which I hope to have completed before the talks.

PPS: Don't forget to enter the competitions!

"My almost"... publication on improving Illumina clustering

I like encouraging people to try to come up with new ideas. One day I want to add 'inventor' to my CV and get a patent, I posted a while ago about an idea that came very close but was pipped to the post by Illumina (although I've still had no feedback about why they have not used their version of the idea!)

But it is not easy turning an idea into reality and get that nice tangible end result, patent, paper or pat-on-the-back. However some times it does not turn out the way you want and I thought I'd post a example of something we thought up a long time ago that is now out there in general use (I know other groups had much the same idea, coming up with a truly novel idea is the really hard part).

The work was completed by Nik Matthews and Kevin Howe in May, 2008.

NB: This post will be followed by occassional posts on work I wish we'd published but never got round to. It's too easy to forget that a lot of work goes on that never gets published but that still has an impact.

Some clustering history: Originally Illumina adaptors were quite different for single or paired-end and smallRNA applications. When clustering Illumina flowcells we used the "cluster station" and each lane was hybridised with a single sequencing primer in a strip-tube with floppy manifolds all over the place. In my lab we have always run a very wide range of sequencing; ChIP-, miRNA-, RNA-, Genome-, Etc-seq; and because of this most of the time we were having to carefully load different sequencing primers into the strip-tubes and make sure neither these, and/or the flowcell, were inverted on the cluster station. Any mistakes in the primer-hyb meant the sequencing first-base would fail requiring a primer-re-hyb.

Improving clustering in the 'Core Genomics' lab: Our idea to resolve this was a simple one, just add a mix of all primers to the strip-tube and only the correct primer will bind (Figure 1).
Figure 1 a: Libraries and their correct primers (lanes 1-6) produce sequence reads, where the wrong primer is used or the libraries are in  the wrong tubes no sequence data will be generated. b: Using a mix of primers allows any library type, or even mixed libraries, to be sequenced.

This modification to Illumina sequencing simplified our cluster generation protocols and reduced the likelihood of manual errors resulting in poor quality, or no sequencing results. Simply mixing standard and smallRNA sequencing primers removed the need for Illumina’s more complex ‘multi-primer hyb’ protocol. And as a by-product this modification allowed sample multiplexing, achieved by mixing two different library types in a flowcell lane and sequencing both libraries with their respective sequencing primers, using alignment to split the samples.

Our results showed no differences between mixed or normal/multi primer hyb and made primer-hyb quicker, easier and less error prone. We successfully mixed the ChIP-seq and smallRNA libraries described above in a single flow cell lane at 10:1 ratio (only a few reads are needed for smallRNA). The increase in sample throughput was achieved at a slight reduction in total sequence yield per sample, but for many applications a single lane now provides sufficient data.

Make sure your flowcell is the right way round: A final modification we'd asked for all that time ago was flow cell that cannot be inverted, the simple addition of a notch to one corner would make it impossible to put the flow cell on an instrument in the wrong orientation. Illumina delivered (sort of) and the current HiSeq 2000 flowcells do have the notch; unfortunately the older instruments don't have the corresponding shape to prevent incorrect orientation. 

Checking orientation with the data: We used to get a visual clue to orientation with the first generation of Illumina's live run reports. An inversion of the flow cell could be seen by looking at the cluster density visualisation. The “smile” seen is due to the drop in DNA concentration as the clustering solution moves through the lane from botto to top, resulting in lower cluster density at the ‘out’ end of the flowcell.

We interpreted the images from the Illumina output as a smile for a good flowcell or a frown for a bad one!

PS: Yes; the wrong orientation frown was one of ours! Everyone makes mistakes. 
PPS: To avoid orientation issues we always run PhiX as a 1% spike on lanes 1-7 and 5% on lane 8.

Wednesday 13 February 2013

A genome, a genome my kingdom for my genome... and is there any horse in that burger?

Two big news stories have been in the headlines in the past seven days. The first is the discovery of Richard the 3rds body, the second is the detection of Horse DNA in processed meat products. Both made me realise they are great stories to engage with our non-scientific mates about, let me know what your friends say down the pub!

Richard III: Last week it was confirmed that a skeleton found under a car park in Leicester, England is none other than the of Richard III. A team from the University of Leicester said analysis of DNA from the bones matched that from his descendants "beyond reasonable doubt". The team used DNA fingerprinting, carbon dating and other methods to try and identify the remains, which showed all the signs of being Richard's body from the description of wounds sustained at the battle of Bosworth. His death ushered in the Tudor's (English readers will be aware of the impact the Tudors had on English and European history, American readers will have seen them as over-sexed royalty on HBO).

The team were able to use mitochondrial DNA from Michael Ibsen who is a 17th-generation descendant working in London. DNA degradation was an issue but the technology to work with degraded DNA is moving at such a rapid pace the=at we might expect a Richard III genome in another year or two. The technology was developed in the lab of Prof Patrick Cunningham at Trinity College Dublin (he is a co-founder of Identigen), and was developed in response to the BSE crisis of the mid-90's.

Q:Is that burger 100% beef? A: Neigh!: DNA has been used to determine the percentage of beef in  processed meats and the scandal is getting hotter and hotter as each day passes. Last nigh on the BBC news Tesco confirmed that trace amounts of Horse DNA were found in processed food.In the same breadth the news reader confirmed that the "trace amount" was 60%! That's marketing for you.

The levels of animal DNA in a product are determined using SNP genotyping or other DNA fingerprinting technologies. The technology used in Ireland that led to the current story comes from Identigen. Their technology allows even highly-processed meat to be traced right back to the farm the animal was reared on.

DNA is collected on the farm as well as anywhere in the meat processing chain, samples are sent to Identigen for analysis and compared to their internal DNA database allowing products to be traced back to the farm.

With high-throughput genotyping technologies like Sequenom costing just a few pence/cents per genotype, the cost of a comprehensive SNP panel to ID species is likely to be a less than $1. If you want to cover all the major farmed animals and ID individual animals then I'd expect several 1000's of SNPs would be needed. Comprehensive testing could be performed and it could be incredibly cheap. And there is no reason that this technology could be applied as animals are tagged under current regulations.

DNA fingerprinting is not just for crooks and paternity! CSI meets James Herriot!

Saturday 2 February 2013

The Hobbit: why, oh why, oh why?

Okay, so this post is completely off topic but I finally got to see "The Hobbit: an unexpected journey" and it is bad. I had been led to believe it was as good as the LOTR trilogy, friends had been to see it multiple times and raved. I will rant...

I thought Peter Jackson was a real Tolkein fan but now I realise he was the ultimate LOTR nerd. The Hobbit for him appears to have been an opportunity to make another three LOTR movies rather than try to recreate a wonderful tale of hobbits, dwarves, dragons and adventure. Instead the film is bloated by tales form outside of the Hobbit, appendices and stories form the Silmarillion and other places are taken and twisted by Jackson. Azog the goblin is saved from having his head chopped off by Dain (a 2nd cousin of Thorin Oakenshield) and instead is cast as pursuing the dwarves to add action that is just not needed. Unless of course you are trying to squeeze in a few more action scenes reminiscent of LOTR!

Bilbo is played rather poorly by Martin Freeman, although I suspect his direction is more the problem than his acting. Bilbo comes across as a rude Hobbit right from the start, and instead of welcoming the dwarves into his house is constantly trying to get them to leave. He wakes up and we are led to believe he finally can't resist the adventure rather than being pushed into going by Gandalf. When captured by the trolls Bilbo calls himself a "burglar-hobbit" rather than a "bur-hobbit", little things perhaps, but important.

I have been described as Hobbit purist, and I guess that mans I was doomed to disappointment. Film critics gave mixed reviews, Philip French over at the Observer has obviously never read the Hobbit, his review basically explains the story. Scott Mendelson came much closer to my point of view and suggests that the thing missing from the whole production was someone saying "No!" to Jackson. Rotten Tomatoes give 65%.

I will probably wait for the next two instalments to come out on DVD, as a box set, at under £10 and after Christmas...2015.

When you have young kids and can't get out to the cinema often then you really want to make the most of it. The Hobbit was a waste of a babysitter! The best thing about the movie was my wife and I smuggling kebabs and beers into the cinema. Three hours is a long movie. The Malibu & coke-in-a-can was a nice way to finish off the evenings disappointment.