Monday, 23 May 2016

Increased read duplication on patterned flowcells- understanding the impact of Exclusion Amplification

Next-generation sequencing is fantastic technology and its use has revolutionised our understanding of biology, but it is not perfect, multiple issues occur in every lab from sample extraction through to the actual sequencing. Not all of these are well enough understood to be safely ignored and in this post I'm going to talk about one that I'm trying to better understand right now - duplication of sequences in datasets, and in particular Exclusion Amplicifation duplication on HiSeq 4000.


Sources of read duplicates in Illumina data - courtesy Illumina 2016

What are duplicates in NGS data: There are several types of duplication that can occur in an NGS experiment (see graphic from Illumina above) and understanding the difference between these should concern us if we are going to "fix" the problem in the lab or bioinformatically:
  1. true biological duplication - a biological sample contains millions of cells and therefore millions of copies of the genome, so in a PCR-free whole genome sequencing library virtually no sequences should occur more than once in any lane. Even for PCR-amplified whole genome sequencing libraries most sequences will occur appear once in the lane. Low level duplication may be due to repeats, but high levels of duplication would be suggestive of a technical issue e.g. PCR over amplification or target enrichment.
  2. PCR and enrichment duplicates - the easiest for everyone to get their head around. PCR amplifies DNA by copying the original molecules so you get duplicates. Start with a high enough input and you should not notice these PCR copies, but if you start with a very low input then duplicate reads in your final sequencing lane are much more likely and may be a problem. If there are biases in your PCR this will also over-amplify some regions leading to duplication that can affect biological interpretation. And if you are targeting a specific region of the genome then you will expect to see more duplication...as you have fewer unique regions in your sample to be analysed.
  3. Optical duplicates - these are only a problem for HiSeq 2500/MiSeq/NextSeq data. They come from large clusters being called as two separate clusters, or from local re-clustering of the original library molecule once it is released from polymerase copying. A read is called as an optical duplicate if the pair of reads are both on the same tile, and the distance between reads is less than the distance set in Picard's "Mark Duplicates".
  4. Exclusion Amplification duplicates (ExAmp) - these are only a problem for HiSeq 4000/X patterned flowcell data. They come from local re-clustering of the original library molecule once it is released from polymerase copying. Picard's "Mark Duplicates" can still be used to call optical duplicate's, and it would be great if these could be marked in teh SAM file separately from PCR duplicates.
A bit more on where ExAmp duplicates come from: The clustering on Illumina's patterned flowcells is very different from the original Solexa/Manteia developed method. Rather than random clustering with spacing determined by the concentration of library molecules, clustering takes place in controlled nanowells. A library molecule "lands" and is clustered beofre a second molecule can arrive in that location, but if a second molecule arrive early enough then a polyclonal cluster is formed which will be discarded in the chastity filtering. See this video on Illumina's website - but look carefully at 1 minute into the video, you can see the original library molecule leaving the clustering area and floating off to create havoc elsewhere on the flowcell!

The ExAmp duplication is happening because library molecules that have made one cluster are free to go back into solution and create a second cluster, and they tend to do this in close proximity to the original cluster. As such they can be counted in much the same way as optical duplicates.

How to reduce ExAmp duplication: There is a balance to be struck in clustering on patterned flowcells between high %PF data i.e. no polyclonal clusters, and low % duplication. Ideally we'd like both but patterend flowcells don't allow this. As you increase the loading concentration of an Illumina library onto a patterened flowcell you increase the rate at which molecules land for clustering, this means you get more polyclonal clusters. But if you load your libraries at too low a concentration you give those library molecules that have clustered an opportunity to recluster in unoccupied nanowells and this means you get higher %duplicates. In our HiSeq 4000 training we discussed this in some depth but the simple fact is users need to make a choice between fewer duplicates or more polyclonal clusters.




Not all duplicates are created equal: But all is not lost as duplicates can be understood and removed or accounted for in your downstream analysis. Although tools like FASTQC presents a graph showing sequence duplication levels understanding and interpreting duplication is complicated for many users by the fact they are working with single-end sequence data, e.g. most RNA-seq or ChIP-seq datasets. In the worst cases up to 70% of your RNA-seq reads can be called as PCR duplicates. After you've paid for 20M reads per sample, getting 6M unique  reads might leave you feeling a little short changed to say the least! But are those 14M duplicates really something you need to worry about?
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Duplication presented in FastQC
FASTQC only analyses the first 100,000 reads in your FASTQ and as at a sequence level there is no way to distinguish between biological or technical and both are be reported as duplicates in the chart above. A FASTQC warning or error is context dependant. For PCR-free gneomes it should be taken seriously, but for RNA-seq and similar applications it may be an unecessary worry. In fact the FASTQC team specificall mention this in their documentation "In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates." 

Identifying and dealing with duplicates: Fortunately most NGS analysis pipelines have options (usually recommended) to remove or mark duplicates them in downstream analysis e.g. the widely used Picard’s MarkDuplicates. This marks reads as duplicates based on their genome co-ordinates after comparing the alignment of their 5' ends, reads with identical 5' ends are duplicates and only the highest quality read is unmarked. Duplicate inserts are marked in the SAM file, allowing downstream GATK tools to exclude duplicates from analyses (most do this by default). But you should understand the impact of duplication, and/or the impact of removing duplicates on your experiments before you decide to use this information. Tere are lots of conversations out there to help - on SEQanswers Malachai Griffith and Heng Li made several comments on the thread Removing duplicates is it really necessary? And in a 2010 Genome Biology exome paper the authors showed data from both single end and paired end sequencing. Marking duplicates in both showed very different results of ~28% and ~8% duplicates respectively. But when they reanalysed paired-end data as single-end the % duplication jumped back up. This is becasue of how the duplicates are being marked (see Picard explanation below), and as such users running single-end experiments e.g. RNA-seq should be very careful not to throw out lots of useable data in the mistaken belief they have too much PCR, or other dupliction.

Picard "MarkDuplicates": Picard MarkDuplicates does not distinguish between different types of biological or PCR duplication. But it can help identify if a sequencing dataset has a problem with optical, and probably ExAmp, duplicates allowing the user to proceed with caution. Ideally I'd like to be able to remove both optical and ExAmp duplicates for downstream analysis, by marking them separately in the SAM format.

You would not be expecting a highly diverse library if you're sequencing targeted amplicons, or might expect a reduction in diversity if sequencing a low-input genome library. And both RNA-seq and CHIP-seq involve a lot of PCR amplification that may not necessarily impact downstream analysis and interpretation. But unless you are certain that duplicate marking is a waste of time, I'd recommend you do it anyway.

However whether you should remove those apparent duplicates is not necessarily clear. We don't for our standard RNA-seq pipeline. And I'm going to focus my next post on how we decided not to remove duplicates in most RNA-seq experiments.

Thursday, 19 May 2016

Happy 10th birthday NGS!

NGS is 10...according to the latest Nature Reviews Genetics: Coming of age: ten years of next-generation sequencing technologies. Just by chance I was asked to give a talk to explain how Illumina sequencing works in a technology seminar series being delivered by the Core Heads at the CRUK Cambridge institute, and as part of that I uploaded a slide-deck and created some animations for Twitter to explain how clustering works...I hope you like it.

Illumina paired-end dual-index clustering and sequencing
and here is a "slow-mo" version for people who could not keep up with the frame-rate!

Wednesday, 18 May 2016

How will Foundations recent patent announcment affect the cancer testing environment

Foundation Medicine were yestoday granted US Patent 9,340,830 "Optimization of Multigene Analysis of Tumor Samples". This is likely to stir up the can of worms that is tumour testing by NGS and is another patent in a complex landscape. The claims basically cover WGS library prep, exome sequencing, alignment and variant calling. It covers all sorts of mutation calling including SNVs at low freq (5%) and mid-freq (10%) or above, SNPs to assess CNV & LOH, fusions and other structural variants, as well as pharmacogenomic SNPs. It also includes in the test a DNA fingerprint.


Michael Pellini, Foundation's CEO appears to be using some very positive language in describing the award of this patent, he said "we do not intend to block the use of methods covered by the patent in patient testing that may be offered by others". But how much of the patent claim is truly novel and might stand up in court remains to be seen. The basic idea of exome sequencing patients is old hat and Foundation were certainly not the first people to be doing this. The SNP ID of patients is an idea even I'd proposed over four years ago (here & here). But if Foundation's patent makes it harder for others to clamp down on competition that can only be a good thing.

Monday, 16 May 2016

How many genomes can the world sequence per year on X Ten?

Illumina's X Ten was a major announcement. It arguably delivered the "$1000 genome", kickstarted national population genomics as a science, and delivered the final blow to Complete Genomics. It is debatable as to whether Illumina needed to release X Ten at $1000, or with such huge capacity; and it is unclear what the X Ten economics really look like for customers or Illumina themselves, but the world can now sequence an unprecedented number of genomes: about 160,000 per year!


Where are all the X Tens: There look to be over 250 X Ten instruments, I searched online and at AllSeq (great) and Genohub (not so great). Installations include - Baylor College of Medicine (10), Broad Institute (14), CEN4GEN (5), Centre National de GeŽnotypage (5), Core Genetics (?), DKFZ (10), Garvan Institute of Medical Research (10), GENEWIZ (10), Genome Quebec (5), GRAIL (?), HudsonAlpha (10), Human Longevity Inc (20), Macrogen (10), McDonnell Genome Institute (10), New York Genome Center (10), Novogene (10), SciLifeLab (10), Sidra Medical and Research Center (10), SNP&SEQ (?), Theragen Etex Bio (10), Wellcome Trust Sanger Institute (10), WuXi AppTec (10), Genomics England (Illumina) (30), Scottish Genomes Partnership (15)

How genomes are actually being sequenced: While each X Ten box can generate one thousand $1000 genomes per year it is unclear how heavily they being used. There are very few reports of X Tens being used at capacity and even 80% capacity seems to be optimistic so a large number of those boxes may be sitting idle. The reasons for this are likely to vary from lab to lab but three main factors need to be considered:
  1. Sample collection: getting enough patients recruited
  2. The cost of sequencing: finding the cash
  3. Analysis and interpretation: hiring enough Bioinformaticians
We're just about to send off our first X Ten project for 1000 genomes...tough to do on two HiSeq 4000's!

Friday, 6 May 2016

BaseSpace updated: no more "free" bioinformatics

I've been a big fan of Illumina's BaseSpace since it was launched in late 2011. It was the first truly simple to access and free to use cloud-based analysis infrastructure for NGS data. I've used it a lot - primarily for run monitoring, but also for RNA-seq and Exome QC analysis using the BaseSpace Apps. But the free-for-all analysis Smörgåsbord is ending, and users will be looking carefully at the costs to determine if BaseSpace offers real value when compared to their internal infrastructure.


Tuesday, 3 May 2016

How many reads to sequence a genome?

Last year I posted about the Lander-Waterman equation used to calculate the number of reads needed to sequence a sample. I explained that this general equation (C = LN/G) can be rearranged to allow you to compute the number of reads (N) to sequence a genome of known size (G) with specific coverage (C) and using reads of a specified length (L). Today I finally finished my Calculoid NGS reads calculator to make it easier to access..feel free to use and reuse as you see fit.


Friday, 29 April 2016

SPRI alternatives for NGS

SPRI beads, generally in the form of AMPureXP beads, are almost ubiquitous in genomics applications such as library prep for NGS. The most popular thing I've ever written was a post on this blog four years ago: "How do SPRI beads work?" with almost 100,000 readers - people obviously want to understand how this wonderful technology can be used. They'd also like it to be cheaper as the Agencourt AMPureXP product (now Beckman) is somewhat expensive - so I've taken a look at some of the alternatives on the market, including DIY SPRI!

Thursday, 28 April 2016

The junior doctors strike...as explained to me

I met two dads last night, one a junior doctor picking his daughter up from the same Karate class my son; the other, a consultant, at the pub after a long day (covering for junior doctors). Both of them spoke about the junior doctors strike and some of what they said is mind boggling. How have we come to such a sorry state in the NHS? Are doctors simply overpaid since contracts were changed by Blair and co? Are the managers of the NHS just after balancing a budget with no thought for the impact their decisions might have on other departments? And does Jeremy Hunt deserve to have his name used as Cockney rhyming slang?

This is how the situation was explained to me:
  • A BMJ paper in September 2015 showed that patients are 10-15% more likely to die if they are admitted on a weekend. [This work has been criticised as flawed due to the complexities of comparing patients admitted on different days - you may actually be sicker at the weekend. But the paper has an errata as it turned out that one of the authors Bruce Keogh (medical director of NHS England) is "a long standing proponent of improving NHS services seven days a week" why the BMJ should have to "have requested that this be included in the authors’ conflict of interest statement" I don't know - seems obvious to include it to me.]
  • Because of this the government in England has decided that NHS services should be available seven days a week.
  • They have proposed a new contract for junior doctors that changes the hourly rate they get after 7pm and at the weekends, and they won't get annual pay increments, instead pay progression will be linked to training.
  • Junior doctors already work weekends.
  • Junior doctors say the new contract will not add new doctors to the system, but it will make them cheaper to pay at the weekend.
  • Both parties cannot come to an agreement so junior doctors have gone on strike.
How will pay change: The Financial Times has a nice graphic showing how the current pay structure compares to those proposed by the government or the BMA. And the BBC also have a graphic showing the junior doctors rates for unsocial hours, I used their numbers to try and see how different the positions are and it really did not seem to be worth striking over - I make the BMAs suggestion as cheaper that the governments!?!

From FT.com

The aim appears to be to get more doctors working at weekends, and it should make paying them at weekends cheaper than today so money will be saved. But the consultant I spoke to used a single example to show how much waste is happening in the NHS - and that fixing this would have far more impact than changing junior doctors rates of pay.

An example of NHS waste: The consultant described an elderly patient who needed a mental health referral before discharge, however the person who would do this was on holiday for a week, so the consultant's patient has to stay in hospital. This blocks the bed for 7 days at around £400 per night (that's the equivalent of The Goring in London, or The Waldorf Astoria in New York). If 10 patients are in the same situation then the bill for this persons one weeks holiday comes to £28,000 or equivalent to a junior doctors salary...and most people get six weeks holiday in the NHS.

The NHS does wonderful things for the UK but successive governments have seriously messed it up. My wife recently spent time in Addenbrookes and whilst she received adequate care the communication from doctors and nurses was abysmal, the food can not be described using any positive adjectives, and the accommodations are pretty terrible given the cost of the bed for the night. I'd love to sing the praises of the NHS but it does feel like big changes are needed...but I am not suggesting I can come up with any useful ideas!





Thursday, 21 April 2016

Fluidigm's single cell issues appear to be fixed..sort of

Fluidgm hosted a webinar this afternoon to describe the performance of the redesigned IFCs for medium-sized cells (10-17um), they've also released a whitepaper on their website describing their work. The redesigned medium-cell 96 IFCs have a >4-fold reduction in doublet rate at around 7%.

This is a significant improvement (and I've written up my notes from the webinar below)...but is this level of doublets low enough for the kind of experiments single-cell users might be planning?

Create your own user feedback survey


Monday, 11 April 2016

WGS versus Exome: what's best for cancer?

I've just started using Twitter polls and kicked things off with a couple asking followers whether they thought Cancer research (or diagnostics) is best investigated using whole genome sequencing, higher depth exomes, much higher depth amplicomes, or even long-reads. Please do take the poll and join in the conversation here or over at Twitter.


Here's a link to the cancer research poll, and here's one to the cancer diagnostics poll. Pass onto your collegaues. I'm going to follow up in a week with a post outlining my thoughts and particularly where long-reads might have a real impact in understnading structural-variation in cancer.