Saturday, 19 January 2013

Understanding where sequencing errors come from

Next-generation sequencing suffers from many of the same issues as Sanger sequencing as far as errors are concerned. The huge amounts of data generated mean we are presented with far higher numbers of variants than ever before and screening out false positives is a very important part of the process. 

Most discussion of NGS sequencing errors focuses on two main points; low quality bases and PCR errors. A recent publication by a group at the Broad highlights the need to consider sample processing before library preparation as well.
Low quality bases: All the NGS companies have made big strides in improving the raw accuracy of the bases we get from our instruments. Read lengths have increased as a result and continue to do so. The number of reads has also increased to the point that for most applications over-sequencing is the norm to get high enough coverage to rule out most issues with low quality base calls.
PCR errors: All of the current NGS systems use PCR in some form to amplify the initial nucleic acid and to add adapters for sequencing. The amount of amplification can be very high, with multiple rounds of PCR for exome and/or amplicon applications. Several groups have published improved methods that reduce the amount of PCR or use alternative enzymes to increase the fidelity of the reaction, e.g. Quail et al.
We also still massively over-amplify most DNA samples to get through the “sample-prep spike”. This is when you use PCR to amplify ng quantities of DNA only to allow robust quantification that allows you to dilute the sample back to pg quantities for loading into a sequencer. Improving methods to remove the need for this spike has resulted in protocols like Illumina’s latest PCR-free TruSeq sample prep. Most labs also use qPCR to quantify libraries allowing much less amplification to be used. However we still get people submitting 100nM+ libraries to my facility, so I try to explain they can use less amplification and it will almost certainly improve their results.
What about sample extraction and processing before library prep: Gad Getz group at the Broad recently published a very nice paper presenting their analysis of artifactual mutations in NGS data that were a result of oxidative DNA damage during sample prep.

These false positives in NGS data are not a huge issue for germline analysis projects using high-coverage sequencing as they can be easily removed. However for Cancer analysis of somatic mutations or population based screening, where low allelic fractions can be incredibly meaningful (e.g analysis of circulating tumour DNA) they present a significant issue. Understanding sequencing and PCR errors helps correct for some of these false positives but the Broad group demonstrate a novel mechanism for DNA damage during sample preparation using Covaris fragmentation. They do not say stop using your Covaris (a nice but expensive system for fragmenting genomic DNA), rather they provide a method to reduce the oxidation and conversion of guanine to 8-oxoG by the addition of anti-oxidants to samples before acoustic shearing. 

They also make the point that their discovery, and fix for this issue made them think about the multitude of possibilities for similar non-biological mechanisms to impact our analysis of low allelic fraction experiments.
I think they are somewhat overly pessimistic about the outlook for Cancer sequencing projects questioning “whether we can truly be confident that the rare mutations we are searching for can actually be attributed to true biological variation”. The numbers of samples being used in studies today is increasing significantly and they are coming from multiple centres, I’d hope that both of these will reduce the impact of these non-biological issues. But I whole heartedly agree with the authors that we should all stop and think about what can go wrong with our experiments at all steps, and not just focus on whether the sequencing lab has generated “perfect” sequence data or not!
PS: In the spirit of following my own advice I linked to the papers in this post using the DOI, hopefully blog consolidators will pick it up and add it to the commentary about this article.

1 comment:

  1. Thanks for the great information. I'm currently looking at multiple runs of different transfections of the same sequence with the same point deletion. In the NGS results, there is a near-noise peak of the right type where my deletion occurs. What kind of confidence can I have that the problem is with my constructs or with the sequence?