Wednesday 11 June 2014

Why no PhiX on BaseSpace: %Q30 vs error rate, should you choose between them

We just had a shiny new NextSeq installed and the validation run gave 480M reads (about 20% above the spec of >400M reads) and 83.1%Q30 (against Illumina's spec of >75%Q30 at 2×150bp). But the initial reaction internally was that the error rates looked too high meaning we were likely trim back to 100bp. The validation run showed error rates of 0.22%, 0.37%, 0.43% at 35, 75 and 100bp respectively, and about 4% at 150bp.

I looked for some other NextSeq data on BaseSpace but none is currently available and the other runs I looked at did not report error rate. Why doesn't Illumina include error rates in their BaseSpace data? Probably because if no PhiX control sample was included, the chart is unavailable: as appears to be the case in these runs Genome in a day, PE150 HiSeq2500, Nextera rapid capture exome. One run I looked at did show error rates from a Metagenomics experiment of 2x250bp MiSeq v3 chemistry: 74.1%Q30 and error rates of 0.6%, 0.79%, 0.86% at 35, 75 and 100bp respectively, and over 5% at 300bp.

Why all the fuss about error rate: First off I'll say we rarely take a single metric in isolation when performing QC analysis for data we generate in my lab. Both error rate and %Q30 are important; as are %PF, phasing/prephasing and a host of other metrics.And of course MGA!

The new platforms are pushing out some incredible data, runs pass spec with evidently high %Q30 but the one metric that comes up time and time again in discussions I have with users and bioinformaticians is error rate. It has become the de facto measure by which reads are trimmed and 1% error seems to be the cut-off were using in analysing cancer genomes. So I started putting this post together as I wanted to know: what is the difference between %Q30 and error rate, can we use either in isolation, and are we being too conservative with a 1% error rate cutoff?
What do other people say about error rates: Firstly Illumina have a technote on how they compute quality scores, although this changed a while ago and introduce a "step" in the Qscore for the first few bases. Of course not everyone likes Illumina's quality scores and the Broad's GATK has a base quality score recalibration tool. I was also pointed to a post by BWA author Heng Li on the BWA mailing list:

"You don’t need to do quality trimming with bwa-mem. BWA-backtrack requires reads to be mapped in full length. Low-quality tail may greatly affect its sensitivity. Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be soft clipped."

What is the impact of errors: In a very simplified example of 3% error in all of the last 50bp of a 150bp read and high genome coverage, we effectively get a 1% error rate in the data. If this is a ChIP-seq experiment and mapping is the default analysis then errors are easy to tolerate. Even if this is a germline genome experiment and you'll be looking for genotype calls, this error rate should not be a big issue. But if you are running cancer samples and wanting to look at somatic mutations at differing allele frequencies in heterogeneous samples with normal contamination this 1% error rate may put a limit on what you can do with the data. In fact it might prevent some analyses from being run at all. However many of the tools available today use quality scores in variant calling so the problem is not as bad as it may seem at first glance.

We've been working with groups developing methods to look for somatic mutations in circulating tumour DNA. So far this has been down to the 1-2% MAF (mutant allele frequency). An across the board 1% error rate could be disastrous. Even a 0.1% error rate may limit MAF calling to 1% (assuming we stick with a 10:1 signal to noise). As such it appears that the habit of trimming data back to the highest quality is hard to give up! We could improve our data with quality trimming and compare variant calling performance against a gold standard, but we can't easily find a comprehensive gold-standard for these complex low-frequency mutant alleles.

Stuart Brown over at NextGenSeq discussed using overlapping reads and recalculating Q-scores by summing them (product of the two error frequencies) and allowing variants at one in ten thousand  or better to be called with a low false positive rate. So all hope is not lost, unless you don't have enough overlap.

To trim or not to trim: Error-, or more accurately mismatch-rate has an affect on the mapping of reads. Tools like Bowtie, BWA and NovoAlign try to align the whole read and will throw away reads with too many mismatches (i.e. you lose data). Some informaticians will hard trim based on quality to a specific length. BWAmem works with the data (simulation suggests alignments should be good given 2% error for an 100bp alignment, 3% error for a 200bp, 5% for 500bp MiSeq PE300 anyone?). Some tools: Trimmomatic, CutAdapt, TrimGalore, allow "smart" trimming of data so only the higher error portions of reads are lost, you still lose data but perhaps not as much as with tools that throw the whole read away.

So back to my earlier questions...
  1. What is the difference between %Q30 and error rate: Error rate is calculated based on our spiked in PhiX control sample, and shows us the number of bases where we have a mismatch to the control. Q scores are defined as a property that is logarithmically related to the base calling error probabilities (P) squared: Q30 = 1 error in 1,000 base calls or an accuracy of 99.9%.
  2. Can we use either in isolation: Either will suffice for a quick check on the quality of a run, useful for monitoring if the run was OK, etc. But both can be misleading when deciding if data is good enough for a specific application. Don't use either in isolation!
  3. Are we being too conservative with a 1% error rate cutoff? Possibly not. Q30 is an error every 1000bp or 0.1%. If you are expecting true variants at this rate then your in at the deep end trying to find the real variants from the noise.
Why is error rate missing: Quite simply there is no PhiX. Why is it absent? Illumina's TechSupport still recommend we spike PhiX at 1-5% for use as a quality control and troubleshooting. I'd like to QC their data but the one thing I want to look at is not there. Come on Illumina please add PhiX to your own runs and let us get the whole picture.


  1. I agree with your comments about NextSeq data. I'm not sure it's totally ready for 150bp reads yet - definite tailing off in quality for both the PhiX validation & our own samples. Luckily we are almost exclusively using it for RNA-Seq so the maximum length we do is 75PE (well, 80PE using some spare indexing cycle)

  2. There is a (very?) recent comparison dataset in SRA: "Comparison of NextSeq500, HiSeq, GAIIx and MiSeq Platforms with PhiX Phage Genome sequencing"

  3. even perfectly reflecting the quality of a base called, quality score underestimates true error of a read because a base not called (deletion) don't have a quality score.

    As for 1% error rate cut off for calling 0.1% variant, shouldn't we expect reasonable coverage should produce a consensus read that the error rate is much lower?


Note: only a member of this blog may post a comment.