Thursday, 23 February 2012

The Human genome on a single MiSeq run...can it be done?

One of the highlights at AGBT for me was Illumina's announcement of a 687bp "perfect" read at AGBT on MiSeq. This was the longest read in a PE400bp dataset, the read had a 113bp overlap sequence where both ends gave the same basecalls. My first question to Geoff Smith was how long are reads with one or two errors in the overlapping sequence?

I wonder if Illumina be able to generate 1000bp reads on MiSeq?

Will Illumina be able to generate 700+bp reads on HiSeq2500? And if so what would you do with 250M of them?

Personally I am not certain our view on sequence error rate needs to remain the same given the length of reads. What can we do if we accept a higher per base error rate but trade this for longer and longer reads? Asking this question of people at CRI got me thinking.

The Celera Human genome was the first really huge "shotgun" genome effort, the one we use almost exclusively today. I know they used publicly available data in their assembly but the bare bones of their work was 27M sequence reads totalling 15Gb of sequence from shotgun clone libraries. These Sanger reads were just 500-700bp long, or about the length of Illumina's latest MiSeq reads. And the MiSeq is about to ramp up the number of reads to 25M.

The thing we find hardest today is crating a good mix of reads from different insert sizes. Celera used 2, 10 and 50kb libraries. We can make the 2 and 10kb (just) today but 50kb libraries for sequencing on NGS are almost unheard of. 50% from 2kb, 40% from 10kb and just 10% from 50kb libraries. As library prep methods get better and if we can make mate-pair libraries with Nextera on the future, these kind of issues become less of a problem.

I think the longer reads on MiSeq will have a large impact in the short-term. It will be great to see if they can be realised on HiSeq 2500 as well.

MiSeq v.X: The speed at which MiSeq runs is allowing these longer reads. How could we make it run even faster and would this allow even longer reads? Could Illumina build a microfluidic chip sequencer? This might allow microlitre sequencing reactions on a tiny disposable flowcell incorporating the fluidics and generate well over 1000bp reads? Even if we only got 1M or so they would be incredibly useful. And a microfluidic sequencer should be cheap to produce and use as very little reagent would be consumed. Maybe this could be competition for a MinION?

PS: In about 2000 we bought a Celera 3700 for use in my lab (not the one I am in now). It was completely knackered!


  1. James:

    My own view is that the Oxford Nanopore and related technologies will bury all the mate pair stuff, but if I'm wrong there is some interesting technology from Lucigen which is basically a fosmid vector which can be converted en masse to mate pair libraries. So that would give you mate pairs in the 40Kb range

  2. Really i am not certain our view on sequence error rate needs to remain the same given the length of reads. Thanks for sharing!!