Friday, 6 March 2015

10X Genomics: what's the fuss over phasing

At AGBT 2015 the big splash was clearly 10X Genomics and their new technology the GemCode "toaster"; presumably so called because of its diminutive size, and not because your microtitre plate is launched out the top nice and warm! The system is available to order now costing $75K, with a $500 per sample price. Using an input of just 1ng means users can test this even with precious clinical samples. Hopefully the improved structural variant detection 10X are promising will have a significant impact on cancer research, perhaps making translocation   discovery easier.

At AGBT David Jaffe was quoted on Twitter as saying "Remarkably, it works." but also "We can't phase long homopolymers with 10x, hard to do with any technology". Hanlee Ji has been collaborating with 10X for about 18 months and described a new 'Quantum genetics'. He presented data from NA12882 1with 99% of variants and genes phased and big N50 blocks, up to 11.5Mb. Levi Garraway presented work on 10X exomes in Prostate cancer and detection of rearrangments creatibg intronic gene fusions.

How does 10X technology work: 10X is a "synthetic long-read" technology and works by capturing an barcoded-oligo coated gel-bead and 0.3x genome copies into a single emulsion droplet (sounds familiar), processing the equivalent of 1 million pipetting steps. The genome goes through a library prep that introduces one 14bp barcode from a pool of 750,000. After HiSeq sequencing the barcoded short-reads can be assembled into contiguous sequences averaging 50kb in length. Just 1ng of input DNA is split across 100,000 gel-coated beads (GEMs), with processing completed in 5 minutes. Samples are pooled for amplification and sequencing with final data averages 50kb phase blocks.
The  GemCode Gel-Bead and library kit contains the reagents to enable sample partitioning, molecular barcoding and library creation for 16 samples; and the GemCode chip kit enables sample partitioning and molecular barcoding for 48 samples; 8 samples per chip. A multiplexing kit allows up to 96 samples to be pooled for sequencing.
10X technology workflow
Why bother with phasing: although we can now buy a 30x genome for $1000 it's not perfect; although what Eukaryotic genome is? There are gaps in the sequence and regions that we can't resolve with the current short reads. Most of this does not make much difference to the majority of samples run through my lab; RNA-seq, ChIp-seq and exomes. But resolving maternal and paternal haplotypes is likely to be transformative in some medical applications, and we're likely to find out even more about structural variation in Cancer.

Results presented at AGBT showed megabase phase blocks with high percentages of phased SNPS; HLA phasing; and false positive deletions in control genomes; and discovery of an EML4/ALK fusion in a lung cancer cell line. A fully phased exome was also presented; this had around 160x coverage, so high compared to the standard today, but not too high if the extra information is usable by researchers. 
There was a good review of haplotyping applied to personalised medicing in GenomeMedicine: Whole-genome haplotyping approaches and genomic medicine, by Gustavo Glusman, Hannah Cox and Jared Roach at the Institute for Systems Biology.
Other long-read technologies: There are other ways to get similar data, but 10X appears to be ahead of the game. The most obvious is the Moleculo technology Illumina purchased a while ago, although I've not heard much about this since. Kevin Gunderson from Illumina presented a poster at AGBT on the transposase contiguity (CPT-seq) paper from Jay Shendure's lab; this reported 30-50kb "reads", limited only by input DNA length. Nextera tagmented DNA remains contiguous and can be diluted into 96well plates for PCR amplification with different barcodes, after pooling and long fragment sequencing "reads" can be assembled with N50 phase blocks of 1.4-2.3Mb. Complete Genomics LFR is also possible and from just 100pg of DNA or 10-20 cells.

What sets the 10X Genomics technology apart from other approaches to reconstruct long-range information from short reads, such as Illumina's TruSeq Synthetic Long-Read technology, originally developed by Moleculo, is its scale: the technology can partition DNA into more than 100,000 fractions, each containing about 0.3 percent of the genome, and has 750,000 different barcodes available.

True long-reads are possible on PacBio with reads over 20kb; and most recently Oxford Nanopore - the recent Nanopore Yeast genome had a contig N50 of over 470,000 base pairs. So there is always the possibility that synthetic long-reads might be outpaced by sequencing technology. However for the foreseeable future combining billions of Illumina short-reads with synthetic long-reads appears to be a winning combination.

Open-source software: 10X are releasing the software for anyone to develop, and includes a haplotype-aware genome browser. They are providing APIs, SDKs, and tools to allow bioinformaticians to develop novel tools for haplotype phasing, structural variant analysis, de novo assembly, etc. The open-sourcing is likely to be popular and 10X say it like they mean it "we believe that vibrant, open-source platforms are essential to the rapid development of high quality bioinformatics software", only time will tell.

Long-reads need long DNA: A challenge for anyone that wants to get phased genomes is the need to have high-quality DNA. The limitation in phase block is likely to be the length of DNA you can get, as opposed to synthetic-read, or even real sequence read. Getting DNA over 100kb is tough!

Other coverage: OmicsOmics (Keith has some great analysis of the costs), GenomeWeb, Dale Yuzuki, Bio-ITWorld, NGS Expert, DeciBio.


  1. "the barcoded short-reads can be assembled into contiguous sequences averaging 50kb in length"

    I'm not getting the sense that is correct from Keith's post, which seems to suggest that the long inserts are not reassemble-able... Because they're not long range amplified before splicing and barcoding; therefore there are no overlapping short reads to reassemble the longer sequences with. You can only tell that they came from the same read, not their order, in other words. Unless I'm missing something which is certainly likely.

    1. I don't think the original template is "converted" into library, i.e. I don't think it gets cut up and has barcodes attached. The impression I've gotten is that they amplify with random primers that have barcodes attached, meaning they should get overlapping reads from the same fragments.

      Although in some of the visualizations I've seen it doesn't look like reads are present for every base of every fragment, so maybe it's a technical difference with a similar outcome.

  2. I just read that is providing 10X Genomics sequencing

  3. This comment has been removed by a blog administrator.