Thursday 23 February 2012

The Human genome on a single MiSeq run...can it be done?

One of the highlights at AGBT for me was Illumina's announcement of a 687bp "perfect" read at AGBT on MiSeq. This was the longest read in a PE400bp dataset, the read had a 113bp overlap sequence where both ends gave the same basecalls. My first question to Geoff Smith was how long are reads with one or two errors in the overlapping sequence?

I wonder if Illumina be able to generate 1000bp reads on MiSeq?

Will Illumina be able to generate 700+bp reads on HiSeq2500? And if so what would you do with 250M of them?

Personally I am not certain our view on sequence error rate needs to remain the same given the length of reads. What can we do if we accept a higher per base error rate but trade this for longer and longer reads? Asking this question of people at CRI got me thinking.

The Celera Human genome was the first really huge "shotgun" genome effort, the one we use almost exclusively today. I know they used publicly available data in their assembly but the bare bones of their work was 27M sequence reads totalling 15Gb of sequence from shotgun clone libraries. These Sanger reads were just 500-700bp long, or about the length of Illumina's latest MiSeq reads. And the MiSeq is about to ramp up the number of reads to 25M.

The thing we find hardest today is crating a good mix of reads from different insert sizes. Celera used 2, 10 and 50kb libraries. We can make the 2 and 10kb (just) today but 50kb libraries for sequencing on NGS are almost unheard of. 50% from 2kb, 40% from 10kb and just 10% from 50kb libraries. As library prep methods get better and if we can make mate-pair libraries with Nextera on the future, these kind of issues become less of a problem.

I think the longer reads on MiSeq will have a large impact in the short-term. It will be great to see if they can be realised on HiSeq 2500 as well.

MiSeq v.X: The speed at which MiSeq runs is allowing these longer reads. How could we make it run even faster and would this allow even longer reads? Could Illumina build a microfluidic chip sequencer? This might allow microlitre sequencing reactions on a tiny disposable flowcell incorporating the fluidics and generate well over 1000bp reads? Even if we only got 1M or so they would be incredibly useful. And a microfluidic sequencer should be cheap to produce and use as very little reagent would be consumed. Maybe this could be competition for a MinION?

PS: In about 2000 we bought a Celera 3700 for use in my lab (not the one I am in now). It was completely knackered!

Wednesday 22 February 2012

AGBT wrap up

AGBT is over and what a meeting it was. I have never been to a conference where it felt like everyone, and it really did feel like everyone was talking about the same thing over lunch. Oxford Nanopore made a big impact.

There has been some great coverage of AGBT and I'd encourage you to take a look:

Dan Kobolt over at MassGenomics had some the best coverage I read. I spoke to Dan at the meeting and we discussed finding the time to post. I am not sure where he found the time this year, AGBT was busy. The guy does not sleep!

Nick Loman has a great round up of other blogs and I wont regurgitate it here so follow this link.

The number of hits I got on my blog made the efforts I put in very worthwhile (thanks for taking a look) but I am not sure I'll try to cover a meeting in the same way again. I put the notes together for internal use and thought it would be simple to strip out anything I wanted to keep quiet and tidy it up for posting on Core Genomics. However it was really hard to get it out in good time. I enjoy my blog posts and certainly don't want Core Genomics to put me under the same pressure as writing a paper!

This is supposed to be fun after all.

PS: I won an iPad 2 at the conference with RainDance and my kids were so happy. I entered most of the competitions at AGBT and take the view that as there are only about 1000 people at the meeting the odds must be quite good. Drop your business card in next time and see what you come home with!

Saturday 18 February 2012

AGBT 2012 day 2

Rick Myers, HudsonAlpha Institute. Genetics and Epigenetics of Human Gene Regulation: Eukaryotic genome architecture is complex, cis and trans components looks like 1800 DNA binding proteins (what do they all do, can we find out before 2020). Talked about the promoter expansion in EPM1 gene as an example of where transcription is important in disease and regulatory elements play a part.

A user guide to the encyleopedia of DNA elements (ENCODE) (Paper), Rick said this is a complex experiment, now I don’t feel so bad about not understanding it as well as I think I should.

His lab is using Interactome, Transcriptome and Methylome analysis, suggested we need to integrate all three with allelic variation (genomics meets genetics)

Interactome: Protein:DNA interactions by ChIP-seq (the hardest part is getting a good antibody for the ChIP). Talked about NSRF binding as a repressor discovery through ChIP-seq impossible with other methods. His group has so far looked at 241 interactions, 80 TFs, 18 cell lines (is this more then Duncan’s group).

Tim Reddy Jay Gertz and Flo Pauli: Allele-specific occupancy to show differential occupancy at a locus (email to Kerstin). (Paper) Genome Research Feb 2012. 

Transcriptome: gave a simple intro to mRNA-seq, alignment and calling geens is still complex and an evolving technology. Can be very sensitive and far more accurate. His group performed the transposase RNA-seq library preps (paper) Gertz et al: Genome Research 2012. Down to 1ng of RNA.

Methylome: WGS for methylation is expensive. Using RRBS and Methyl450 chips in the lab. Measuring differential methylation at a locus genome-wide. Can you use WGS bisulfite low coverage sequencing of ctDNA to detect cancer early on by looking for higher than expected methylation of DNA?

Breast cancer: Combining all three in a clinical trial of anti-DR5 antibody and BrCa. Methylation at 114 CpGs is associated with response to antibody therapy. Possible companion diagnostic for treatment response likelihood.

Tom Gingeras, Cold Spring Harbor Laboratory. Important Lessons from Complex Genomes: How to distil down data from very large projects, ENCODE and modENCODE. Recent stats from GENCODE are around 50,000 genes (20000 protein coding), transcripts 160,000 (76000 are protein coding). This fives about 8 transcripts per gene locus and more than half are non-coding. Geneic regions have multiple control regions and complex transcription. Transcripts are being identified at a faster rate then genes.

RNA analysis: Presenteed a slide showing the flowchart for RNA analysis from a whole cell to RNA-seq, RNA-PET and CAGE analysi of Cytoplasmic or Nuclear RNA, at > or < 200bp and Poly+ or PolyA- (for RNAs >200bp). Long RNAs - 200M reads per replicate, 15 cell lines. Short RNAs - 35M reads per replicate. IDR analysis: Irreproducibility Detection Rate Li et al 2011 (Paper). Statistics for two replicates, I know a lot of biologists that will be arguing “two is enough”! Transcription: Estimated that 80% of the genome is being transcribed (across 15 cell lines). There is pervasive and promiscuous transcription with lots of novel multi-exon transcripts and antisense transcripts. More than half of 245k unannotated single exon transcripts are intergenic and antisense (predominantly polyA-). Cells appear to transcribe all isoforms, there may be a dominant isoform present but the others can be compartmentalised within the cell. Transcript copy number can be different by orders of magnitude in different cells, showed a beautiful slide of expression level for coding and non-coding transcripts. (Paper) Landscape of transcription in Human cells submitted.

*Geoffrey Smith, Illumina: The Miseq DNA Sequencing Platform and Application in Clinical Microbiology.  Geoff reintroduced some of the MiSeq upgrades, 7Gb, longer read lengths (2x250), etc. Unfortunately most of his talk was about three very interesting cases of how clinical microbiology can use NGS in the form of MiSeq. As this work is out for reveiw we were asked not to blog or tweet.

Illumna did present a poster showing a 678 bp read, the longest read generated so far on MiSeq. This was done from a PE400bp run. Human genome sequeucing is going to get a whole lot easier.

Jo Boland, NCI. Exome Sequencing on the PGM: Exome in a Day: First 318 chips in October, now running 6 PGMs, they also have 1 HiSeq2000. Single exomes on the PGM adds flexibility to the lab. If a sample drops out of a panel it needs to be repeated as quickly as possible to not delay the rest of a project. Flexibility is great.  
Exomes in a day: Nimblegen Capture in production on HiSeq. LifeTech TargetSeq on PGM in “Exome in a day” service. Exome capture still takes three days, it is only the sequencing that can be done in one day. POC on CEPH/UTAH 1463 trio. 3, 5, 7 & 10GB for Mother, Father and Son at 5Gb only. Data looked very god but 3.5Gb was just under 20x coverage. Tried exome in a day pipeline on a Melanoma family. Five PGM runs and 1 Ampli-seq panel. Generated 4.8Gb. Lots of variants reported.
318 improved loading onto chips will make a difference. OneTouch 200bp chemistry. Rapid exomes on PGM foreshadows what will be possible very shortly.

*Richard Roberts, New England Biolabs, “Characterization of DNA Methyltransferase Specificities Using Single-Molecule, Real-Time DNA Sequencing”: He started by saying "can we please spend a small fraction of the money going into NGS on finding out what the genes actually do!" This got a round of applause form the audience. Everyone want to understand function.
His talk focused on methylation analysis. NEB have a strain of E coli with no methylases, they can introduce a plasmid with a specific methylase and determine the impact on the genome in isolation. They did all their sequencing with Pacific Biosciences.
Bacterial genome sequencing should be backed up with methylation sequence, possibly on PacBio.

*Clive Brown, Oxford Nanopore Technologies, “Single Molecule ‘Strand’ Sequencing Using Protein Nanopores and Scalable Electronic Devices”

See my earlier post for a clearer description of what Clive presented.

Nanopores are small holes, measure current changes and dwell time as DNA passes through a pore. ONT do not use anything already published intheir chemistry. 160 enzymes tested 20-400bps,. Srand sequencing, DNA can be fed 3-5 or 5-3’. DSamples prep puts a hairpin at the end to allow forwad and reverse sequen ing. dsDNA sample prep standard, duo mono, 15 minute prep, Capture prep modes. Fragment end repair a tail ligate adapter.

ASCI core is the heart of the system. These sensors can do 1000bps sequencing. On top f ASIC is a sensor chip, array of microwells layered with a membrane in which are nanopores. Very stable biological system. Not currently using dwell time for base calling. Actually reading 3 kmer blocks and need to determine the sequence from the series of read blocks to generate an actual sequence.

PhiX Genome: 5 and 10kb reads. 4% error rate. They have seen a full length genome read from PhiX! Moved to Lambda genome in a similar single pass read all 42KB!!! The 10k base is as good as the 40k base, data quality does not deteriorate. Errors come form “wobbly” Kmers. Error rate will be improved with pore eolution.

Sens-antisense reads combine to improve base quality, base analogues and SNPs.

Homopolymers cause the enzyme to tick?

RNA-seq with no cDNA conversion.

Epigenetic modification: MeC and hMeC have been looked at but many could be analysed.

Spiked known sample into blood and put blood directly onto chip. Can detect rabbit DNA in blood with no samples prep.

New product in 2012: MinION USB sized disposable DNA sequencer. About 150Mb per hour scalable to 1GB for use in the field. Plug into a laptop and sequence! App does the analysis on the fly. 5-25Gbp per day but limited to 6 hours run time.

GridIOn 2000 and 4000 pore versions. Throughput a fcunction of number of pores and speed of sequencing. 25-125 Gb per day.

Pricing: $500-1000 per use. $10 per Gbp Human genome in 15 minutes.

Friday 17 February 2012

Oxford Nanopore did not disappoint

AGBT is over as far as many attendees are concerned, the sequencing landscape looks like it might be about to ramp up to the next gear. Again!

The new MinION (read about the USB seqeuncer later on) will cost $500. assuming everyone at AGBT buys one ONT get $5M in rvenue form one conference. This is before anyone does anything real with the technology.

At least one lucky punter got a sneak preview. Nick Loman, my collaborator on the Google Map of seqeunceres, got an early interview with Clive, Zoe and Dan at ONT. Read his post here.

Clive Brown from ONT presented “Single Molecule ‘Strand’ Sequencing Using Protein Nanopores and Scalable Electronic Devices” at AGBT in a 20 minute slot. It was 20 minutes with lots of information, much of it jaw dropping if it delivers as well as everyone hopes. The company also have a press release which says “[Clive] outlined the Company's pathway to a commercial product with highly disruptive features including ultra long read lengths, high throughput on electronic systems and real-time sequencing results. Oxford Nanopore intends to commercialise GridION and MinION directly to customers within 2012".

I can't resist adding this image of Clive... I lifted it from Bloomberg (sorry Bloomberg). He looks like the cat that killed the rabbit! And yes, that is a DNA sequencer he is holding in his fingers.

ONT have commercialised DNA 'strand sequencing', not the exonuclease method Illumina licensed for $18M.

The sequencing chip:
ONT are not using any published chemistry in their system. The chips apparently have an ASIC core with a membrane covering them supporting the nanopores. The aim is to release chips with 4000 or 8000 pores each. The enzymes used in the sequencing run at 20-1000bp/sec. Currently ONT are only using the change in current to differentiate bases, dwell time has been proposed as a way increase discriminatory power and sequence MeC, etc.

The sequencing is done by reading 3bp Kmers, i.e. the pore is large enough to hold three bases inside. These three bases give a characteristic signature and as the strand translocates one base at a time sequential Kmers allow discrimination of the sequence.

Sample prep: The current preferred sample prep is to add a hairpin to the molecule to allow sequencing in both directions. Load the first strand, run through the hairpin and start reading the second strand all in one contiguous read. DNA can be fed through 3’-5’ or 5’-3’.

The chips are very stable and can be loaded with blood to allow sequencing of DNA in solution with no sample prep at all. Clive is going to get a reputation as a mean b*****d after describing sending a vegetarian colleague to an abattoir to collect blood, then getting it from someone’s pet rabbit and finally sending another ONT employee into a river to collect raw sewage. All in the name of progress ;-)

Genome sequencing: Clive discussed a couple of projects. First ONT have sequenced PhiX. But they did it in a single contiguous read. First pass sequencing showed about a 4% error rate but the hairpin double-strand sequencing reduced this to 0.1-2% and the errors are in know locations of the genome as some Kmers are know to be “wobbly”. Next they sequence Lambda, all 42Kb in single reads, and as 5 and 10Kb fragment libraries. The Lambda sequence was done as a 100kb read, 42kb from each direction. The most interesting thin here was that the quality of the 1st, 10000th and 40000th bases is the same. Unlike all other systems that show some form of decay in quality, ONT may be able to give us almost infinite read length, whole chromosomes perhaps?

RNA-seq and Epigenetic modification:
MeC and hMeC have been looked at but many other could potentially be analysed.

Clive briefly mentioned RNA-seq with no cDNA conversion but immediately pointed out thsat though this had been demonstrated there were no plans to commercialise the application at this time. An interesting feature about the ONT methodology is that once one strand has been sequenced a new one can load in and begin sequencing. I can imagine this is going to allow some really cool RNA-seq once they finish the development of that, even with only 4000 pores per chip, at 1000bp/sec you could sequence nearly 3M transcripts of 5Kb average length. With no cDNA intermediary!

GridION: The GridION was presented in a 20-node cluster format that, when loaded with 8,000 pore chips, could sequence a human genome in just 15 minutes. The GridION will be available first with 2000 pore chips, 4000 and 8000 will come in the mid-term. Using the “Run until” technology, a user can specify how much data they want form a samples and leave the instrument to run for just long enough to get all the data. I did not get a chance to see the pricing

The star of the show - MinION:
This is a disposable USB sequencer that will generate up to 1Gb of sequence in the field. Just plug into your laptop and sequence! The chips are capable of 5-25Gbp per day but runs are limited to 6 hours due to chemistry.

The possibility of sequencing in the field brings a whole new dimension to disease and agricultural research. Being able to swap a patient or collect a fungal sample and sequence to identify which drug or fungicide should be used is going to revolutionise research, health and agriculture.

We can't place an order just yet. I certainly hope to start some early access collaborations (hint, hint) and am still not really sure what difference this is going to make to the world.

PS: sorry if there is anything incorrect in this post, there was a lot in Clives talk and I have been grabbing a few minutes here an there to get the post together.

AGBT 2012 day 1

Day one of my attendance at the 2012 AGBT conference.

I missed the Wednesday workshop, "Utilization of New DNA Sequencing Technologies" and the
welcome reception which this year was sponsored by Agilent Technologies.
Getting through Miami airport was a chore, nealr two hours for security!

I decided to try and posy my notes from the meeting on my blog. The idea was to allow those of you not ablre to coe to learn some of the highlights of the meeting from my perspective. The problem I saw when writing these up for posting is that my perspective might not be as insightful as I'd like. I have posted anyway and will try to do the same for the rest of the meeting.

*Abstract-Selected Talks
Thursday morning 16th February 2012
Plenary session: clinical translation of genomics
Lynn Jorde: Whole-genome sequencing, mutation rates, and disease-gene detection.
Heidi Rehm: Supporting large scale sequencing in a clinical environment.
Lisa Shaffer: A genotype first approach to the identification of new chromosomal syndromes.
*Darrell Dinwiddie: Initial clinical experience with molecular testing for >600 severe childhood diseases by target enrichment and next-generation sequencing.
*Rong Chen: Type 2 diabetes risk alleles demonstrate extreme directional differentiation among human populations, compared to other diseases.

Three common themes came out of this session, the adoption of panels for disease-analysis sequencing pipelines for discovery of disease causing mutations and the need for Sanger validation.

Panels for targeted sequencing: Panels will continue to develop as labs focus on different tests. It seemed that the likelihood of a single panel for Human disease is low; groups will design their own, some may just sequence the exome plus some additional loci. But it did seem that in the short-term specific disease based panels will be developed and used in many labs across the world. There is a lot to drive research in this direction, primarily the upfront cost of developing a panel that can be high. If a panel can be shared then it can be manufactured in high volume and cost drops significantly. It also follows that the data can be more easily shared between labs increasing the number of samples in meta-analysis.

Disease-causing variant analysis pipelines: I do wonder how long it might take for the best-practices to be developed for disease causing mutation pipelines. Ideally many labs would use the same workflow making sharing much easier? However in the research setting people can?t always agree on the best way to proceed.

Sanger sequencing validation: Sanger sequencing is still being used for confirmatory validation of results from NGS studies. However the number of Sanger tests that need to be run increases, as NGS panels get larger. Can we really afford to keep the paradigm of Sanger validation? I suspect we?ll ditch it and rely on NGS alone at some point. It may be that different platforms are used to give us additional confidence. Perhaps Illumina:TSCC sequencing followed up by Ion Torrent:Ampliseq kind of approaches?

The outlook: I suspect the clinic might drive us towards more uniform approaches in all three of the above points as there is a need for standardization. Sharing data across different clinics will also improve our ability to review and update knowledge based on new cases. This will be particularly so for rarer diseases where only a few cases are presented per year in one clinic.

Lynn Jorde: Whole-genome sequencing, mutation rates, and disease-gene detection.
Lynn Jorde, University of Utah School of Medicine: Direct estimation of mutation rates from whole-genome sequence data. Started the AGBT meeting with the obligatory slide showing the cost of genome sequencing and the increasing number of genomes sequenced.
Miller syndrome sequencing: family sequencing of a mother, father and two affected children, in 1981 this family was the first reported to show a recurrence of Miller syndrome, there was no way of determining inheritance. A second family showing recurrence of Miller syndrome has since been identified. Showed a slide of 7 Miller syndrome patients but this was about 25% of cases worldwide! This is a rare disorder. Used 15ug of DNA for sequencing by Complete Genomics of all family members. The family self-identified and received a lot of publicity.
Mutation rate estimation: From the CGI data they saw 34000 Mendelian errors, but only expected around 0.01% to be true mutations. They used Agilent capture arrays to resequence all variants and only 28 were confirmed. This allowed them to calculate a mutation rate of 1.1x10-8, each gamete having about 35 novel variants. Two other trios gave very similar estimates, which is about half the rate people thought before we had NGS data.
Variant Discovery: He reported discovering 3M SNPs, 10k non-synoymous and 100 loss of function variants. Methods to identify these are rather ad-hoc so the group developed Variant annotation analysis and selection tool, VAAST, see Yandell et al genome research July 2011(Paper) to better identify disease-causing variants. Used VAAST in ?progeria-like? lethal x-linked disease with exome sequencing, 20 minute analysis identified NAA10 as a new genetic condition ?Ogden syndrome?. Mother was four months pregnant but could not inform her of findings, baby was affected and died the week the paper was published. A CLIA-approved test has been developed and the family will use it for pre-implantation diagnostics. VAAST increses in power as more control genomes are used. Collecting lots of control genomes will improve the power of this technique, and others like it.
Analysis of male-specific mutation rate: used a large family and compared identical by descent regions, showed male mutation rate was around 5x the female rate. Males are responsible for most of the change in the genome. Questions that might be asked - Affect of paternal age on mutation rate? Variation in DNA repair genes? Population variation?

Heidi Rehm: Supporting large scale sequencing in a clinical environment.
Heidi Rehm, Partners Healthcare Center for Personalized Genetic Medicine: Gave a CLIA lab perspective from the Laboratory of Molecular Medicine, over 150 tests main focus is on large multi-gene panels, now moved to NGS. Also run MLPA, STR, dPCR, etc. content per test has grown logarithmically from single gene test to large panels, now SNP-arrays, exomes and WGS. LMM workflows have become more complex. HCM cardio-chip (and OtoChip for hearing loss) included a test for GLA, the only gene where there is a treatment available, 2% of patients have the mutations and can be treated. You need to include as much as possible in a panel! , both HCM and OtoChip are Affymetrix sequencing arrays. Identifying the correct mutation has an impact on patients.
Drawbacks: Need to confirm NGS results, LMM and many others are still Sanger confirming results. Need to fill-in any regions known to be associated with a disease, we cannot have gaps. HCM NGS test requires Sanger validation for confirmation of variants, also un-callable bases, low-covered regions (less then 20x), non-covered regions. Sanger adds a lot of time and money. For the OtoChip NGS test LMM are developing Sanger validation assays at the same time as launching the test. Need to report on all variants and this takes time, 20min to 2 hours for each novel variant to be assessed using google, pubmed, databases like dbSNP, DGV, etc. Myriad detect 10-20 new missense variants per week despite testing 150,000 people for BRCA1/2. Lots of new variants will be discovered. Not clear how many people we may need to sequence to reduce this number. In HCM 66% and in OtoChip 81% of clinically significant variants are found in only one family!
Sharing data: Need to get more clinically useful data out in the public domain and share it around. LMM have a grant submission for creation of a universal Human genomic mutation database, under review. Approached almost 100 labs now have 160,000 cases of data that can be submitted if funded.
GeneInsight in Hum Mutat 32:1-5 2011(Paper). Reports need to be reanalysed and possibly amended, as data in the research databases becomes more comprehensive. For HCM 4% of reports would have changed and had an affect on the patient. Similar to the reporting feedback that 23andMe offer to subscribers. We need electronic health care records. Data needs to feed into a network of labs as each will bring their own specific expertise on particular tests. The infrastructure is going to be hugely important.

Lisa Shaffer: A genotype first approach to the identification of new chromosomal syndromes.
Lisa Shaffer, Perkin Elmer, Inc.: This was a sponsored talk from Perkin Elmer. She introduced us to the traditional phenotype-first approach to syndrome identification e.g. Downs syndrome reported in 1866, Lejeune reported the etiology in 1960s (need to check the date). In the 80s and 90s syndromes were reported based on known chromosomal deletions, easily visible under light microscope >10Mb in size. More recently arrays allow very quick analysis of deletion and size of variation and are a powerful tool to identify and characterise micro-deletions.
Signature Genomics Genoglyphix: (Google) database of 16000 abnormal cases, looking for deletions in particular genes, see Baliff et al Hum Genet 2011 (Paper). Lisa is proposing a genotype first approach, could easily be sequence-based. The important bit is to get the data in one place and curated to allow better analysis, similar to Heidi Rehm?s talk above. Sharing is good for everyone.

*Darrell Dinwiddie: Initial clinical experience with molecular testing for >600 severe childhood diseases by target enrichment and next-generation sequencing.
Darrell Dinwiddie, Children's Mercy Hospital: Children?s Mercy Centre for Pediatric Genomic Medicine, 20 physicians involved across every specialty in the hospital, 20 MD, PHD or genetic counselors. There are over 7000 Mendelian diseases described, 3,300 have a known molecular basis, and cause ~25% of infant deaths. We need to test as early and comprehensively as possible to impact treatment or palliative care, also for pre-conception carrier testing. Rule out specific diseases and better treat patients.
Testing at Mercy today: costs about $10,000, 1-10 diseases at a time, 3000 patients per year, turnaround up to 12 months, 50% cases diagnosed.
Testing at Mercy tomorrow: with genomic medicine shold cost about $1000, 600 diseases at a time, 30000 patients per year, turnaround up to 1 week, 90% cases diagnosed.
Logistics are important: The labs are setup with a uni-directional workflow. Separation of DNA isolation, library prep, PCR-enrichment, QC, Sequencing. They have employed a semi-automated 96 samples processing platform. Caliper LabChip GX, Caliper robots, Covaris 96well shearing.
They are using Illumina TSCC of 526 genes, 8366 genomic regions. 4Gb of sequence per sample give 98% of targets at 16x or higher coverage, allows around 50 samples per lane on HiSeq. NGS genotyping test shows very good concordance to array based genotyping. They do not test children for carrier status or for disorders that will not affect them until adulthood.
Variant detection analysis: They have what looks like a pretty comprehensive variant detection pipeline. Spoke about the difficulties in detecting deletions, insertions and CNVs and that these can account for up to 80% of disease causing mutations in some diseases. Gross deletions detected by drop in sequence depth and by aberrant pair read mapping. They are completing a validation of 700 samples including Coriell samples, 384 from CHM, 220 from collaborators in Germany and Iran.
Summed up with a discussion on the power of Genomic medicine. Cases presented were sisters who had been tested over 5 years to figure out disease causing molecular diagnosis. Genomics analysis suggested APTX gene, which showed a homozygous mutation confirmed by Sanger sequencing, parents were both carriers. APTX causes CoQ10 deficiency that is treatable and this treatment has started in the sisters. Reported on current progression and the older sibling is no longer wheel-chair bound.

*Rong Chen: Type 2 diabetes risk alleles demonstrate extreme directional differentiation among human populations, compared to other diseases.
Rong Chen, Personalis Inc.: Presented a nice slide showing how T2D genetic risk decreases when Humans migrate, frequency of risk alleles changes across populations. Why?
Broad thrifty gene hypothesis caused by the promotion of energy storage and usage appropriate to environments and/or energy intake or mismatch between genetic background and environment.
They used publically available HapMap 3, HGDP and CGI data in their analysis.

Thursday afternoon 16th February 2012
Poster session: some interest in my poster but oh so many posters for people to choose from!

Plenary Session:ˇ Genomic Studies
Sequencing Thousands of Human Genomes. Goncalo Abecasis.
Goncalo Abecasis, University of Michigan: Genetic variants that modify LDL levels are responsible for about 30% of known association. Rare variants more often have larger effects. It is difficult to know with certainty if all susceptibility loci have been found.
The scale of rare variation: NHLBI Exome sequencing data, nearly 50% of variants are seen in a single individual. These singletons also appear to show more non-synonymous mutations. How much of the phenotypic variation do these rare variants explain?
How to study rare variation: whole genome sequencing, exomes, low coverage WGS, new GT arrays or genotype imputation. NHBLI performed exome sequencing of 400 samples from a 40,000 population showing the extremes of LDL phenotype. Also had data from 1600 other individuals with sequence data.
Sardinian sequencing: 6000 Sardinians from the Lanusei valley, currently 1700 individuals at 4x. Chose the key individuals in families to sequence allowing as many genomes to be analysed from imputation of genome in related individuals. Low coverage analysis allows you to get from a 5% error rate (66 samples) in genotype calls to less then 1% in (300 samples).
Problems in analysing so many samples: standard tools for genotype analysis take a long time when you want to run 1000 samples with low coverage sequencing. Different analysis methods give different results, not easy to determine which to use. Consensus calling (Sanger, Broad and NHBLI methods) reduced error rates by 40%.
1000 genomes current integrated map: >1000 samples, 37.9M snps, 3.8M short indels, 14000 large deletions. 98.5% SNPs but only 70% of indels can be validated. 10 samples account for 200000 indel false positives or 5% of errors!

ENCODE: Understanding Our Genome. Ewan Birney.
Ewan Birney, EBI: ChIP-seq 150, RNA-seq 100, Dnase-seq 100 (samples assays), 164 assays for multiple transcription factors, six cells lines chosen for in-depth analysis K562, HeLa, HepG2 plus GM12878, H1ESC, HuVec. 10 to 20 of the ChIP-seq assays have been run on nearly all 182 cell line samples.
ENCODE is a big project: with 11 main sites, 50 labs involved, 10 additional groups, 30 ?lead? PIs, 350 authors on the main paper, 6 ?high profile? papers and 45 companion papers.
Global analysis of the data without a priori knowledge came up with numbers very similar to what we alrady think is correct; Exons 3%, ChIp-seq regions 12%, DNAse 20%, Histone mods 49%, RNA 80% of genome is covered. The genome is doing a lot of stuff, 97% is not "junk".
Selection in primate specific regions: multiple alignments allow you to infer a particular sequence has inserted into the primate lineage, many of these appear to be under selection in Human populations.
GWAs and functional SNPs: many SNPs are not the functional SNP but simply linked target SNPs. ENCODE showed about 30% of reported SNPs are either functional or within 500bp of the functional SNP.
Ewan gave two big plugs for Rick Myers DNA methylation and Tom Gingeras RNA landscape talks tomorrow, so I?ll try not to miss those. He talked about a lot of big experiments, very quickly and went over time.
But I still don?t really understand ENCODE, should I try harder?

Molecular Classification of Breast Tumors Using Gene Expression Profiling and Its Translation Into Clinical Practices. Charles Perou.
Charles Perou, The University of North Carolina at Chapel Hill: 2000 Breast Cancer mRNA-seq analysis. Did a lot of work up front on different RNA-seq methodologies. Compared Agilent microarrays with mRNA-seq and DSN RNA-seq. Al three showed good concordance, especially so if only the 1600 GX BrCa list are used instead of all genes.
Ribosomal depletion: They performed a comparison of the different ribosomal RNA depletion methods; Ribominus (uses a few oligos to pull down), RiboZero (tiled oligos) or DSN (Double Strand Nuclease). All worked well, Chuck appeared to prefer the DSN approach as possibly more automatable.
GX diagnostics: The current diagnostic tests like PAM50 and Oncotype are useful, pathology is useful, but together they are better. Multiple assays give the very best results for prognosis, diagnosis. We should consider compex tests using protein and gene expression, CNV, InDels, methylation etc.
Adding copy-number data. Copy-number data by itself is good but both are only marginally better. It gets expensive to do all the possible technologies on large sample cohorts. The ideal would be to get these assays into clinical trials.

*Breast Cancer Progression From Earliest Lesions to Clinically Relevant Carcinoma Revealed by Deep Whole Genome Sequencing. Arend Sidow.
Arend Sidow, Stanford University: BrCa is a progressive disease, normal through CCL, DCIS, to IDC. Different stages are related by cell lineage and a tree of BrCa evolution is more appropriate way to describe the disease.
Presented 3 cases with CLL, FEA, DCIS and IDC lesions. Take an FFPE specimen; histology and pathology determine which regions to access for sequencing, core the tissue, extract DNA, make libraries and sequence. 3 patients 4-5 regions each sequenced to 50x coverage. Looking at SNPs, SNVs, Aneuplodies, Structural variants. It is difficult to dissect tumours cleanly so there is always some normal contamination. Analysis of tumour evolution is possible by looking at which mutations (any kind) are shared between different sites in the tumour. LOH and chromosome gain affect the absolute coverage of a region, LOH will lead to fewer reads, chromosome gain will result in more reads.
Cells go through around 40 cell divisions from fertilization to Puberty.

Pacific Biosciences Roundtable Discussion/Dinner
Dinner was good, not perhaps as fun a discussion as last years. John Rubin?s movie was very entertaining but this year we only got a clip of Matt Damon in Contagion.

Thursday evening 16th February 2012
I spent most of the evening in the genomic technology concurrent session with a brief interlude in the Medical Sequencing and Genetic Variation session for Elliott Margulies talk.

*Understanding Sequencing Bias Across Multiple Sequencing Technologies. Michael Ross, Broad:
I missed this one due to a medical emergency, lots of firemen and paramedics in a very tight space. Don?t ask!

*Infinipair: Capturing Native Long-range Contiguity by in situ Library Construction and Optical Sequencing Within an Illumina Flow Cell. Jerrod Schwartz.
Jerrod Schwartz, University of Washington. A postdoc in Jay Shendures group (that group does loads of cool work). De novo assembly does not work brilliantly with current NGS mainly due to the size of inserts that are sequenced, currently around 400bp-5Kb insert libraries. The HGP used much larger insert libraries. We need short, mid and long-range congruity sequence information. Current approaches require circularisation of long bits of DNA that is inefficient. Optical mapping or optical sequencing are other approaches. They thought about what the ideal platform could look like; high-throughput, inexpensive, no circularisation, use current hardware, work at different length scales.
Infinipair: uses existing Illumina hardware. Load high molecular weight DNA onto a flowcell and generate spatially related reads. Take long DNA, hyb both ends to the flowcell, cut and make clusters out of the pair of covalently attached molecules, reads in clusters 1 and 2 should be spatially related in the genome. Used PCR-free library prep method to prepare libraries where the two adapters are only on one end allowing the molecule to hyb at both ends. 2, 3 and 4kb works well. For longer bits of DNA they used in situ stretching of DNA and Transpososome library prep to produce the other end for cluster generation. 5-8kb libraries have been sequenced using the stretching methods.
What?s the big deal: no circularisation or PCR or anything in vivo. Works on 1-8kb, works on existing Illumina hardware. They are now improving hybridisation and conversion efficiency. Jerrod finished with a beautiful slide of Lambda DNA stretched out on a flow cell and stained with JoJo dye. No longer stars, more like the jump into Hyperspace of the Millenium Falcon!
PS: this talk was not easy to explain in a few notes, look out for the publication!
*The GnuBIO Platforms: Desktop Sequencer for the Clinical and Applied Markets. John Healy.
John Healy, GnuBio, Inc.: ?This was perhaps the fastest talk of the meeting, an interesting one and I?ll certainly be watching GnuBio, but this was a presentation where I think less would certainly have been more.
The beta-version of GnuBio is coming mid 2012, a desktop sequencer with a small footprint and rackable format. Inject DNA into a microfluidic cartridge for a 200 gene resequencing panel and get sequence data back analysed in 3 hours. The chip is the machine (where have we heard that before).
Their picoinjector: A RainDance emulsion type system that injects DNA into each droplet rather than merging droplets. Each droplet has a distinct primer pair. After PCR each reaction-droplet is injected into a sequencing probe library droplet containing 6mers for hyb and quenching. Primer extension of the sequencing probe extends the reaction (TaqMan) and removes the quenching probe allowing the fluorphore on the PCR amplicon to signal (added by labelling one of the PCR primers). Finally map all the mini-reads contigs onto the genome and then reassemble!!! I did not understand the analysis methodology, it sounded like perhaps I should have but it was a fast description.
They showed an early access work on TP53 heterozygous InDel. All work so far completely sequences all amplicons, full-length, Q50+, over 10,000 fold coverage. Will this be the future for targeted sequencing?
What on earth is a Mahalanobis distance? Apparently it allows 15,000 barcodes in four dyes, 6 dyes gets to 300,000 addresses or 64Kb sequences.
Gene panels: You still need to buy oligos and make the panel before you can run an assay and this is still a problem with all the targeted approaches.

*Whole Genome Analysis of Monozygotic Twins and Their Parents: Accurate Detection of Rare Disease-Causing and De Novo Variants. Elliott Margulies.
Elliott Margulies, Illumina: Was at NHGRI just recently moved to Illumina in Cambridge. Analysed a family of four Mum, dad and identical twins (concordant for neurological phenotype) as part of the NIH undiagnosed diseases programme. Sequenced genomes on v3 chemistry four lanes for each family member. Analysis filtering is still not easy, InDels, centromeres and larger deletions are all difficult.
Filtering to reduce variants: Used a "no Q20 evidence" filter for de novo alleles to reduce the number of variants to validate. Identified about 58 de novo changes in the twins. De novo InDels are still very hard to analyse. This is an area that needs improvement.
Elliot suggested that people report the proportion of the genome that is callable with a particualr method along with error rates etc. allowing a more apples:apples comparison to be made.
Unfortunately this was a case where they have not yet found the mutation responsible in the twins. Its not always simple!
(Paper) Ajay et al Genome Research.

*Multiplexed Enrichment of Rare Alleles. Andre Marziali.
Andre Marziali, Boreal Genomics Inc/UBC
OnTarget system: Use 2D SCODA gels to capture specific loci to remove wild-type sequence. Showed Human:Ecoli mix 23:1 input, output of 40% Ecoli or a 1000-5000 fold enrichment. Recently improved the technology and presented BRAF V600E vs wildtype achieving a 1M fold enrichment.
(Paper) Winnowing DNA for rare sequences PlosOne Thompson et al. 2012.
The technology is depleting the wild-type so you don?t need to know the mutations beforehand. Could you pool large populations and find all mutations in a population very quickly? Potentially multiplexing up to 100 targets.
PS: I came In late to this talk as the other "concurrent" session wasn?t as concurrent as perhaps it should have been!

*Whole-Genome Amplification and Sequencing of Single Human Cells. Sunney Xie, Harvard University.
Sunney Xie, Harvard University: He has been working on single moleclule analysis since 1988 (paper) Lu et al in Science 282:1877. His group reported fluorogenic sequencing in micro reactors in Nature Methods (not single molecule). Can you take a single cancer cell and report the genome or transcriptome? Not currently possible to capture all the fragments from a single genome, so WGA is required. PCR or MDA are both methods to use but have limitations.
MALBAC genome amplification: He presented their method for genome amplification; MALBAC - multiple annealing and looping based amplification cycles. This is a low bias linear preamplicifaction method that covers more of the genome than MDA methods. He showed CNV analysis of two single human cancer cells: some nice genome plots of CNVs, looks good and correlates to SKY Karyotypes. There are amplification errors appears to be around 2x104. Sequencing to sibling cells reduces the error and shared SNVs are confidently called. Surely this is reliant on the similarity of ?sibling? cells, tumours are multi-clonal and this has got to make the analysis tougher.
RNA-seq: single-cell RNA-seq published in 2010 by Asim Surani?s group. Ran two experiments simultaneous genome and transcriptome, saw single cells have unique genomes and unique transcriptomes.
Digital RNA-seq: this is work recently published in PNAS. It is not possible to distinguish one copy from a few copies of DNA molecules due to intrinsic noise of PCR. Add barcodes to allow counts to be made on reads and barcodes as two ways to determine expression values. (Paper) Shiroguchi et al PNAS 109, 1347 2012.

The parties.
There were parties all over last night, lots of free drinks on offer but I was so wiped out from jet lag I went to bed reasonably early this year. I'll try harder tomorrow. You can't beat scientists having a good time!

Monday 6 February 2012

My AGBT poster

I will be at this years AGBT and will be presenting poster number 124: " – mapping the world’s sequencers" on Thursday, February 16th from 1:10pm to 2:40pm. Come and say hello and see if your lab is on the map, I may even be wearing my new PacBio cufflinks!

This year I have put together a poster describing the google map of next-generation sequencers that myself and Nick Loman produce. I am sure many of you are already aware of the map and hopefully have your labs and NGS instruments on their as well.

The map has been changed little in two years. But when Nick came on board he made it possible for users to add themselves easily and has built the map on top of a database that allows us to pull out some interesting stats. There are now over 1800 NGS instruments in 650 labs across 45 countries, which is about 60-70% of installs. Over 150,000 page views suggests that a large number of people using NGS are also curious about who else has one.

There is an interesting graph on my map plotting the growth (and decline) of the major platforms. It is clear Illumina has the most sequencers out there today. However there appears to be a ten to one difference in Ion vs MiSeq, first-to-market is very powerful but will it last?

We are also looking for feedback from users about how the map should be developed, and have created a survey to help with this for you to complete. Would you like to see maps for other technologies (we are thinking of Proteomics and Flow Cytometry to begin with). We are also looking to introduce some features to make finding a service provider much easier. If you would like to advertise your core-facility or commercial service provision company onthe site do get in touch.

As users ourselves, we do understand that people require different levels of service, and that core facilities and commercial service providers work differently. Some users will require sequencing only, whilst others may want to access library prep or Bioinformatic services.

We really want your feedback so please stop by the poster and say Hi.

What did people think would be possible with a $1,000 genome five years ago?

In 2007 Nature Genetics asked "What would you do if it became possible to sequence the equivalent of a full human genome for only $1,000?" 49 people responded, and the list contains some of the great and good in Genome sciences including; Mike Stratton, James Lupski, Detlef Weigel, Leena Peltonen-Palotie, et al. It is an intersting read and I'd encourage you to take a look. I'd also recommend James Watson's The Double Helix and Robert Weinberg's One Renegade Cell, as good reads that give a point of view from the past.

The answers these people gave included a lot of speculation but covereed some common themes; personalized medicine, Cancer, epigenetics, exomes and the likely "death of arrays". The ethical challenges such cheap seqeuncing is creating came up several times. And a few answers focused on some of the more frivolous things we might be able to do with such cheap sequencing such as understanding the genomics of beauty, musical ability and even the perfect golf swing!

It is not fair to negatively comment on any specifc predictions, after all I am sure very few people on the planet thought we would be able to produce 1Tb of data in ten days on a HiSeq just five years after Illumina purchased Solexa.

Some of my favourite bits:
Michael Rhodes (Sr Manger Sequencing Portfolio at Life Technologies) pointed out "if the 6-Gb diploid human genome costs a mere $1,000, a 4-Mb haploid genome would cost $0.67." Julian Parkhill (Wellcome Trust Sanger Institute) asked the quesition "what will you do with a $1 bacterial genome?" Both of these answers resonate with a previuos post on this blog a $1000 genome requires a $1 sample prep. Perhaps they draw attantion to the fact that a $1 genome might require an even cheaper prep.

Paul Nurse (Chief Exec of the Francis Crick Institute) dared to speculate that all this sequence data might finally squash "the creationists and the intelligent designers under a mountain of base pairs". I suspect no amount of data would bury their faith in what they believe.

Stephen Scherer (Hospital for Sick Children/University of Toronto) wanted "curiosity and imagination [to] trump deep pockets" in the field of genomics. It is certainly a worry of some of the new PIs I speak to that deep pockets appear to be winning the field. However some clever thinking is giving them the chance to stay a step ahead of the huge projects and publish some fantastic work.

Lena Peltonen-Palotie (University of Helsinki) suggested sequencing only one of a monozygotic twin pair as "since their genomes are identical, and you get two phenotypes for the price of one genome sequence" and focusing on those twins that are "discordant for important diseases like schizophrenia, autism or Alzheimer disease."

James Lupski (Baylor College of Medicine) wanted to "sequence the haploid genome of 100 sperm from each of ten men in whom the diploid genomic sequence was determined." This might be almost possible today using the Nextera technology from Illumina. It is theoretically possible to seqeunce a genome with no amplification other than cluster generation on the illumina system. 1000 single-cell transoposome reactions, all individually bar-coded and pooled, then run on as many lanes as possible until the sample was exhausetd. There would be 'holes' in the genome but it's perhaps as close as we can get today.

I wondered where are the large scale population studies that Paul Nurse (Cichlid fish explosion) and Trudy Mackay (500 inbred strains of Drosophila melanogaster) mentioned? These should be simple projects to generate the primary sequence data for today, and a 1000 genome project is almost in the realms of a PhD thesis or even a graduate dissertation.

Lastly I wonder if Samir Brahmachari (IGIB, India) actually did "invest in stocks of computer hardware companies involved in the data storage business"!

Thursday 2 February 2012


I hope you like my latest attempt at producing genomics-based art. I must start by thanking Pacific Biosciences for giving me over 300 SMRT cells to use in this project. I also need to say thanks to my daughter who helped glue them all on, it took almost two hours to finish.

At around $100 per SMRT cell then the model cost about $35,000 to complete!

Any offers? Donations to Cancer Research UK will be considered.

Wednesday 1 February 2012

Oxford Nanopore confirmed at AGBT

ONT will be speaking at AGBT. 11:40 am-12:00 pm

Clive Brown, their Chief Technology Officer will be talking about “Single Molecule ‘Strand’ Sequencing Using Protein Nanopores and Scalable Electronic Devices” just before lunch on Friday at 11:40. But only for 20 minutes.

Back in April last year Clive posted on SEQanswers "What if your reads were over 100kb, very accurate and you had a mountain of them?"

See you there.

PS: Illumina licensed ONTs exonuclease sequencing methods only, and I am not sure what deal they have on the strand sequencing methods. Either exonuclease is going to come out in a blast of publicity later in the year or Illumina will need to negotiate again over strand sequencing.