CoreGenomics: AGBT 2012 day 1

Day one of my attendance at the 2012 AGBT conference.

I missed the Wednesday workshop, "Utilization of New DNA Sequencing Technologies" and the
welcome reception which this year was sponsored by Agilent Technologies.
Getting through Miami airport was a chore, nealr two hours for security!

I decided to try and posy my notes from the meeting on my blog. The idea was to allow those of you not ablre to coe to learn some of the highlights of the meeting from my perspective. The problem I saw when writing these up for posting is that my perspective might not be as insightful as I'd like. I have posted anyway and will try to do the same for the rest of the meeting.

*Abstract-Selected Talks
Thursday morning 16th February 2012
Plenary session: clinical translation of genomics
Lynn Jorde: Whole-genome sequencing, mutation rates, and disease-gene detection.
Heidi Rehm: Supporting large scale sequencing in a clinical environment.
Lisa Shaffer: A genotype first approach to the identification of new chromosomal syndromes.
*Darrell Dinwiddie: Initial clinical experience with molecular testing for >600 severe childhood diseases by target enrichment and next-generation sequencing.
*Rong Chen: Type 2 diabetes risk alleles demonstrate extreme directional differentiation among human populations, compared to other diseases.

Three common themes came out of this session, the adoption of panels for disease-analysis sequencing pipelines for discovery of disease causing mutations and the need for Sanger validation.

Panels for targeted sequencing: Panels will continue to develop as labs focus on different tests. It seemed that the likelihood of a single panel for Human disease is low; groups will design their own, some may just sequence the exome plus some additional loci. But it did seem that in the short-term specific disease based panels will be developed and used in many labs across the world. There is a lot to drive research in this direction, primarily the upfront cost of developing a panel that can be high. If a panel can be shared then it can be manufactured in high volume and cost drops significantly. It also follows that the data can be more easily shared between labs increasing the number of samples in meta-analysis.

Disease-causing variant analysis pipelines: I do wonder how long it might take for the best-practices to be developed for disease causing mutation pipelines. Ideally many labs would use the same workflow making sharing much easier? However in the research setting people can?t always agree on the best way to proceed.

Sanger sequencing validation: Sanger sequencing is still being used for confirmatory validation of results from NGS studies. However the number of Sanger tests that need to be run increases, as NGS panels get larger. Can we really afford to keep the paradigm of Sanger validation? I suspect we?ll ditch it and rely on NGS alone at some point. It may be that different platforms are used to give us additional confidence. Perhaps Illumina:TSCC sequencing followed up by Ion Torrent:Ampliseq kind of approaches?

The outlook: I suspect the clinic might drive us towards more uniform approaches in all three of the above points as there is a need for standardization. Sharing data across different clinics will also improve our ability to review and update knowledge based on new cases. This will be particularly so for rarer diseases where only a few cases are presented per year in one clinic.

Lynn Jorde: Whole-genome sequencing, mutation rates, and disease-gene detection.
Lynn Jorde, University of Utah School of Medicine: Direct estimation of mutation rates from whole-genome sequence data. Started the AGBT meeting with the obligatory slide showing the cost of genome sequencing and the increasing number of genomes sequenced.
Miller syndrome sequencing: family sequencing of a mother, father and two affected children, in 1981 this family was the first reported to show a recurrence of Miller syndrome, there was no way of determining inheritance. A second family showing recurrence of Miller syndrome has since been identified. Showed a slide of 7 Miller syndrome patients but this was about 25% of cases worldwide! This is a rare disorder. Used 15ug of DNA for sequencing by Complete Genomics of all family members. The family self-identified and received a lot of publicity.
Mutation rate estimation: From the CGI data they saw 34000 Mendelian errors, but only expected around 0.01% to be true mutations. They used Agilent capture arrays to resequence all variants and only 28 were confirmed. This allowed them to calculate a mutation rate of 1.1x10-8, each gamete having about 35 novel variants. Two other trios gave very similar estimates, which is about half the rate people thought before we had NGS data.
Variant Discovery: He reported discovering 3M SNPs, 10k non-synoymous and 100 loss of function variants. Methods to identify these are rather ad-hoc so the group developed Variant annotation analysis and selection tool, VAAST, see Yandell et al genome research July 2011(Paper) to better identify disease-causing variants. Used VAAST in ?progeria-like? lethal x-linked disease with exome sequencing, 20 minute analysis identified NAA10 as a new genetic condition ?Ogden syndrome?. Mother was four months pregnant but could not inform her of findings, baby was affected and died the week the paper was published. A CLIA-approved test has been developed and the family will use it for pre-implantation diagnostics. VAAST increses in power as more control genomes are used. Collecting lots of control genomes will improve the power of this technique, and others like it.
Analysis of male-specific mutation rate: used a large family and compared identical by descent regions, showed male mutation rate was around 5x the female rate. Males are responsible for most of the change in the genome. Questions that might be asked - Affect of paternal age on mutation rate? Variation in DNA repair genes? Population variation?

Heidi Rehm: Supporting large scale sequencing in a clinical environment.
Heidi Rehm, Partners Healthcare Center for Personalized Genetic Medicine: Gave a CLIA lab perspective from the Laboratory of Molecular Medicine, over 150 tests main focus is on large multi-gene panels, now moved to NGS. Also run MLPA, STR, dPCR, etc. www.genetests.org content per test has grown logarithmically from single gene test to large panels, now SNP-arrays, exomes and WGS. LMM workflows have become more complex. HCM cardio-chip (and OtoChip for hearing loss) included a test for GLA, the only gene where there is a treatment available, 2% of patients have the mutations and can be treated. You need to include as much as possible in a panel! , both HCM and OtoChip are Affymetrix sequencing arrays. Identifying the correct mutation has an impact on patients.
Drawbacks: Need to confirm NGS results, LMM and many others are still Sanger confirming results. Need to fill-in any regions known to be associated with a disease, we cannot have gaps. HCM NGS test requires Sanger validation for confirmation of variants, also un-callable bases, low-covered regions (less then 20x), non-covered regions. Sanger adds a lot of time and money. For the OtoChip NGS test LMM are developing Sanger validation assays at the same time as launching the test. Need to report on all variants and this takes time, 20min to 2 hours for each novel variant to be assessed using google, pubmed, databases like dbSNP, DGV, etc. Myriad detect 10-20 new missense variants per week despite testing 150,000 people for BRCA1/2. Lots of new variants will be discovered. Not clear how many people we may need to sequence to reduce this number. In HCM 66% and in OtoChip 81% of clinically significant variants are found in only one family!
Sharing data: Need to get more clinically useful data out in the public domain and share it around. LMM have a grant submission for creation of a universal Human genomic mutation database, under review. Approached almost 100 labs now have 160,000 cases of data that can be submitted if funded.
GeneInsight in Hum Mutat 32:1-5 2011(Paper). Reports need to be reanalysed and possibly amended, as data in the research databases becomes more comprehensive. For HCM 4% of reports would have changed and had an affect on the patient. Similar to the reporting feedback that 23andMe offer to subscribers. We need electronic health care records. Data needs to feed into a network of labs as each will bring their own specific expertise on particular tests. The infrastructure is going to be hugely important.

Lisa Shaffer: A genotype first approach to the identification of new chromosomal syndromes.
Lisa Shaffer, Perkin Elmer, Inc.: This was a sponsored talk from Perkin Elmer. She introduced us to the traditional phenotype-first approach to syndrome identification e.g. Downs syndrome reported in 1866, Lejeune reported the etiology in 1960s (need to check the date). In the 80s and 90s syndromes were reported based on known chromosomal deletions, easily visible under light microscope >10Mb in size. More recently arrays allow very quick analysis of deletion and size of variation and are a powerful tool to identify and characterise micro-deletions.
Signature Genomics Genoglyphix: (Google) database of 16000 abnormal cases, looking for deletions in particular genes, see Baliff et al Hum Genet 2011 (Paper). Lisa is proposing a genotype first approach, could easily be sequence-based. The important bit is to get the data in one place and curated to allow better analysis, similar to Heidi Rehm?s talk above. Sharing is good for everyone.

*Darrell Dinwiddie: Initial clinical experience with molecular testing for >600 severe childhood diseases by target enrichment and next-generation sequencing.
Darrell Dinwiddie, Children's Mercy Hospital: Children?s Mercy Centre for Pediatric Genomic Medicine, 20 physicians involved across every specialty in the hospital, 20 MD, PHD or genetic counselors. There are over 7000 Mendelian diseases described, 3,300 have a known molecular basis, and cause ~25% of infant deaths. We need to test as early and comprehensively as possible to impact treatment or palliative care, also for pre-conception carrier testing. Rule out specific diseases and better treat patients.
Testing at Mercy today: costs about $10,000, 1-10 diseases at a time, 3000 patients per year, turnaround up to 12 months, 50% cases diagnosed.
Testing at Mercy tomorrow: with genomic medicine shold cost about $1000, 600 diseases at a time, 30000 patients per year, turnaround up to 1 week, 90% cases diagnosed.
Logistics are important: The labs are setup with a uni-directional workflow. Separation of DNA isolation, library prep, PCR-enrichment, QC, Sequencing. They have employed a semi-automated 96 samples processing platform. Caliper LabChip GX, Caliper robots, Covaris 96well shearing.
They are using Illumina TSCC of 526 genes, 8366 genomic regions. 4Gb of sequence per sample give 98% of targets at 16x or higher coverage, allows around 50 samples per lane on HiSeq. NGS genotyping test shows very good concordance to array based genotyping. They do not test children for carrier status or for disorders that will not affect them until adulthood.
Variant detection analysis: They have what looks like a pretty comprehensive variant detection pipeline. Spoke about the difficulties in detecting deletions, insertions and CNVs and that these can account for up to 80% of disease causing mutations in some diseases. Gross deletions detected by drop in sequence depth and by aberrant pair read mapping. They are completing a validation of 700 samples including Coriell samples, 384 from CHM, 220 from collaborators in Germany and Iran.
Summed up with a discussion on the power of Genomic medicine. Cases presented were sisters who had been tested over 5 years to figure out disease causing molecular diagnosis. Genomics analysis suggested APTX gene, which showed a homozygous mutation confirmed by Sanger sequencing, parents were both carriers. APTX causes CoQ10 deficiency that is treatable and this treatment has started in the sisters. Reported on current progression and the older sibling is no longer wheel-chair bound.

*Rong Chen: Type 2 diabetes risk alleles demonstrate extreme directional differentiation among human populations, compared to other diseases.
Rong Chen, Personalis Inc.: Presented a nice slide showing how T2D genetic risk decreases when Humans migrate, frequency of risk alleles changes across populations. Why?
Broad thrifty gene hypothesis caused by the promotion of energy storage and usage appropriate to environments and/or energy intake or mismatch between genetic background and environment.
They used publically available HapMap 3, HGDP and CGI data in their analysis.

Thursday afternoon 16th February 2012
Poster session: some interest in my poster but oh so many posters for people to choose from!

Plenary Session:ˇ Genomic Studies
Sequencing Thousands of Human Genomes. Goncalo Abecasis.
Goncalo Abecasis, University of Michigan: Genetic variants that modify LDL levels are responsible for about 30% of known association. Rare variants more often have larger effects. It is difficult to know with certainty if all susceptibility loci have been found.
The scale of rare variation: NHLBI Exome sequencing data, nearly 50% of variants are seen in a single individual. These singletons also appear to show more non-synonymous mutations. How much of the phenotypic variation do these rare variants explain?
How to study rare variation: whole genome sequencing, exomes, low coverage WGS, new GT arrays or genotype imputation. NHBLI performed exome sequencing of 400 samples from a 40,000 population showing the extremes of LDL phenotype. Also had data from 1600 other individuals with sequence data.
Sardinian sequencing: 6000 Sardinians from the Lanusei valley, currently 1700 individuals at 4x. Chose the key individuals in families to sequence allowing as many genomes to be analysed from imputation of genome in related individuals. Low coverage analysis allows you to get from a 5% error rate (66 samples) in genotype calls to less then 1% in (300 samples).
Problems in analysing so many samples: standard tools for genotype analysis take a long time when you want to run 1000 samples with low coverage sequencing. Different analysis methods give different results, not easy to determine which to use. Consensus calling (Sanger, Broad and NHBLI methods) reduced error rates by 40%.
1000 genomes current integrated map: >1000 samples, 37.9M snps, 3.8M short indels, 14000 large deletions. 98.5% SNPs but only 70% of indels can be validated. 10 samples account for 200000 indel false positives or 5% of errors!

ENCODE: Understanding Our Genome. Ewan Birney.
Ewan Birney, EBI: ChIP-seq 150, RNA-seq 100, Dnase-seq 100 (samples assays), 164 assays for multiple transcription factors, six cells lines chosen for in-depth analysis K562, HeLa, HepG2 plus GM12878, H1ESC, HuVec. 10 to 20 of the ChIP-seq assays have been run on nearly all 182 cell line samples.
ENCODE is a big project: with 11 main sites, 50 labs involved, 10 additional groups, 30 ?lead? PIs, 350 authors on the main paper, 6 ?high profile? papers and 45 companion papers.
Global analysis of the data without a priori knowledge came up with numbers very similar to what we alrady think is correct; Exons 3%, ChIp-seq regions 12%, DNAse 20%, Histone mods 49%, RNA 80% of genome is covered. The genome is doing a lot of stuff, 97% is not "junk".
Selection in primate specific regions: multiple alignments allow you to infer a particular sequence has inserted into the primate lineage, many of these appear to be under selection in Human populations.
GWAs and functional SNPs: many SNPs are not the functional SNP but simply linked target SNPs. ENCODE showed about 30% of reported SNPs are either functional or within 500bp of the functional SNP.
Ewan gave two big plugs for Rick Myers DNA methylation and Tom Gingeras RNA landscape talks tomorrow, so I?ll try not to miss those. He talked about a lot of big experiments, very quickly and went over time.
But I still don?t really understand ENCODE, should I try harder?

Molecular Classification of Breast Tumors Using Gene Expression Profiling and Its Translation Into Clinical Practices. Charles Perou.
Charles Perou, The University of North Carolina at Chapel Hill: 2000 Breast Cancer mRNA-seq analysis. Did a lot of work up front on different RNA-seq methodologies. Compared Agilent microarrays with mRNA-seq and DSN RNA-seq. Al three showed good concordance, especially so if only the 1600 GX BrCa list are used instead of all genes.
Ribosomal depletion: They performed a comparison of the different ribosomal RNA depletion methods; Ribominus (uses a few oligos to pull down), RiboZero (tiled oligos) or DSN (Double Strand Nuclease). All worked well, Chuck appeared to prefer the DSN approach as possibly more automatable.
GX diagnostics: The current diagnostic tests like PAM50 and Oncotype are useful, pathology is useful, but together they are better. Multiple assays give the very best results for prognosis, diagnosis. We should consider compex tests using protein and gene expression, CNV, InDels, methylation etc.
Adding copy-number data. Copy-number data by itself is good but both are only marginally better. It gets expensive to do all the possible technologies on large sample cohorts. The ideal would be to get these assays into clinical trials.

*Breast Cancer Progression From Earliest Lesions to Clinically Relevant Carcinoma Revealed by Deep Whole Genome Sequencing. Arend Sidow.
Arend Sidow, Stanford University: BrCa is a progressive disease, normal through CCL, DCIS, to IDC. Different stages are related by cell lineage and a tree of BrCa evolution is more appropriate way to describe the disease.
Presented 3 cases with CLL, FEA, DCIS and IDC lesions. Take an FFPE specimen; histology and pathology determine which regions to access for sequencing, core the tissue, extract DNA, make libraries and sequence. 3 patients 4-5 regions each sequenced to 50x coverage. Looking at SNPs, SNVs, Aneuplodies, Structural variants. It is difficult to dissect tumours cleanly so there is always some normal contamination. Analysis of tumour evolution is possible by looking at which mutations (any kind) are shared between different sites in the tumour. LOH and chromosome gain affect the absolute coverage of a region, LOH will lead to fewer reads, chromosome gain will result in more reads.
Cells go through around 40 cell divisions from fertilization to Puberty.

Pacific Biosciences Roundtable Discussion/Dinner
Dinner was good, not perhaps as fun a discussion as last years. John Rubin?s movie was very entertaining but this year we only got a clip of Matt Damon in Contagion.

Thursday evening 16th February 2012
I spent most of the evening in the genomic technology concurrent session with a brief interlude in the Medical Sequencing and Genetic Variation session for Elliott Margulies talk.

*Understanding Sequencing Bias Across Multiple Sequencing Technologies. Michael Ross, Broad:
I missed this one due to a medical emergency, lots of firemen and paramedics in a very tight space. Don?t ask!

*Infinipair: Capturing Native Long-range Contiguity by in situ Library Construction and Optical Sequencing Within an Illumina Flow Cell. Jerrod Schwartz.
Jerrod Schwartz, University of Washington. A postdoc in Jay Shendures group (that group does loads of cool work). De novo assembly does not work brilliantly with current NGS mainly due to the size of inserts that are sequenced, currently around 400bp-5Kb insert libraries. The HGP used much larger insert libraries. We need short, mid and long-range congruity sequence information. Current approaches require circularisation of long bits of DNA that is inefficient. Optical mapping or optical sequencing are other approaches. They thought about what the ideal platform could look like; high-throughput, inexpensive, no circularisation, use current hardware, work at different length scales.
Infinipair: uses existing Illumina hardware. Load high molecular weight DNA onto a flowcell and generate spatially related reads. Take long DNA, hyb both ends to the flowcell, cut and make clusters out of the pair of covalently attached molecules, reads in clusters 1 and 2 should be spatially related in the genome. Used PCR-free library prep method to prepare libraries where the two adapters are only on one end allowing the molecule to hyb at both ends. 2, 3 and 4kb works well. For longer bits of DNA they used in situ stretching of DNA and Transpososome library prep to produce the other end for cluster generation. 5-8kb libraries have been sequenced using the stretching methods.
What?s the big deal: no circularisation or PCR or anything in vivo. Works on 1-8kb, works on existing Illumina hardware. They are now improving hybridisation and conversion efficiency. Jerrod finished with a beautiful slide of Lambda DNA stretched out on a flow cell and stained with JoJo dye. No longer stars, more like the jump into Hyperspace of the Millenium Falcon!
PS: this talk was not easy to explain in a few notes, look out for the publication!
*The GnuBIO Platforms: Desktop Sequencer for the Clinical and Applied Markets. John Healy.
John Healy, GnuBio, Inc.: ?This was perhaps the fastest talk of the meeting, an interesting one and I?ll certainly be watching GnuBio, but this was a presentation where I think less would certainly have been more.
The beta-version of GnuBio is coming mid 2012, a desktop sequencer with a small footprint and rackable format. Inject DNA into a microfluidic cartridge for a 200 gene resequencing panel and get sequence data back analysed in 3 hours. The chip is the machine (where have we heard that before).
Their picoinjector: A RainDance emulsion type system that injects DNA into each droplet rather than merging droplets. Each droplet has a distinct primer pair. After PCR each reaction-droplet is injected into a sequencing probe library droplet containing 6mers for hyb and quenching. Primer extension of the sequencing probe extends the reaction (TaqMan) and removes the quenching probe allowing the fluorphore on the PCR amplicon to signal (added by labelling one of the PCR primers). Finally map all the mini-reads contigs onto the genome and then reassemble!!! I did not understand the analysis methodology, it sounded like perhaps I should have but it was a fast description.
They showed an early access work on TP53 heterozygous InDel. All work so far completely sequences all amplicons, full-length, Q50+, over 10,000 fold coverage. Will this be the future for targeted sequencing?
What on earth is a Mahalanobis distance? Apparently it allows 15,000 barcodes in four dyes, 6 dyes gets to 300,000 addresses or 64Kb sequences.
Gene panels: You still need to buy oligos and make the panel before you can run an assay and this is still a problem with all the targeted approaches.

*Whole Genome Analysis of Monozygotic Twins and Their Parents: Accurate Detection of Rare Disease-Causing and De Novo Variants. Elliott Margulies.
Elliott Margulies, Illumina: Was at NHGRI just recently moved to Illumina in Cambridge. Analysed a family of four Mum, dad and identical twins (concordant for neurological phenotype) as part of the NIH undiagnosed diseases programme. Sequenced genomes on v3 chemistry four lanes for each family member. Analysis filtering is still not easy, InDels, centromeres and larger deletions are all difficult.
Filtering to reduce variants: Used a "no Q20 evidence" filter for de novo alleles to reduce the number of variants to validate. Identified about 58 de novo changes in the twins. De novo InDels are still very hard to analyse. This is an area that needs improvement.
Elliot suggested that people report the proportion of the genome that is callable with a particualr method along with error rates etc. allowing a more apples:apples comparison to be made.
Unfortunately this was a case where they have not yet found the mutation responsible in the twins. Its not always simple!
(Paper) Ajay et al Genome Research.

*Multiplexed Enrichment of Rare Alleles. Andre Marziali.
Andre Marziali, Boreal Genomics Inc/UBC
OnTarget system: Use 2D SCODA gels to capture specific loci to remove wild-type sequence. Showed Human:Ecoli mix 23:1 input, output of 40% Ecoli or a 1000-5000 fold enrichment. Recently improved the technology and presented BRAF V600E vs wildtype achieving a 1M fold enrichment.
(Paper) Winnowing DNA for rare sequences PlosOne Thompson et al. 2012.
The technology is depleting the wild-type so you don?t need to know the mutations beforehand. Could you pool large populations and find all mutations in a population very quickly? Potentially multiplexing up to 100 targets.
PS: I came In late to this talk as the other "concurrent" session wasn?t as concurrent as perhaps it should have been!

*Whole-Genome Amplification and Sequencing of Single Human Cells. Sunney Xie, Harvard University.
Sunney Xie, Harvard University: He has been working on single moleclule analysis since 1988 (paper) Lu et al in Science 282:1877. His group reported fluorogenic sequencing in micro reactors in Nature Methods (not single molecule). Can you take a single cancer cell and report the genome or transcriptome? Not currently possible to capture all the fragments from a single genome, so WGA is required. PCR or MDA are both methods to use but have limitations.
MALBAC genome amplification: He presented their method for genome amplification; MALBAC - multiple annealing and looping based amplification cycles. This is a low bias linear preamplicifaction method that covers more of the genome than MDA methods. He showed CNV analysis of two single human cancer cells: some nice genome plots of CNVs, looks good and correlates to SKY Karyotypes. There are amplification errors appears to be around 2x104. Sequencing to sibling cells reduces the error and shared SNVs are confidently called. Surely this is reliant on the similarity of ?sibling? cells, tumours are multi-clonal and this has got to make the analysis tougher.
RNA-seq: single-cell RNA-seq published in 2010 by Asim Surani?s group. Ran two experiments simultaneous genome and transcriptome, saw single cells have unique genomes and unique transcriptomes.
Digital RNA-seq: this is work recently published in PNAS. It is not possible to distinguish one copy from a few copies of DNA molecules due to intrinsic noise of PCR. Add barcodes to allow counts to be made on reads and barcodes as two ways to determine expression values. (Paper) Shiroguchi et al PNAS 109, 1347 2012.

The parties.
There were parties all over last night, lots of free drinks on offer but I was so wiped out from jet lag I went to bed reasonably early this year. I'll try harder tomorrow. You can't beat scientists having a good time!

CoreGenomics

Pages

Friday, 17 February 2012

AGBT 2012 day 1

1 comment: