Saturday 28 April 2012

How do SPRI beads work?

Someone recently asked me, “how do SPRI beads work” and I realized I was not completely sure so I went to find out.

My lab uses kits. Lots and lots of kits; kits for DNA extraction, kits for PCR, kits for NGS library prep and kits for sequencing. Kits rock! But understanding what is going on at each stage of the protocol provided with the kit really helps with troubleshooting and modification. How many people have added 24ml of 100% Ethanol to a bottle of Qiagen’s PE buffer without stopping to ask what is already in the bottle? The contents of the PE bottle are… answer at the bottom of this post.

I hope this post helps you understand SPRI a bit better and think of novel ways to use it. I’d recommend reading A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries in Genome Biology 2011 for the with-bead methods discussed at the bottom of this post. I am sure we'll all be using their method a lot more in the future!

You are most likely to come across SPRI beads labeled as Ampure XP from Beckman.

First some handling tips:
  • Vortex beads before use.
  • Store beads at the correct temperature.
  • Allow enough time for beads to come to room temperature.
  • Pipetting is critical; use very careful procedures (SOPs) and well calibrated pipettes.
How do SPRI beads work? Solid Phase Reversible Immobilisation beads were developed at the Whitehead Institute (DeAngelis et al 1995) for purification of PCR amplified colonies in the DNA sequencing group. SPRI beads are paramagnetic (magnetic only in a magnetic field) and this prevents them from clumping and falling out of solution. Each bead is made of polystyrene surrounded by a layer of magnetite, which is coated with carboxyl molecules. It is these that reversibly bind DNA in the presence of the “crowding agent” polyethylene glycol (PEG) and salt (20% PEG, 2.5M NaCl is the magic mix). PEG causes the negatively-charged DNA to bind with the carboxyl groups on the bead surface. As the immobilization is dependent on the concentration of PEG and salt in the reaction, the volumetric ratio of beads to DNA is critical.

SPRI is great for low concentration DNA cleanup that is why it is used in so many kits. The reagents are easy to handle and a user can process 96 samples very easily in a standard plate. Alternatively the protocol can be easily automated and tens or hundreds of plates can be run on a robot in a working day. The binding capacity of SPRI beads is huge. 1ul of AmpureXP will bind over 3ug DNA.

This is the typical SPRI protocol from Beckmans website.

Size-selection with SPRI: Again the concentration of PEG in the solution is critical in size-selection protocols so it can help to increase the volume of DNA you are working with by adding 10mM Tris-HCL pH 8 buffer or H2O to make the pipetting easier. The size of fragments eluted from the beads (or that bind in the first place) is determined by the concentration of PEG, and this in turn is determined by the mix of DNA and beads. A 50ul DNA sample plus 50ul of beads will give a SPRI:DNA ratio of 1, as will 5ul pipetting (but much harder to get right). As this ratio is changed the length of fragments binding and/or left in solution also changes, the lower the ratio of SPRI:DNA the higher larger the final fragments will be at elution. Smaller fragments are retained in the buffer usually discarded and so you can get different size ranges from a single sample with multiple purifications. Part of the reason for this effect is that DNA fragment size affects the total charge per molecule with larger DNAs having larger charges; this promotes their electrostatic interaction with the beads and displaces smaller DNA fragments
SPRI size selection from Broad "boot camp"
SPRI size selection from Broad "boot camp"
Broad Boot Camp

With-bead SPRI cleanup: An incredibly neat modification of SPRI clean-up is the “with-bead” method developed by Fisher et al in the 2011 Genome Biology paper. Rather than using SPRI to clean-up discrete steps in a protocol these are integrated into a single reaction tube method, thus reducing the number of liquid transfer steps. After each step DNA is bound to beads by addition of the 20% PEG, 2.5M NaCl buffer, washes are performed as normal with 70% ethanol but the DNA is not eluted and transferred. Rather the master-mix for the next step in the protocol is added directly to the tube. The final DNA product is eluted from beads for further processing e.g. Illumina sequencing. This modification increases DNA yields in Illumina library prep as multiple transfer steps are removed, reducing the amount of DNA lost at each transfer.

How will you use SPRI in your lab?

PS: the Qiagen PE bottle contains 6ml of water.

PPS: If DNA and Carboxyl are both negatively charged what is happening on a SPRI bead? There appears to be a bit of a black hole in the literature I read through about how the SPRI beads actually work at the molecular level. Digging around some more it appears that the PEG can induce a coil-to-globule state change in the DNA with the NaCl helping to reduce the dielectric constant of the solvent, PEG also acts as a charge-shield. These mechanisms may be behind the "crowding" talked about on other sites. Perhaps someone out there can enlighten me?

Lis JT. Size fractionation of double-stranded DNA by precipitation with polyethylene glycol Methds Ezymol (1975)
Lis J. Fractionation of DNA fragments by polyethylene glycol induced precipitation. Methds Ezymol (1980)
DeAngelis et al. Solid-phase reversible immobilization for the isolation of PCR products. NAR (1995)
Hawkins et al. DNA purification and isolation using a solid-phase. NAR (1994)

Friday 27 April 2012

My top ten things to consider when organising a conference

I just organised and hosted an NGS conference for 150 people in Cambridge which was well attended, had interesting talks and received positive feedback. It was a lot of work getting everything organised so I thought I’d put together my thoughts on the whole thing as a reminder of what to do next time and also mention some things I learnt during the process. The tips below should be useful for national meetings of 50-200 people.

If you are going international or much bigger then get more help, and lots more money!

First and foremost in my recommendations for event organisers is plan well and plan early!

Here are my top ten tips, there is some discussion about a few of them further down the page.
  1. Don't forget the science!
  2. What is your meeting about and who is the target audience?
  3. How long will the meeting be and what sort of format will you use?
  4. Who is going to organise it?
  5. Who is going to speak?
  6. Where will the meeting be held and what are the facilities like?
  7. Can you get sponsorship or do you need to charge a registration fee?
  8. How will you advertise and organise registrations?
  9. Who is making name badges?
  10. Get some feedback after the meeting!

1, 2, 3: The science and the audience are your most important considerations. Make sure you have a good understanding of what the audience are expecting to get from the meeting. Your own experience of good (and bad) conferences will help. Understanding the focus of the meeting will also help determine the probable length of your meeting. A lot can be accomplished in half a day, more than one and a half might mean two nights accommodation and that can put people off coming.

4, 5, 6: Organisation is key, get this wrong and the meeting might not be as successful as you'd hoped. Are you going to organise it yourself or will someone in your lab, or you might even have an administrator who can help? Will the meeting be in your host institution or somewhere else, what sort of rooms do you have available and is there a pub nearby for afterwards!

7, 8, 9: Meetings cost money and people need to be told about them. If you can get sponsorship then do, many companies would love the opportunity for a captive audience. Charge more for a stand or the chance to give a talk, between £100-£1000 for a stand and as high as you can for a talk. You could also offer companies the chance to sponsor an academic speaker and cover their flights and accommodation, this goes down better with audiences. Registration fees can cost more to administer than they raise in revenue, speak to your finance officer about them first. We used Survey Monkey for registrations, users can answer questions about session preferences and food requirements and everything is delivered back in a simple spreadsheet format with email addresses for a confirmation from you to your attendees. Don't underestimate name badges. Making 200 yourself without knowing what a "mail merge" is would be hard work.

10: Get some feedback. I used Survey Monkey again to ask several questions about the 3rd CRI/Sanger NGS workshop (all the results are here). First and foremost was did people think the meeting was good (91% said yes)? Would they come again (90% said yes)? I also asked about their background and experience, as well as the aspects of the meeting they liked best to help shape the next years meeting. Leaving space for some free text comments was good as well, some people gave very useful constructive criticism. 

Other things: I used a spreadsheet in Excel to keep track of the meeting, what the times were for each session, who the speaker was, had they confirmed, did I have a title, what were their contact details, etc. You can get a copy here.

A hard thing to get right is how much to over-book your meeting or under-cater it. You can bet there will be no-shows on the day and a registration fee will help prevent speculative registrations. But inevitably some people won't be able to make it. I would aim for filing the room allocated next time but only catering for 75% of the numbers to keep costs reasonable.

PS: Remember to ask people to switch of their mobiles. I was at a meeting where a phone rang loudly for a while before the current speaker realised it was theirs!

Thursday 26 April 2012

3rd CRI/Sanger NGS workshop

Over the past two days I have hosted the 3rd CRI/Sanger NGS workshop at CRI. I hosted the first of these in 2009 and the demand for information about NGS technologies and applications appears to be greater today than it was back then. We had just under 150 people attend, and had sessions on RNA-seq, Exomes and Amplicons as well as breakouts covering library prep, bioinformatics, exomes and amplicons. Our keynote speaker was Jan Korbel from EMBL.

I wanted to record my notes from the meeting as a summary of what we did for people to refer back to. I'd also hope to encourage you to arrange similar meetings in your institutes. Bringing people together to talk about technologies is an incredibly powerful way to get new collaborations going and learn about the latest methods.
Let me know if you organise a meeting yourself.

I will follow up this post with some things I learnt about organising a scientific meeting. Hopefully this will help you with yours.

3rd CRI/Sanger NGS workshop notes:

RNA-seq session:
Mike Gilchrist: from NIMR presented on his groups work in Xenopus development. He strongly suggested that we should all be doing careful experimental design and consider that variation within biological groups needs to be understood as methodological and analytical biases are still present in modern methods. mRNA-seq is biased due to transcript length, library prep and sample collection, which needs to be very carefully controlled (SOPs and record everything).
Karim Sorefan: from UEA presented a fantastic demonstration of the biases in smallRNA sequencing. They used a degenerate 9nt oligo with 250,000 possible sequences as the input to a library prep and saw heavy bias (up to 20,000 reads from some sequences, where only 15 expected, similar issues with a larger 21nt 1014 sequences) mainly coming from RNA-ligation bias. smallRNAs are heavily biased due to RNA-ligase sequence preferences, some RNA-ligase bias comes from the impact of 20 structure on ligation (there are other factors as well). He presented a modified adapter approach where the 4bp at the 3’ or 5’ ends of the adapters are degenerate, and very significantly improved bias to almost the theoretical optimum.
Neil Ward: from Illumina talked about advances in Illumina library preparation and the Nextera workflow, which is rapid (90 mins) and requires vanishingly small quantities of DNA (50ng) without any shearing and uses a dual-indexing approach to allow 96 indexes from just 20 PCR primers (8F and 12R, possibly scalable to 384 with just 40 primers). Illumina provide a tool called “Experiment Manager” for handling of sample data to make sure the sequencing run can be demultiplexed bioinformatically. Nextera does have some bias (as with all methods) but is very comparable to other methods. He showed some amplicon resequencing application (drop off of the ends of PCR products at about 50bp), with good coverage uniformity and demonstrated the ability to detect causative mutations.

Exome session:
Patrick Tarpey: Working in the Cancer Genome Project team at Sanger was talking about his work on Exomes sequencing. Refreshed everyone on how mutations can be acquired somatically over time, that there are differences between Drivers and Passengers and that exome sequencing can help to detect  drivers where mutations are recurrent e.g. SFB31. They are assaying substitutions, amplifications and small InDels (rearrangements and epigenetic not currently assayable). He presented a BrCa ER+/- T:N paired exome analysis using Agilent SureSelect on HiSeq with Caveman (substitutions) and Pindel (InDel) analysis. About 1/3rd of data is PCR duplicates or off-target so they are currently aiming for 10Gb with 5-6Gb on-target for analysis, ~80% bp at 30x coverage. Validation is an issue does it need to be orthogonal or not? So far they have identified about 7000 somatic mutation in the BrCa screen (averaging around 10 indels and 50 substitutions) however some mutator phenotype samples have very large numbers of substitutions (50-200). An interesting observation was that TpC (Thymine precedes Cytosine) dinucleotides are prone to C>T/C/A substitutions (why is currently unknown). They found 9 novel cancer genes including two oncogenes (AKT2 & TBX3). And noted that many of the mutated genes in the BrCa screen abrogate JUN kinase signalling (defective in about 50% of BrCa in the study).
David Buck: Oxford Genomics Centre, spoke about their comparison of exome technologies (Agilent vs Illumina). His is a relatively large lab (5 HiSeqs) with automated library prep on Biomek FX. He talked about how exomes can be useful but have limitations (SV-seq is not possible, not clear about impact on complex disease analysis). They did not see a lot of difference in the exome products and he talked about the arms race between Agilent and Illumina. They recently encountered issues in a 650 exome project and saw drop-out of GC rich regions, almost certainly due to a wrong pH in NaOH from acidification during shipment. They had seen issues with SureSelect as well. All kits can go wrong make sure you run careful QC of the data before passing it through to a secondary analysis pipeline.
Paul Coupland (AGBT poster link): Talked about advances in genome sequencing. He discussed nanopores and some of the issues with them, focusing on translocation speed. And there was some discussion about whether we would realy get to 50x genome in 15 minsSMRT-cell sequencing. Briefly they are blunt-end ligating DNA to SMRT-bell adapters to create circular molecules for sequencing, read lengths are up to 5kb but made of sub-reads (single pass reads) averaging 1-2kb, 15% error rate on single pass sequencing but <1% on consensus called sequence, vast majority of errors are insertions, variability is often down to library prep rather than sequencing, improved library prep methods are needed. He completed a P falciparum genome project in 5 days DNA on 21 SMRT-cells for 80x genome coverage, long reads are really helpful in genome assembly. Paul also also mentioned the Ion Torrent technology as well as 454, Visigen, Bionanomatrix, Visigen, GnuBio, GeniaChip, etc.

Amplicons and clinical sequencing:
Tim Forshew: Tim is a post-doc in Nitzan Rosenfeld's group here at CRI. He is working on analysis of circulating tumour DNA (ctDNA) and its utility for tumour monitoring in patients via blood. ctDNA is dilute, heavily fragmented and the fraction of ctDNA in plasma is very low. The group are developing a method called "TAM-seq" which uses locus-specific primers tailed with short tag sequences, followed by a secondary PCR  from these to add NGS adapters and patient specific barcodes. They are pre-amplifying a pool of targets, then using Fluidigm to allow single-plex PCR for specificity and generating 2304 PCRs on one AccessArray. PCR products are recovered for secondary PCR and the sequenced. Till now they have sequenced 47 OvCa (TP53, PTEN, EGFR*, PIK3CA*, KRAS*, BRAF* *hotspots only), with amplicons of 150-200bp (good for FFPE), from 30M reads on GAIIx for 96 samples they achieve >3000x coverage. Tim presented an experiment testing the limits of sensitivity using a series of known SNPs in five individuals mixed to produce a sample with 2-500 copies per SNP and they looked at allele frequency for quantitation. Sanger-sequencing confirmation of specific TP53 mutations was followed-up by digital PCR to showed these were detectable at low frequency in the pool. Fluidigm NGS found all TP53 mutations and additionally detected an unknown EGFR mutation present at 6% frequency in the patient, this mutation was not present in the initial biopsy of the right ovary. Tim performed some "digital Sanger sequencing" to confirm EGFR mutations, and looking in more detail the biopsies of Omentum showed the mutation whilst it was missing in both ovaries. The cost of the assay is around $25 per sample for duplicate library prep and sequencing of clinical samples, they can find mutations as low as 2% (probably lower as sequencing depth and technology gets better), and are now looking at the dynamics of circulating tumour DNA. TAM-seq and digital PCR are useful monitoring tools and can detect response or relapse weeks or even months earlier than current methods, it is also possible to track multiple mutations in a patient as complex-biomarker assays probably allowing better analysis of tumour heterogeneity and evolution during treatment. He summarised this as a personalised liquid biopsy!

Christohper Watson: Chris works at St James's clinical molecular genetics lab which runs a series of NGS clinical tests biweekly to meet turn-around-time (TAT), so far they have issued over 1500 NGS based reports. There's was the first lab in the UK to move to NGS testing, see Morgan et al paper in Human Mutation. They perform triplicate lrPCR amplification for each test and pool before shearing (Covaris) for a single library prep (Beckman SPRIworks) for Illumina NGS, aiming for reports to be automated as much as possible, still doing Sanger sequencing where not enough individuals to warrant NGS approaches, lrPCRs are pooled for library prep by shearing (Covaris) and standard Illumina library prep on Beckman SPRIworks robot (moving to Nextera), they run SE100bp sequencing reads (50x coverage) and are using NextGENe from SoftGenetics for data analysis that gives mutation calls in standard nomenclature. Currently all variants are confirmed by Sanger-seq (100% concordance) but this is under review as the quality of NGS may make this unnecessary and other NGS platforms may be used instead. The lab have seen a 40% reduction in costs and 50% reduction in hands on time their process is CPA accredited. They see clinicians moving from sequential analysis of genes to testing all at once improving patient care and now 50% of the workload at Leeds is NGS. Test panels cost £530 each per patient. Chirs discussed some challenges (staff retraining, lrPCR design, bioinformatics, maximising run capacity), and they are probably moving to exomes or mid-range panels, the "medical exome". On thing Chris mentioned is that in the UK TAT for BRCA is required to be under 40days, which was a lot longer than I thought.

Howard Martin: Is a clinical scientist working at EASIH at Addenbrokes Regional genetics lab where they are still doing most clinical work with Sanger-seq offered on a per gene basis which can take many months to get to a clinically actionable outcome. They initially used 454 but this has fallen very much out of favour, Howard has led the testing and introduction of the Ion Torrent platform, he had the first PGM in the UK and likes it for a number of reasons; very flexible, fast run time, cheap to run, scalable run formats, etc. They are getting 60Mb from 314 chips. They have seen all the improvements promised by Ion Torrent and are waiting for 400bp reads. He presented an HLA typing project for bone marrow transplantation where they need to decode 1000 genotypes for the DRB1 locus. He is also doing HIV sequencing to identify sub-populations in individuals from a 4kb lrPCR. They are also concatenating Sanger PCRs as the input to fragment library preps to make use of the up-stream sample prep workflows for Sanger-seq, costs are good data quality is good with a very fast TAT.
Graeme Black: Graeme is a clinician at the St Mary's hospital in Manchester. He is trying to work out the best way to solve the clinical problem rather than looking specifically at NGS technologies. He needs to solve a problem now in deciphering genetic heterogeneity. The NGS is done in clinically accredited labs. He is working primarily on retinal dystrophies that are highly heterogeneous conditions involvinglem now in deciphering genetic heterogeneity. The NGS is done in clinically accredited labs. He is working primarily on retinal dystrophies that are highly heterogeneous conditions involving around 200 genes. With Sanger-based methods clinicians need to decide which tests to run as many patients are phenotypically identical, they cherry-pick a few genes for Sanger sequencing but only get about 45% success rates. The FDA looking to accredit gene therapies for retinal disorders already. Graeme would like success rates on tests to be much higher, complex testing should be possible and universally applied to removes inequality of access (~20% patients get tested). They are now offering a 105 gene test (Agilent SureSelect on 5500 SOLiD) and Graeme presented data from 50 patients. A big issue is that the NHS finds it really difficult to keep up with the rapid change in technology and information, the NGS lab has doubled the computing capacity of the whole NHS trust!!!. They have pipelined the analysis to allow variant calling and validation for report writing and are finding 5-10 variants per patient, results from the 50 patients presented showed 22 highly pathogenic mutations and they were able to report back valuable information to patients. He is thinking very hard about the reports that go back to clinicians as carrier status is likely to be important. 8/16 previously negative patients got actionable results and he saw a 60-65% success rate from the NGS test. Current single-gene Sanger £400, complex 105 gene NGS cost £900, NHS spends more but patients get better results

Bioinformatics for NGS:
Guy Cochrane, Gord Brown, Simon Andrews, John Marioni: This was one of the breakout sessions where we had very short presentations from a panel followed by 40 minutes discussion. Guy: Is a group leader at the EBI who are the major provider of Bioinformatics services in Europe; databases, web tools, genomes, expression, proteins etc, etc, etc. He spoke about the CRAM project to compress genome data currently we use 15-20 bits per base, CRAM in lossless is 3-4 b/b and it should be possible to get much lower with tools like reference based compression. Gord: Is a staff bioinformatician in Jason Carroll's research group at CRI. He spoke about replication in NGS experiments and its absolute necessity. Why have we seemingly forgotten the lessons learnt with micorarrrays? Careful experimental design and replication are required to get statistically meaningful data from experiments. Biological variability needs to be understood even if technical variability is low. Gord showed a knock-down experiment with singletons (looked OK) but replication showed how unreliable the initial data was. Reasons why people don’t replicate; cost, “if you can’t afford to do good science is it OK to do cheap bad science?”, sample collection issues, time, etc. John: Is a group leader at the EBI who is working on RNA-seq. He talked about the fact that it is almost impossible to give prescribed guidelines on analysis, the approach depends on your question, samples, experiment, etc. Things are changing a lot. Sequencing one sample at high depth is probably not as good as sequencing more replicates at lower depth (with similar experimental costs). There are many read mapping tools (genome, transcriptome fast but you could miss stuff, tools, de novo gene models 60 tools at last count) but it is difficult to choose a “best” method. Many are using RPKM for normalisation but it is still not clear how best to normalise and quantify. There is a lack of an up-to-date gold standard data set (look out for SEQC). john thought that differential expression detection does seem to be a reasonably solved problem, but more effort is needed for calling SNPs in the context of allele-specific expression. He pointed out that there is no VCF-like format for RNA-seq data that might be useful to store variation data in RNA-seq experiments. Simon: Is a bioinformatician at the Babraham Institute and wrote the FastQC package. It should be clear that we need to QC data before we put effort into downstream analysis! Sequencers produce some run QC but this may not be very useful for your samples. Library QC is also still important. Why bother? You can verify your data is high-quality, not contaminated and not (overly) biased. This analysis is also an opportunity to think about what you can do to improve any issues before starting the real analysis. You can also discard data, painful though that may be! Your provider should give you a QC report. Look at Q-scores, sequence composition (GC, RNA-seq has a bias in the first 7-10 nucleotides), trim off adapters, check species, duplication rates, aberrant reads, (FastQC, PrinSeq,QRQC, FastX, RNA-seq QC FastQScreen, TrimGalore).

An issue that was discussed in this session was "barcode-bleeding" where barcodes appear to switch between samples, not sure what is going on, are we confident that barcoding has well understood biases?

Our Keynote Presentation:
Jan Korbel: Jan is a groupl leader at the EMBL lab in Heidelberg he was previously in Mike Snyders group in Yale where he published one of the landmark papers in Human genomic structural variation. Jans talked about phenotypic impact of SV in human disease, rare pathogenic SVs in very small numbers of people e.g. Downs muscular dystrophy, common SVs in common traits psoriasis cancer and cancer-specific somatic SVs e.g. PrCa TMPRSS2/ERG, leukaemia BCR/ABL.

His groups interest is deciphering what SVs are doing in Human disease. Using SV mapping from NGS data Korebel et al 2007 Science They have developed computational approaches to improve the quality of SV detection, read-depth, split-reads, assembly. He is now leading the 1000 genomes SV group. They produced the first population level map of SVs and looked at the functional impact in GX data. A strong interest is in trying to elucidate the biological mechanisms behind SV formation in the Human genome, non-allelic homologous recombination (NAHR) 23%, mobile element insertion (MEI) 24%, variable number of simple tandem repeats (VNTR) 5%, non homologous rearrangement (NH, NHEJ) 48% (%ages from 1000 genomes data). Some regions of the genome appear to be SV hotpsots and are often close to telomeric and centromeric ends.

Cancer is a disease of the genome: it is often very easy to see lots of SV in karyotopes, every cancer is different, but what is causal or consequential? The Korbel group are working on the ICGC PedBrain project; in medulloblastoma there are only about 1-17 SNVs per patient with very poor prognosis. He presented their work on the discovery of “circular” (to be proven) double minute chromosomes. In teir first patient thay have confirmed all inter-chromosomal connections by PCR as present in the Tumour only. In a 2nd patient chr15 was massively amplified and rearranged at same time Stephens et al at the Sanger Institute described “chromothripsis” in 2-3% of Cancers. Jan asked the question “is the TP53 mutation in li fraumeni syndrome driving Chromothripsis?” They used Affy SNP6 and TP53 sequencing to look and saw a clear link between SHH and TP53 mutation in Chromothripsis in medullobastoma. The data is suggestive that chromothripsis is an early, possibly initiating event in some medulloblastomas and these cancers do not follow the textbook progressive accumulation of mutations model of cancer development and that it is primarily driven by NHEJ.

TP53 status is actionable and surveillance of patients with MRI, mammography increases survival and personalised treatments may be required as exposure to radiotherapy, etc could be devastating to these patients.

Wednesday 25 April 2012

How much are you using your HiSeq?

Over at GenomeWeb, everyone's favourite after CoreGenomics ;-), there is some coverage of an Illumina conference call. One statistic stood out for me and that was the "average annualised consumable utilisation of HiSeq", I think this is how much each HiSeq owner spends on consumables. Apparently this is $299,000.

I was a little surprised as it equates to only 25 or so paired-end 100bp runs, the kind being used in Cancer genome projects across the world. Even back in 2010 the figure was $400,000 or about 35 PE100 runs.

Why aren't we (as a group) spending more?

If you are operating an instrument feel free to drop me a line and discuss!

Sunday 22 April 2012

Battle of the benchtops:MiSeq vs Ion vs 454 and bacterial genomics

My collaborator Nick Loman is lead author on a Nature Biotechnology paper released today. Performance comparison of benchtop high-throughput sequencing platforms. I have been involved in several performance comparisons in the past and the one question people ask when they read papers like this is which one should I buy. The paper does not come down firmly one-way-or-the-other so read it and make your own decisions.

Nick talks about the new batch of laser-printer sized “personal sequencing machines”; GS Junior from Roche, MiSeq from Illumina and PGM from Ion Torrent. There is a lot of competition in this market mainly between Ion Torrent and Illumina, which Nick refers to as “lively”! In fact at a meeting I hosted last week I asked if anyone would consider using 454 to develop new methods on and no-one was willing. It looks like both PGM and MiSeq are making life very difficult for Roche.

Roche have responded with paired-end multiplexed sequencing protocols now and soon-to-be-released longer read lengths and automated library-prep and ePCR. However users have reported problems with the longer reads (>700bp).

Over the past four or five years the perfect mix of machines in many labs has been an Illumina GAIIx or HiSeq and a 454. Is this still the case with the personal genome sequencers?

You can read my summary of the paper at the end of this post. But first I thought I'd share some questions I asked Nick after reading the paper and his responses:

1.  Assuming Illumina release 700bp reads this year what is the impact on 454 and your kind of bacterial assembly? I saw the Broad presentation about 1x300bp reads which looks promising, and pairing those up could give an overlapping read of say 500-550bp which would be very nice. Qualities looked to drop off markedly on the poster I saw so that will need addressing if they are to be used for more than just scaffolding. Even with the long read kit I think 454 will find it increasingly tough to keep up unless they can increase the throughput and/or drop the run cost.

2.  Is increasing to PE250 and 5Gb on MiSeq a big or small impact on your comparison? For our applications (clinical & public health microbiology) we are interested in the per-strain cost, so greater throughput means higher multiplexing which is good. However library prep is increasingly the bulk of the costs, so some work needs to be done there on all the platforms. PE250 - if the qualities are high - should give assemblies very competitive with the 454 at a fraction of the cost.

3.  How would the current PGM 318 chip with PE200bp (or even longer) reads compare to the data you used? A 318 chip should give about 1Gb or 3-4x what we got with the 316 chips and make sequencing cheaper. With the PGM the maximum amplicon size you can use in the pipeline is still quite small, so running paired-end would mainly have the effect of improving read quality, as you are reading the same molecule twice. The 200bp kits had some quality issues when first released see some discussion from Lex Nederbragt. I'd definitely like to repeat my comparison this summer with the latest and greatest kits available for all manufacturers.

4.  Another big assumption but if ONT get even 10kb reads and 1,000,000 per run is there any hope for the three technologies you compared? That's a very good question! There is a general feeling that all this faffing around with short reads may completely pointless if a nanopore technology delivers on the promise. Certainly bacterial de novo assembly becomes a trivial problem if you can reliably get 10kb+ reads, in this E. coli strain that would cover all the repetitive regions. PacBio already can do great bacterial assemblies but the cost of this instrument puts it out of the running for labs like ours. What's most interesting about Nanopore is the apparent lack of sample preparation and amplification, plus the tiny form factor in a USB stick. As I said at the time of the announcement, this could be a major game-changer for applications like near-patient testing in clinical and public health microbiology.

5.  Thinking about workflow and assembly, have you considered using the Nextera library prep before? Yes, Nextera is very interesting and speeds up the time taken to make libraries, plus with the obvious advantage you can do a bunch in a 96-well plate. The downside it seems to us is that you don't get tight size selection of fragments which makes it less nice for paired-end runs for de novo assembly, where a fixed insert size is very helpful.

6.  What are the things NGS companies need to work on to make your job easier? Is it just more data, longer reads and faster? Can you name three things that would be on your "desert-island discs" list for them to focus on? I guess all the companies need to keep pushing in all directions, but probably the most important for us are: 1) workflow, making it as plug and play as possible to get from a clinical sample to a sequence 2) cost, getting the per-sample price down as far as possible - we want the $10 bacterial genome, so that is going to mean cheaper library preparation then throughput improvements 3) read lengths, the longer the better!

7.  Which one should people buy if they want to do bacterial genome sequencing? Ha, the $1000 question - I'd point people at the paper and get them to figure out what was appropriate for their needs and budget.

So what does the paper say? There are some simple comments from initial instrument comparison, the 454 GS Junior produced the longest reads, Ion Torrent PGM generated data fastest and MiSeq produced the greatest mount of data. See table below.

Comparison analysis:
In the paper Nick and his colleagues compare instrument performance
 sequencing the O104:H4 E. coli isolate behind the 2011 food poisoning in Germany. They set a benchmark for comparison by first generating an O104:H4 reference assembly on GS FLX+ from long fragment and 8-kb insert paired-end libraries using Titanium chemistry reads (~800bp). They produced a 32-fold coverage very high-quality draft genome assembly.

They used to to compare de novo assemblies from each instrument. Contigs obtained from the 454 GS Junior data aligned to the largest proportion of the reference, with 3.72% of the reference unmapped, compared to 3.95% for MiSeq and 4.6% for Ion Torrent PGM. None of the instruments produce a single-contig 10% accurate genome. And for each technology there is a trade-off between advantages and disadvantages.

As the paper is based on data generated several months ago it is certain to be out of date. Roche continue to improve 454 chemistry, Ion Torrent aim for 400bp reads and MiSeq is about to get PE250bp and has been reported as generating a near 700bp read. How would the comparison look with these improvements?

The impact on public health:
Moving NGS into pubic health lbs is not ing to be easy so it had better come with real advantages. It is also likely that NGS data will need to compare well to current typing methods. In the paper Nick used the NGS assemblies to generate multi-locus sequence typing (MLST) profiles, MiSeq performed best at this but the other platforms also worked quite well.

The paper asked three questions in the discussion one of which was “how much one should have to rely on human insight rather than automated analyses and pipelines”. There is a lot of discussion around clinical NGS and the need for highly automated reporting back to clinicians, perhaps we should make sure this question stays at the front of those discussion.

The challenge from ONT: I’d be surprised if anyone reading this has not heard of Oxford Nanopore. There is a sense that their technology will render the “short read” platforms (and in this instance I’ll include 454 here) obsolete. Generating an assembly from 100kb reads is game changing, especially if no sample prep is required.You can read a great article by Michael Eisenstein in the latest issue of Nature Biotechnology.

Thursday 19 April 2012

Breast cancer research extravaganza

Other than listing publications at the side of this blog, I would not normally cover in detail the work that my lab has contributed to or collaborated on. But there are two recent special cases, one just published (covered below) and another coming in five or six weeks (watch this blog).

Yesterday Nature published what I think will become the definitive molecular classification of Breast Cancer. Curtis et al: The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups or METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) as we have referred to it over the past five years; is an integrated analysis of gene expression and copy number data in almost 2500 patient samples funded by Cancer Research UK (link).

Other groups have published big studies and the mother of them all (ICGC) is proceeding very nicely (I have been involved in the Prostate and Oesophageal projects at CRUK). A recent paper on triple-negative breast cancer showed that no two tumours were the same . Even when we try to carefully subcategorise we still don’t get to a single disease. There is also an awful lot of intra-tumour heterogeneity we have to get to grips with.

The METABRIC project was, led by Carlos Caldas a senior group leader at the CRI (where I work) and Sam Aparicio in Vancouver, Canada. They looked at the copy-number and gene expression and their interaction in tumours with high-quality clinical information. They were able to look at what influenced survival or things like age at diagnosis. The real power in the study came from the large numbers used and because of this they were able to detect new sub-groups in breast cancer bringing the number from five up to at least ten. Each disease has its own molecular fingerprint that might be used to help diagnosis and treatment.

A specific aim of the project was to identify at least four groups of patients by molecular classification:
  1. Patients without lymph nodes metastasis at very low risk of relapse who might be spared chemotherapeutic treatment.
  2. Patients with ER+ lymph nodes metastasis who might only need hormone therapy.
  3. Patients with ER- lymph nodes metastasis who might have a better prognosis with drugs other than hormones.
  4. Patients with more aggressive disease who are likely to relapse and may well benefit from intensive preventative therapy and follow-up.

Breast Cancer sub-classification: We have known for over twenty years that there are at least five breast cancer subtypes. Initial observations on oestrogen and progesterone receptor status gave is the first three, ER+, PR+ or double-negative (ER/PR-). The discovery that Her2 over-expression had a major impact on a proportion of Breast cancers gave us the next two, Her-2+ or triple-negative (Er/Pr- and Her2-normal). Most people have heard of Her2 as the drug Herceptin is given to patients with the amplification (and my earliest work as a researcher after university was developing a test for Her2 amplification). Tests based on histopathology for Er, Pr and Her2 or ones like PAM50 (also used in the METABRIC paper) can classify tumours but even so some women respond better or worse than their classification might have suggested they would. Probably because our understanding of breast cancer biology is not complete; and METABRIC gives us a huge leg-up in our understanding.

The analysis team led by Christina Curtis at USC performed a sub-classification of tumours based on 2477 DNA copy-number measurements on Affymetrix SNP6 and 2136 gene expression measurements on Illumina HT12 arrays as well as the matched clinical data. This was the first map of CNVs on breast cancer and is the largest study of its kind for either CNV or GX. It was the clustering of this data that led to discovery of at least ten clinical sub-types. And these clusters contained hotspots of activity suggesting drivers of breast cancer at these locations, Her2 amplifications in the Her2+ patients for instance. They also contained genes never linked to breast cancers before, for which there are drugs available for use in other cancers. This opens the door to new treatments for some patients hopefully in the next three to five years.
BrCa clustering figure from the paper
Prognosis of the ten clusters

Incorporating the data from METABRIC into databases like the NCBI’s International Standards for Cytogenomic Arrays db that contains data on over 30,000 CNV tests, will help clinicians improve patient treatment. The ISCA database provides the first rating scale for CNVs from 0 (no evidence in disease) to 3 (evidence of clinical significance) and should improve as more data gets added.

What did my lab do for the METABRIC project: When I went to the interview for my current job about seven years ago the lead author on METABRIC spoke about his ideas and hopes for the project. It was an inspiring one for me as I was primarily a microarray expert at the time (no NGS then) and the prospect of generating so much data on a single project was exciting.

My lab collaborated on some preliminary work to decide the best way to extract DNA and RNA from the samples. We also helped on a project to determine which microarray to use for copy-number analysis and get early access to the Illumina HT12 arrays allowing high-throughout gene expression analysis.

It took almost a year to process 5-6000 sectioned tumour samples for DNA and RNA. Because of tumour heterogeneity we used a complicated SOP where each tumour was sectioned for H&E, DNA and RNA multiple times. Duplicate nucleic acid preps were processed in batches using Qiagen DNeasy and miRNAeasy kits. We also QC’d every sample on gels, bioanalyser and nanodrop. A similar amount of work went on in the other major collaborators labs in Vancouver, Canada.

The 2136 HT12 arrays were processed in my lab over just five to six weeks. We did not use robots to do any of the processing and just one person prepared the labelled cRNA for array hybridisation. This was done to minimise the technical variation in the samples. We also used a very carefully designed sample allocation to particular wells in the plates and on the final arrays. Again this allowed us to account for technical variables in the final data analysis.

The copy-number arrays were all processed by Aros in Denmark who did a fantastic job.

The discussions leading up to lab work were intensive to say the least. I’d really recommend reading the supplementary methods for details of sample collection, pathology review, nucleic acid extraction and QC, and the microarray processing. Of ocurse there was all the work that went on in the Bioinformatics lab as well, but that is a story I’ll leave someone else to tell.

PS: Please don't comment on how much better this may have been if we had used NGS! It was not even available when we started and processing 2000+ RNA-seq and SV-seq samples would have been almost impossible in a lab with just four sequencers today!

Curtis, C. et al. (2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups

Sam Aparicio's group published their research on triple-negative breast cancer last week:
Shah et al. (2012). The clonal and mutational evolution spectrum of primary triple-negative breast cancers.

Sunday 15 April 2012

The long way round...with a DNA sequencer?

I used to ride a BMWR1200RT on my daily commute from Wymondham to Cambridge. What I really wanted was an R1200GS like Ewan and Charlie rode around the world on during their easterly 19,000-mile London to New York trip through Europe and Asia to Alaska and finally New York.

The other evening I was watching "Who Do You Think You Are?" the show where a celebrity is helped along a journey to trace their ancestors. It has been on for coming up to ten years now and there are ten international adaptations on air. It was the episode following Stephen Fry was on where they traced his  family tree to investigate his maternal Jewish ancestry.

Whilst watching I wondered if Ewan MacGregor might fancy combining the two and adding in a bit of sequencing on the trip? We could fit a whole mobile next-generation sequencing lab onto a big BMW and ride around the UK tracing his ancestors and checking out how related people really are to each other. It would not be too hard to fit an Ion Torrent PGM, a One Touch template prep system and a PCR machine for AmpliSeq library prep (using a Human ID kit) all on the back of an R1200GS. With a PalmPCR machine in my pocket and a cool box for a few kits in a pannier the mobile lab is ready to go.

I'd prefer the bike to the Ion bus which is currently touring across the US. And of course if Oxford Nanopore release their MinION at the end of the year I could always use an old Vespa. no need for all the other heavy gear and so much cooler around European streets in the Summer!