Tuesday, 29 January 2013

Do participants in genomic studies really risk boredom?

There has been lots of news recently following the publication from Yaniv Erlich’s group at the Whitehead Institute on re-identifying individual participants in genomic studies. Anyone that has worked in forensics knows how little data is needed to unequivocally identify an individual. Erlich’s group demonstrated that it is possible to match an individual participant in a publicly available dataset with a relative who has deposited data in a genetic genealogy database. They did this by imputing Y-chromosome STR haplotypes from a dataset and then searching genetic genealogy sites to find matching STRs and probable surnames.

Identification: The paper is a good read and made good headlines, hopefully this is enough to push the debate over identification along a positive road.

Only a few weeks ago when the UK Prime Minister, David Cameron, visited our lab he was explaining that the £100M investment the UK was making into medical genomics would create a database where all data would be anonymised. I think politicians and policy makers need to be clear about what can and can’t be done with this data so the public are well informed about the benefits and risks of participating in genomic research. We do not want negative headlines that put people off, just in case their name appears in the Daily Mail. Of course this means that consenting may need to be a process that is followed up as new developments arise. One day soon we will be able to get a pretty good genetic photo-fit from the kind of data in the public domain.

Boredom: Christine Iacobuzio-Donahue runs the Johns Hopkins rapid autopsy programme. This aims to recruit patients and their families into a very emotive study requiring the collection of tumour tissue within a few hours of death. Their consent form lists the risks of participating one of which appears to be boredom; “You may get tired or bored when we are asking you questions or you are completing questionnaires.” At least they are being honest!

Their patient information sheet also makes specific references to the fact that your identity may be discoverable; “As stated above, your exact identity will not be provided in materials published as part of this study, but individuals who know you such as family members may be able to identify you to some degree of certainty from published information. Efforts will be made to make it difficult to identify you in this manner, but there is a risk that you will be identifiable by individuals who have prior knowledge of your diagnosis or medical history.” 

As scientists I think we need to do as much as possible to promote the work we do and interact with the public. Making people aware of the difficulties of research and being realistic about the rate of progress is important. We need individuals to participate in research studies like the one at Johns Hopkins.

Ideally we don't want people getting put off because they are bored, lets make those interactions as positive as possible!

Wednesday, 23 January 2013

Mate-Pair made possible?

Mate-Pair sequencing has been used from the early days of genome sequencing to help with final assembly. The technique creates libraries with much larger inserts than the standard fragment library prep, even 10Kb or more. However making these libraries has never been easy and the methods used have not changed a great deal from the days of the HGP.

A few groups have been successful in creating mate-pair libraries but my experience has not been so good. My lab has only tried once and did not get great results, many of the groups I work with have found the preps difficult and sequencing results less than encouraging. It looks like mate-pair is a technique where “green fingers” are required. This is not a situation I like as I am a firm believer that as long as a protocol has been carefully put together then anyone should be able to follow it. 

How does mate-pair work: High-molecular weight genomic DNA is fragmented to an approximate size range, usually 3, 5 or 10Kb, this is then end-repaired with biotinylated nucleotides and gel-size selected more specifically. Fragments are then circularized by ligation, and purified by biotin:streptavidin cleanup. These circularized fragments are then subjected to a second round of fragmentation (to 300-500bp) biotin:streptavidin cleanup to remove all but the ligated ends of the original molecules. This DNA is then used for a standard fragment library prep and mate-pair sequencing. The length of the initial gel size-selection determines the mate-pair gap size expected during alignment of the final sequence reads. The orientation of these reads also acts as a useful QC. Two major problems are the amount of DNA required, usually 10ug or more and the creation of chimeric molecules during ligation that produce artifactual structural variations. 

Jan Korbel’s group at EMBL have been successfully using mate-pair sequencing for their analysis of structural variation in Cancer (Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome). They have been using a pretty standard protocol. They fragment DNA using a Hydroshear (see the bottom of this post for more details), cut gels for 5Kb libraries, and try to minimise PCR to keep diversity high. They are processing 8 samples at a time and it takes 5 days but the tedious and careful handling pays off in some very nice results. The biggest drawback is the requirement for 10ug of DNA.

Newer methods offer some promise for simpler mate-pair prep: At least two new products have been released that may help the rest of us produce reliable mate-pair libraries. Lucigen have developed a new BAC/Fosmid vector system for the creation of high-quality libraries with large inserts. Illumina have released a new Nextera based mate-pair kit. 

Lucigen pNGS: Lucigen's novel protocol for making clone-free libraries of 40-300kb insert size could have a dramatic impact on the de novo assembly of complex genomes. They have released the pNGS system that includes a di-tagged vector which contains the PCR amplification sites required to produce sequence ready libraries. This vector lacks the promoter for lacZ and has transcriptional terminators which, according to Lucigen, “result in higher stability cloned insterts”.

Lucigen workflow
The system works by using random DNA fragmentation to produce inserts of 40-300kb that are gel-purified and cloned into the pNGS vector. The amplified BACs or Fosmids are then digested with 4bp cutters to leave just the ends of the original insert fragment and the vector sequence. This digested molecule is self-ligated to create a circular template for NGS PCR amplification producing the final sequencing library. 

Nextera mate-pair: The new protocol from Illumina makes use of the Nextera product Illumina acquired when they bought Epicentre in 2010. The protocol is not completely what I expected as it uses a mix of Nextera and TruSeq reagents rather than relying on Nextera alone. Illumina have a TechNote on their website technote_nextera_matepair_data_processing that I'd recommend interested readers take a look at. They show comparison data from paired-end vs mate-pair+paired-end sequencing of the Human genome and show a modest, but important increase in coverage statistics that is most obvious in repeat regions of the Human genome.

Nextera mate-pair offers gel-free or gel-based size selection. The gel-free option allows a lower DNA input of just 1ug but generates a larger final mate-pair size distribution of 2-15kb, which may make analysis harder as you cannot simply discard mate-pairs using insert size as a QC. The gel-based protocol requires 4ug of DNA but the user has control over the final mate-pair size distribution as normal.

Transposomes are loaded with biotinylated oligs and are mixed with DNA at a much lower ratio than in the standard Nextera kits, which allows much larger fragment sizes to be produced (2-15kb). The mate-pair library is then size-selected and circularized as in standard protocols (this is still a blunt-ended ligation which is less efficient than the “sticky” one used in the Lucigen kit. Physical shearing breaks up the large circular molecules and the biotinylated oligos added by in vitro transposition allow capture of only the mate-pair regions. These are purified and used as the template to a standard TruSeq library prep.

I like Nextera and we have been using it in capture projects and the new XT formulation. My ideas about what Nextera mate-pair might look like made use of two sets of transoposome loaded with different sequences. The first would create the large fragments, incorporate biotin and leave compatible ends for a "sticky" ligation. The second would be a standard Nextera prep to produce the final library which could be streptavidin purified before PCR.

Will structural-variants in Cancer be easier to detect: We’ve done very little mate-pair in my lab because of the sample requirements, so I'm hoping that these new developments will mean more users request the protocol and are able to make use of the additional structural variation data. For now many people seem to be happy with getting 80% or more of the variants from standard fragment libraries. However protocols that allow generation of multiple mate-pair sizes that can be indexed for sequencing are likely to allow identification of important, and so far difficult to identify rearrangements in Cancer genomes. Being able to run a single pooled sample that contains tumour:normal at ~350, ~3Kb, ~10kb & ~40kb inserts should give very high resolution copy number and high-quality structural variation data. This may also be achievable with far fewer reads than are used today and with the bonus that significantly less DNA is used in the prep.

Hydroshear vs Covaris vs Soniction vs enzymes: There are many ways to chop DNA into fragments but only a few will reliably give the larger ones required for successful mate-pair library preparation. Most of us are using Covaris or Bioruptor to produce 300-500bp fragments. These instruments can also generate longer bits of DNA but they are inefficient as most DNA is outside the range required.

The Hydroshear is a very clever piece of kit from Digilab that uses a pretty simple mechanism to break DNA into relatively tight fragment distributions. DNA is pushed through a tight contraction in a tube by a syringe. As the sample moves through the contraction the flow rate increases dramatically, this stretches the DNA until it snaps. The process is repeated over several cycles until an equilibrium is reached. The flow-rate and the size of the contraction determine the final fragment size. The smear of DNA usually seen on a gel after sonication is much tighter, e.g. 1.5-3Kb.

I'm writing a post on DNA fragmentation methods and will go into more detail there.

Saturday, 19 January 2013

Understanding where sequencing errors come from

Next-generation sequencing suffers from many of the same issues as Sanger sequencing as far as errors are concerned. The huge amounts of data generated mean we are presented with far higher numbers of variants than ever before and screening out false positives is a very important part of the process. 

Most discussion of NGS sequencing errors focuses on two main points; low quality bases and PCR errors. A recent publication by a group at the Broad highlights the need to consider sample processing before library preparation as well.
 
Low quality bases: All the NGS companies have made big strides in improving the raw accuracy of the bases we get from our instruments. Read lengths have increased as a result and continue to do so. The number of reads has also increased to the point that for most applications over-sequencing is the norm to get high enough coverage to rule out most issues with low quality base calls.
 
PCR errors: All of the current NGS systems use PCR in some form to amplify the initial nucleic acid and to add adapters for sequencing. The amount of amplification can be very high, with multiple rounds of PCR for exome and/or amplicon applications. Several groups have published improved methods that reduce the amount of PCR or use alternative enzymes to increase the fidelity of the reaction, e.g. Quail et al.
 
We also still massively over-amplify most DNA samples to get through the “sample-prep spike”. This is when you use PCR to amplify ng quantities of DNA only to allow robust quantification that allows you to dilute the sample back to pg quantities for loading into a sequencer. Improving methods to remove the need for this spike has resulted in protocols like Illumina’s latest PCR-free TruSeq sample prep. Most labs also use qPCR to quantify libraries allowing much less amplification to be used. However we still get people submitting 100nM+ libraries to my facility, so I try to explain they can use less amplification and it will almost certainly improve their results.
 
What about sample extraction and processing before library prep: Gad Getz group at the Broad recently published a very nice paper presenting their analysis of artifactual mutations in NGS data that were a result of oxidative DNA damage during sample prep.

These false positives in NGS data are not a huge issue for germline analysis projects using high-coverage sequencing as they can be easily removed. However for Cancer analysis of somatic mutations or population based screening, where low allelic fractions can be incredibly meaningful (e.g analysis of circulating tumour DNA) they present a significant issue. Understanding sequencing and PCR errors helps correct for some of these false positives but the Broad group demonstrate a novel mechanism for DNA damage during sample preparation using Covaris fragmentation. They do not say stop using your Covaris (a nice but expensive system for fragmenting genomic DNA), rather they provide a method to reduce the oxidation and conversion of guanine to 8-oxoG by the addition of anti-oxidants to samples before acoustic shearing. 

They also make the point that their discovery, and fix for this issue made them think about the multitude of possibilities for similar non-biological mechanisms to impact our analysis of low allelic fraction experiments.
 
I think they are somewhat overly pessimistic about the outlook for Cancer sequencing projects questioning “whether we can truly be confident that the rare mutations we are searching for can actually be attributed to true biological variation”. The numbers of samples being used in studies today is increasing significantly and they are coming from multiple centres, I’d hope that both of these will reduce the impact of these non-biological issues. But I whole heartedly agree with the authors that we should all stop and think about what can go wrong with our experiments at all steps, and not just focus on whether the sequencing lab has generated “perfect” sequence data or not!
 
PS: In the spirit of following my own advice I linked to the papers in this post using the DOI, hopefully blog consolidators will pick it up and add it to the commentary about this article.

Friday, 18 January 2013

Do scientists make the most of web comments?

Social networking and commenting or reviewing on websites are part of the run-of-the-mill for most of today's web users. Who would book a hotel without checking TripAdvisor for instance? Even finding a plumber is made easier with sites like Mybuilder!

There are many other "social-networking" sites we use in our day to day lives and some attempt has been made to emulate these in our working lives. I am sure many readers of this blog use LinkedIn, Twitter, FaceBook and other sites. And it is almost impossible to buy something on-line without having access to comments and reviews from other users. However I am surprised how few comments are left by scientists on scientific websites.

One of the things I most like about blogging is the relatively large amount of direct feedback on the work I am doing. Readers leave comments on posts I have written and I meet readers at meetings who are generally complementary. But when I read journal articles online I don’t see the same level of commenting as I do on the blogs I read. Is there something stopping us from commenting? If there is can we do something to enrich the community? Does anybody care? Please leave a comment if you do!

Technical reasons for a lack of commenting: An immediate problem is that comments appear to be stuck where they get left, so unless people visit that particular page they'll never see your critical but enlightening comment so why bother.

For NGS users a fantastic resource is SEQanswers but it is more targeted to questions and answers on specific issues rather than general discussions. Sites such as Mendeley are trying to build more features that make use of their large user community to make our lives easier. At a recent user forum we discussed the possible consolidation of tags based on all users tagging. The suggestion was to allow Mendeley to analyse tag usage in the community and the suggest tags for you to use, possibly even auto-tagging documents as they are imported. They are already trying to disambiguate tags and allowing users to control the tag vocabulary we use.

A relatively new site is PubPeer which is aiming to consolidate comments from various sources. Some journals are trying to do something similar. Nature Genetics has a metrics page for publications, which show how many citations, tweets, page views, news and blog mentions an article has received. The counts are not comprehensive as mentions don’t always link directly to the articles in a format that can be assessed, they need to include a DOI.

A DOI, or Digital Object Identifier, is assigned to the majority of scientific articles by the publisher. CrossRef is the organisation that hands the unique identifiers out and linking with these should allow blog consolidators to scrape content, so I'll try to use DOIs rather than linking to a journal or PubMed in future. However I am not entirely clear how the people behind the consolidation decide where to scrape from.

However I don't think the technical reasons are all that is stopping us, and there are so many opportunities to comment on published work that we could make more use of. If bloggers, news organisations and the like can use DOIs then perhaps PubPeer, Mendeley and other efforts can consolidate the dispersed discussions and provide us with somewhere to focus our efforts. "You can lead a horse to water, but you can't make them drink!" so the onus must be on us to comment in the first place.

Social reasons for a lack of commenting: Are we scientists worried that our unedited comments on scientific literature might be taken negatively? We are free to discuss ideas in journal clubs and ask questions after presentations but why don't we use the resources available to us more widely?

If there is a peer pressure, real or imagined that is making it less likely for people to leave comments then we should start an active discussion to promote commenting. I think the outcome will be a richer resource of peer reviewed primary literature alongside news-articles, reviews, comments and blogs all potentially in one place.

Saturday, 12 January 2013

See the Taj Mahal and get your exome sequenced

There is an ongoing debate about consumer genetics. Companies such as 23andMe can provide what appear to be rich datasets on the surface but comparisons of results between companies can lead to different conclusions, the raw data can also differ although where the same arrays is used for genotyping this appears to be minimal.

Three camps appear to be emerging, the paternalistic "oh no you don'ts"; the ambivalent "not sure it matters" and the righteous "how dare you tell me no's!" With the debate likely to rage for some time to come.

I had my 23andMe screen and found it very interesting, although certainly not life changing. As Susan Young writes over at the MIT Technology Review, the process is easy but not incredibly informative. However she will keep coming back to her results as more and more information from the scientific community becomes interpretable by 23andMe and others. I whole heartedly agree with her sentiments that consumer genetics should not be restricted.

The release of some fantastic papers last year, with several of these being picked up by news organisations across the globe means that the general public is becoming more aware of what can be done with genomics.

Foetal screening for Down's: The tests being offered by Sequenom MaterniT21 PLUS, Verinata, Ariosa and Natera use non-invasive NGS assays to screen for trisomy 21, T18 and T13 (and more in the future). Wired has a great story covering this technology, I think readers would like.

The tests appear to be as sensitive (potentially more so as sequencing depth increases), more specific (it's not just macro-genomic disorders that can be identified) and costs are affordable. A major incentive for their use is the very significantly lower risk when compared to amniocentesis, which carries a 1% risk of miscarriage as well as risks of injury, club foot, rhesus disease and infection. Hopefully these test will have such an impact on screening that the NHS and other health care providers will adopt them. However the take up is not as quick as many informed mothers would like and some of them are resorting to private screening at a cost of around $500.

What about cancer: there are a couple of companies offering tests for cancer, Genomic Health's "OncotypeDX", Quest Diagnostics lung cancer test, Foundation Medicines "Foundation One" test. This field is behind non-invasive prenatal testing, primarily because the tests are so much harder to validate. Rather than a simple positive/negative on a trisomy a much more complex signature might be required to understand prognosis. There is lots of opportunity for treatment where specific mutations are linked to response. The non-invasive is coming to cancer as well although the possibility of screening non-symptomatic patients is a way off. Cancer patients have formed some very strong advocacy groups and as these tests appear to make more impact on their diseases these groups are likely to push for the tests to be introduced. Companies will offer these on a fee-for-service model as well.

Will "health tourism" start to include exomes and genomes: So if the NHS and other providers can't keep up might a business arise for health tourism to countries where patients can simply turn up, have blood taken and get exome or genome results back in a week for $10,000? I'd not be surprised if you will be able to buy a holiday with genomics thrown into the "all-inclusive" package. Imagine a trip to Delih to visit nxgenbio and then onto the Taj Mahal? Regulation differs from country to country so simply hopping across a border or jumping on a plane could get you somewhere a test can be performed. This will have big implications for individuals, health care providers and insurers.

PS: if any travel organisation needs someone to review five star hotel complexes that offer exome sequencing I have no plans yet for my 2013 summer holidays!

Tuesday, 8 January 2013

Illumina's latest innovations, roll on 2013

At the 2012 Illumina European scientific summit just outside Barcelona in June Sean Humphray presented work on HiSeq 2500 sequencing of clinical genomes. The results were pretty stunning with very nice data, generated and analysed in under a week, it really looked like we were going to be getting some nice Christmas presents from Illumina. Unfortunately Santa laeft it a bit late but the goodies are probably worth the wait.

Illumina announced three improvements to sample prep, new flowcells, PE250 on HiSeq 2500 (who needs a MiSeq). All of these are likely to make an impact on teh science we are all trying to do using these technologies. They should also keep Life Technologies on their toes and the competition between Illumina and Life Tech is likley to keep driving costs down and innovation along. We'll need that now that the FTC have given BGI:CGI the go-ahead for $117.6 million.

You can listen to Jay Flatley's JP Morgan webcast here.

Better, faster sample prep: all of us making sequencing libraries would like methods to be better, faster and cheaper. Well two out of three ain't bad.

Illumina announced new rapid exome capture kits based on Nextera that allow the entire process to be completed in 1.5 days, combine this your new HiSeq2500 rapid run and go from DAN to exome sequence data in just 2.5 days. This comment has been edited as the kit is even faster than the previous version I'd used, I thought it was fast enough but apparently the wizards at Illumina have worked even more magic!.

New TruSeq DNA PCR-free kits should be available to order at the end of the month. Sean Humphray's presentation gave us the details of the prep. 500ng input for PCR free  leading to very diverse libraries. This is likely to have a big impact on genome sequence quality and reduce GC biases due to PCR dropout.

Nextera Mate-Pair is an exciting innovation as traditional mate-pair methods have not been easy for everyone to adopt. I first heard rumours about this over a year ago. The ability to make mate-pair libraries from just 1ug of DNA is likely to be tried by many. The protocol makes use of a Nextera reaction to create large fragments for circularisation, these are used as an input to a TruSeq library prep to produce the smaller fragments ready for sequencing.  I'm going to post a follow-up on this so won't go into too much detail now.

I do wish Illumina would unify some of their Nextera kits, which today use different clean-up methods and have a few other idiosyncrasies between kits which should be pretty much the same.



New targeted RNA-seq should allow users to analyse a few hundred or thousand genes in many samples at low cost. The cost of targeted RNA-seq will need to be low as differential gene expression can be completed with just a few million reads which today might cost just $50. Cheap sequencing requires cheap samples prep.

A less well announced change is the use of on-bead reactions in TruSeq kits. This is likely to make many applications easier to run, require lower inputs and create higher yielding libraries. I hope this and bead-normalisation will be in almost all kits by the end of 2013.

Better flowcells and who are Moleculo? New flowcells are promised that will give us a significant increase in cluster density, roll on the Summer! The figure of 300Gb from HiSeq 2500 announced by Illumina means we could get 3Tb of data in the same time a PE100 currently takes on HiSeq 2000. That is a lot of data, and all of it on BaseSpace in real-time.

Illumina also purchased Moleculo Inc, a Moleculo start-up company from Stephen Quake's group at Stanford University with very little information out there (see Moleculo Inc coverage by NextGenSeek). They have developed methods to allow generation of synthetic reads of up to 10kb which will be useful for many applications. How this might be combined with the Nextera mate-pair for high-quality de novo genome assembly remains to be seen.

BaseSpace underwhelms again: lastly Ilumina made another big splash about BaseSpace. I am a fan; but of what it could be and not what it is. I hope it does develop into a rich community that does not get too restricted by Illumina. There are just eight apps today and the community has yet to develop a "killer app". I've lots of app ideas but building them is not my area of expertise. If anyone wants to collaborate on building or designing apps give me a shout.