CoreGenomics: October 2013

Wednesday, 23 October 2013

MinION early access program

Update from GenomeWeb at the bottom!

Get ready for millions of minions! ONT have announced an early-access program for the MinIon (GridION later) and it is almost free to access. I'm sure they will see huge demand from eager NGS users. But who will have projects best suited to the MinION technology and who will be first to publish?

Let's not forget ONT's technology promises long-reads, how long is not completely clear but some applications will benefit more than others.

$1,000 buys you a MinION system, free flowcells (to an undisclosed limit), free sample prep and free sequencing reagents. Of course there is no such thing as a free lunch and ONT will require users to sign their End User License Agreement "to allow Oxford Nanopore to further develop the utility of the products, applications and customer support while also maximising scientific benefits for MAP participants". And the press release does give a lot of hope that ONT don't want to restrict your right to publish.

"MAP participants will be the first to publish data from their own samples. Oxford Nanopore does not intend to restrict use or dissemination of the biological results obtained by participants using MinIONs to analyse their own samples. Oxford Nanopore is interested in the quality and performance of the MiniION system itself."

I've signed up already!

You can too at by visiting the ONT contact page and selecting the box marked 'Keep me informed on the MinION Access programme'.

Update: Some more details came from GenomeWeb a few minutes after I posted. According to their coverage read-lengths may be up to 100kb but the number of pores could be as low as 500. This is exactly the kind of detail we are going to need to determine the best applications to run tests on.

Tuesday, 22 October 2013

Bioinformatics at the top

A few years ago one of our junior group leaders made an interesting appointment; he recruited a bioinformatician into a research assistant role. Every lab has someone, or several people, who keep the lab running. They are the people making sure cells get cultured, supporting post-docs and PhD students, the nuts and bolts of most labs. But this recruitment stood out as the person was being appointed to look after "Big Data", not PCR & gels!

Now we have embedded bioinformaticians and bioinformatic research assistants in many of the groups, especially those heavily using Genomics technologies.

Computational groups seem to have changed too and all those in our Institute now have wet-lab scientists as part of their team. I think this is definitely the way to go and makes it much easier for groups to direct their research in a particular direction.

Computational biologists of all sorts are rapidly cropping up at the top end of the career ladder. In February 2013 Professor Simon Tavare FRS, FMedSci was appointed as Director of the Cancer Research UK Cambridge Institute (the place I work), and earlier this week BBSRC announced that Dr Mario Caccamo has been appointed the Genome Analysis Centre's (TGAC's) new Director.

With statisticians, computational biologists leading the way as Directors of research institutions and not just as group leaders I wonder if we'll see a slightly different angle on some of our research?

Monday, 21 October 2013

Genomics England is go

Genomics England is steaming ahead to sequence 100,000 genomes from NHS patients. Today Genomics England and Illumina announced their intention to start the 1st 10,000 genomes as part of a seqeucning contract run by Illumina.

Set up by the Department of Health and announced by the PM (when visiting my lab) in December 2012, the Genomics England project has some lofty goals. If the team can deliver then the NHS and the UK population really could benefit from the advances in molecular medicine. I for one would be glad to see the NHS take the lead on the word stage as we've had some pretty big milestones so far in the UK:

1953 and the structure of DNA was discovered in the UK.

1977 Sanger DNA sequencing invented in the UK.

1997 Solexa sequencing invented in the UK.

2020 UK NHS 1st to screen all cancer patients with NGS?

The £100 million so far pledged by the UK government will (according to the Genome England website):

train a new generation of British genetic scientists to develop life-saving new drugs, treatments and scientific breakthroughs;

train the wider healthcare community to use the technology;

fund the initial DNA sequencing for cancer and rare and inherited diseases; and

build the secure NHS data linkage to ensure that this new technology leads to better care for patients.

See the Science working group report if you'd like to know more about where they are going.

This morning Genomics England and Illumina announced their intention to start a 3 year programme of sequencing genomes. The 1st 10,000 genomes will be for rare diseases and this has real potential to impact many patients; ideally with treatments for their disease, but at the very least a hope that a causal mutation is discovered.

This is the first step for the NHS to develop the infrastructure required to bring WGS into routine clinical practice. But the UK is likely to need a big and shiny new sequencing space if we are going to do what David Cameron said and do all the sequencing in England. Note that is very carefully stated, the seqencing will be in England; not China, not the US, and not Scotland!

Whether we will we realistically sequence whole genomes from 100,000 patients is not clear. The infrastructure to do this in a timely fashion does not exist in the UK (yet). And as the technologies for sequencing improve clincal exomes, amplicon panels and whole genomes will all need to be considered to find the best fit for different groups of patients.

With Synapdx in the US releasing a Autism Spectrum Disorder test using RNA-seq it is clear that genomes are just the tip of the iceberg.

Friday, 18 October 2013

How good is your NGS multiplexing?

Here’s a bold statement: "I believe almost all NGS experiments would be better off run as a single pool of samples across multiple lanes on a sequencer."

So why do many users still run single-sample per-lane experiments or stick to multiplexes that give them a defined number of reads per lane for each sample in a pool? One reason is the maths is easy: if I need 10M reads per sample in an RNA-seq experiment then I can multiplex 20 samples per lane (assuming 200M reads per lane). But this mean my precious 40 sample experiment now has an avoidable batch effect as it is run on two lanes which could be two separate flowcells on different instruments at different times by different operators in different labs…not so good now is it!

And why doesn’t everyone simply multiplex the whole experiment into one pool in the first place? When I talk to users the biggest concern has been, and remains, the ability to create accurate pools. A poorly balanced large-pool is probably worse than multiple smaller-pools ones, as with the latter you can simply do more sequencing on perhaps one of the sub-pools to even out the sequencing in the experiment.

We have pretty agreed standards on quality (>Q30) and coverage (>10x for SNP calling), but nothing for what the CV of pool of barcoded libraries should be. What’s good and what’s bad is pretty much left up to individuals to decide.

Here are some examples from my lab: pools 1, 2 & 3 are not great; 4 is very good.

Robust library quantification is the key: What do Illumina et al do to help? The biggest step forward in the last few years has been the adoption of qPCR to quantify libraries. Most people I speak to are using the Kapa kit or a similar variant. Libraries are diluted and compared to known standards. When used properly the method allows very accurate quantification and pooling however it has one very large problem; you need to know the size of your library to calculate molarity.

The maths once you have size is pretty simple:

We find dilutions of 1:10,000 and 1:100,000 are needed to accommodate the concentrations of most of the libraries we get. We do run libraries in triplicate and qPCR both dilutions. It’s a lot of qPCR but the results are pretty good.

Unfortunately accurate sizing is not trivial and it can be a struggle to get this right. Running libraries on a gel or Bioanalyser is time consuming and some libraries are difficult to determine a very accurate size for, e,g, amplicons & Nextera. Some users don’t bother, they just trust that their protocol always gives them the same size. The Bioanalyser is not perfect either, reads this post about Robin Coope’s iFU for improved Bioanalyser analysis. Get the sizing wrong and the yield on the flowcell is likely to be affected.

Even with accurate QT pooling is still a complex thing to get right: Illumina try to help by providing guidelines to allow users to make pools of just about any size. However these are a headache to explain to new users without the Illumina documentation. And the pooling always has a big drawback in that you may need to sequence a couple of libraries again and this can be impossible if they are not compatible.

We run MiSeq QC on most of the library preps completed in my lab. This is very cost effective if we are sequencing a whole plate of RNA-seq or ChIP-seq, at just £5 per sample. However if we only have 24 RNA-seq samples then we’ll only want 2 lanes of HiSeq SE50bp data, this means MiSeq QC is probably a waste of time and we’ll just generate the experimental data. Unfortunately the only way to know for sure that the barcode balance is good is to perform a sequencing run!

Mixing pools-of-pools to create "Superpools": We’ve been thinking about how we might handle pools-of-pools (Superpools) on HiSeq 2500, the instrument has a two-lane flowcell that requires a $400 accessory kit if you want to put a single sample on each lane. The alternative is to run two lanes, or a superpool of libraries from different users. We’ve tested this in our LIMS and can create the pools, the samplesheet and do the run but in thinking about the process we’ve come up with a new problem. What do you do when the libraries you want to superpool are different sizes?

We can accurately quantify library concentration (if you can accurately size your libraries) but the clustering process favours small molecules. Consider the following scenario: in a superpool of two experiments on one HiSeq 2500 flowcell we have an RNA-seq library (275bp) and a ChIP-seq library (500bp). These are equimolar pooled and sequenced. When demultiplexed the RNA-seq library accounts for 80% of the run and the ChIP-seq 20%; consequently the RNA-seq user has too much data and the ChIP-seq user has too little. And all because the smaller RNA-seq library clustered more efficiently. How do you work that one out!

We’ve not empirically tested this but I think we will soon on our MiSeq.

Top tips for accurate pooling:

Perform robust QT
Mix libraries with high volume pipetting (~10ul)
Run MiSeq QC

PS: writing this post has got me thinking of better ways to confirm pooling efficiency than sequencing. Watch this space!

Tuesday, 15 October 2013

Hacking MiSeq updated and now hacking your BioAnalyser too!

In a post earlier this week I talked about the hacking of a MiSeq run by MadsAlbertsen, one comment on the post drew my attention to another paper I'd missed where the authors hacked their MiSeq to perform 600bp reads (PE300). Considering this was a year before Illumina sold us kits I'd say that's quite an achievement!

The Genome Sciences Centre, British Columbia Cancer Agency in Vancouver, did the sequencing for the Spruce genome paper (1). One of the authors is Robin Coope (Group Leader, Instrumentation BCCA Genome Sciences Centre) and he has been behind some pretty cool engineering in the genome sciences. In the Spruce paper his group demonstrated how to crack open a MiSeq cartridge and replace the insides with a larger reagent reservoir so kits can be mixed allowing much longer runs than Illumina intended (at the time of publishing).

The image below is from their supplementary data, I don't recommend you do this at home!

I met Robin when he was speaking at European Lab Automation 2013 in Hamburg last June. He gave an excellent talk on Automation Challenges in Next Generation Sequencing; we also had excellent weiner-schnitzel and dark bier once the conference finished. He spoke about the problems of quantifying NGS libraries on Bioanalyser and qPCR; we want molarity but get DNA concentration and these are not the same thing! Current methods allow you to use a simple calculation to convert between the two but this is heavily reliant on library size estimation. It is pretty much impossible to get the size right in the first place without measuring it and most people use the BioAnalyser. This is where Robin's talk really got interesting for me...

Unfortunately I can't share the slides (it was a commercial conference) but you could email Robin and ask him for a copy (or to hurry up and publish). Basically he described the deficiencies of the Bioanalyser software and introduced the concept of intelligent Fluorescent Units (iFU) to change the way the Bioanalyser does its analysis.

The Bioanalyser does a reasonable job of calculating size and molarity that works well on “tight” libraries, equally a visual estimation of mean insert size gives good results and cluster errors are more likely to be from mass quantification errors than insert size estimation errors. However for wider library distributions like Nextera or amplicons, iFU improves cluster density prediction and reduces cluster density error by 60% in the set (n=28) of amplicons he presented.

Of course none of this would be needed if we were using probe-based assays or digital PCR to count library molecules, but that is a whole other post!

Finally Robin went on to describe his groups work on the Barracuda a robot for 96 samples gel size selection.

Monday, 14 October 2013

Detecting trisomies in twin pregnancies: now available from Verinata

Illumina acquired Verinata earlier this year and their Verif prenatal test is a non-invasive one that detects foetal aneuploidies as early as 10 weeks. Many others are developing or selling similar tests and the real excitement for me (as I already have kids and don't plan on any more) is the impact that developments in foetal medicine have for Cancer diagnosis and prognosis.

A press release on Illumina's website today announces the development of the Verif test for use in twin pregnancies. A twin pregnancy means the allele fraction from each twin in maternal blood is lower than a single pregnancy making detection harder. They have verified the test in over 100 twin pregnancies and achieved 100% detection of aneuploidy for Downs, Edwards and Patau syndromes, trisomies 21, 18 & 13 respectively .

This shows how much development is still on-going in non-invasive testing by NGS.

Does this mean we can expect tests that will detect multiple cancers from ctDNA? Perhaps if we can improve sensitivity and can distinguish cancers based on specific patterns of mutation.

How do you count reads from a next-gen sequencer?

We’ve been asked a question many times over the years “should you count paired-end sequencing as one read or two?”

“Who cares” I hear you cry. But if you ask someone to give you back 200M reads for a sample and they give you 100M paired-reads who’s right? Again this may not be important to you but if you have to pay for 100M reads when you were expecting 200M you’re going to feel short-changed. And when those 100M might have cost £500 or more you care about the change!

My personal view is that it is better for us to count the number of molecules we sequenced from the initial library. This currency tells us something about how deeply we sequenced one sample compared to another. In the case of Illumina sequencing that means counting clusters (raw or PF is a whole other debate), for Ion Torrent I guess it would be positive wells (probably PF reads).

We get about 160-180M reads per lane on our HiSeq 2000, and we count a ‘pair’ of reads as a single data point. That is to say, a single end run or a paired end run with the same cluster density will give the same "read" count. This turns out to be a useful when we want to compare performance of single-end and paired-end runs. I’m happy to listen to the point of view that says their are actually two sequencing reads generated in the paired-end run but I find it adds to the confusion new users have. I know others do too as I’ve been to many talks where people have quoted the number of reads they get for an instrument and I know it was way too high to be possible (my suspicion being a paired-end run was quoted).

The nice simple metric number of clusters per lane works well for me in most cases. This also allows me and my users to compare between different instruments easily; GAIIx 40M, HiSeq 170M, MiSeq 15M, etc.

Unfortunately in trying to decide whether to upgrade to HiSeq 2500 and use rapid runs it gets confusing as I really need to consider the number of clusters per unit of time I can sequence to determine if the experiment will be cheaper on multiple rapid runs rather than one standard run! Instrument amortisation and maintenance charges are high so the more runs I can do in a week the better, I think. The life of a core lab manager is full of exciting stuff like this.

Thursday, 10 October 2013

Hack your MiSeq and get $400 off a 600bp run

I’ll start off by saying not quite, but you can read on to get an idea on how to increase read length of a MiSeq 500cycle v2 kit to get 600bp of data.

MadsAlbertsen posted on a SEQanswers thread about their protocol to squeeze a little more out of the MiSeq. They are using a hacked MiSeq Reagent kit v2 (500 cycles) and running a 2x301bp which is not supported by Illumina. “Do it at your own risk! (although it works nicely.)” is the message on the website. The group are using the modified protocol and hacked kits for bacterial 16S rRNA gene amplicon sequencing of the V1-3 variable region. The target region in E. coli position is a total of 489bp, but depending on target species can vary up to significantly (making the 2x301 run necessary).

How to hack your MiSeq kit: Make sure you follow the instructions on adding a little extra reagent to some of the wells.
5 mL of incorporation buffer from well 1 of a left-over reagent cartridge to well 1 of the 2x301 cartridge.
7 mL of scan mix from well 2 of a left-over reagent cartridge to well 2 of the 2x301 cartridge.
6.8 mL of cleavage mix from well 4 of a left-over reagent cartridge to well 4 of the 2x301 cartridge.
80 mL of incorporation buffer from a left-over incorporation buffer bottle.

Now simply set the Miseq to 2x301 in the samplesheet and ignore the warning the software gives. Et voila 600bp for the price of 500. With a MiSeq v3 kit costing about $1400 that’s potentially a $400 saving.

Will we be doing this in my lab? No way, I’m far too conservative with users samples to play around like this. But I wish I could do more stuff like this, as it’s fun. It makes me want to come up with my own genomics Instructables.

Watch out for their paper: Saunders, A.M., Albertsen, M., McIllroy, S.J. & Nielsen, P.H. (in prep) MiDAS: the field guide to the activated sludge ecosystem.

Friday, 4 October 2013

It's not Open Acess's fault!

Update: ...see the bottom of the post for more coverage on this "sting"

GenomeWeb has coverage of a story all of us should take a look at. A fake manuscript produced by a journalist from Science was accepted by 157 open-access publishers, a damning indictment of OA? Probably not. Damning indictment of peer review? Quite possibly.

Read Michael Eisen's blog for a pretty good discussion of the whole problem (NB: Michael Eisen is co-founder of the PLoS).

The article has obviously struck a chord with it being in Sciences top 1% of articles ranked by Altmetric.

Peer review is full of problems and all of us suffer from that. But it is the system we have so we should work hard to do it right. So I'm off to re-read that paper I reviewed last week to see if the data really does stack up...

Update: GenomeWeb pointed to a live chat organised by Science on the impact of this paper. They bill this as a "chat about the dark side of open access and the future of academic publishing"! Still not making it clear what some of the problems were with their "study".

Zen Falukes has a nice round-up of coverage on his blog, NeuroDojo (and a great header image).

PS: We still need to fix peer-review, anyone got any good ideas?

Pages