CoreGenomics

CoreGenomics has moved

2017-01-23T20:58:00.001+00:00

Follow this link to Enseqlopedia/coregenomics...

"CoreGenomics is dead...long live CoreGenomics"...the CoreGenomics blog has moved to its new home: http://enseqlopedia.com/coregenomics. Please update your bookmarks and register to follow the new blog, for updates on the NGS map (coming soon), and to access the new Enseqlopedia (coming soon)!

Enseqlopedia: Last year I started the process of building the new Enseqlopedia site, after five years of blogging here on Blogger. Whilst Enseqlopedia is still being developed the CoreGenomics blog has moved over and you can also find all the old content there too. Comenting should be much easier for me to manage so please do give me your feedback directly on the site.

NGS mapped: Currently I'm working on the newest implementation of the Googlemap sequencer map Nick Loman and I put together many years ago. The screenshot of the demo gives you an idea of what's changed. The big differences are a search bar that allows you to select technology providers and/or instruments. The graphics also now give a pie-chart breakdown of the instruments in that location...you can clearly see the dominance of Illumina!

Other technologies that will appear soon will be single-cell systems from the likes of 10X Genomics, Fluidigm, Wafergen, BioRad/Illumina, Dolomite, etc, etc, etc. So users can find people nearby to discuss their experiences with (we're also restarting our beer & pizza nights as a single cell club here in Cambridge so keep an eye out for that on Twitter).

Lastly a change that should also happen in 2017 is the addition of users to the map. I'm hoping to give anyone who uses NGS technologies a way to list their lab, and highlight the techniques they are using. Again the aim is to make it easier for us to find each other and get talking.

Enseqlopedia.com is a big step for me. I hope you think it was worthwhile in a year or so. There's one one feature I've not mentioned until now which I'm hoping you'll get to hear more about in the very near future - the Enseqlopedia itself. Watch out for it to appear in press.

Thanks so much for following this blog. I'm sad to leave Blogger. I hope you'll come with me to Enseqlopedia/coregenomics.

10X Genomics updates

2016-12-09T15:57:00.000+00:00

We had a seminar form 10X Genomics today to present some of the most recent updates on their systems and chemistry. The new chemistry for single-cell gene expression and the release of a specific single-cell controller show how much effort 10X have placed on single-cell analysis as a driver for the company. Phasing is looking very much the poor cousin right now, but still represents an important method to understand genome organisation, regulation and epigenetics.

Single cell 3'mRNA-seq V2: the most important update from my perspective was that 10X libraries can now be run on HiSeq 4000, rather than just 2500 and NextSeq. This means we can run these alongside our standard sequencing (albeit with a slightly weird run-type).

The new chemistry offers improved sensitivity to detect more genes per cell, improved sensitivity to detect more transcripts per cell, an updated Cell Ranger 1.2 analysis pipeline, and compatibility with all Illumina sequencers - sequencing is still paired-end but read 1 = 26bp for 10X barcode and UMI, Index 1 is the sample barcode, read 2 = the cDNA reading back to the polyA tail.

It is really important in all the single-cell systems to carefully prepare and count cells before starting. You MUST have a single-cell suspension and load 100-2000 cells per microlitre in a volume of 33.8ul. This means counting cells is going to be very important as the concentration loaded affects the number of cells ultimately sequenced, and also the doublet rate. Counting cells can be highly variable; 10X recommend using a haemocytometer or a Life Tech Countess. Adherent cells need to be trypsinsed and filtered using a Flowmi cell strainer or similar. Dead cells, and/or lysed cells, can confuse analysis by leaching RNA into the cell suspension - it may be possible to detect this by monitoring the level of background transcription across cell barcodes. The interpretation of QC plots provided by 10X is likely to be very important but there are not many examples of these plots out there yet so users need to talk to each other.

There is a reported doublet rate per 1000 cells of 0.8%, which keeps 10X at the low end of doublet rates on single-cell systems. However it is still not clear exactly what the impact is of this on the different types of experiment we're being asked to help with. I suspect we'll see more publications on the impact of doublet rate, and analysis tools to detect and fix theses problems.

The sequencing per cell is very much dependant on what your question is. 10X recommend 50,000 reads per cell, which should detect 1200 transcripts in BMCs, or 6000 in HEK293 cells. It is not completely clear how much additional depth will increase genes detected before you reach saturation, but it is not worth going much past 150,000 reads per cell.

1 million single-cells: 10X also presented a 3D tSNE plot of the recently released 1 million cell experiment. This was an analysis of E18 mouse cortex, hippocampus, and ventricular zone. The 1 million single-cells were processed as 136 libraries across 17 Chromium chips, and 4 HiSeq 4000 flowcells. This work was completed by one person in one week - it is amazing to think how quickly single-cell experiments have grown from 100s to 1000s of cells, and become so simple to do.

Additional sequencing underway to reach ~20,000 reads per cell. All raw and processed data will be released without restrictions.

The number of cells required to detect a population is still something that people are working on. The 1 million cell dataset is probably going to help the community by delivering a rich dataset that users can analyse and test new computational methods on.

What's next from 10X: A new assay coming in Spring 2017 is for Single Cell V(D)J sequencing, enabling high-definition immune cell profiling.

The seminar was well attended showing how much interest there is in single-cell methods. Questions during and after the seminar included the costs of running single-cell experiments, the use of spike-ins (e.g. ERCC, SIRV, Sequins), working with nuclei, etc.

In answering the question about working with nuclei 10X said "we tried and it is quite difficult"...the main difficulty was the lysis of single-nuclei in the gel droplets. Whilst we might not be able to get it at single-cell resolution, this difficulty in lysing the nucleus rather than the cell might possibly be a way to measure and compare nuclear versus cytoplasmic transcripts.

MinION: 500kb reads and counting

2016-11-17T15:49:00.000+00:00

A couple of Tweets today point to the amazing lengths Oxford Nanopores MinION sequencer is capable of generating - over 400kb!

Dominik Handler Tweeted a plot showing read distribution from a run . In replies following the Tweet he describes the DNA handling as involving "no tricks, just very careful DNA isolation and no, really no pipetting (ok 2x pipetting required)".

and Martin Smith Tweeted an even longer read, almost 500kb in length...

Exactly how easily we'll all see similar read lengths is unclear, but it is going to be hugely dependant on the sample and probably having "green fingers" as well.

Here's Dominics gel...

Unintended consequences of NGS-base NIPT?

2016-11-09T11:34:00.000+00:00

The UK recently approved an NIPT test to screen high risk pregnancies for foetal trisomy 21, 13, or 18 after the current primary screening test, and in place of amniocentesis (following on from the results of the RAPID study). I am 100% in favour of this kind of testing and 100% in favour of individuals, or couples, making the choice of what to do with the results. But what are the consequences of this kind of testing and where do we go in a world where cfDNA foetal genomes are possible?

I decided to write this post after watching "A world Without Downs", a documentary on BBC2 that was presented by Sally Phillips (of Bridget Jones fame), mother to Olly who has Down's syndrome. She presented a program where the case for the test was made (just), but the programme was very clearly pro-Down's. Although not quite to the point of being anti-choice.

My own personal experience of Down's is limited, and I'd watched the documentary more out of excitement to see how NGS is being rolled out across the NHS; particularly because the same technology is being applied in Cancer and is likely to transform patient treatment. My view before watching was that this new NIPT test could only be a good thing. The program made me see that there are likely to be unintended consequences of this kind of testing, and that there may be darker sides to the use of the technology. It made me think more carefully about the issue, but in the end I'm still 100% in favour of the test.

Unintended consequences of cell-free DNA testing in have been reported previously, with the discovery of cancer in an expectant mum first reported in 2013. How we deal with these issues is a matter of ongoing debate. For Down's the program highlighted the negative way expectant mothers and fathers are given the news that they may have a Down's child; and that better information can only lead to a more informed choice - not difficult to agree with that. Unfortunately the program can't escape it's Herodotian title. This test won't lead to "A world Without Downs", but how people use the information might.

I'd highlighted the program on Twitter after watching it. And posted again after reading an article in the Guardian "Fears over new Down's syndrome test may have been exaggerated, warns expert" where Prof Sir David Spiegelhalter was quoted as saying that terminations would not go up - based on the current models being used. I did not disagree with his stats (I'd be crazy to do that), but models can be wrong, and that was the basis of my Tweet.

The main argument from Phillips in the program is that this test will result in more terminations, and that means fewer people being born with Down’s syndrome. She visited Iceland, which she stated has not had a Down's syndrome child born in the last 5 years. This is surprising as I'd expect a country like Iceland to have a testing regime with as many false-negatives as anyone else - a few Down's children should have been born...and data from the WHO seem to suggest this is indeed the case.

Ultimately even if 100% of parents did choose to abort after receiving test results, as long as they were well informed before making their decision, then we've done the right thing. Haven't we?

Trisomies 13, 18 and 21 are the only things tested for right now. But the underlying technology could ultimately use whole genome sequencing and find the full spectrum of genetic abnormalities: such as an increased risk of psoriasis, glaucoma, and Alzheimer's. If my mum had decided these were not traits she wanted her baby to have I'd not be writing this blog.

Does the world have too many HiSeq X Tens?

2016-10-21T15:03:00.000+01:00

Illumina stock dropped 25% after a hammering by the stock market with their recent announcements that Q3 revenues would be 3.4% lower than expected at just $607 million. This makes Illumina a much more attractive acquisition (although I doubt this summers rumours of a Thermo bid had any substance), and also makes a lot of people ask the question "why?"

The reasons given for the shortfall were "a larger than anticipated year-over-year decline in high-throughput sequencing instruments" i.e. Illumina sold fewer sequencers than it expected to. It is difficult to turn these revenue figures and statements into the number of HiSeq 2500's, 4000's or X's that Illumina missed it's internal forecasts by, but according to Francis de Souza Illumina "closed one less X deal than anticipated" - although he did not say if this was an X5, X10 or X30! Perhaps more telling was that de Souza was quoted saying that "[Illumina was not counting on a continuing increase in new sequencer sales]"...so is the market full to bursting?

Before diving into my own analysis (you might also like to read GenomeWeb's coverage), I would like to put these numbers in perspective. A 3rd quarter revenue of $607 million is nearly $2.5 billion over the full year (versus $2.3B in 2016 and $2.1B in 2015 (numbers from Illumina data here). And revenues grew by 10% year on year. This does not seem like bad news from an academic users perspective!

Is there such a thing as too many sequencers: Illumina have talked about how they were surprised by the interest in X Ten, and have sold far more units than they initially forecast. The word on the street seems to be that only a few X Ten labs are working at capacity Broad, NYGC, Human Longevity. Illumina have said the reagent pull-though on X Ten has been about $650K/X/year, which is only half of the theoretical $1.2 million/X/year.

Sales of HiSeq 4000 appear not to have been as strong as the 2500 platform was on its launch. NextSeq seems to be popular with almost 1000 units out in the field, especially for NIPT use, but also in medium sized labs wanting their own sequencer. I suspect a fair number of MiniSeq's are rolling off the production line (although whether they offer good value for money is debatable).

But Illumina's main reasons for slightly lower than expected performance were clearly lower sales of instruments; and this was particularly so in Europe last quarter. Todd Campbell at The Motley Fool asked an important question about what's happening in Europe "Europe was [growing] slower than the rest of the world." but he also poseed the questions "Why? What's so special about Europe? What are the things that could be going into the reasoning behind Europe growing more slowly than the other parts of the world?" He went on to discuss competition (from Oxford Nanopore) as being a factor, but most telling was something he picked up on from the Illumina conference call when their results were announced at the end of Q2 "Europe is slowing is because of sharing of devices"!

I'd wholly subscribe to the "glut of capacity and increased use of outsourcing" hypothesis. If the glut does not go away, and if labs continue to move to outsourcing then Illumina will sell decreasing numbers of instruments and service contracts, but consumables pull-through should be higher from each box. Ultimately I think this is a win for Illumina as more science will be published using their technology - and that's really what we all want.

I run a core lab, and I know lots of other people who do in Europe, Africa, and across the world. Sharing Illumina (and other) instruments in core labs has been a part of science for a very long time. It makes good sense scientifically and economically (I know I'm biased). And from where I sit I can see many Illumina sequencers gathering dust (metaphorically speaking), or being run at 25% or lower utilisation. People got the funding to buy these amazing devices; but not the money to staff the lab, to service them, and to fund the projects to run on them. Perhaps worse is the opportunity cost of the lost science; science that could not be done because someone spent money on a sequencer rather than sequences.

Maybe instrument sales have slowed down in Europe because we've got wise to this problem, maybe scientists in Europe have seen how great great core labs can be, and that shared devices with high utilisation is a good thing for science in general. But what happens, if I'm right, when the rest of the world realises it has too many sequencers but not as many results as they'd expected, and focuses on buying sequences rather than sequencers?

Will users continue to purchase consumables at an ever increasing rate: Illumina's business model has been described as "a simple razor and blade model: Illumina makes one-time sales of large machines at lower margins, then provides consumables needed for use in their operation on an ongoing basis."

A Tweet earlier in the year from Kinghorn Genomics is one of the few public figures I've seen for actual sequencing throughput on an X Ten. 1100 genomes in one month is astounding, but still 20-30% short of the 1500 per month figure in Illumina's specs. Very few owners openly discuss the numbers of samples going through their instruments, and Illumina are very cagey about reagent pull-through in individual labs. It seems pretty clear if X Ten labs simply can't pull in the required numbers of samples to match Illumina's specs. But Kinghorn Genomics is at the high end of reagent pull through at < 70% utilisation.

Illumina's consumables are a highly profitable business with gross profits around 70%, and these margins have been at that level for as long as I can remember. I don't want to skip over the fact that Illumina has also invested heavily in R&D, and is investing heavily in the clinical adoption of it's core technology in the clinical space via Helix and Grail. So some of that 70% margin is going somewhere that is likely to be useful to me in the future. But lllumina have cited weakness in the HiSeq franchise outside of the X – both instrument shipments and consumables. Regent pull-through on HiSeq was below their estimates of ~$350K per year. Again pointing to a glut of sequencers, rather than sequencing projects. So reagents are perhaps the most important thing for Illumina to focus on.

Total revenue increased by $93.9 million to $1,171.9 million in the first half of 2016; up by 9% over 2015.

Consumables revenue (63% of total) increased by $128.4 million to $740.1 million in the first half of 2016; up by 21% over 2015 "driven by growth in the sequencing instrument installed base".

Instrument revenue (20% of total) decreased by $58.5 million to $243.2 million in the first half of 2016; down by 19% over 2015 "primarily due to lower shipments of our high-throughput platforms".

Service and other revenue (15% of total) increased by $23.2 million to $179.2 million in the first half of 2016; up by 15% over 2015 "driven by revenue from genotyping services and extended instrument service contracts associated with a larger sequencing installed base".

What locks us into Illumina: Capital costs are very high in replacing an Illumina fleet, my own lab has around £2 million invested (2x HiSeq 4000, 1x HiSeq 2500, 1x NextSeq, 2x MiSeq) - we couldn't simply go out and buy machines from another vendor, even if there were one. The real tie in is the infrastructure we've built up around the use of Illumina sequencing. Users are unlikely to switch until there is a really good competitor out there...and Life Tech's SOLiD and Ion Torrent technologies just were not good enough.

Predicting the future: For the future I'm as confident as everyone else that NGS usage is going up, bigger projects, more samples, more sequencing, more data - that's a great scenario for Illumina. They might be a bit stuck with the next big leap in instrument yields, as this would need to jump significantly to make labs like mine purchase new boxes, and that could land them back in the same position as they were in 2011. If the economic case for a new machine can't be made then labs will find it hard to get funding for incremental changes. And if Illumina do make a big leap then many labs may prefer to share the infrastructure costs, and aim bring down experimental costs. Where do Illumina go in the research space next if they can't bring us cheaper sequencing?

Q: What will Illumina announce at J.P.Morgan? A new sequencer? The $500 genome? Nanopores?

The use of NGS in oncology might take ten years to become profitable given the pace at which healthcare systems can adopt new technologies. I know from my experiences of the NHS that a few labs can be leading lights, but the majority need to be dragged into accepting change of almost any kind. Oncology is tough, but is a huge, and highly profitable, market so the effort from Illumina is likely to be worth it. Illumina certainly think so; SeekingAlpha quoted Francis deSouza (Illumina CEO) as saying "We spent a decade selling instruments to researchers who are experts and understand genomics. Now we're seeing applications take off, which is a much bigger market for us." Whether the recent stock fall was partly because the markets see the realisation of this "bigger market" as being too much of a future gamble is unclear to me. Verinata, Grail and Helix are really exciting ventures, but how quickly can they add to Illumina's revenues and profits?

The rapid adoption of NGS in NIPT might shed some light on the future. Verinata is now contributing high single digit percentages to Illumina's revenues, and this could reach 10% as soon as 2018. I'd highly reccomend anyone who can get access to BBC Player to watch the "A world without Downs" documentary!

I thought I'd finish up with a look to the future; particularly to the other NGS technology that we might be using alongside Illumina routinely by 2020 - Oxford Nanopore. The technology, soon to be "a genome centre in a box", and possibly iPhone compatible, is starting to gain traction outside of the hardcore fanboys and fangirls like Nick Loman and Josh Quick. Right now it is certainly an unproven, in the commercial sense; closed-community, the MinION is available commercially, but users are generally talking in the Nanopore forum; and niche tool. But R9 makes Nanopore sequencing easy, and the most recent updates from Clive Brown point to a future where we might use Nanopores alongside SBS. If the ONT tech is truly disruptive then there is a future that may be decidedly less longer orange!

I'd not want to forget to mention Pacific Bioscience now that Sequel appears to be getting some traction (over 100 instruments sold since the launch compared to 100-150 RSIIs). And the 50x drop in DNA required is going to make this a tool people with limited sample availability can now consider using.

But we should not forget that Illumina is a company that can deliver on innovation. Whilst Illumina did not invent SBS - Solexa, a small UK company, did; Illumina turned Solexa's $2.5 million revenues in 2006, into a $100 million business, in one year! Many readers will remember the release of the HiSeq, MiSeq, NextSeq, X Ten - all significant leaps for genomics; and I'm betting they've got some pretty cool tech up their sleeves yet.

Finally: Do you think there are too many sequencers out there? Should we focus on buying sequences rather than sequencers? If the majority of users answer yes to these questions then sequencer sales may well continue to decline in the short term. But reagent pull-through on each box should increase, and Illumina's focus for research sequencing might shift to "blades rather than razors", on driving utilisation of their instal base up.

Controlling for bisulfite conversion efficiency with a 1% Lamda spike-in

2016-10-21T08:00:00.000+01:00

The use of DNA methylation analysis by NGS has become a standard tool in many labs. In a project design discussion we had today somebody mentioned the use of a control for bisulfite conversion efficiency that I'd missed, as its such a simple one I thought I'd briefly mention it here. In their PLoS Genet 2013 paper, Shirane et al from Kyushu University spiked-in unmethylated lambda phage DNA (Promega) to control for, and check, the C/T conversion rate was greater than 99%.

The bisulfite conversion of cytosine bases to uracils, by deamination of unmethylated cytosine (as shown above) is the gold standard for methylation analysis.

Users identify the C/T transitions in a comparison of bisufite treated/untreated samples, or by comparing to a known reference. However bisulfite treatment is a harsh biochemical reaction, and can cause large losses in template DNA. As such controlling for and measuring conversion efficiency is important in making conclusions about the methylation data from NGS experiments. As a reminder - Bisulfite does not convert methylated or hydroxy-mehtylated cytosine allowing users to discriminate between non-methylcytosine (C) and methylcytosine (mC) or hydroxymethylated (hmC) cytosine.

We're likely to start using this control if it works well in the project we have just kicked off. In the paper they added 1ng of to 1000 oocytes before performing a PBAT analysis. We'll aim for 1% spike-in, but need to consider how much to add to each sample, and whether Lambda is the right spike-in as we're using an RRBS method or this project. To check the suitability I grabbed the Lambda sequence from Genbank and did an in silico Msp1 digest using WebCutter2.0. I found 330 cut sites - which should be plenty for checking efficiency.

Want to learn more about bisulfite conversion in general? Take a look at Zymo's website, it's an excellent resource.

SIRVs: RNA-seq controls from @Lexogen

2016-10-17T10:14:00.000+01:00

This article was commissioned by Lexogen GmbH.

My lab has been performing RNA-seq for many years, and is currently building new services around single-cell RNA-seq. Fluidigm’s C1, academic efforts such as Drop-seq and inDrop, and commercial platforms from 10X Genomics, Dolomite Bio, Wafergen, Illumina/BioRad, RainDance and others makes establishing the technology in your lab relatively simple. However the data being generated can be difficult to analyse and so we’ve been looking carefully at the controls we use, or should be using, for single-cell, and standard, RNA-seq experiments. The three platforms I’m considering are the Lexogen SIRVs (Spike-In RNA Variants), or SEQUINs, or ERCC 2.0 (External RNA Controls Consortium) controls. All are based on synthetically produced RNAs that aim to mimic complexities of the transcriptome: Lexogen’s SIRVs are the only controls that are currently available commercially; ERCC 2.0 is a developing standard (Lexogen is one of the groups contributing to the discussion), and SEQUINs for RNA and DNA were only recently published in Nature Methods.

You can win a free lane of HiSeq 2500 sequencing of your own RNA-seq libraries (with SIRVs of course) by applying for the Lexogen Research Award

Lexogen’s SIRVs are probably the most complex controls available on the market today as they are designed to assess alternative splicing, alternative transcription start and end sites, overlapping genes, and antisense transcription. They consist of seven artificial genes in-vitro transcribed as multiple (6-18) isoforms to generate a total of 69 transcripts. Each has a 5’triphosphate and a 30nt poly(A)-tail, enabling both mRNA-Seq and TotalRNA-seq methods. Transcripts vary from 191 to 2528nt long and have variable (30-50%) GC-content.

Want to know more: Lexogen are hosting a webinar to describe SIRVs in more detail on October 19th: Controlling RNA-seq experiments using spike-in RNA variants. They have also uploaded a manuscript to BioRxiv that describes the evaluation of SIRVs and provides links to the underlying RNA-Seq data. As a Bioinformatician you might want to download this data set and evaluate the SIRV reads yourself. Or read about how SIRVs are being used in single-cell RNA seq in the latest paper from Sarah Teichmann’s group at EBI/Sanger.

Before diving into a more in-depth description of the Lexogen SIRVs, and how we might be using them in our standard and/or single-cell RNA-seq studies, I thought I’d start with a bit of a historical overview of how RNA controls came about...and that means going back to the days when microarrays were the tool of choice and NGS had yet to be invented!

RNA quality control – MAQC: The use of controls is recommended in any experiment, and the lack of them is one of the oft cited reasons for the current reproducibility crises. Nearly everyone who’s worked on differential gene expression in the last fifteen years has heard of the MAQC (MicroArray Quality Control) study. Although four sources of RNA were evaluated Stratagene’s Universal Human Reference RNA and Ambion’s Human Brain RNA samples were chosen because of the number of genes expressed at a detectable level, and the size of the fold changes between the two samples. These two control samples were used to evaluate five microarray platforms, in an international project involving 137 participants from 51 organisations (see Nat Biotech 2006). Labs like mine adopted, and continue to use the MAQC controls in our differential gene expression pipelines, which today are almost all based on RNA-seq methods. We used them in my lab to show how detection sensitivity drops as RNA inputs are reduced to under 100ng (something I keep meaning to repeat with RNA-seq).

The move to RNA-seq has had a dramatic impact on our ability to perform complex experiments. We are no longer limited to asking questions about the differential expression of genes where we have sequence information available to make an array. RNA-seq allows us to analyse the whole transcriptome; to assess differential gene expression (oligo-dT enriched mRNA-seq is the most widely used method), as well as differential splicing, allele specific expression, polyA tail length, transcription initiation and termination, microRNA, lincRNA, etc, etc, etc (see my "wish list" for controls at the bottom of this post).

The MAQC controls we used are simply not up to the more complex job that RNA-seq presents. Both the ABRF and SEQC papers used MAQC samples, which are admixtures of multiple individuals (I discussed these limitations in a 2014 post), but both included the ERCC controls as well.

Newer, more carefully designed and manufactured controls are available that can better serve the needs to biologists; and this is where SIRVs come in.

The SIRV workflow: from sample to answer

RNA quality control – Lexogen and beyond: SIRVs are designed to represent much of, but not all of, the complexity of Eukaryotic transcriptomes e.g. differential gene expression, differential splicing, polyA tail length variation, GC content, etc. SIRVs are designed to be added to samples before RNA extraction, or starting the RNA-seq library prep. They should allow an objective assessment of the technical biases in library preparation, sequencing and analysis; and ultimately should improve our ability to make biological insights from comparison of experimental conditions. They are a huge leap forward from the MAQC controls, and a significant step ahead of the ERCC1.0 controls, which are restricted to single-exon transcripts.

How are SIRVs made: SIRVs were designed to be similar to Human gene structures with overlapping multi-exon genes that are transcribed in both sense and antisense, with alternative splicing and alternative transcription start and end sites. Genes are in-vitro transcribed from linearized plasmids to produce full-length transcripts which are subject to very careful quality control and quantitation. This includes spectrophotometric, molecular weight, and Agilent Bioanalyser analyses. After QC and QT SIRV transcripts are mixed at equimolar concentrations (E0), or at 8-fold (E1) or 128-fold (E2) variations.

Designing SIRVs: A comparison of SIRV1 and KLK5

How are SIRVs used: Spiking SIRVs into your samples requires some careful consideration of how you’ll use the data they provide in downstream assessment. Today the most important control in my lab is simply whether the library prep has worked, or more importantly where it did not work whether it was the lab or the sample that was the cause of the failure. Our use of MACQ controls on a plate of samples is great, but extending this to an internal control in every sample is going to be better. However I don’t want controls to dominate the experiment or they’ll add too much to the costs of library preparation and sequencing.

SIRVs themselves don’t need much data to generate useful results and around 1% of your sequencing reads should be sufficient for most labs. However determining how much SIRV mix to add to your samples before extraction, or your RNA before library prep can require some empirical testing as the amount of RNA in a sample or a cell differs so much. As a rule of thumb 95% of RNA is ribosomal RNA’s, and the other 5% is mRNA (and non-coding RNAs). For an experiment starting with 100ng of TotalRNA in an mRNA-seq workflow approximately 50pg would represent 1% of the 5ng of mRNA present.

SIRVs are available in three configurations E0, E1 & E2 that mix the in vitro transcribed RNAs at equimolar (mix E0), up to 8-fold (mix E1), or up to 128-fold (mix E2), variation in concentration. Importantly SIRVs are built in a modular format and should be compatible to other spike in controls like the ERCC. Additional modules should address transcript lengths, polyA tail length variation, etc.

Coinciding with the webinar on October 19th, Lexogen will release the “SIRVs suite” (see "How are SIRVs analysed" below) for analysis of spike-in data. This will also include an "Experiment Designer" tool to calculate recommended spike-in ratios based on known or expected input for the RNA content, mRNA ratio, and type and efficiency of the workflow.

SIRVs in bulk RNA-seq: Bulk RNA-seq experiments can use SIRVs as process controls in place of the MAQC Brain and UHRR samples allowing a full 96 samples to be run on each plate. Assuming the 100ng TotalRNA input then just 50pg of SIRVs are needed per sample, with 5ng added to the oligo-dT master-mix used in the enrichment step. The use of SIRV E0 is recommended for process QC, but E1 and E2 may be useful when evaluating new methods for accuracy and precision of differential transcript detection and quantitation.

SIRVs in scRNA-seq: Single-cell RNA-seq has quickly adopted spike-in controls with Hashimshony et al presenting their use of ERCC spikes in the CELSeq protocol. Both Wu et al 2013 and Truetlein et al 2014 used the ERCC mixes at a 1:40,000 dilution spiked into the cell lysis mix of the Fluidigm C1 protocol. And Svensson et al use the ERCC and SIRV spikein's to assess sensitivity and accuracy of various protocols across a standard analysis pipeline. This demonstrates the utility of using RNA control spike-ins, but also the requirement for careful dilution to avoid swamping single-cell RNA-seq experiments with control data, or not having enough to QC data before interpreting results. Assuming each single cell has around 20pg of TotalRNA then just 200fg of SIRVs are needed per sample, the amount of SIRV added, and exactly where to add it the protocol is highly dependent on the single-cell RNA-seq protocol being used.

How are SIRVs analysed: Lexogen will release the Galaxy-based “SIRVs suite” for uploading, evaluating and comparing spike-in data. This will allow SIRV users to compare results from their experiments to anonymised data, and should help determine if your own experiment is any good. Back in 2003/4 I developed rptDB: a tool to compare QC data between Affymetrix arrays. This had over 3500 samples submitted to it, and allowed a quick easy call on whether your data was "good" or "bad" - highly context dependant of courrse! As a user if I had received data from a core lab or service provider, or were downloading RNA-seq data for meta-analysis, then being able to select only data where SIRV, or other, controls had been used, and where results were shown to be high-quality, would most likely save me considerable time in cleaning up data before starting.

SIRVS are not designed to be used as a normalisation tool. Whilst spike-ins have been considered they are not really reliable enough for standard normalisation procedures. The development of novel normalisation algorithms appears to offer hope for the future (see Risso 2014), and approaches like this might be applicable to SIRVs. I suspect this will be an active area of algorithm development over the next couple of years because of the huge interest in single-cell RNA-seq.

The competition: alternative RNA-seq controls

Sequins: ‘Sequins’ (sequencing spike-ins) were developed by the Garvan Institute and recently published in Nature Methods. Sequins are conceptually similar to SIRVs. They are a set of synthetic RNA isoforms that align to an artificial in silico chromosome, with no homology to known genomes. They represent full-length spliced mRNA isoforms, at a range of concentrations. They can be used to assess differential gene expression and alternative splicing pipelines. The authors state that sequins can by used for normalisation, and refer to the same Nature Biotech as I did above. In their Nature Methods paper they do show some very nice results from scaling normalisation using sequins and I hope these results will ultimately be achieveable with any well-designed spike-in series.

In the back-to-back Nature Methods publications the team at Garvan show how sequins can be used in RNA-seq and DNA-seq experiments to asses biases and determine the limits of detection, quantitation and analytical methods. Sequin genes are mixed in a two-fold serial dilution, with a minimum three genes per dilution, to span an ~106-fold range. The team also developed 24 Sequins to represent cancer fusion genes and used these to assess fusion gene detection and quantitation. They also reported that split reads significantly outperformed read-pairs in their correlation with Sequin concentration – this has a significant impact on the sequencing format as many groups today use paired-end reads where longer single-end reads may be more sensitive, and would also be around 40% cheaper.

ERCC 2.0: the original ERCC1.0 controls are a mix of 92 relatively simple single-exon transcripts of varying length and GC content. They are used in a mix at known concentrations spikedinto samples before library preparation. ERCC2.0 aims to update the spikes to better represent the complexity of the transcriptome, and to provide FFPE derived controls. Again they are are conceptually similar to SIRVs and Lexogen were one of 9 groups invited to present at the 2014 NIST ERCC2.0 workshop at Stanford University.

Conclusions: The use of controls in RNA-seq experiments is an absolute requirement if you want to get the best out of your experiments. Bulk RNA-seq can benefit from a relatively simple data QC of the controls before moving onto more complex differential gene expression and splicing analyses. And including spike-in controls may allow easier comparison of longitudinal data sets, or between labs. Single-cell RNA-seq has shown an absolute requirement to include spike-ins, although the very latest papers suggest that spiked-in transcripts may not truly mirror Human mRNAs in the protocols used, due to much shorter poly-A tails (30 vs 200+bp), and that they may underestimate detection sensitivity by up to ten-fold.

SIRVs, more recently SEQUINs, and soon ERCC2.0 controls can be further enhanced and manufacturers should not be consider their job complete! With protocols like Pacific Bioscience’s ISO-seq and the advent of Oxford Nanopores direct RNA-sequencing longer and longer transcripts could be assessed and this will need to be controlled. Phased sequencing, possibly from long RNA molecules on 10X Genomics, is likely to need controls where variants are phased. Additionally PacBio and Nanopore sequencing also offer the ability to detect and quantify RNA base modifications. All of this shows how far the controls we might use still have to go.

My RNA controls wish list:

differential gene expression normalisation

differential splicing

allele specific expression

transcript and polyA tail length variation

GC content

transcription initiation and termination

non poly-adenylated RNAs e.g. microRNA, lincRNA

pseudogene mapping

limits of detection

RNA variant detection at different MAF

High-quality and degraded FFPE RNA

Spike-in's with corresponding baits for in-solution capture

Spike-in RNA encapsulated in synthetic cells

Phased variants on long RNAs

RNA base modifications

Please let me know what you’d like to add by leaving a comment below.http://biorxiv.org/content/early/2016/10/13/080747

Batch effects in scRNA-seq: to E or not to E(RCC spike-in)

2016-10-14T12:46:00.001+01:00

At the recent Wellcome Trust conference on Single Cell Genomics (Twitter #scgen16) there was a great talk (her slides are online) from Stephanie Hicks in the @irrizarry group (Department of Biostatistics and Computational Biology at Dana-Farber Cancer Institute). Stephanie was talking about the recent work she's been doing looking at batch effects in single-cell data, all of which you can read about in her paper is on the BioRxiv: On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data. You can also read about this paper over at NExtGenSeek.

Adapted from Figure 1 in Hicks et al.

Almost without exception every new technology gets published with a slew of high-impact papers. And almost without exception those papers turn out to be heavily biased. This is not to say we should expect every wrinkle to be ironed out before initial publication - new technologies take a lot of effort and the faster they make it into the public domain the sooner the community can improve them and make them more robust. Often batch effect is the first problem identified: with arrays, with NGS, and now with single-cell RNA-seq.

Stephanie et al looked at 15 published single-cell RNA-seq papers and found that in the 8 studies investigating differences between group, and where they could assess confounding effect it ranged from 82.1% to 100% (see table 1 from the paper - 82,85,93,96,98,100 & 100%). All of these studies were designed such that the samples were confounded with processing batch. They report that the number of genes detected expressed explained a significant proportion of observed variability, but that this varied across experimental batches. This confounding of biological question with experimental batch effectively cripples the project;

"Batch effects lead to differences in detection rates, which lead to apparent differences between biological groups"

However the authors do point out that relatively simple experimental design choices can be used to remove the problem.

What does this mean for ERCC and other spike-ins : In her final slides, see "The Wild West", Stephanie clearly explains the problems we face with batch effects and in normalising single-cell RNA-seq experiments.

Batch effects can be a big problem in scRNA-Seq data (but not always).
Batch effects and methods to correct for batch effects have been around for many years (lots of places to start).
Bad news: Poor experimental design is a big liming factor…. also, more complicated because of sparsity (biology and technology), capture efficiency, etc
Good news: Increase awareness about good experimental design. New methods specific for scRNA-Seq are being developed

It is looking more and more possible to use RNA spike-in's in scRNA-seq experiments specifically as a tool to help in the normalisation of the data, and also as a way to reduce/remove batch effects. Stephanie does state that there are still challenges in doing this, and also points to the use of UMI counts to help fix the problem by reducing amplification bias, etc.

However not every protocol recommends spike-in's and there is certainly no clear preference in the community - although I think this is beginning to emerge. Read about how ERCC's & SIRVs are being used in single-cell RNA seq in the latest paper from Sarah Teichmann’s group at EBI/Sanger.

I'm putting effort into understanding spikes in a lot more detail and am sure we'll all be using them routinely in a few more months.

What does this mean for the choice of scRNA-seq platform: My briefest of surveys for the three platforms we're using or looking at in my lab are as follows. Fluidigm suggest using the ArrayControl RNA Spikes (Thermo Fisher Scientific AM1780). Drop-seq suggest using the ERCC spikes (although this is not mentioned in their online protocol). 10X Genomics don't say anything about spikes in their current protocols!

I generated the figure at the top of this post to show where these 3 scRNA-seq platforms fit into Stephanie's figure 1 from the paper. Both C1 and Drop-seq are completely confounded as only one sample is processed per batch. 10X Genomics allows up to 8 samples to be processed together so a replicated "AvsB" study could be completed with zero batch effect.

But in the future we're likely to need 12, 24 or even 96 sample systems that allow us to process a scRNA-seq experiment in one go. Whilst it may well be possible to design Fluidigm C1 chips that can process more samples, each with fewer cells, or for Drop-seq to emulate 10X Genomics, or even for 10X Genomics to move to a larger sample format chip; none of this will solve the problem of collecting large numbers of single-cell samples without introducing batch effects further upstream in the experiment.

The take home message is to spend time on experimental design, and to replicate your study - simple enough stuff! Biological replication will allow batches to be randomised during the experiment to scRNA-seq prep runs and across sequencing flowcells if necessary. This generally allows batch effects to be removed from the experiment, even if they are significant.

Clinical trials using ctDNA

2016-10-11T07:04:00.001+01:00

DeciBio have a great interactive Tableau dashboard which you can use to browse and filter their analysis of 97 “laboratory biomarker analysis” ImmunOncolgy clinical trials; see: Diagnostic Biomarkers for Cancer Immunotherapy – Moving Beyond PD-L1. The raw data comes from ClinicalTrials.gov where you can specify a "ctDNA" search and get back 50 trials, 40 of which are open.

Two of these trails are happening in the UK. Investigators at The Royal Marsden are looking to measure the presence or absence of ctDNA post CRT in EMVI-positive rectal cancer. And Astra Zeneca are looking for ctDNA as a secondary outcome to obtain a preliminary assessment of safety and efficacy of AZD0156 and its activity in tumours by evaluation of the total amount of ctDNA.

You can also specify your own search terms and get back lists of trials from OpenTrials which went live very recently. The Marsden's ctDNA trials above is currently listed.

You can use the DeciBio dashboard on their site. In the example below I filtered for trials using ctDNA analysis and came up with 7 results:

Thanks to DecBio's Andrew Aijian for the analysis, dashboard and commentary. And to OpenTrials for making this kind of data open and accessible.

Index mis-assignment to Illumina's PhiX control

2016-10-07T11:59:00.000+01:00

Multiplexing is the default option for most of the work being carried out in my lab, and it is one of the reasons Illumina has been so successful. Rather than the one-sample-per-lane we used to run when a GA1 generated only a few million reads per lane, we can now run a 24 sample RNA-seq experiment in one HiSeq 4000 lane and expect to get back 10-20M reads per sample. For almost anything other than genomes multiplexed sequencing is the norm.

But index sequencing can go wrong, and this can and does happen even before anything gets on the sequencer. We noticed that PhiX has been turning up in demultiplexed sample Fastq. PhiX does not carry a sample index index so something is going wrong! What's happening? Is this a problem for indexing and multiplexing in general on NGS platforms? These were the questions I have recently been digging into after our move from HiSeq 2500 to HiSeq 4000. In this post I'll describe what we've seen with mis-assignment of sample indexes to PhiX. And I'll review some of the literature that clearly pointed out the issue - in particular I'll refer to Jeff Hussmann's PhD thesis from 2015.

The problem of index mis-assignment to PhiX can be safely ignored, or easily fixed (so you could stop reading now). But understanding it has made me realise that index mis-assignment between samples is an issue we don not know enough about - and that the tools we're using may not be quote up to the job (but I'll not cover this in depth in this post).

Issues with index mis-assignment and quality were initially noticed when we detected Illumina's PhiX control in demultiplexed Fastq data. PhiX is supplied by Illumina as a non-indexed library and as such should never appear in demultiplexed Fastq files. In our default analysis pipeline it should only appear in the "lost-reads" file and should be around 1% in data from lanes 1-7, and 5% in data from lane 8 of an Illumina flowcell (the actual percentage of PhiX can vary for several reasons, so we're not surprised to see higher or lower percentages than expected). We are still running PhiX in almost every lane of sequencing as an easy control to monitor run quality. But if PhiX is getting a barcode what's going wrong?

The main concern is that if the barcode read is failing in some manner, and attributing barcodes incorrectly, this will lead to erroneous results. There are two major things that index mis-assignment causes

reads are lost because a spurious barcode was assigned; this data would usually be discarded, should be minimal, and can potentially be ignored.
barcodes are mis-assigned to the wrong sample; this is a much more serious issue, and understanding what causes it, and the likelihood of it happening, will be critical in reducing the technical factors that could limit low variant calling.

With PhiX on every lane we should be able to monitor index mis-assignment in every run. PhiX may also allow us to estimate the rate of mis-assignment between samples, which will be vital if users need to allow for this in their analysis, particularly in low-frequency variant calling.

Previous reports about multiplexing on Illumina sequencers: As was anticipated several years ago multiplex sequencing has become a common tool in many studies, the level of multiplexing varies but it is almost ubiquitous – an anomaly to this is the creation of indexed libraries in the Genomics England sequencing program but the running of non-indexed sequencing and single-sample-per-lane by the sequencing contractor Illumina.

Several key papers are listed below that describe this issue, probably the most useful papers are Kircher et al from the Meyer lab at the MPI, Mike Quail and Peter Ellis's SASI-seq paper from the Sanger, and Jeff Hussmann's PhD thesis.

The Kircher paper presents data from three slightly different preps no-CAP (standard library prep), SP-CAP (single-plex in-solution capture libraries), and MP-CAP (multi-plex in-solution capture libraries). They were able to determine the fraction of mis-tagging events caused by either barcode contamination during oligo synthesis, pooling or handling, by mixed clusters, or by PCR recombination. After removing possible contamination as a source of error they reported that both no-CAP and SP-CAP had low levels of index mis-assignment (0.018% and 0.034%) but that the MP-CAP libraries had more than ten times higher mis-assignment (0.390%). The low percentages in the first tow libraries were due to mixed cluster that could not be eliminated by quality filtering. The high, almost 0.5%, mis-assignment in the MP-CAP library was due to PCR recombination during multiplex PCR after in-solution capture. Importantly they calculated that if this recombination is occurring primarily in the adapter sequences then half of the chimeric reads, almost 0.25% of all exome reads, would be mis-assigned to a sample if a single index was used, and that dual-indexing would be recommended.

Their analysis was confirmed by Mitra et al 2015 who went further in showing that the template read on the HiSeq was part of the problem - on HiSeq 2500 this is kept to 4 cycles to reduce memory requirements, but when Mitra et al increased template read lengths to 20 cycles they saw 2-5 fold better results for index mis-assignment. Such a long template read would kill most of our HiSeq instruments, but upgrading the memory is suggested by the authors and could be very economical given the impact of low quality cluster detection and index mis-assignment.

In Jeff's PhD he used reads from the shortest library molecules with read-through into the adapters to determine that the PhiX control use the older ‘PE’ primers, which have no sequence complementarity to the standard indexing read primers; as such they cannot generate a signal during the index read. He noticed the same drop in quality scores for PhiX index reads compared to the indexed samples as we had. But he also shows that the PhiX reads that appear to be indexed are physically closer to an indexed cluster than PhiX reads with no index read. This led him to propose the same the model of index bleeding as I have here.

Jeff also carefully investigated PCR-mediated recombination (as did Kircher et al) as an additional source of index mis-assignment. This was first reported back at the start of the 1990's by Meyerhans et al. In any PCR the polymerase can stall or fall off the template creating a short extension products, this can then hybridise in place of a primer in the next round of PCR. The issue with Illumina libraries is that such a product could create a chimeric index mis-assignment due to molecular swapping of indexes. This is likely to be most pronounced in multiplexed amplification after indexed library prep i.e. most exome and amplicome strategies. He also stated that his analysis "constituted overwhelming evidence that PCR-mediated recombination happens during cluster generation". His analysis was all on HiSeq 2500 "Manteia" clustering chemistry, this is likely to perform quite differently from the patterned flowcell "Exclusion Amplification" chemistry and we're looking into index mis-assignment on that right now.

In the SASI-seq paper Quail et al highlighted the issue of index mis-assignment and discussed the need for confirmation that contamination is not present before a data set is analysed. They presented a simple and inexpensive method to verify that results are not contaminated. They prepared a mix of three uniquely barcoded amplicons, of different sizes spanning the range of insert sizes one would normally use for Illumina sequencing, and added these to samples at a spike-in level of approximately 0.1%. They also designed a set of 384 11bp Illumina indexes sequences with high Hamming distance (5bp apart) higher levels of error correction and very low levels of barcode mis-assignment due to sequencing errors.

Our PhiX mis-assignment analysis results: We took historical data to verify if PhiX mis-assignment was happening across all flowcells and could clearly see this was the case, (A) simply shows the percentage of PhiX we added to each lane. In (B) you can see that the majority of lanes show a reasonably low level of index mis-assignment to PhiX, at just 0.01-1% in single indexed samples (green), and 0.01-0.0001% in dual-indexed samples (red). Dual indexing appears to help significantly. We also saw that the level of PhiX contamination was worse on 2500 than 4000, and increased as the amount of PhiX used increased. In fact the rate of PhiX index mis-assignment was more strongly correlated with the amount of phiX in lane for single indexed samples than for dual indexed samples (C). We see PhiX appearing at as much as 1% of the sample in the very worst cases - however this is generally in single-indexed multiplexed sequencing with very high levels of PhiX e.g. low-diversity spiking.

Indexed versus non-indexed PhiX analysis: Whilst the Illumina PhiX control is not indexed, it is possible to purchase an indexed version from SEQMATIC. When we compare indexed versus non-indexed PhiX the results were clear - non-indexed PhiX shows around 0.02% bleed through, while the SEQMATIC index is around 0.005%; a four fold reduction in bleed through.

Indexed versus non-indexed PhiX comparison

Index-read base-quality scores are worthless: We saw that mis-assigned PhiX (PhiX FQ below) reads generally had lower sequence read quality scores than the correctly assigned samples (D). The mis-assigned PhiX index reads were also had generally lower quality scores than the correctly assigned samples (E & F), and it would be great to filter on base quality scores to remove mis-assigned reads. Unfortunately the quality score you get from an Illumina index read is pretty much useless. This is primarily due to its short length. Actually getting the index quality scores requires quite a bit of messing around with the default bcl-fastq pipeline.

These index Q-scores are currently discarded. Just to get the data for the plots below we had to rerun the flowcell through a modified bcl-fastq pipeline. Keeping index Q-scores would require changes to our default pipelines and increase in our compute storage requirements. However we may be able to develop methods similar to Q-score binning, to reduce this extra data, and still allow an assessment of index quality.

Going further than this Illumina sequencing might benefit from running a longer template read at the beginning of all reads e.g. read 1, i5, i7 and read 2. What the computational burden might be and exactly the impact on index mis-assignment this would have is difficult to predict. But even small reductions in errors like this would be worthwhile for low allele frequency applications. I'd expect that companies aiming for tumour screening in the general population (e.g. Grail) would benefit the most from doing these experiments.

PhiX mis-assignment analysis conclusions: Based on our analysis, and the results presented in Jeff's PhD we've come to the conclusion that PhIX index mis-assignment is caused by two issues: index bleeding and/or poly-clonal clusters. And that this can be fixed or safely ignored.

In the figure above (1A) I've tried to present “index bleeding” - each library template cluster emits a signal according to it’s base-fluorophore, represented by the capitalised circles as GAT, (green=G/T, red=A/C), however this fluorescent signal “bleeds” outward from each cluster. A non-indexed PhiX cluster, represented by the lower-case circles, does not emit signal and is base-called from the erroneous "index bleeding" library cluster signal as gat. An indexed PhiX cluster emits a signal according to it’s base-fluorophore and is correctly base-called as CTA. In figure 1B I've tried to present what may be happening on mixed template poly-clonal clusters. These are caused by the random nature of clustering where some clusters are made from two template molecules, that may have seeded at different times. A cluster produced from a single library molecule (α) is correctly base-called as GAT. A mixed template non-indexed PhiX cluster (β) is base-called on the low-signal from the erroneous library cluster signal in the indexing read only, due to lack of PhiX index signal as gat. A mixed template indexed PhiX cluster(γ) emits a signal according to it’s base-fluorophore that is higher than signal from the erroneous library cluster and is correctly base-called as CTA.

Index-bleeding should only be an issue for non-patterned flowcells, whilst poly-clonal clusters will be a problem on both patterned and non-patterned flowcells i.e. HiSeq 4000 and 2500.

How to fix the problem: for index mis-assignment to PhiX the fix is relatively straight-forward. Either use an indexed PhiX, or spike in an oligo to the indexing read primers such that PhiX generates a signal. Both strategies will mean the PhiX clusters generate a signal that outcompetes the index-bleeding, or poly-clonal cluster signals. PhiX will no longer appear in your demultiplexed fastq, or will be at such low levels you'd only see it if you specifically went looking.

Unfortunately index mis-assignment between samples is still an unresolved issue. In a follow up post I'm going to discuss what we've seen, and what the apparent causes are. Again some relatively simple fixes are available - but if you are using multiplexed sequencing to detect low-frequency alleles in populations; e.g. cancer, single-cells, population genomics, then you need to consider whether you understand how your experiments might be affected.

PS: I think it is pretty lax of Illumina not to provide an indexed PhiX. The V2 PhiX was indexed but V3 dropped this, probably due to there only being 96 TruSeq indexes. Come on Illumina sort this one out!

Useful references:

Kircher et al. 2011: Double indexing overcomes inaccuracies in multiplexsequencing on the Illumina platform
Quail et al. 2014: SASI-Seq: sample assurance Spike-Ins, and highlydifferentiating 384 barcoding for Illumina sequencing.
Hussmane J PhD thesis 2015: Expanding the applications of high-throughput DNA sequencing.
Phillipe et al. 2015: Accurate multiplexing and filtering for high-throughputamplicon-sequencing
Carlsen et al. 2012: Don’t make a mista(g)ke - is tag switching an overlookedsource of error in amplicon pyrosequencing studies
Mukherjee et al. 2015: Large-scale contamination of microbial isolate genomes by Illumina PhiX control
Williams et al. 2006: Amplification of complex gene libraries by emulsion PCR. Nature methods, 3(7):545–550, 2006.
Meyerhans et al. 1990: DNA recombination during PCR.
Mamanova et al. 2010 Target-enrichment strategies for next-generation sequencing.

The future of Illumina according to @chrissyfarr

2016-09-20T15:34:00.001+01:00

In yesterdays Fast Company piece Christina Farr (on Twitter) gives a very nice write up of Illumina's history and where they are going with respect to bringing DNA sequencing into the clinic. I really liked the piece and wanted to share my thoughts after reading it with Core-Genomics readers.

To showcase how Illumina is impacting medicine Christina mentions two recent Illumina spin-outs; Helix (an Apple-esque app store for genome applications) and Grail (aiming to develop early cancer detection tests from deep sequencing of ctDNA). And also highlights some wonderful examples of where Illumina themselves have applied sequencing to clinical cases; the Jaynome (Flately's own genome) and discovery of his having the condition malignant hypothermia; to the more compelling rare disease cases such as Massimio, a boy with a genetic mutation causing HBSL (Hypomyelination in the Brain stem and Spinal cord) a new disease found only by the use of Illumina's technology.

Next-generation sequencing is changing medicine and the reality is when we say NGS most of us mean Illumina sequencing - for now at least.

New business models are emerging in genomics: Illumina's Helix is subsidising exome sequencing costs with the hope that users will pay to query the data over time and that this use will more than cover sequencing costs. In an era of very low borrowing costs buying in now to sequence 100 million genomes might only require users to sign up to a $10 a month plan for the rest of their lives, with queries costing a few dollars - in the case of Flatley's own malignant hypothermia, which can result in sudden death while under general anesthesia, a user might query this before deciding on surgery. for instance Or a family might check for an MT-RNR1 m.1555A>G mutation before their child is being treated with gentamicin saving the 1:500 kids with this particular variant from going deaf while in the ICU.

$10 per month is pretty low compared to life-insurance policies and if Illumina or others can do a deal with the "Man from the Pru" personalised genomes outside of the clinic really could become the norm. $10 per month over 10 years is $1200 versus a $1000 genome, but over 40 or even 80 years should be attractive, and this does not consider the reselling of consumer genomics data as 23andMe are showing is possible.

The negative impact of Illumina's lack of competition: Christina comes back to an issue Illumina are facing more and more several times during her article; the fine line Illumina are walking to bring new products to clinical and even consumer markets without competing with their academic and clinical customers. The Liquid biopsy market is predicted to be worth $1 billion by 2020 (personally I reckon a figure much higher than this), and NIPT possibly $2.4 billion by 2022. The size of these markets is a temptation for the company that is delivering most of the infrastructure being used to service them today.

John Stuelpnagel (Illumina's cofounder) and Jonathan Groberg (biotech analyst at UBS) both express some reservations about where Illumina are going in the comments Christina quotes in her article. John immediately jumps into one of the worries I hear about at conferences and meetings, especially when talking to the commercial sector, he says "people [companies] are apprehensive about Illumina and worried about if, and when, they might choose to compete against them". When asked about this fine line that Illumina should walk to stay on the right side of their customers Jon Groberg says "As Illumina moves into the clinical markets, it's making for some tough conversations", and Chrisina acknowledges that some of the people she spoke to were reluctant to talk openly. This comes out later in the article when Christina is interviewing Christian Henry (Illumina EVP & COO) about the purchase of Verinata and the signal it send to Illumina's users, possibly viewed as competitors. Whilst Henry is clear that competition with customers is "a foundational question for Illumina" (i.e. Illumina does not want to compete directly), Groberg adds that Illumina might be unable to NOT compete. And a description of Illumina as an "800-pound gorilla in genomics" by 23andMe’s director of research Joyce Tung is not completely flattering.

In the article Christina highlights Illumina's early days facing litigation from the likes of Affymetrix, where it was the underdog, to its own litigation against ONT, where it has been described as a bully trying to stifle competition. Illumina's dominance in the NGS market is so large that questions are being asked about whether it is unfairly abusing its monopoly position. As a long-term user, and being previously described as "an Illumina fan-boy" I see Illumina's dominance as down to the simple fact that they bought the best technology (an element of luck), but they put a team together that made it work really really well (they made their own luck by investing and working hard). It is Illumina's investment in R&D that has given us the family of instruments from the Mini-seq to the HiSeq X. I'd love to see stronger competition, but its' not there yet, and some big guns have tried and failed (454 LifeTech and CGI). I hope Illumina don't become another ABI bullying other companies trying to get into the space, as well as users - 10 years ago ABI was not a nice company to work with and users were pretty happy to drop them and move over to Illumina. I'm sure Illumina are working on not making the same mistake. But in her article Christina mentions that some of the people she spoke to were afraid to talk openly about this aspect of their relationships with Illumina.

NGS is here to stay and it is going to become more and more common to hear about it in the news and even down the pub. Jay Flately, Shankar Balasubramanian, David Klenerman et al, Solexa and Illumina will be remembered for developing a technology that changed the world (has anyone written a screenplay). Illumina may not be an Apple yet, but it can't be far away. However predicting the future of NGS has proven to be tough, nearly everyone has under-estimated what/when something might be possible in the future. New technologies like Oxford Nanopore's sequencers are looking like they may be ready for the clinic in as little as two or three years.

I am certain that after almost ten years working with NGS the next ten are likely to be almost as exciting.

Reporting on Fluidigm's single-cell user meeting at the Sanger Institute

2016-09-16T14:49:00.001+01:00

The Genomics community is pushing ahead fast on single-cell analysis methods as these are revolutionising how we approach biological questions. Unfortunately my registration went in too late for the meeting running at the Sanger Institute this week (Follow #SCG16 on Twitter), but the Fluidigm pre-meeting was a great opportunity to hear what people are doing with their tech. And it should be a great opportunity to pick other users brains about their challenges with all single-cell methods.

Imaging mass-cytometry: the most exciting thing to happen in 'omics?

Mark Unger (Fluidigm VP of R&D) started the meeting off by asking the audience to consider the two axes of single-cell analysis: 1) Number of cells being analysed, 2) what questions can you ask of those cells (mRNA-seq is only one assay) - proteomics, epigenetics, SNPs, CNVs, etc.

Right now Fluidigm has the highest number of applications that can be run on single-cells with multiple Fluidigm and/or user developed protocols on the Fludigm Open App website; 10X Genomics only have single-cell 3' mRNA-seq right now, as do BioRad/Illumina and Drop-seq. But I am confident other providers will expand into non 3'mRNA assays...I'd go further and say that if they don't they'll find it hard to get traction as users are likely require a platform that can do more than one thing.

There are three sessions over the two days:

Session I: Single-cell heterogeneity, classification and discovery
Session II: Immunotherapy in oncology—new insights at single-cell resolution
Session III: Single-cell functional biology

Session I: Single-cell heterogeneity, classification and discovery

Achieve new insights through single-cell biology. Candia Brown, Director Strategic Marketing, Fluidigm

Candia asked the audience "what are we trying to do with single-cell genomics methods?" She focussed her brief introductory presentation on understand biological mechanisms and pathways, cell differentiation, cell lineage, etc, and for biomarker discovery, therapeutics...or even in the clinic in the future? Much of the initial work has been done on identifying cell types within populations and to understand heterogeneity. Moving beyond this kind of classification requires more complex methods and analyses. Ultimately we'll need to be using spatio-temporal methods such as in-situ sequencing of carefully prepared samples, and combination analyses with data from RNA, DNA and proteins. We need to detect from single cells (this was a hot topic for Fluidigm at the beginning of 2016) and Candia shoewed examples of population classification and discussed how we might move past relatively "simple" atlasing studies to more complex experiments that aim to make mechanistic insights. Fluidigm aim to present all the latest updates on their tech during this meeting for the C1, Biomark, Helios and Polaris systems.

Dissecting cerebral organoids and fetal cortex using single-cell RNA-seq. Gray Camp, a post-doc in Svante Pääbo's group at Max Planck Institute for Evolutionary Biology, Germany. Gray is also collaborating closely with the Treutlein lab.

Cerebral organoids make biological experimentation easier in the same way that tumour organoids are better informing cancer biology. The group are deconstructing cellular heterogeneity in cerebral organoids using single-cell RNA-seq compared to bulk analysis. Now using organoids developed from patients to generate samples that recapitulate periventricular neuronal heterotopia. Also following the reprogramming of fibroblasts into induced neurons (recently published in Nature and in their News and Views). This great editorial in Development discusses the impact that organoids are having on biological research.

Becoming a new neuron in the cerebral cortex. Ludovic Telley, University of Geneva, Switzerland.

Ludovic is also talking about cells in the brain, single-cell methods are having a huge impact on brain biology. His talk focussed on the "L4 neurons" the main recipient of sensory input into the brain. Using a novel technology called FlashTag to visualise and isolate neurons during their development, see Science 2016 paper. Isolated neurons are then profiled using Fluidigm single-cell RNA-seq to track neuronal transcriptional programs. They found that waves of transcriptional activity are seen as each neuron progresses from proliferative to migratory and finally to connectivity phases.

A cost-effective 5’ selective single-cell transcriptome profiling approach. Pascal Barbry, Institut de Pharmacologie Moléculaire et Cellulaire, France.

Pascal's group are using Fluidigm single-cell methods to investigate Mucocilliary differentiation. Today he describes the modified SMART-seq method they developed incorporating on-chip barcoding and UMIs. This is somewhat similar to STRT-seq published in 2011, but now on the Fluidigm IFC. Pascal spent some time describing the impact of UMIs (Unique Molecular Identifiers), showed the figure from Cellular Research's PNAS paper, and mentioned one of the four methods to reduce RNA-ligation biases. After processing cDNA is fragmented and 5' fragments are isolated by the biotin tag before completion of library prep and sequencing. Showed data on performance and reproducibility of the assay: reads are very biased to the 5' end of transcripts (but have not copared directly to CAGE data), saw about 25% efficiency of ERCC cloning, data suggest that more than 1 million reads per cell is unnecessary. Interestingly they saw a correlation of 0.9 for a C1+IonProton versus Drop-seq+Illumina, but with a reasonable number of genes that appear to be present in only one method! The script will appear on Fludigm's Open App site after publication!

Pascal briefly mentioned their work on the 800 cell IFC, they're pretty happy so far. But would like to be sequencing on Next-seq, which needs lots of PhiX to be added due to the need to read through the oligo-dT sequence. He suggested starting sequencing from the 5' end instead.

Single-cell analysis of clonal dynamics and tumour evolution in childhood ALL. Virginia Turati, Enver lab UCL, UK.

ALL is the most common childhood cancer with 1 in 2000 affected and around 500 cases per year in UK. ALL was one of the first disease where branching evolution was described. Using Fluidigm C1 single-cell in a "mouse clinic" from primary patient tumour material, where treatments can be monitored over time. Analysis during chemotherapy of PDXs shows no impact on intratumour heterogeneity i.e PDXs recapitulate the patient tumour. Single cell WGS was much more difficult than RNA-seq! But an average of 37 CNV were found in each cell. They are generating around 10 million reads per cell to generate a coverage of around 0.2x. Saw multiple variants around CDKN2a locus.

Virginia presented some data that shows how small numbers of cells (Freddy) overlap transcriptomes with resistant cells, suggesting that these are evolving towards resistance. Understanding this process is key to improving outcomes for patients. They are aiming to identify a signature of resistant cells to use in the clinic.

See more with the C1™: explore the breadth of applications available on the C1 platform for single-cell genomics. Shaun Cordes, Senior Product Manager, Fluidigm.

Shaun gave an overview of the different methods users can run on the C1 system. He also confirmed the 10,000 cells coming soon, as is a Fluidigm automated imaging system which includes a cloud based software toolkit. New applications coming include single-cell protein analysis with two anti-bodies carrying probes that allow qPCR analysis (read more about the Proximity Ligation Assay approach in the Science 2015 paper).

Session II: Immunotherapy in oncology—new insights at single-cell resolution

Mass Cytometry applications from Fluidigm. Gary Impey, Director, Product Management - Mass Cytometry, and Robert Ellis, Director, Product Management, Fluidigm.

About half the audience are either using mass-cytometry already, or are considering using it. A search on PubMed for "mass-cytometry" or "CyToF" results in 196 papers - a pretty high number given how new this method is. Gary is talking about how Fluidigm's Helios system can be used to interrogate cells for immunogenic markers. Gary referenced a Wall Street Journal article: Immunotherapy and cancers super survivors. David Lane (formerly Chief Scientist at CRUK) was quoted as saying “It’s the most exciting thing I’ve ever seen”. To get real insights we need highly-dimensional single-cell methods - Fluidigm's Helios CyToF is one tool that can help.

Fluidigm currently have 50 high-purity metal isotope tags which allow almost generation of data with minimal biological or technical noise. Metals are tagged to antibodies and these are used to tag cell surface or intra-cellular markers.

Robert is presenting an overview of a new method called imaging mass-cytometry (see the figure at the top of this post - it may be the most exciting thing to happen in 'omics in a while). This allows spatial resolution of proteomic data from tissues in-situ. The system requires a new box to be bolted onto the Helios instrument to perform imaging, a UV laser vaporises tissue by scanning across the section one line at a time (approximately 1um per pixel), and the ionised tissue goes into the mass-cytometer for semi-quantitative analysis. It works with fixed or frozen tissue on standard microscope slides. The process takes approximately 1 hour to get a region 0.5mm square - highly detailed but highly focused (spatially). Robert presented software developed in the Bodenmiller group at ETH, Zurich. You can do LCM-style selection and pick defined regions.

Robert showed some wonderful images of imaging mass-cytometry compared to IHC or FISH. Alos some lsides from David Hedley's group at Toronto. You can label your own antibodies using a kit from Fluidigm, but Robert showed a slide of their Immuno-Onc panel with a broad concentration range for different anitbodies- just how much empirical work tis required to get the balance right is unclear!

Imaging Mass Cytometry—about proteins, tissues and biomedical research. Valerie Dubost and Markus Stoeckli (also on the SAB of Imabiotech a CRO for mass-cytometry imaging), Novartis, Switzerland.

Valerie is talking about her early access results from the imaging mass-cytometry methods presented by Robert. Valeri is a histologist so her perspective will be an interesting one, and potentially give insights into how likely this technology is to make it int the clinic. Novartis haev moved quickly to build a cross-functional team to focus on mass-cytometry imaging technology application and development. Using FFZN and FFPE tissue, incubate a panel of up to 30 antibodies, slides loaded into imaging mass-cytometer for laser ablation and analysis.

Data presented included validation of the antibodies - this is critical and too many scientific papers are messed up by the use of poorly characterised antibodies. Comparison of IHC to IMC looked excellent. She showed beautiful images of cell segmentation by Voronoi boundaries. The need to carefully consider cellular architecture is important in interpretign results from IMC - you are still going to need a pathologist to help interpret this kind of data. Pathology:Molecular Pathology:IMC Pathology is going to increase our understanding of tissue architecture, and possibly interactions.

Session III: Single-cell functional biology

An introduction to single-cell functional biology. Simon Margerison, Senior Manager, Application Support, Fluidigm

Simon gave an overview of how the Helios and Polaris systems can be used to investigate functional single-cell biology. We heard lots about the Helios yesterday and Simon showed some Cancer data using panels where 10 markers were used for phenotpying and 30+ markers used to investigate functional biology.

However Simon spent a little more time describing the Polaris system which was not really mentioned yesterday. This is a system that allows selection of 48 single-cells, and culture them for up to 24 hours while modulating the environment - this is automated cell culture and I'm hoping Polaris is the first of many such systems that will allow highly parametric experiments to be performed where instead of a simple A vs B, treated and untreated experiment, we'll do A,B,C,D,E,F & G, treated at different doses and times all without being messed up in the tissue culture lab.

A holistic view of the mucosal immune system: identification of tissue- and disease-specific cellular networks. Frits Koning, Leiden University Medical Center, Netherlands

Frits is presenting work published recently in Immunity. His lab has built a mass-cytometry panel to look at heterogeneity of the adaptive and innate immune compartment, applied to Human intestinal samples (Coeliac disease). He presented data from an initial cohort of 44 patients. 8 months to generate the data, 6 months to analyse it - a common bioinformatics challenge! He showed a merge scatterplot of all 2.5 million cells from all 44 patients, the different cell types clearly separate into the canonical immume cell populations. However the different samples (PBMC vs colon) and individual patients show very different enrichments for cell populations.

They were able to distinguish distinct mass-cytometry signatures that divide patients from controls, and were able to detect patients with mucosal lymphoid malignancies. His group has been working hard on developing computational methods to analyse these huge datasets quickly, all 5.2 million cells in 1 hour on a 32Gb laptop! See the Cytosplore website for more details. Frits was very bullish about the use of mass-cytometry in the clinic and finished by saying "we are moving towards an unbiased diagnostic tool".

The nature and nurture of cell heterogeneity: single-cell functional analysis, temporal single-cell sequencing and imaging of gene edited macrophages. Esther Mellado, Wellcome Trust Centre for Human Genetics, UK.

Esther's work is the focus of a spotlight article on Fluidigm's website. She is running the Polaris system at the WTCHG and presented her work isolating single cells and perturbing them to understand the role of macrophages in HIV pathology. And in particular cells with mutations in SAMHD1 gene and the effect of this mutation on HIV latency. They used multiple microenvironmental conditions in early and late activation so adjusted dosing for either 1 or 8 hours, comparing mutant and wild-type macrophages across 10 replicates. They performed high-resolution imaging off the Polaris to investigate morphology and behaviour. They saw that knockout of SAMDH1 has important paracrine signalling effects.

The WTCHG team call the Polaris their "10 Postdocs in a Box". It allows much mire complex experiments to be performed than an individual in the lab can realistically manage. As I said above I'm hoping Polaris is the first of many automated cell culture systems - and ideally we'd see instruments that can handle bulk cells too.

Understanding cellular heterogeneity. Sarah Teichmann, Wellcome Trust Sanger Institute and EMBL-EBI, UK

Sarah is presenting her groups work on cellualr heterogeneity, it turns out that much of this is of functional significance. She stumbled upon this when doing bulk RNA-seq could not relate the abundance of transcripts to counts of single-molecule RNA-FISH. Bulk RNA is limiting, single-cell rocks!

She presented data from a new publication just deposited on the BioRxiv: Temporal mixture modelling of single-cell RNA-seq data resolves a CD4+ T cell fate bifurcation. They used temporal modelling of single-cell RNA-seq to analyse development of Th1 and Tfh cell populations in mice infected with Plasmodium, and show that a single cell gives rise to both cell types. I'd really suggest reading the paper.

10X Genomics publications

2016-09-14T10:53:00.002+01:00

Anyone that's been reading Core-Genomics will have seen my interest in the technology from 10X Genomics. I've been watching and waiting for publications to come out to get a better understanding of how people are using the technology and thought you might like my current list of articles: many of these are on the BioRxiv and should be available in a reputable journal if you're reading this in 2017 or later!

The number of 10X Genomics publications is going to grow rapidly; and this list will only be updated sporadically!

Direct determination of diploid genome sequences. BioRxiv 2016 Aug.

This paper by Deanna Church and David Jaffe et al describes the 10X Genomics Chromium phasing technology. I've done a more comprehensive write up of this paper here on Core-Genomics. Essentially this is the paper to refer to if you're considering using Chromium phasing in your own research and want to better understand how it works and what you can do. The authors explain the basic principles of generating LinkedReads, and present data on 7 Human genomes successfully assembled from HiSeq X data using the Supernova algorithm. Assemblies are good with 100kb+ contigs and 2.5Mb phase blocks, and the HGP sample used had excellent alignment to the reference along a 162kb contig.

ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter. BioRxiv 2016 Aug.

The authors present ABySS 2.0 and compare it to the previous version and 5 other assemblers, BCALM2, DISCOVAR, Minia, SGA and SOAPdenovo. They used the Genome in a Bottle data: 70X coverage Human genome using Illumina paired 250bp reads (PE250) as well as mate-pair data, 10X genomics Chromium data, and BioNano optical mapping data. ABySS 2.0 generated an N50 of 3.5 Mb using only 35 GB of RAM (still won't run on your Mac Book Pro). Whilst this is not a 10X paper per se they do discuss the limitations of current short-reads and the impact the 10X technology is likely to have on assembly including the BioNano Genomics and 10x Chromium data increased N50 from 29 to 42 Mb. In Fig. 3 from the paper (see below) the authors show all of the 90 scaffolds over 3 Mb, which add up to 90% of the genome. And state that "most chromosome arms are reconstructed by 1 to 4 large scaffolds".

Fig.3 from Jackman/Vandervalk et al 2016

High-Quality Assembly of an Individual of Yoruban Descent. BioRxiv 2016 Aug.

The authors present a hybrid assembly of NA19240 using multiple technologies including PacBio, BioNano genomics, Illumina sequencing, 10x Genomics LinkedReads, and BAC hybridization and sequencing. They explain the need for multiple technologies given that no single method "can fully resolve every genomic feature and/or region"; and argue that BAC tiling is still a useful technology. I'd be interested to know how useful this might be once 10X Genomics becomes standardised as the time and cost involved in BAC library construction, mapping and sequencing, let alone the huge amount of DNA required is quite outside the reach of most labs.

The assembly presented is the first in a set of 5 genomes which the authors are aiming to use to improve the diversity of the reference genome. They refer to "Gold" and "Platinum" genomes but I cannot tell which the final assembly was considered. The final assembly had an N50 of 7.25 Mb and a scaffold N50 of 78.6 Mb, which according to the authors "represents one of the most contiguous high-quality human genomes".

A hybrid approach for de novo human genome sequence assembly and phasing. Nat Methods. 2016 Jul:

This paper describes a combinatorial approach to de novo assembly and phasing analysis using Illumina sequencing, 10X Genomics (GemCode) LinkedReads, and BioNano Genomics mapping; again using NA12878.

Massively parallel digital transcriptional profiling of single cells. BioRxiv 2016 July:

This paper describes the 10X Genomics single-cell 3' mRNA-seq technology. I'd previously covered this paper here on Core-Genomics. Essentially this is probably the paper to read to if you'd like to are considering 10X Genomics single-cell RNA-seq in your own research and want to better understand how it works and what you can do. The authors explain the basic principles of the methods, and present data from 250,000 cells across 29 samples. An awesome paper...when does it come out in a Jurnal?

Ben Hindson (10X Genomics CSO) will be presenting this work at the San Diego Festival of Genomics if you'd like to know more.

"The 1/4 of a million cell RNA-seq paper!" http://ctt.ec/51O3p+

Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016 Jun (originally on BioRxiv).

This is the massive GIAB (Genome in a Bottle Consortium) paper describing NA12878 and other reference materials sequenced across multiple technologies. These include: Illumina WGS paired-end, mate-pair, Moleculo and exomes, PacBio, BioNano Genomics, Ion Proton exome, 10X Genomics GemCode, Oxford Nanopore MinION; and the now defunct SOLiD, and Complete Genomics paired-end and LFR technologies.

The coverage for the 10X Genomics data is only 25x and was produced using the GemCode platform so is not really representative of what 10X Genomics would reccomend today. The data is available from 10X Genomics and from the GIAB ftp.

Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. BioRxiv 2016 Jun.

This paper presents a novel clustering method for single-cell RNA-seq data. One of the data sets they used was the 10X Genomics single-cell RNA-seq of PBMCs from Zheng et al 2016. They present their method: SIMLR (single-cell interpretation via multi- kernel learning), and show that it more accurately defines subpopulations from single-cell data than either t-SNE or PCA methods.

Fig 5: 2D visualisation of data from 5 cell sub-populations by PCA (b) SIMLR (c) and t- SNE (d).

Health and population effects of rare gene knockouts in adult humans with related parents. Science Apr 2016 (originally on BioRxiv).

This paper presents the use of 10X Genomics phased genome sequencing as a confirmatory method in a study identifying gene knockouts created by rare homozygous predicted loss of function (rhLOF) variants from exome sequencing data. In one case a PRDM9 rhLOF was confirmed by 10X Genomics sequencing. PDRM9 is a gene involved in the localisation of meiotic crossovers, however the individual was healthy and fertile. The results suggest there are alternative mechanisms of localising human meiotic crossovers as PRDM9 LOF leads to infertility in mice and an inability to repair double-strand breaks. The authors state that we need to be careful when interpreting predicted loss of function events.

Third-generation sequencing and the future of genomics. BioRxiv April 2016.

This review of third-generation NGS systems describes 10X Genomics Chromium genome technology as a mapping, rather than a sequencing application. 10X Genomics is lumped in with BioNano Genomics, Dovetail Genomics cHiCago (HiC) method, genetic maps and mate-pair mapping. The paper includes a great table highlighting the characteristics of the different 3rd-gen platforms (reproduced below).

Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol. 2016 Mar.

This paper from Hanlee Ji's group at Stanford and 10X's Ben Hindson et al describes the 10X Genomics GemCode phasing technology. It is the first paper to demonstrate that droplet methods for phasing and structural variant analysis. This is the other paper you should refer to if you'd like to are considering phasing in your own research, but the more up-to-date Chromium BioRxiv Church/Jaffe paper (see above) will give you better information about the technical performance today (Sept 2016).

Fig1: !0X technology overview

This paper demonstrates what can be done for cancer genomes, and that is what makes it such an important read for people deciding if the 10X the might be useful in their research here at the CRUK Cambridge Institute. I've previously written about why I'm excited about using phasing to resolve complex structural rearrangements and determine if multiple variants in the same gene are in cis/trans (cis- is on the same allele, trans- is on the second allele).

For a single colorectal cancer patient they generated 50x Illumina WGS and 30x 10X Genomics WGS (the choice of the name 10X Genomics requires some explanation, written as it is here only the inclusion of the work "genomics" in the sentence makes it easily interpretable. And given that most 10X Genomics phasing data will be generated on Illumina's X Ten we're having to ask for "10X on X Ten" or "an X-Ten 10X genome" - I get them mixed up in conversations and I know PIs and post-dos do too!) Multiple deleterious cancer mutations, including the known driver genes TP53 and NRAS, five rearrangements and 26 copy-number variants were found . The most interesting result presented was a C>T mutation in TP53 that causes a deleterious nonsynonymous R213Q substitution, confirmed in the LinkedRead data as being on one haplotype. The other haplotype was shown to be deleted in the same region leading to LOH, with the only copy present having the TP53 C>T mutation resulting in a single but inactvated copy of TP53. This phased cancer genome was produced from 1ng of ~50kb DNA, from a sample with 70% tumour purity - this is pretty close to many samples that people are collecting, but the careful reporting of this kind of information is going to be vital as we understand which samples might sensibly be run on the 10X Genomics tech, and which we should leave for now.

A previously validated EML4-ALK translocation was detected in lung cancer cell line NCI-H2228. To target the exome for phasing 10X Genomics and Agilent have partnered on a modified capture panel that includes baits designed to target the introns and improve pull-down of the large genomic fragments. The sequencing of 200X was after removal of duplicates so this could be very deep sequencing indeed. However the 10X Genomics data revealed that this is not a simple inversion, but is a more complex with a deletion including exons 2–19 of ALK.

They discuss the Moleculo tech (actually refs 6-9 from the paper) from Illumina pointing out the main reasons that these methods are sub-optimal are the relatively large amount of DNA used and the relatively low number of partitions generated - both limiting how well the technology can be applied.

The authors conclude their discussion with the following statement "phased cancer genomes will provide new insight into the genomic alterations underlying tumor development and maintenance". I think the next few months will see other papers being published confirming how useful the technology really is. And who knows how soon we might see a phasing panel specifically for DNA repair genes being used in the clinic for instance?

Haplotypes drop by drop. Nat Biotechnol 2016 Mar.

In this news and views article Jacob Kitzman (University of Michigan) describes the data from the Zheng et al paper (see above) in the same issue, and explores the impact it might have in the field. This paper clearly describes the issue that clinicians want to understand: are both copies of a gene affected e.g. as in cystic fibrosis, where two mutations, one on each allele knock out both copies of the CTFR gene, or if the same haplotype is hit twice with mutations in cis.

He suggests how other methods might be improved by the use of 10X phasing technology including metagenomics (we're trying this with a collaborator), and for phasing cDNA to analyse transcriptomes more deeply with regards splice isoform diversity.

One of the questions Kitzman poses is "Whether the 10X Genomics platform will be widely adopted may depend as much on its cost above and beyond standard whole-genome shotgun sequencing as on its technical merit." The papers above are showing just how useful the 10X Genomics tech is turing out to be...but as I said at the start of this post this list is going to grow rapidly; and this list will only be updated sporadically!

10X Genomics phasing explained

2016-09-09T11:57:00.001+01:00

This post follows on from my previous one explaining the 10X Genomics single-cell mRNA-seq assay. This time round I'm really reviewing the method as described in a paper recently put up on the BioRxiv by 10X's Deanna Church and David Jaffe: Direct determination of diploid genome sequences. This follows on from the earlier Nat Methods paper which was the first 10X de novo assembly of NA12878, but on the GemCode system. While we are starting some phasing projects on our 10X Chromium box the more significant interest has been on the single cell applications. But if we can combine the two methods (or something else) to get single-cell CNV then 10X are onto a winner!

The paper describes the 10X Genomics Chromium phasing technology. They highlight the impact of their tech by first reminding us that the majority of Human genomes sequenced to date are analysed by alignment to the reference (an important point often forgotten by users). They say that only a few de novo Human assemblies have been created, but that most do not truly represent complex biological genomes. The authors only consider two published genomes as true diploid de novo assemblies - Levy et al. PLoS Biol 2008: The diploid genome sequence of an individual human and Cao et al. Nat Biotech 2015. De novo assembly of a haplotyperesolved human genome.

The method: They introduce the 10X Chromium library prep. This starts with 1.25ng of >50kb DNA, from which 16bp barcoded random genomic loci are copied (by polymerase extension?) inside the Chromium gel-beads. Each of these contains around 10 molecules per droplet equal to ~0.5 Mb of the genome. The most important bit of the tech is the ability to put just 0.01% of the diploid Human genome into a single droplet - this makes the probability of both alleles being present vanishingly small. With 2 lanes of X Ten you can expect to get about 60X Human genome coverage and the authors calculate the number of "linked reads" per molecule as 60, which equates to around 0.4x coverage (enough for shallow CNV sequencing to reveal clonality in Tumours perhaps).

Question to the authors: I do not understand the statement about smaller genomes getting lower linked read coverage: "For smaller genomes, assuming that the same DNA mass was loaded and that the library was sequenced to the same readdepth, the number of LinkedReads (read pairs) per molecule would drop proportionally, which would reduce the power of the data type. For example, for a genome whose size is 1/10th the size of the human genome (320 Mb), the mean number of LinkedReads per molecule would be about 6, and the distance between LinkedReads would be about 8 kb, making it hard to anchor barcodes to short initial contigs." My first assumption was that genome size would have no impact on linked read depth, but it would significantly affect the amount of the genome present in a single droplet. As such the smaller genome, with DNA fragments of the same size should still have around 60 linked reads per DNA molecule, but a 10MB genome would mean 5% was in each droplet making the phasing much harder to determine. Please feel free to explain this to me.

The data: In the paper they present data from seven Human genomes, sequenced on HiSeq X Ten, and assembled using the "pushbutton" Supernova algorithm (it won't run on your Mac Book Pro as you'll need >384Gb of RAM). In just two days per genome they generated 100kb+ contigs with 2.5Mb phase blocks. The 7 genomes include 4 with parental data to verify phasing results, as well as one sample used in the HGP. They include a figure (see below) showing the Supernova assembly of the HGP sample aligned to a 162kb clone which is part of the GRCh37 reference. It almost completely matches the reference sequence with the 8 variants including just 1 SNV (green), but 6 homopolymer and 1 di-nucleotide repeat length variants (blue/cyan). The sceond figure shows the representation of the path a FASTA sequence takes through the "megabubbles" separating parental alleles, and "microbubbles" caused by longer repeats and homopolymers.

Who's careful hand at 10X Genomics drew this representation of FASTA?

Tuning 10X phasing to your needs: Users may be able to "tune" scaffold N50 by varying DNA length or sequencing coverage. A single X Ten lane generating 30x coverage looks like it would push scaffold N50 down from 17 to 12 Mb. DNA quality is probably most important and I suspect many people will accept a significant improvement in phasing estimation from lower cost experiments.

Many groups will also want to run differently sized genomes and will need to estimate how much DNA to use and how much sequencing they'll require. For small genomes this gets really interesting and 10X could be an awesome metagenomics tool allowing strain level analysis of complex samples. For the larger non-Human genomes people will need to us a much smaller amount of DNA in a single run, which may limit the number of genome copies to an unreasonable level.

Human 3Gb = 1ng = 300 genome copies
Wheat 5Gb = 0.67ng = 135 genome copies
Maize 20Gb = 0.17ng = 8 genome copies
Salmander 50Gb = 0.07ng = 1.3 genome copies
Paris japonica 150Gb = 0.02ng = 0.15 genome copies

Who's going to use Chromium phasing: Is this kind of data going to be relevant enough for people to adopt 10X Chromium as the default genome library prep? I suspect many teams are working on 100s or even 1000s of 10X Genomics genomes right now and we'll see many more publications very soon. If the $500 Chromium prep can add real value (biologically or clinically) then 10X have a real chance of becoming a new standard for library prep. If that's the case I guess we'll see how strong their IP is as the competition builds their own variants of the technology.

Nuclear sharks live for 400 years

2016-09-05T13:21:00.001+01:00

A wonderful paper in a recent edition of Science uses radiocarbon dating to show that the Greenland shark can live for up to 400 years - making it the longest lived vertebrate known. See: Eye lens radiocarbon reveals centuries of longevity in the Greenland shark (Somniosus microcephalus).

“Who would have expected that nuclear bombs [one day] could help to determine the life span of marine sharks?” The authors used measurements of 14C radiocarbon isotopes in eye lens nuclei to estimate life span of around 300 years, with the oldest animal approximately 392 years old. A complication in their analysis was the “bomb pulse”: the the pulse of carbon-14 produced by nuclear tests in the 1950s. This creates a spike in radiocarbon levels, however only the two smallest, and presumably youngest, of the 28 animals analysed had the high 14C levels associated with the bomb tests.

Why eye lens nuclei? It turns out that the lens is made from metabolically inert crystalline proteins, and the nucleus, which is formed during prenatal development, retains proteins synthesised at age 0.

No sex until you're 150: But this longevity comes at a price though, and for the Greenland shark the price is that sexual maturity is not reached for a very long time - around a female sharks 156th birthday!

Animals that live this long are rare, and horribly susceptible to Human activities; primarily fishing, shipping and pollution in the case of marine vertebrates. Most of animals used in this study came from several years of collecting dead sharks, many of them accidentally ensnared when trawling for commercial catches.

Sequencing base modifications: going beyond mC and 5hmC

2016-09-02T14:21:00.000+01:00

A great new resource was recently brought to my attention on Twitter and there is a paper describing it on the BioRxiv: DNAmod: the DNA modification database. Nearly all of the modified nucleotide sequencing we hear and read about is modifications to Cytosine mostly methyl cytosine and hydroxymethyl cytosine; you may also have heard about 8-oxoG if you are interested in FFPE analysis. All sorts of modified nucleotides occur in nature and may be important in biological processes where they can vary across tissue of an organism, or may just be chemical noise. The modifications are most important when they change the properties of the DNA strand, how is is read, and what might or might not bind to it e.g mC.

The biology of base modification is very complex - DNA methyltransferase marking Cytosine with a 5-methyl, TET family enzymes oxidising 5-methylcytosine to 5-hydroxymethylcytosine, and thymine DNA glycosylase-mediated base excision repair back to unmodified Cytosine. Many groups have worked on methods to sequence modified bases, with Shankar Balasubramanian's research group here in Cambridge most closely associated with 5hmC-seq in his CEGX spinout.

DNAmod DB: The DNA modification database lists 38 modified bases, only 7 of which only been observed synthetically. It gives each a brief description of each modified base including the likely biological function, and most importantly for readers of Core Genomics it lists the methods that can be used to map the modifications in the genome.

Unfortunately it appears to miss the OxBS-seq method published by Booth et al in 2012, but does have the competing TAB-seq method published by Yu et al in the same year.

Not all bases are modified to the same extent: There are a total of 128 modified nucleotides reported in the unverified list on DNAmod. I'd assumed modifications would be about the same number for each of the biological building blocks but they vary quite significantly: Uracil has 45 mods (I'm guessing modifications in ribonucleotides need less careful control?), Adenine (39) has nearly twice as many modifications as Guanine (19), and Cytosine (13) and Thymine (12) have the least.

Citation: Sood AJ, Viner C, Hoffman MM. 2016. DNAmod: the DNA modification database. bioRxiv 071712.

Celebrating 10 years at the CRUK-Cambridge Institute today

2016-09-01T09:54:00.005+01:00

Today I have been working for Cancer Research UK for ten years! September 1st 2006 seems like such a short time ago but a huge amount has changed in that time in the world of Genomics. NGS has changed the way we do biology, and is changing the way we do medicine. The original Solexa SBS has been pushed hard by Illumina to give us the $1000 genome, and perhaps just as exciting are the results coming out of Oxford Nanopore's MAP community - this maybe the technology to displace Illumina? What the next ten years will hold is difficult to predict, but today I wanted to focus on the highlights of the last ten years at CRK for me.

CRUK-Cambridge Institue circa early 2006

I was employed to build a brand new genomics facility and was hired for my expertise in gene expression microarrays - previously I'd set up an Affymetrix facility at the John Innes Centre in what is now the Earlham Institute. Perhaps the one thing I remember from my interview is the answer to a question I'd posed at the end "Will the CRUK institute be using the new next-generation sequencing technologies?" NGS was still in its infancy then, in late 2005 the first 454 paper made a big splash, and a Solexa sequencer has been installed at the Sainsbury lab in Norwich and I'd heard interesting things about the technology.

The answer was something like "we want this facility to focus on microarrays, we'll see if the NGS comes to anything useful". Well everyone reading Core-Genomics knows how disruptive NGS was, microarrays are dead (for gene expression anyway) and virtually all the data we generate in my lab comes off an Illumina HiSeq sequencer.

When I arrived the site had only just been handed over by the builders. In January of 2007 we had the first instruments installed and were processing Sanger sequencing and Illumina arrays by the Spring. But we'd decided to get our first sequencer and our initial discussions with the Solexa rep ended up with the purchase of an Illumina GAI. The rest as they say is history.

Highlights from the last ten years: The Institute celebrates its 10th anniversary in February of 2017 so I'll not go into too much detail about the top ten projects the Genomics core has been involved with. But I did want to pick upon three projects that I was personally involved with and that I think were major advances.

Understanding gene regulation: In a wonderful paper: Species-specific transcription in mice carrying human chromosome 21, Mike Wilson, in Duncan Odom's group, demonstrated that sequence differences in regulatory regions are the dominant force in governing when and where genes are expressed. Mike designed an incredibly elegant experiment using a Mouse model of Down's syndrome, the TC1 mouse carries an extra copy of chromosome 21, but it is a Human copy. That Human chromosome is in a mouse nuclear environment and this allowed the authors to show that the Mouse transcription factors bound to Human DNA in a Human specific context i.e. the DNA sequence was the dominant force driving gene expression. Mike and Duncan were instrumental in the development of NGS at the Institute. Mike was great to work with, and hosted probably the best "crash pad" in Cambridge; and Duncan has kept up an amazing pace of research over the whole of the last ten years.
Molecular subtyping of Breast cancer: The METABRIC project was a major reason I took the job at CRUK. It was the largest array project I ever worked on and had a huge impact on our understanding of Breast cancer, revealing novel subtypes of breast cancer with distinct clinical outcomes and subtype- specific driver genes. It was truly a landmark study. The Genomics core processed all of the UK-based samples extracting DNA and RNA, quality controlling and normalising them for analysis. I managed the Affymetrix genotyping on SNP6.0 arrays, carried out as a service by Aros in Denmark. And my lab processed all of the 2500 Illumina HT12 arrays used in the study in just 6-8 weeks. Christina Curtis now runs her own lab at Stanford. And the Caldas group continues to lead on Breast cancer genomics, most recently we've been working with them most recently on a PDX project where we introduced low-coverage WGS of pre-capture exome libraries to significantly improve CNV calling.
Liquid biopsy: probably the biggest advance I've been involved with, NGS analysis of ctDNA as a liquid biopsy, is changing the way we do cancer medicine. Tim Forshew in Nitzan Roselfeld's group was the first person to use NGS to non-invasively identify mutations by sequencing the DNA from a patients tumour circulating in their blood. In a hugely impactful Science Translational Medicine paper Tim and colleagues showed that this could be used to detect and quantify mutations seen in the tumour, that de novo mutations could be identified, and that a liquid biopsy could be used to monitor tumour progression in patients. Mohammad Murtaza (now Assistant Prof at TGEN) pushed the technology even further by showing that it was possible to perform whole exome analysis of ctDNA, and that this could be used to monitor tumour evolution. This was a groundbreaking study published in Nature, but when I presented it at AGBT the following year the audience was still highly skeptical of how widely ctDNA might be used - that has changed and now there are dozens of companies pursuing liquid biopsy including Nitzan and Tims Inivata.

I've worked with some amazing people over the last decade many of whom have gone on to start their own labs. My team has been great; people have come and gone, marriages have happened and babies have been born. The CRUK Cambridge Institute continues to be an excellent place to work, and is still a world leader in Genomics, and I've played my part in helping that to happen. Here's to the next ten years.

Optalysys eco-friendly genomics analysis

2016-08-25T15:00:00.002+01:00

The amount of power used in a genome analysis is not something I'd ever thought of until I heard about Optalysys, a company developing optical computing that has the potential to be 90% more energy-efficient and 20X faster than than standard (electronic) compute infrastructure. Read on if you are interested in finding out more, and watch the video below - featuring Prof Heinz Wolff!

Optalysys was originally spun out from the University of Cambridge and the technology needs a lot more explanation that I'll give: briefly they split laser light across liquid crystal grids where each "pixel" can be modulated to encode analogue numerical data in the laser beam, this diffracts forming an interference pattern and a mathematical calculation is performed - all at the speed of light. The beam can be split across many liquid crystals to increase the multiplicity and complexity of mathematical operations performed.

Optalysys and the Earlham Institute in Norwich are collaborating on a project to build hardware/software that will be used for metagenomic analysis. This is a long way from comparing 500 matched tumour and normal genomes in an ICGC project; but if Optalysys can build systems to handle this scale then the huge compute processing tasks might be carried out at a fraction of the current costs and whilst running from a standard mains power supply.

PS: do you remember the Great Egg race as fondly as I do?

Upcoming Genomics conferences in the UK

2016-08-24T15:28:00.001+01:00

It is almost time for the kick off at Genome Science, probably the best organised academic conference in the UK. It runs from August 30th to September 1st next week and sadly I can't be there (just returned from holidays and too much going on). You can hear from a wide range of speakers in a jam packed agenda. This year it is hosted by the University of Liverpool, and the evening entertainment comes from Beatles Tribute Band “The Cheatles”!

What other conferences are available for Genomics in the UK, and which one should you attend if you too can't make it over to Liverpool? The Wellcome Trust Genome Campus is holding their first Single Cell Genomics conference from September 9th (sold-out I'm afraid). Personally I thought that the London Festival of Genomics was excellent and I've high hopes for the January 2017 meeting.

Often it is word of mouth that brings a conference to my attention, but there are a couple of resources out there to help.

AllSeq maintain a list of conferences.
GenomeWeb has a similar list, but it seems less focused than AllSeq.
NextGenSeek has a list for 2016, but nothing on the cards for 2017 yet.
Nature has an events page (searchable) that lists 50 upcoming NGS conferences.

PS: please do let me know if you've particular recommendations on conferences to attend. And do get in touch with the groups above to list your conference on their sites.

PPS: If you can justify it then the HVP/HUGO Variant Detection Training Course - "Variant Effect Prediction" running from 31st October 2016 is in Heraklion, Crete - a beautiful place to learn!

10X Genomics single-cell 3'mRNA-seq explained

2016-07-28T16:37:00.004+01:00

10X Genomics have been very successful in developing their gel-bead droplet technology for phased genome sequencing and more recently, single-cell 3'mRNA-seq. I've posted about their technology before (at AGBT2016, and March and November 2015) and based most of what I've written on discussion with 10X or from presentations by early access users. Now 10X have a paper up on the BioRxiv: Massively parallel digital transcriptional profiling of single cells. This describes their approach to single-cell 3'mRNA-seq in some detail and describes how you might use their technology in trying to better understand biology and complex tissues.

Technical performance of the GEMcode system: The paper is unfortunately based on the earlier GEMcode system rather than the latest Chromium, but the results are likely, though not definitely, going to be representative of what Chromium can deliver.

Technical performance was assessed using 1200 Human 293T or Mouse 3T3 cells, with 100,000 reads per cell. 71% of reads aligned to Human or Mouse genomes (38% and 33% respectively). Analysis of the UMIs allowed the authors to estimate a total number of cell-containing GEMs to be just over 1000 (482 and 538 Human or Mouse respectively). Only 8 GEMs appeared to have Human and Mouse cells co-located, as assessed by GEM barcoded reads aligning to both genomes. It is not easy (is it possible) to detect Human:Human or Mouse:Mouse cell doublets so the inferred doublet rate for this experiment was 1.6% (see figure 2a in the paper with multiplet GEMs as grey dots).

The 1.6% multiplet (doublet, triplet, or higher) rate appears low, but as cell numbers increase so does the multiplet rate, the authors describe a linear relationship of multiplet rate to cell loading from 1000-10000 cells (Supplementary Fig. 1a), however it is not clear how this rate changes at 20k, 30k, 40, or 50k (the maximum loading recommended). What the impact is on experiments I do not know - but this is an area several labs are focusing on. The multiplet rate "approximately followed a Poisson distribution" as assessed by imaging experiments (Supplementary Fig. 1b). In these a Nikon microscope equipped with a high-speed camera capable of capturing 4000 frames per second imaged GEMs as they were created. 28,000 frames were analysed for single-cell encapsulation (7 seconds of video, which only represents about 1.5% of the time your Chromium is actually making GEMs) but the multiplet rate was 16% higher than expected - I don't think the authors delve deeply enough into the reasons for this. Multiplets are likely to add significant noise to analysis of single-cell experiments, every single-cell technology has to account for them and cells like to stick together so user probably can't rely on actually having a single cell suspension in the first place.

To further investigate this the authors also carried out mixing experiments with Human 293T (female & expressing XIST) and Jurkat cells (male & expressing CD3D). Figure 2e (see above) in the paper shows the PCA for these mixes at 100% 293T, 100% Jurkat, 50:50 or 10:90. The 50:50 mix shows a lot of cells in the space between the cell clusters, I\d suggest this indicates higher multiplet rates in this experiment than the 1.6% suggested? But I could not see the cell loading density used, which may explain the higher numbers of apparent muliplets.

Cell capture efficiency: The rate of cell capture is important especially where rare cell populations are being studied. 10X captures about 50% of the cells loaded into GEMS (Supplementary Tables 1&3), and whilst this could be increased it would be to the detriment of an increased cell doublet/triplet rate. This might be a parameter users are willing to tweak depending on their needs and it would be interesting to ask how many users would accept higher doublets in return for 80-90% cell capture rates? What we really need in a single-cell system is the ability to image cells in droplets so we can exclude empty drops, doublets and triplets; I'd be interested to know if anyone is working on something like this?

The level of cross-talk between cell barcodes was about 1% (see Online Methods) but it is not clear in the manuscript where this cross-talk comes from. If it is error in reading the cell barcodes then this could be reduced by sequencing longer, more error-tolerant barcodes, and a longer barcode read (if >25bp) would allow a proper error estimation of the index read. But if this is coming from molecular cross-over during the downstream library prep (which is going to happen to some degree) then fixing it will be much more difficult (see these papers to learn more about PCR chimeras and their affect on NGS: NAR 2012, NAR 1990, JBioChem 1990, NAR 1995).

83% of UMIs were associated with cell barcodes suggesting that cell-free RNA does not significantly affect the results - this is an issue scSeq users will have to consider carefully as the amount of cell-free RNA or DNA in a sample is likely to be highly variable, and it may be that experiments with artificially high levels might show us the failure mode in these sample types.

Transcript counting: With 100,000 reads per cell the authors report a median detection rate of 4500 genes or 27,000 transcripts with little bias for GC content or gene length. However as a 3' assay I'd not expect a huge variation here, and this is something that would become much more important as 10X, and others, move to whole transcript assays. Clustering analysis was performed Seurat (Satija et al., 2015).

SNV detection from scRNA-seq data: while deciphering population structure and discovering rare cells is great many people will want to look for SNP/SNVs in their scRNA-seq data. The authors reported an analysis of a curated set of high quality SNVs only observed in only 293T or Jurkat cells, but not both (see Online Methods). They showed that they could detect SNVs reliably, and that multiplet rates predicted from SNV were highly correlated with those from gene expression analysis. The paper is confusing in suggesting that each cDNA generates 250bp of sequence for SNV detection, but the sequencing run generates only 98bp in read 1 from the cDNA (I'd like to understand this better or see this corrected in the final version if it is a typo).

scRNA-seq from frozen cells: In the discussion the authors make a strong statement about the ability to analyse frozen cells: "the ability of GemCode to generate faithful scRNA-seq profiles from cryopreserved samples enables its application to clinical samples". The frozen cells in questions were fresh cells recovered from whole blood, cryopreserved and "gently thawed" one week later (see Online Methods). Only a small number of genes (57) showed greater than 2-fold upregulation (no down regulated genes were reported), suggesting that freezing cells is possible. However I suspect that the minimal freezing time and "gentle" protocols will put many users off relying on cell storage until more comprehensive evaluation is undertaken. The fact that they got such good results is encouraging, we're working on a project with patient material that needs to be processed immediately for best results. Right now we're brining cells over from the hospital about one hour after collection and processing straigh-away, but this is not an efficient use of the technology when the plastic chip holds 8 samples and costs $150 each time.

A few words about sequencing 10X scRNA-seq libraries: In the paper the authors say that after GEMcode prep "libraries then undergo standard Illumina short-read sequencing" - there is nothing standard about the run type you need to do for 10X. It is a 98.14.8.10 format run - 98bp 1st read (mRNA), 14bp Index 1 (UMI), 8bp Index 2 (sample index), 10bp 2nd read (Cell index) - I hope I got that right!

10X sequencing does not fit easily into a core lab running HiSeq instruments due to the run configuration (we need 8 lanes of the same sample type). I suspect this is going to get much easier as we do more and more 10X sequencing, but for now we're either running longer reads than necessary, or using NextSeq/2500 RapidRuns. Chromium genomes can now be run on X Ten as PE150 with no modification. Hopefully single-cell RNA-seq will move to a more standard single-end run for differential gene expression, this would make life easier for my team, and reduce costs by around 40%.

Summary: All on all this paper explains many of the things potential users of 10X single-cell are looking to understand. I'm expecting papers to be coming thick and fast over the next six months now people have the instruments in their hands.

It is going to be interesting to see how 10X develop their chemistry, particularly for whole transcriptome single-cell, for copy-number and for applications like G&T-seq or scM&T-seq, or even ATAC-seq.

How will RainDance fight back with their own single cell methods? And how does this 3'mRNA-seq assay compare to Fluidigm's C1? Both of these are questions I look forward to seeing answered. Ultimately the more technologies we have for single-cell the better, there are likely to be strengths and weaknesses in each. But I'd not be surprised if the one with the most open chemistry becomes dominant - this was part of Illumina/Solexa's success as it meant users could develop methods from a core technology.

PS: Supplementary Figure legends are available on BioRxiv, but not the figures - go figure! Online methods are also missing. Probably because the BioRxiv does not check if these have been submitted.

RNA-seq advice from Illumina

2016-07-25T17:36:00.000+01:00

This article was commissioned by Illumina Inc.

The most common NGS method we discuss in our weekly experimental design meeting is RNA-seq. Nearly all projects will use it at some point to delve deeply into hypothesis driven questions, or simply as a tool to go fishing for new biological insights. It is amazing how far a project can progress in just 30 minutes of discussion, methodology, replication, controls, analysis, and all sorts of bias get covered as we try to come up with an optimal design. However many users don't have the luxury of in-house Bioinformatics and/or Genomics core facilities so they have to work out the right sort of experiment to do for themselves. Fortunately people have been hard at work creating resources that can really help and most recently Illumina released an RNA-seq "Buyer’s Guide" with lots of helpful information....including how to keep costs down.

Illumina's "Buyer’s Guide": the guide offers advice on common RNA-Sequencing methods and should help new users in evaluating the many options available for next-generation sequencing of RNA. Anyone considering a differential gene expression analysis experiment should have RNA-seq as their platform of choice and the guide presents three simple steps for users to consider different aspects of their experiments.

1) First of all make sure you understand what your scientific question is! This sounds simple but all too often people want to get too much out of one experiment and end up getting in a bit of a mess. Better to answer one question well, than two questions badly. Once you've thought about this it should be clear whether you want analyse mRNA's for a simple differential gene expression experiment, or are after something else e.g. splicing, and also if you'll need to look at more than just poly-adenylated mRNAs. And if possible try to determine ahead of time whether the genes you're interested in studying are highly expressed or very rare.

2) Once you've thought about this you can consider what sort of samples you have, are they low quality and/or low quantity? You should also consider who's going to do the work in the lab and who's going to analyse the sequence data?

3) Now you can really think about the final experimental design, what type f library preparation kit to use, replicate numbers, proper controls, depth of sequencing, etc. Illumina's RNA-seq buyers guide describes some of the things you'll need to consider in choosing the read-depth and run-type, and also include some tips for keeping the costs of your experiment down.

What do people mean when they say "RNA-seq": When people say "RNA-seq" most of them are talking about differential gene expression (DGE) by sequence analysis of reverse transcribed poly-adenylated mRNAs, but by changing the depth sequencing or type of sequencing, and/or choosing a different library prep kit you can investigate so much more. The guide includes three different scenarios for RNA-seq experiments including basic differential gene expression; DGE and allele-specic expression plus isoforms, SNVs and fusions; and finally whole transcriptome analysis. These show the breadth of experiments you can consider once you've mastered this method.

The first two scenarios showcase the power of RNA-seq and demonstrate how using a single library prep method, but varying the sequencing allows very different questions to be asked of your samples. The guide recommends Illumina's TruSeq Stranded mRNA-seq kits (these are the ones we use most in my lab and we have done so ever since beta-testing the original RNA-seq kit many years ago). Scenario #1 is a simple DGE experiment and Illumina recommends you generate ≥ 10 million reads per sample, using single-end 50bp reads (SE50). Scenario #2 allows a full mRNA analysis by simply changing read depth to ≥ 25 million reads per sample, and using paired-end 75 bp reads (PE75).

If you are interested in more than poly-adenylated mRNA's then changing the RNA-seq library prep kit to Illumina's TruSeq Stranded Total RNA gets rid of ribosomal RNA's, letting you anaylse both coding and non-coding RNA. Much greater read depth is needed and Illumina recommend ≥ 50 million PE75 reads per sample. Completing the RNA-seq line-up is the TruSeq small RNA kits which allow you to analyse microRNAs and other smaller transcripts, usually this requires only ≥ 1-2 million SE50 reads per sample.

How do Illumina's recommendations stack-up: The guide is pretty good in the suggestions it makes for common RNA-seq methods. I'd aim a bit higher for DGE and suggest ≥ 20 million reads per sample to allow profiling of high, medium and lowly expressed genes. I'm really not keen on the suggestion that MiSeq or NextSeq mid-output are good tools for RNA-seq as from my experience most experiments, with sufficient replication, will be too large to fit into a single sequencing run. I'd argue that the cheapest way to get your RNA-seq data is going to be on HiSeq 4000, until of course we can run RNA-seq on X Ten. Of course not everyone should buy a HiSeq and a MiniSeq, MiSeq or NextSeq may be a good fit for your own laboratory; but I'd encourage you to consider the benefits of using your local core lab first though, especially if you are planning on doing experiments bigger than 12-24 samples. I'm not sure I'd argue quote as strongly for paired-end data and would prefer splicing, ASE, fusion detection to be coming from higher depth sequencing instead (50M SE50 reads cost about the same as 25M paired-75bp reads).

Why does my lab focus on mRNA-seq DGE: My own choices for RNA-seq are primarily informed by the questions people say that want to answer in experimental design discussions - and nearly all of these are differential gene expression questions. As such my lab runs lots and lots of Illumina's stranded mRNA-seq kits. We only run some form of ribosomal reduction when the experiment warrants it as these methods generally require deeper sequencing for the same differential gene expression analysis power. We've very few users who need to run FFPE RNA so although we tested the RNA Access kit, we've yet to really use it in a significant project. This is partly because the research groups coming ot my lab understand the limitations of FFPE samples, and work hard to procure fresh frozen material wherever possible.

A brief bit about informatics: This article is focussed on the wetlab but without a good analysis pipeline you'll be stuck with some big but unusable Fastq files. The analysis requirements are heavily influenced by the biological questions being asked, by the samples available, and by the library preparation and sequencing performed. I'd always recommend the user to make sure they know what analysis is likely to be performed before generating data.

Many others have weighed in on how to use and design RNA-seq experiments (see the list of my favourite references at the bottom of this post). Nearly everyone agrees that replication is key with most people suggesting 4-6 biological replicates. Most papers agree on read-depth being kept to under 20M reads per sample. The ENCODE RNA-seq guidelines are very different recommending just two biological replicate and 30M paired-end reads per sample - I've never agreed with this, even when it was published in 2011, and have steered people to other resources. The Blogosphere also offers lots of help; a 2013 post by GKNO (Marth lab, U. Utah), and the RNA-seqlopedia (U. Oregon) are two great reads for people who want to know more.

All Illumina products listed are for research use only. Not for use in diagnostic procedures (except as specifically noted).

Further reading:

How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA. 2016. This paper really pushes to answering the question most people want to understand. They present a very highly replicated study and show that as many as 20 biological replicates were required to detect 85% of DGE accurately. They recommend using 6 biological replicates in RNA-seq experiments as a minimum, and edgeR or DESeq2 as the best tools. They used single-end sequencing and generated 0.8-2.6 million reads per technical replicate - equivalent to about 10M per biological sample.
Experimental Design and Power Calculation for RNA-seq Experiments. Methods Mol Biol. 2016. This book chapter reviews the major factors that influence the statistical power of detecting DGE.
Designing alternative splicing RNA-seq studies. Beyond generic guidelines. Bioinformatics. 2015. This paper describes how sequencing depth and length, library preparation and the level of replication affect the cost-effectiveness of single-sample and group comparison studies. They present data showing how short reads outperformed long reads for most analyses.
Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014. In this paper the authors compare and evaluate five differential expression analysis packages - DESeq, edgeR, DESeq2, sSeq, and EBSeq. They show that increasing sample size is preferable to increasing sequencing depth past 20 million reads.
RNA-seq differential expression studies: more sequence or more replication? Bioinformatics. 2014. This paper describes the explicit trade-off between numbers of biological replicates and depth of sequencing in increasing the power to detect DGE. They suggested that greater than 10M reads was unnecessary and that more replicates should be the strategy of choice to increase power and accuracy inRNA-seq studies.
Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods. 2013. This paper presents a quantitative statistical method to distinguish biological variability from technical noise in single-cell RNA-seq.
Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics. 2013. This paper presents a web-based tool, Scotty, to assists in the design of RNA-seq experiments with appropriate sample size and read depth.
RNA-SeQC: RNA-seq metrics for quality control and process optimisation. Bioinformatics. 2012. Authors from the Broad Institute present the RNA-SeQC tool for quality control of data before DGE analysis. They provide metrics including yield, alignment and duplication rates; GC bias, rRNA content, regions of alignment (exon, intron and intragenic), continuity of coverage, 3'/5' bias and count of detectable transcripts.
Design and validation issues in RNA-seq experiments. Brief Bioinform. 2011. This paper reviews the experimental design issues pertinent to RNA-seq.
RNA-seq: technical variability and sampling. BMC Genomics. 2011. This paper analysed technical bias in 3 replicated RNA-seq experiments and showed that low coverage (less than 5 reads per base) leads to a significant increase in technical noise, and that understanding sampling bias is an issue that needs to be considered.
Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010. One of the first papers to suggest that a relatively low read-depth for RNA-seq of just 10 million reads "gave the same dynamic range as microarrays, with better quantification of alternate and highly abundant transcripts". However they used paired-end reads in their analysis.
RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008. In this paper the authors estimated the technical variance in RNA-seq and compared it to arrays for detecting differentially expressed genes.

Core Genomics is going cor-porate (sort of)

2016-07-21T21:41:00.001+01:00

I've just had my five year anniversary of starting the Core Genomics blog! Those five years have whizzed by and NGS technologies have surpassed almost anything I dreamed would have been possible when I started using them in 2007. My blog has also grown beyond anything I dreamed possible and the feedback I've had has been a real motivating factor in keeping up with the writing. It also stimulated my move onto Twitter and I now have multiple accounts: @CIGenomics (me), @CRUKgenomecore (my lab) and @RNA_seq, @Exome_seq (PubMed Twitter bots).

The blog is still running on the Google Blogger site I set up back in 2011 and I feel ready for a change. This will allow me to do a few things I've wanted to do for a while and over the next few months I'll be migrating core-genomics to a new WordPress site: Enseqlopedia.com.

Introducing Enseqlopedia: The new home of Core Genomics will be a chance for me to expand on something I've been doing for many years - explaining NGS to users. The same blog content is going to keep flowing, but other stuff will appear alongside, and I hope you'll find it informative and entertaining.

The Enseqlopedia name was chosen as I'll be adding content describing methods, linking to the best papers that demonstrate these or advance them, and hopefully making the new site a useful resource for the community. It will also be somewhere I can serve up more PubMed Twitterbot output in a single place outside of Twitter. I'd also like to reinvigorate the sequencer map Nick Loman and I put together many years ago. Some of the reasons for these changes has come about from my dissatisfaction with sites that serve up NGS news, but simply regurgitate press releases from academics or companies in the NGS field; I want to deliver more than this. Hopefully you already agree that my blog posts hit the spot, and I'm hoping the new stuff is of real interest to readers. I aim to make sure you can see that what appears has been carefully chosen and has an opinion behind it.

Core Genomics corp: The biggest change is going to be the appearance of commissioned or sponsored content i.e. stuff I get paid to post. I've not tried to monetise my blog before, mainly because I don't like unsightly ads all over the place, however I've been asked to write reasonably frequently about new products in the NGS space and until now I've always turned the offers down. I have ghost written other content, but nothing on Core Genomics has been paid for - and all the topics have been chosen by me. The two new types of post will be tagged so you can tell immediately what your reading:

Commissioned posts will be tagged "commissioned content" and labelled at the top of the post so you know who paid me to write the piece. All commissioned content will be taken on with full editorial control i.e. I decide what ends up in the final piece, and I will have written the post.

Sponsored posts will be tagged "sponsored content" and labelled at the top of the post so you know who wrote the piece. Sponsored content will only be accepted by me if I think readers of Core Genomics would be interested. The content is likely to be written by the sponsor and should be considered as an advert. Although I will decide whether a sponsorship opportunity will get posted I will NOT have full editorial control i.e. I get to decide on what sponsored content appears on the site, but I will not have written the post.

My first sponsored piece will be coming soon. Although the topic has been chosen by someone else, the opinions are very much my own. I'm not expecting to write much more than one sponsored post a month (so any NGS companies reading this better get their requests in soon), and I'm not going to write about something I really don't believe in.

I'll also be making it clearer what kind of consultancy work I'm happy to take on. Mostly this has been technological consulting for investors who want to understand market reactions to new instruments or developments (with Brexit came a rush of consultancy work). But I've also consulted for technology companies, and for research groups.

Thanks for reading Core Genomics - hopefully you'll be reading for another five years!

James.

Whole genome amplification improved

2016-07-16T12:15:00.002+01:00

A new genome amplification technology from Expedeon/Sygnis: TruePrime looks like it might work great for single-cell and low-input anlyses - particularly copy number. TruePrimer is a primer-free multiple displacement amplification technology. It uses the well established phi29 DNA polymerase and a new TthPrimPol primase, which eliminates the need to use random primers and therefore avoids their inherent amplification bias. The senior author on the TthPrimPol primase paper, Prof Luis Blanco, is leading the TruePrime research team.

I saw a recent poster which had results demonstrating equal amplification and homogenous coverage (see image above), no primer artefacts, and high identification of both SNPs and CNVs. TruPrime gave very similar CNV data to unamplified DNA with very little apparent amplification or coverage bias from low coverage whole genome sequencing (12 million reads). Competitors "R" and "G" did not look so good.

What does TthPrimPol do in the cell: TthPrimPol is a DNA and RNA primase with DNA-dependent DNA and RNA polymerase activity. It is a unique human enzyme capable of de novo DNA synthesis solely with dNTPs and is found primarily in the nucleus - TthPrimPol -/- cells show inefficient mtDNA replication, but it is not an essential protein. In the mitochondria TthPrimPol provides the primers for leading-strand mtDNA synthesis in the replication fork. It is an important protein in the mitochondria where the highly oxidative environment leads to replication stress and and genome instability. It is also capable of reading through template lesions such as 8oxoG, a common DNA lesions produced by reactive oxygen species that causes G to T and C to A substitutions. This may have auseful application in the amplification of FFPE damaged DNA.

Using TruPrime in single-cell sequencing: I can see several opportunities for using this technology in my lab, including both single-cell systems: 10X Genomics and Fluidigm C1 for future copy-number methods. It is also likely to be useful for other low-input experiments and we're likely to couple it with Nextera XT or similar.

I'm sure we'll see some great work using this enzyme if it really works as well as the company suggest - if you are using TruPrimer please do let me know how you are getting on!

How much time is lost formatting references?

2016-07-14T16:20:00.000+01:00

I just completed a grant application and one of the steps required me to list my recent papers in a specific format. This was an electronic submission and I’m sure it could be made much simpler, possibly by working off the DOI or PubMed ID? But this got me thinking about the pain of reformatting references and the reasons we have so many formats in the first place. It took me ten minutes to get references in the required format, and I've spent much longer in the past - all wasted time in a day that is already too full!

I use Mendeley as my reference manager of choice and it has a very good Word plug-in that makes it easy to add references and build a final reference list when writing papers. I used it for my PhD with over 160 references and it coped pretty well. Mendeley, EndNote, et al make changing reference styles pretty easy, but why do we have to bother at all?

In digging into this I came across a post by Jay Fitzsimmons at the Canadian Field-Naturalist blog. Jay's post is well written and describes the problem well - lots of citation styles, but no real evidence about which is most efficient.

How did reference styles evolve: Once upon a time the only way to access published information was to go to your University library and find the paper you were looking for (it wasn't that long ago). House styles were developed by publishers as a set of standards for the writing and design of articles in their periodicals. There was no, or little, effort to determine what the most efficient way to communicate the information in a reference. A big reason for abbreviating information, or omitting article titles etc from references was to reduce the amount of text - simply to save money for publishers of printed materials. There is even an ISO standard just for abbreviating journal titles! Even though we're in the electronic age there might still be good reasons to abbreviate references. Who wants to read a 300 author list (unless you're one of the authors of course)!

What do I think is important: It depends on why I’m looking at a reference in the first place but here are my priorities

The title is the most common reason I decide if this is a paper I should read, I’d like to see it every time.
Second on my list is the year of publication, there’s sometimes no point looking at old references in a fast moving field (but beware this simple cull on useful reading materials).
Then I’d like a link to the paper - personally I’m happy with the PubMed ID or DOI.
Lastly is the lead author(s) as these are likely to be the people with most to gain from the publication in the immediate future.

As far as the authors go then in the context of my grant application, or perhaps a CV or job application I’d prefer a simple numbering format: the authors place by numerical ordering of the author list and the total number of authors, perhaps with an asterisk to denote joint first or corresponding author status e.g. 2*/17 where I am the 2nd author in a list of 17 authors, but I'm a joint 1st author.

Lastly I’d set it all to a nice delimited format so a screen grab from almost anything can be easily imported into whatever I need to use the reference in.

I don't really care about the Journal and certainly not volume and/or page numbers as I am NOT going to look for this in the library!

So here’s my suggestion:
Murtaza/Dawson/Tsui. Non-invasive analysis of acquired resistance to cancer therapy by sequencing of plasma DNA. Nature. 2013. DOI:10.1038/nature12065. PMID:23563269. 13/17.

Compare this to:
Murtaza M, Dawson SJ, Tsui DW, Gale D, Forshew T, Piskorz AM, Parkinson C, Chin SF, Kingsbury Z, Wong AS, Marass F, Humphray S, Hadfield J, Bentley D, Chin TM, Brenton JD, Caldas C, Rosenfeld N. Non-invasive analysis of acquired resistance to cancer therapy by sequencing of plasma DNA. Nature. 2013 May 2;497(7447):108-12.

I guess nothing is going to change in the field anytime soon. But I feel better for getting this off my chest. And I’ve sent feedback to the funder...

Comparison of DNA library prep kits by the Sanger Institute

2016-07-02T15:38:00.002+01:00

A recent paper from Mike Quail's group at the Sanger Institute compares 9 different library prep kits for WGS. In Quantitation of next generation sequencing library preparation protocol efficiencies using droplet digital PCR assays - a systematic comparison of DNA library preparation kits for Illumina sequencing, the authors used a digital PCR (ddPCR) assay to look at the efficiency of ligation and post-ligation steps. They show that even though final library yield can be high, this can mask poor adapter ligation efficiency - ultimately leading to lower diversity libraries.

In the paper they state that PCR-free protocols offer obvious benefits in not introducing amplification biases or PCR errors that are impossible to distinguish from true SNVs. They also discuss how the emergence of greatly simplified protocols that merge library prep steps can significantly improve the workflow as well as the chemical efficiency of those merged steps. As a satisfied user of the Rubicon Genomics library prep technology (e.g. for ctDNA exomes) I'd like to have seen this included in the comparison*. In a 2014 post I listed almost 30 different providers.

Hidden ligation inefficiency: The analysis of ligation efficiency by the authors sheds light on an issue that has been discussed by many NGS users - that of whether library yield is an important QC or not? Essentially yield is a measure of how much library a kit can generate from a particular sample, but it is not a measure of how "good" that library is. Only analysis of final library diversity can really act as a sensible QC.

The authors saw that kits with high adapter ligation efficiency gave similar yields when compared to kits with low adapter ligation efficiency (fig 4 reproduced above). They determined that the most likely cause was that the relatively high amount of adapter-ligated DNA going into PCR inhibits the PCR amplification reaction leading to lower than expected yields. For libraries with low adapter ligation efficiency a much lower amount of adapter-ligated DNA would make it into PCR, but because there is no inhibition the PCR amplification reaction leads to higher than expected yields. The best performing kits were Illumina Truseq Nano and PCR free, and KAPA Hyper kit with ligation yields above 30%; and the KAPA HyperPlus was fully efficient.

Control amplicon bias: the PhiX control used had three separate PCR amplicons amplified to assess bias. The kits with the lowest bias at less than 25 % for each fragment size were KAPA HyperPlus and NEBNext. The Illumina TruSeq Nano kit showed different biases when using the "Sanger adaptors" rather than "Illumina adaptors", which the authors suggest highlights that both adapter and fragment sequence play a role in the cause of this bias.

Which kit to choose: The authors took the same decision as most kit comparison papers and shied away from making overt claims about which kit was "best". The did discuss fragmentation and PCR-free as important points to consider.

If you have lots of DNA then aim for PCR-free to remove any amplification errors and/or bias.
If you don't have a Covaris then newest enzymatic shearing methods e.g. KAPA fragmentase have significantly less bias than previous chemical fragmentation methods.

Ultimately practicability, the overall time and number of steps required to complete a protocol, will be uppermost in many users minds. The fastest protocols were NEBNext Ultra kit, KAPA HyperPlus, and Illumina Truseq DNA PCR-free.

*Disclosure: I am a paid member of Rubicon Genomics' SAB.