Tuesday 23 December 2014

Monday 22 December 2014

Extracting cell-free DNA from plasma

Two major problems are encountered by researchers wanting to analyse circulating tumour DNA; contamination by gDNA from white blood cells; and the relatively low amount of circulating DNA. Because of this the protocols used for blood collection and circulating DNA extraction are critical. This post describes some of the challenges in more detail and reviews some of the kits available.

Friday 19 December 2014

Authorship and acknowledgment of core facility work

I've been lucky enough to work in an Institute that views its core facilities as a cornerstone of the scientific output, and we get acknowledged for most of the work we do. Other core facilities are not so lucky; and even here sometimes people simply forget to acknowledge everyone they should have in the rush to get their paper submitted.

Now the journal BioTechniques has introduced a new editorial policy that will require authors to answer the question "if they worked with a core laboratory", and if so make sure this is acknowledged in the final manuscript. I hope other journals will take a similar decision. All core labs are funded, completely, or in some part, by grant income and are effectively subsidised. This is the main reason core labs are usually cheaper than commercial providers. This funding is not wasted by the host institution as the cores (should) offer a more flexible and bespoke service.

Tuesday 9 December 2014

Top 100 papers of 2014: courtesy of Altmetric

Congratulations to Marc Tischkowitz and colleagues for getting into Altmetrics Top 100 papers of the year. Their NEJM paper on increase risk of breast cancer in loss-of-function PALB2 carriers was number 87 and one of two papers from the University of Cambridge.

The Altmetric Top 100 is heavily influenced by the media coverage of "newsworthy" publications and it is perhaps not surprising that the Number1 spot is taken by Facebook's rather controversial "emotional contagion" paper (covered by Retraction Watch).
No surprise that medical and biological sciences top the list with around 60% of papers in these categories. I was pleased to see that 80% of articles included an author from the UK.

Monday 8 December 2014

Genomes, exomes or amplicones: what's best for the clinic?

Clinical sequencing is looming and recently I've heard more, and better, arguments for using whole genomes. These have mainly focused on the quality of PCR-free genomes, which have uniform coverage and few regions of missing or very low coverage. Compared to the PCR and hybridisation-capture artefact's in exomes, or the focus on a biased set of genes a genome can sound very convincing (especially if you ignore the analysis/storage issues). Until recently the cost has still been too high, but Illumina recently released PCR-free on X Ten reducing the cost substantially. However there are so many different arguments that it is difficult to determine if we can perform a one-size-fits-all sequencing experiment.
Comparison of Illumina PCR-free to PCR+ kits

Recently I started talking to people who use terms like analytic validity, clinical validity and clinical utility - these meant something to me, but what they meant for genome sequencing assays, exomes and amplicomes I was not so sure so I thought I'd find out.

Tuesday 2 December 2014

XNA ink in your Sharpie: an indelible marker for genomics

New research at the LMB (across the road from the Institute I work in) has demonstrated how alternate "genetic polymers" can be used in place of DNA and RNA: Synthetic Genetic Polymers Capable of Heredity and Evolution. XNAs [xeno-nucleic acids] use nucleotide analgoues, and the trick has been to get these to work in a biological system by engineering polymerases and other enzymes.

XNA structure from Pinheiro et al 2014

Friday 28 November 2014

At the top of Mt Kilimanjaro

Nature ran a great news feature on the top 100 papers of all time. The graphic below appears in the article and the first thing I wondered is where do papers I've been an author on sit? I know this is pure vanity but lets be honest anyone looking at this is thinking the same thing!

Monday 24 November 2014


Here's the front cover to this weeks Nature featuring an image by Kelly Krause (@kellybkrause), Nature's Creative Director. It is a wonderful picture which manages to blend both Man and Mouse with the DNA helix as the mouse's tail, a lovely detail.

The Mouse ENCODE projects joins ENCODE and modENCODE, the scale of the data generation and analysis in these combined projects is staggering. Ewan Birney (lead analysis coordinator for ENCODE) described the efforts over five years of 400 scientists for the first ENCODE project on his blog. The fact that we can attempt big-data projects like this and succeed is a testament to the power of Genomics & Next-Generation Sequencing (the next best thing to sliced-bread).

Sunday 23 November 2014

Have I got New For You vists my lab...sort of

I'd like to say that Ian Hislop and Paul Merton visited last week to record a special HIGNFY, but alas no. However my lab did appear briefly on the most recent show, unfortunately the team mixed our institute up with King's College! Before I set the record straight I thought I'd include the snippet - it is not often you see people looking so bored when I'm talking to them...

Why the visit: Ed Miliband was at the Institute to meet the top brass of Cancer Research UK and to talk with cancer ambassadors. He spent a good thirty minutes listening to cancer survivors and asking them what would make a difference from their perspective. The Labour Party have recently pledged that NHS patients in England will wait no longer than one week for cancer tests. While here he also visited our light microscopy core and they got on Sky news!

Thursday 20 November 2014

Copy-number: exomes vs genomes, proposing a better strategy

Comparing CNVs
Exome sequencing (@Exome_seq) can be used to generate CNV maps bit the data are noisy compared to WGS or SNP-arrays. In this post I’ll describe a novel workflow to get high-quality CNV from an exome-seq pipeline and show some very new data…

Image reproduced and enlarged at the bottom of this post.

Monday 17 November 2014

Ion Proton amplicons for clinical molecular diagnostics

A recent paper from the MD Anderson's Department of Hematopathology reports on their use of the AmpliSeq for a 409 gene panel on Ion Proton: Clinical massively parallel next-generation sequencing analysis of 409 cancer-related genes for mutations and copy number variations in solid tumours. I'm a big fan of amplicomes, another 'ome I know, and in this context used to mean all the amplicons in your panel. PCR is a great way to enrich for specific regions of the genome and we all understand the basics. For me amplicomes on NGS is perhaps the easiest NGS for the NHS. It simply replaces M13-PCR for a single amplicon with a multiplex PCR, and Sanger sequencing with NGS. If a doctor understood a Sanger test then this is perhaps an easier leap to make than trying to understand/explain what an exome is!

Back to the paper:  The MD Anderson group used their gene panel on FFPE material and reported on the "sensitivity, specificity, reproducibility, and applicability of using the Ion Proton 409-gene panel to routinely screen for SNVs, insertions/deletions, and CNVs". To do this they used 55 tumours (20 with paired normals) and four cell lines.

Fig 6A: ERBB2 amplifications as seen in 4 breast cancer samples.

The Ion Torrent Ampliseq Comprehensive Cancer Panel uses 4000 primer pairs across four primer pools each requiring just 15ng on input DNA (60ng total), and up to 96 samples can be indexed using Ion Xpress Barcodes. Alignment and analysis of data was performed using the Torrent Suite software. And the groups OncoSeek was used to generate a clinical report.

Concordance between platforms was high when the group compared results to an earlier PGM panel of 46 genes, and detected pretty much everything they expected. The Proton did call some InDels that the PGM missed, which the group put down to improvements in calling software.

The group reported high sensitivity for SNVs up to 5% allelic fraction, and high reproducibility. This, coupled to the minimal FFPE input, and fast turnaround (5 days) makes the platform combination one that could be used in a clinical molecular diagnostics laboratory. Additionally they reported that up to 10 samples could be multiplexed per run and that this fitted in particularly well with their workflows.

Friday 14 November 2014

Supercentenarian genomes: but are they the right ones

GenomeWeb covered a recent paper from Stuart Kim's group at Stanford University: Whole-Genome Sequencing of the World’s Oldest People. Unfortunately they did not find any longevity genes, and Kim was quoted saying "We were pretty disappointed."

There is lots of suggestion that longevity has a genetic component, but I can't help but consider that the environmental component is likely to be stronger, and this would mask the genetics. Kim's study was very small, just 17 genomes, so the chances of finding anything were equally small. But had there been low-hanging fruit this would almost certainly have been a Science paper rather than PLoS One (sorry PLoS).

Monday 10 November 2014

HiSeq X Ten: when might they be available one at a time?

In the Summer I posted a Christmas letter to Santa. He's already delivered number 2 (cheaper RNA-seq) and 3 (longer RNA-seq reads via PE250), and number 4 might be coming soon (exomes at PE125). I'd also asked that he not deliver HiSeq X Ten as an individual instrument just yet, but as there are just 44 days left till Christmas I thought I'd look head and see what might be the reasons for, or not for gift wrapping a single X Ten this Christmas.

Friday 7 November 2014

ctDNA in triple negative breast cancer: a study comparing Illumina & 454 sequencing

A recent circulating tumour DNA (ctDNA) paper describes a comparison of ctDNA to CTC's with respect to TP53 mutations in 40 triple negative breast cancer patients. See: Circulating tumour DNA and circulating tumour cells in metastatic triple negative breast cancer patients in the International Journal of Cancer.

Wednesday 5 November 2014

250bp paired-end on your HiSeq 2500...up yours X Ten ;-)

The 250bp paired-end HiSeq rapid run kits are almost here! Expect to get 150M reads per lane or more on the current rapid flowcells. What will you do with 150M 500bp OPES?

No word on pricing yet, and I could not find any PE250 data on BaseSpace. Looking forward to finding out more...

Friday 17 October 2014

Your own personal genome project

I was chatting with a colleague at work who'd asked me if I know anywhere they could get their genome or exome sequenced. My genome has been sat in the freezer for over five years wanting to go onto a flow cell, but I've never been comfortable putting it on our own machines. I did get 23andMe'd a few years ago but they've closed the exome for now.

Today there are many sequencing service providers across the world. Would any of them be open to a consumer led project? How many genomes/exomes would we need to sequence to get a price consumers were willing to pay? To test the market we've used AllSeq: "the global sequencing marketplace", and a couple of replies have now come in!

Thursday 9 October 2014

Twitterbots for NGS

I've been inspired to create three Twitterbots for NGS papers on @RNA-seq, @ChIP-seq and @Exome-seq by Casey Bergman at the University of Manchester. I'd not come across Caseys twitter account (I don't actually use Twitter that much) or his lab website and blog; but I was directed there by a piece on the Nature website...How to tame the flood of literature.

What Casey has done is pretty simple and it is very well explained in his blog post, or by Rob Lanfear who has posted instructions on GitHub. There are three simple steps for PubMed and Twitter (and more for arxiv, peerj, etc).
  1. Set up a twitter account
  2. Set up a pubmed search
  3. Set up your dlvr.it account
The feeds I created only went live this evening and I'll follow Robs advice to refine them over the next few weeks. Let me know if you like them, and why not create your own feed.

PS: Casey has a great post on how to host a custom UCSC genome browser trackwith Dropbox.



Tuesday 7 October 2014

Count-down to AGBT 2015

Registration is open! The registration process for AGBT has been overhauled in the last few years; yes there's still a lot of people who don't get in, but it seems to be pretty fair. 

I hope to see you there. I'm going to be blogging and Tweeting again (assuming I get in). Say Hi if you read CoreGenomics and I'll buy you a beer (or rather grab one from Illumina, Agilent, etc)!

Thursday 2 October 2014

Whole exomes from single cells: Fludigm C1 update

This was not a planned post but it follows on nicely from today's other one about exomes. This time I'm writing about Fluidigm's new single-cell exome-seq protocol. Yup that's right, whole exomes from 96 single cells! The C1 is an amazing piece of kit (wish I had one) and we've used it a little bit for mRNA-seq. The ability to sequence single-cell genomes and exomes means you can pretty much do whatever you want with a single-cell now. So how do the exomes look?

More on exomes

I've been finding out more about exomes: specifically QC analysis using HS Metrics in Picard. There's loads of useful metrics and I'm hoping to get to a point that I can explain these to users here and also look at the results to try and troubleshoot an experiment. I'm also trying to understand what sort of read length we should be using for exome analysis. An earlier post discussed my thoughts around moving to PE125 or switching to SE125 and running more lanes. In a follow up post (watch this space) I'll try to consider the impact of different run modes: will users/reviewers accept any kind of read for an exome or will they baulk at seeing something different from the paired-end norm?

PS: Your comments on this would be greatly appreciated!

Tuesday 30 September 2014

Blogger's spell checker tries to fix "blog"

Why does blogger's spell check try to correct 'blog', and why does it suggest 'bog' or 'blag'. Are they telling me I'm writing s**t or trying to get something for free?

Please fix this one blogger.com.

Monday 29 September 2014

Thanks for reading

This morning someone made the 500,000th page view on the CoreGenomics blog. It amazes me that so many people are reading this and the last couple of years writing have been really good fun. I've met many readers and some fellow bloggers, and received lots of feedback in the way of comments on posts, as well as at meetings. I've even had people recognise my name because of my blogging; surreal! But the last few years have seen some big changes in how we all use social media like blogs, Twitter, etc. I don't think there is a K-index for scientific bloggers, perhaps Neil can look at that one next ;-)

Question: What do you see?

Sunday 28 September 2014

Making BaseSpace Apps in Bangalore

I'm speaking at the BaseSpace Apps developers conference in Bangalore tomorrow. It's my first App and my first time in India, so I'm pretty excited about the whole thing.

Tuesday 23 September 2014

Welcome to a new company built around ctDNA analysis: Inivata

Inivata, is a new company spun out of Nitzan Rosenfelds research group at the CRUK Cambridge Institute (where I work). His group developed and published the TAm-seq method for circulating tumour DNA amplicon sequencing. The spin-out aims to develop blood tests measuring circulating tumour DNA (ctDNA) for use as a "liquid biopsy" in cancer treatment. Inivata has been funded  by Cancer Research Uk's technology arm CRT, Imperial Innovation, Cambridge Innovation Capital and Johnson & Johnson Development Corporation; initial funding has raised £4million.


Inivata is currently based in the Cambridge Institute and the start-up team include the developers of the TAm-seq method: Nitzan Rosenfeld (CRUK-CI), Tim Forshew (now at UCL Cancer Institute), James Brenton (CRUK-CI) and Davina Gale (CRUK-CI).

The research community has really taken hold of cell-free DNA and developed methods that are surpassing expectations. Cell-free DNA is having its largest impact outside of cancer in the pre-natal diagnostics market. And has been shown to be useful in many types of cancer. The use of ctDNA to follow tumour evolution was one of the best examples of what's possible I've seen so far and it's been exciting to be involved in some of this work. Inivata are poised to capitalise on the experience of the founding team and I'll certainly be following how they get on over the next couple of years.

If you fancy working in this field then they are currently hiring: molecular biologist, and computational biologist posts.

This is likely to become a crowded market as people pick up the tools available and deploy them in different settings. ctDNA is floating around in blood plasma and is ripe for analysis, I expect there is still lots of development space for new methods and ultimately I hope we'll be able to use ctDNA as a screening tool for early detection of cancer.

If we can enrich for mutant alleles using technologies like Boreal or Ice-Cold PCR then detection (not quantitation) may be possible far earlier than can be achieved today.

Monday 15 September 2014

Are PCR-free exomes the answer

I'm continuing my exome posts with a quick observation. There have been several talks recently that I've seen where people present genome and exome data and highlight the drop-out of genomic regions primarily due to PCR amplification and hybridisation artefacts. They make a compelling case for avoiding PCR when possible, and for sequencing a genome to get the very best quality exome.

A flaw with this is that we often want to sequence an exome not simply to reduce the costs of sequencing, but more importantly to increase the coverage to a level that would not be economical for a genome, even on an X Ten! For studies of heterogeneous cancer we may want to sequence the exome to 100x or even 1000x coverage to look for rare mutant alleles. Unfortunately this is exactly the kind of analysis that might be messed up by those same PCR artefact's, namely PCR duplication (introducing allele bias) and base misincorporation (introducing artifactual variants).

PCR free exomes: In my lab we are running Illumina's rapid exomes so PCR is a requirement to complete the Nextera library prep. But if we were to use another method then in theory PCR-free exomes would be possible. Even if we stick to Nextera (or Agilent QXT) then we could aim for very low-cycle PCR libraries. The amount of exome library we are getting is huge, often 100's of nanomoles, when we only need picomoles for sequencing.

Something we might try testing is a PCR-free or PCR-lite (pardon the American spelling) exome to see if we can reduce exome artefacts and improve variant calling. If anyone else is doing this please let me know how you are getting along and how far we can push this.

Thursday 4 September 2014

The newest sequencer on the block: Base4 goes public

I've heard lots of presentations about novel sequencing technologies, many have never arrived, some have come and gone, all have been pretty neat ideas; but so far not one has arrived that outperforms the Illumina systems many readers of this blog are using.

Base4's pyrophosphorolysis sequencing technology

The latest newcomer is Base4's single-molecule microdroplet sequencing technology. The picture above explains the process very well: a single molecule of double-stranded DNA is immobilised in the sequencer, single bases are cleaved at a defined rate from the 3' end by pyrophosphorolysis (the new Pyrosequencing perhaps?), as each nucleotide is cleaved it is captured into a microdroplet where it initiates a cascade reaction that generates a fluorescent signal unique to each base, as microdroplets are created at a faster rate than DNA is cleaved at the 3' end the system generates a series of droplets that can be read out by the sequencer (a little like the fluorescent products being read of a capillary electrophoresis instrument).

Base4 are talking big about what their technology can deliver. They say it will be capable of sequencing 1M bases per second with low systematic error rates. The single-molecules mean no amplification and read-lengths should be long. Parallelisation of the technology should allow multiple single-molecules to be sequenced at the same time. How much and when will have to wait a little longer.

I've been speaking to Base4 over the past few years after meeting their founder Cameron Frayling in a pub in Cambridge. Over the past two years Base4 has been developing their technology and recently achieved a significant milestone by demonstrating robust base-calling of single nucleotides in microdroplets. They are still small, with just 25 employees and are based outside Cambridge. I hope they'll be growing as we start to get our hands on the technology and see what it's capable of.

Low-diversity sequencing: RRBS made easy

Illumina recently released a new version of HCS v2.2.38 for the HiSeq. The update improves cluster definition significantly and enables low-diversity sequencing. It’s a great update and one that’s making a big impact on a couple of projects here.

Thursday 28 August 2014

SEQC kills microarrays: not quite

I've been working with microarrays since 2000 and ever since RNA-seq came on the scene the writing has been on the wall. RNA-seq has so many advantages over arrays that we've been recommending them as the best way to generate differential gene expression data for a number of years. However the cost, and lack of maturity in analysis meant we still ran over 1000 arrays in 2013, but it looks like 2014 might be the end of the line. RIP: microarrays.

Thursday 21 August 2014

FFPE: the bane of next-generation sequencing? Maybe not for long...

FFPE makes DNA extraction difficult; DNA yields are generally low, quality can be affected by fixation artefacts and the number of amplifiable copies of DNA are reduced by strand-breaks and other DNA damage. Add on top of this almost no standardisation in the protocols used for fixation and widley different ages of samples and it's not suprising FFPE causes a headache for people that want to sequence genomes and exomes. In this post I'm going to look at alternative fixatives to formalin, QC methods for FFPE samples to assess their suitability in NGS methods, some recent papers and new methods to fix FFPE damage.
Why do we use formalin-fixation: The ideal material to work with for molecular studies is fresh-frozen (FFZN) tumour tissue, as nucleic acids are of high-quality. But many cancer samples are fixed in formalin for pathological analysis and stored as Formalin-Fixed Parrafin-Embeded (FFPE) blocks, preserving tissue morphology but damaging nucleic acids. The most common artifacts are, C>T base substitutions caused by deamination of cytosine bases converting them to uracil and generating thymines during PCR amplification, and strand-breaks. Both of these reduce the amount of correctly amplifiable template DNA in a sample and this must be considered when designing NGS experiments.
Molecular fixatives: Our Histopathology core recently published a paper in Methods: Tissue fixation and the effect of molecular fixatives on downstream staining procedures. In this they demonstrated that overall, molecular fixatives preserved tissue morphology of tissue as well as formaldehyde for most histological purposes. They presented a table, listing the molecular-friendly fixatives and reporting the average read-lengths achievable from DNA & RNA (median read-lengths 725 & 655 respectively). All the fixatives reviewed have been shown to preserve nucleic acid quality, by assessment of qPCR Ct values or through RNA analysis (RIN, rRNA ratio, etc). But no-one has performed a comparison of these at the genome level, and the costs of sequencing probably keep these kind of basic tests beyond the limits of most individual labs.

The paper also presents a tissue-microarray of differently fixed samples, which is a unique resource that allowed them to investigate the effects of molecular fixatives on histopathology. All methods preserved morphology, but there was a wide variation in the results from staining. This highlights the importance of performing rigourous comparisons, even for the most basic procedures in a paper (sorry to any histpathologists reading this, but I am writing from an NGS perspective).

The first paper describing molecular a fixative (UMFIX) appeared back in 2003, in it the authors describe the comparison of FFZN to UMFIX tissue for DNA and RNA extraction, with no significant differences between UMFIX and FFZN tissues on PCR, RT-PCR, qPCR, or expression microarrays. Figure B from their paper shows how similar RNA bioanalyser profiles were from UMFIX and FFZN.

UMFIX (top) and FFZN (bottom)


Recent FFPE papers: A very recent and really well written paper in May 2014 by Hedegaard et al compared FFPE and FFZN tissues to evaluate their use in exome and RNA-seq. They used two extraction methods for DNA and three for RNA with different effects on quality and quantity.  Only 30% of exome libraries worked, but with 70% concordance (FFZN:FFPE). They made RNA-seq libraries from 20 year old samples with 90% concordance, and found a set of 1500 genes that appear to be due to fixation. Their results certainly make NGS analysis of FFPE samples seem to be much more possible than previous work. Interestingly they made almost no changes to the TruSeq exome protocol, so some fiddling with library prep, perhaps adding more DNA to reduce the impact of strand-breaks for instance would help a lot (or fixing FFPE damage - see below). The RNA-seq libraries were made using RiboZero and ScriptSeq. Figure 2 from their paper shows the exome variants with percentages of common (grey), FFZN-only (white) and FFP-only (red), there are clear sample issues due to age (11, 7, 3 & 2 years storage) but the overall results were good.

Other recent papers looking at FFPE include: Ma et al (Apr 2014): they developed a bioinformatics method fo gene fusion detection in FFPE RNA-seq. Li et al (Jan 2014): they investigated the effect of molecular fixatives on routine histpathology and molecular analysis. They achieved high-quality array results with as little as 50ng of RNA. Norton et al (Nov 2012): they manually degraded RNA in 9 pairs of matched FFZN/FFPE samples, and ran both Nanostring and RNA-seq. Both gave reliable gene expression results from degraded material. Sinicropi et al (Jul 2012): they developed and optimised RNA-seq library prep and informatics protocols. And most recently Cabanski et al published what looks like the first RNA-access paper (not open access and unavailable to me). RNA-access is Illumina's new kit for FFPE that combines RNA-seq prep from any RNA (degraded or not) with exome capture (we're about to test this, once we get samples).

QC of FFPE samples: It is relatively simple to extract nucleic acids from FFPE tissue and get quantification values to see how much DNA or RNA there is, but tolerating a high failure rate, due to low-quality, in subsequent library prep is likely to be too much of a headache for most labs. Fortunately several groups have been developing QC methods for FFPE nucleic acids. Here I'll focus mostly on those for DNA.

Van beers et al published an excellent paper in 2006 on a multiplex PCR QC for FFPE DNA. This was developed for CGH arrays and produces 100, 200, 300 and 400bp fragments from nonoverlapping target sites in the GAPDH gene from the template FFPE DNA. Figure 2 from their paper (reproduced below) demonstrate a good (top) and a bad (bottom) FFPE samples results.

Whilst the above method is very robust and generally predictive of how well an FFPE sample will work in downstream molecular applications, it is not high-throughput. Other methods generally use qPCR as the analytical method as it is quick and can be run in very high-throughput. Illumina sell an FFPE QC kit which uses comparison of a control template to test sampeples and a deltaCq method to determine if samples are suitable for arraya or NGS. LifeTech also sell a similar kit but for RNA, Arcturus sample QC, using two β-actin probes and assessing quality via their 3'/5' ratio.Perhaps the ideal approach would be a set of exonic probes multiplexed as 2, 3, or 4-colour TaqMan assays. This could be used on DNA and RNA and would bring the benefits of the Van beer and LifeTech methods to all sample types.

Fixing FFPE damage: Another option is to fix the damage caused by fomalin fixation. This is attractive as there are literally millions of FFPE blocks, and many have long-term follow up data. A paper in Oncotarget in 2012 reported the impact of using uracil-DNA glycosylase (UDG) to reduce C>T caused by cytosine deamination to uracil. They also showed that this can be incoporated into current methods as a step prior to PCR, something which we've been doing for qPCR for many years. There are not strong reasons to incorporate this as a step in any NGS workflow as there is little impact on high-quality templates.

NEB offer a cocktail of ezymes in their PreCR kit, which repairs damaged DNA templates. It is designed to work on: modified bases, nicks and gaps, and blocked 3' ends. They had a poster at AGBT demonstrating the utility of the method, showing increased library yields and success rates with no increase in bias in seqeuncing data.

Illumina also have an FFPE restoration kit; restoration is achieved through treatment with DNA polymerase, DNA repair enzyme, ligase, and modified Infinium WGA reaction, see here for more details.

These cocktails can almost certainly be added to: MUTYH works to fix 8-oxo-G damage, CEL1 is used in TILLING analysis to create strand-breaks in mismatched templates and could be included, lots of other DNA repair enzymes could be added to a mix to remove nearly all compromised bases. It may be possible to go a step further and fix compromised bases rather than just neutralise their effect.

Whatever the case it looks very much like FFPE samples are going to be processed in a more routine manner very soon.

Monday 18 August 2014

$1000 genomes = 1000x coverage for just £20,000

It strikes me that if you can now sequence a genome for $1000, then you could buy 1000x coverage for not much more than a 30x genome cost a couple of years ago! Using a PCR-free approach I can imagine that this would be the most sensitive tool to determine tumour, or population, heterogeneity. I’m sure that sampling statistics might limit the ability to detect low-prevalence alleles but I’m amazed by the possibility none-the-less.
  • 1 X-Ten run costs $1,000
  • 1000x requires 33 X-Ten runs (30x each)
  • $33,000 = £20625
If you’re running a ridiculously high Human genome project on X-Ten do let me know!

Thursday 14 August 2014

How many MiSeq owners are using the bleach protocol to minimise carryover?

A comment popped up on a post I'd written in April last year "MiSeq (and 2500) owners better read this and beware"that made me think I'd ask readers the question in the title: "how many of you are using the MiSeq bleach wash protocol?

The carryover issue led to a small residue of the last library to be run being sequenced in the subsequent run. This caused a potential problem to MiSeq users, particularly those interested in low frequency mutations. My post was prompted after some discussions with other MiSeq owners and a thread on SEQanswers, which Illumina posted to describing their experiences with reducing this carryover, and that it was seen typically at 0.1%.

The comment on my post was about a more aggressive bleach protocol which reduced carryover to almost undetectable levels (thanks Illumina), but that appears to have not been communicated to all users. It was impossible for me to find on the Illumina website but it's not the most easy to navigate site in the world so I thought I'd put the document up here for you to see (click the image for the TechNote).


You need to request this through your techsupport team as it needs a new recipie on your MiSeq. And you really must follow the instructions to the letter, too much bleach and you'll probably kill your MiSeq!

Ilumina demonstrated that this protocol can reduce carry-over to less than 0.001% or one read per 100,000. We've been using this as the default wash for many months and reports of carry-over are nearly non-existent.

Sunday 27 July 2014

When will BGI stop using Illumina sequencers

With the BGI aiming to get their own diagnostic sequencing tests on the market, and the purchase and development of Complete Genomics technology - Omega, a question that could be asked is whether BGI will ever stop using Illumina technology?

BGI are still the largest install of HiSeq's but they have not purchased an X-Ten and it's not clear if they've switched over to v4 chemistry on the updated 2500. The cost of upgrades or replacement on a scale on 128 machines would be high, but BGI have deep pockets. So is this the start of the end for Illumina in China?

If BGI stops using Illumina will Illumina notice? I'm sure they will and the markets will read lots into any announcement, but in the long run it's difficult to see China without an Ilumina presence. The Chinese science community is booming, their research spend is second only to the US and is likely to climb much more quickly, and they have a massive health-care market that NGS can make a big impact on.

Once we hear what BGI can do with the CGI technology (exomes for instance) we might find out Illumina has a strong competitor and with LifeTech/Thermo effectively putting Ion Torrent on-hold competition in the NGS market is something the whole community, including Illumina needs.

PS: This is my last post for a couple of weeks while I'm off on holiday in Spain. Hasta luego!

Monday 21 July 2014

1st Altmetric conference - Sept 25/26th in London

I've been a user of Altmetric for a while now and very much like what they are doing with article metrics. I'm sure many Core Genomics readers will also have seen the Altmetric badge on their own papers. Now Altmetric are hosting their first conference.

The meeting aims to demonstrate how users are integrating Altmetric tools into their processes. Hopefully they'll cover lots of interesting topics and spend some time talking about how the community can keep tools like Altmetric from becoming devalued by gaming.

Might see you there...

Thursday 17 July 2014

A hint at the genomes impact on our social lives

GWAS is still in the news and still finding hits, the number of GWAS hits has increased rapidly since the first publication for AMD in 2005. Watch the movie to see the last decade of work!

A recent paper in PNAS seems to have got people talking: in Friendship and natural selection Nicholas Christakis and James Fowler describe their analysis of the Framingham Heart Study (FHS) data; specifically the data of people recorded as friends by participants. The FHS recorded lots of information about relatives (parents, spouses, siblings, children), but also asked participants “please tell us the name of a close friend". Some of those friends were also participants and it is this data the paper used to determine a kinship coefficient, higher values indicate that two individuals share a greater number of genotypes (homophily.

The study has generated a lot of interest and news (GenomeWeb, BBC, Altmetric) but also some negative comments, mainly about how difficult this is to prove in a study where you cannot rule out genetic relationships individuals themselves don't know exist (i.e. I don't know who my third cousins are and might make friends by chance).

The data in supplemental files from PNAS paper show Manhattan plot (top) for the identified loci, its not as stunning an example as you'd see in other fields. Compare it to a well characterised GWAS hit from a replicated study in Ovarian Cancer (bottom).

Monday 14 July 2014

Sequencing exomes: what sort of read to use?

What's the right way to sequence an exome? We've been looking at Illumina's v4 chemistry for HiSeq 2500 and wondering whether we should jump to PE125bp or not, or should we try to reconfigure our exome capture for shorter or longer fragments.

Exome-seq: Exomes have been a big hit, there are currently over 3000 publications in PubMed with the search term "exome". Given that the first in-solution exome paper was only published in 2009 that's pretty amazing, but then again the exome is an amazing research tool.

Note to readers: This post started out as a writing down my thoughts about whether we should move to longer reads for exomes. But it has become a bit more rambling as I started to find out I need to do some mroe digging. I may well come back to this post with an update or version two...

There are many ways to prepare an exome for sequencing and in my lab we're currently using Illumina's rapid exome kit. We're also about to compare this to Agilent's new SureSelect QXT kit which is a direct competitor to Illumina's Nextera-based offering. But we've never tried Nimblegen or AmpliSeq, however this post is more about how to sequence the exome than prepare it so enough of kit comparisons.

The standard exome: Their are two things you need to consider when sequencing exomes: read depth and read-length. I'm not going to worry about depth in this post, and instead I'm going to focus on read-length. Today most labs appear to be running exomes at PE75bp, a standard which I am not sure has ever been agreed by anyone, but it has been accepted as being good-enough for most projects (Illumina recommend PE75-100). I know of some groups that moved over to PE100 to simplify lab logistics as much as anything else, but I am not clear that there are significant benefits to increasing length so we've stuck at PE75 for the time being.

Are longer reads better: With the advent of v4 chemistry on HiSeq 2500 we should be able to generate high-quality paired-end 125bp reads, albeit with a slightly higher error rate at the end of the read. At first glance this additional data seems too good to ignore, especially when Illumina do not sell a 150cycle SBS kit, and 3x50cycle SBS would not be that much cheaper (and more hassle for my lab staff!) By my reckoning PE75 costs £900 per lane whilst PE125 is £1200, or £300 for an extra 100bp of coverage. So if cost does not prevent us using PE125, should we simply switch?

Insert size vs read-length: As you can see below the average distribution of exome fragments size spans the read-length of the sequencer. The solid black line indicates 150bp (PE75): everything to the left of this will be fragments sequenced with an overlapping reads (opes), whilst everything to the right is sequenced with non-overlapping reads (nopes). As read-length increases the percentage of fragments sequenced with an overlap also increases, at PE100 (dashed line)  this is over 50% of reads, and at PE125 (dotted line) it's about 75% of all fragments. An overlapping read creates some issues as the two reads are not independent, tools need to take the overlap into account when calculating on-target coverage, etc; but it also offers the opportunity to increase variant calling quality by increasing Q-scores in the overlap region.

Exome libraries may not be the best size for sequencing: If a non-overlapping read is the best kind to generate then we may need to reconfigure library prep in the light of v4 chemistry. An interesting comparison can be made to the Agilent Bioanalyser trace below the computed insert size distribution. If you overlay and rescale the two images, then the Agilent trace appears to be peak-shifted to larger fragments, and the right-hand fragment distribution is much broader. This appears to demonstrate the preference of clustering:sequencing for shorter fragments.

Exome libraries are probably the best size for capturing exons: The average exon length in the Human genome is 170 bp with 80–85% exons less than 200bp (Zhu et al & Sakharkar et al) so the 185bp average fragment length seems almost ideal.

Table reproduced from Shkharkar et al 2004

So what's the sweetspot for Exome capture and sequencing: The simple answer is I don't know, and several factors are likely to affect this. As we increase read-length we'll get more fragments with overlapping reads that could be wasteful; the same happens if we decrease fragment size so longer reads give us more and more overlap with higher quality. But unless there are tools to make use of this the data are redundant. So fragments should not be longer than reads.

But fragments are captured by probes of 95bp so we should probably not make fragments shorter than probes.

Exome capture kits contain blocking oligos to prevent adapter:adapter hybridisation and off-target pull-down. As fragment length increases then the amount of near-target sequence captured may increase meaning we should not make fragments too long. A long fragment risks too much off-target enrichment by the secondary capture of off-target fragments.

Lastly (for now) we'd like to be able to use independent fragments for our analysis so read-pairs might be better replaced with longer single-reads, but twice as many. So perhaps the answer is probes that efficiently capture exons with little or no fragment:fragment hybridisation, coupled to single-end 185bp sequencing with low error-rate across the reads.

Monday 7 July 2014

Anatomy of a NextSeq flowcell

Personally I'm thinking that the aluminium plate might make a pretty nifty bottle opener!

Thursday 3 July 2014

How to find the best papers to read is tough

We've all been there: PubMed returns over 2500 RNA-seq papers, and there's still 800 left when you only search the title! How do you find the best papers to read? PubMed can help a little more with your quest to find out more about RNA-seq as there are just 19 reviews, but it's often primary papers you need to dig into to truly understand what's going on in a field. There are other ways to find out what's a hot paper and I've just started using a relatively new one: the Altmetric explorer.

Before I go any further I will say this is a demo account (thanks Altmetric) and their pricing plans are squarely aimed at institutions. Hopefully they'll find a way to make tools for individuals with perhaps more limited search functionality.

What does Altmetric Explorer do: The search tool allows you to filter the vast amount of data Altmetric has collected, you can even enter a PubMed search directly. The first thing I did was to look at was my own publication record and see who's talking about the papers I've co-authored, turns out it is often just me (as far as Altmetric is concerned)!

I'd originally been in touch with the Altmetric team about using data from ORCID (I wrote about this last year) and seeing if it were possible to pull out more complex relationships between authors. The aim was to make creation of something like the Circos plot below easy to do for any group of individuals ro even institutes. I'm still a long way from doing this but if anyone can offer some help that would be great!

The searches I presented below simply used a PubMed search and list papers in the order of most activity, as recorded by Altmetric. You can filter on lots of other metrics including; keyword, date, journal, etc. Take a look and get in tocuh with the Altmetric team if you'd like to do more.
RNA-seq Altmetric activity:

ChIP-seq Altmetric activity:

My Altmetric activity:
PubMed = Hadfield J[author]