Thursday, 20 November 2014

Copy-number: exomes vs genomes, proposing a better strategy

Comparing CNVs
Exome sequencing (@Exome_seq) can be used to generate CNV maps bit the data are noisy compared to WGS or SNP-arrays. In this post I’ll describe a novel workflow to get high-quality CNV from an exome-seq pipeline and show some very new data…

Image reproduced and enlarged at the bottom of this post.

Copy-number is a driving event in cancer and as such we've done lots and lots of copy-number analysis. The majority of this has been on microarrays, which even today offer a very cost-effective way to get copy-number variation (CNV) and loss of heterozygosity (LOH) from a sample. The METABRIC study from Prof Carlos Caldas's lab used CNV to investigate the genomic and transcriptomic architecture of 2,000 breast tumours, and we ran over 2000 Illumina HT12 gene expression, and 2000 Affymetrix SNP6 arrays as part of the study. 

We have moved our differential gene expression analysis from microarrays to RNA-seq and several years ago considered if this would be technically possible and cost-effective to use NGS for CNV analysis. We discounted this due to the inability to detect LOH in low-coverage sequencing. But about two years ago several projects, some great papers, and a beta-test of Illumina’s Nextera exome method made us reconsider our decision. 

Exome-seq for CNV: Exomes allow low-cost analysis (compared to genomes) of large numbers of samples and can detect low-frequency single nucleotide variants and small InDels, in cancer it allows us to analyse tumour heterogeneity. Exome sequencing relies on the capture and enrichment of exonic fragments from a whole genome sequencing library, using biotinylated oligonucleotide baits. The amount of PCR, both pre- and post-capture, and the use of one or two hybridisations adds significant intra- and inter-sample technical variability due to strong batch effects. As such the use of CNV tools developed for WGS is not recommended and multiple methods have been developed specifically for exome CNV analysis.

Multiple exome seq copy number papers make the point that calling CNV from exome data is more cost effective than from WGS, however these assume that 30x coverage will be used. Most of the exome-CNV tools assume that the technical variability is systematic and can be removed by using controls, paired-samples or by removing some of the worst “noise”. At least four comparisons of exome CNV tools have been published. Together these reviewed the performance of nine tools: VarScan 2, eXome Hidden Markov Model (XHMM), ExomeCNV, CONTRA, ExomeCopy, ExomeDepth, ExCopyDepth, CoNIFER, cn.MOPS. One of the comparison papers reported that some of the exome CNV tools did not perform fantastically when compared to array CGH.

Exome sequencing also generates large amounts of off-target data that is often discarded. We have seen the number of useable bases, i.e. exonic bases drop as we have moved to a standard run length of paired-end 125bp. Essentially we are sequencing libraries with average insert sizes under 250bp so almost all fragments have some overlap, many reads are from non-exonic bases (intros, UTR) and many fragments will read-through into adapter; this can total over 50% of bases. One group has specifically addressed the off target bases as useful in detecting CNV and developed the cnvOffSeq tool. 

A novel method for “exome-CNV”: When testing the new Nextera exome kits from Illumina I noticed that we had large amounts of library that was not needed for exome capture. I suggested that we could use this to perform low-coverage sequencing to detect copy-number. Although we would have to drop LOH analysis, the CNV data could be generate for perhaps £20-50 per sample, assuming we used 10-20 million single-end 50bp reads. I was inspired by seeing clear HER2 amplifications in WGS libraries run on a MiSeq QC where we had just 1M reads per sample!

The pipeline we are now testing uses the standard Nextera exome prep. However we generate two sequencing libraries at the end of the process. The first is the exome pool, and is sequenced as 6-12 plex pool using PE125bp reads. The second is a pool of WGS libraries from the left-overs from capture and is sequenced as 20-plex pools using SE50bp reads.

As we are effectively making two libraries from a single prep and using just 50ng of DNA the workflow is very efficient. We do not always have library left over from capture so we are considering the impact of normalising all libraries for capture and then taking a 1ul aliquot, this would reduce capture input by 2.5% and we do not think this should have too much of an impact. The paired nature of genome CNV and exome-seq will allow us to use both in future comparisons and to adopt different analysis methods as they are developed. 

How does CNV analysis compare: The data presented below is representative of what we see at 100kb resolution, the analysis was done by Oscar Rueda in the Caldas group. The exome data is from a 44M reads and the WGS is from 20M reads (we are sub-sampling to more fairly compare and see how low we can go).  It is clear that the exome data are significantly noisier, and we do seem to see evidence of slightly higher than expected CNV calls. However the exomes are generating very useful copy-number data, and this does mean we’ll continue to evaluate if the extra work and cost are worth it. Currently our view is that the gain in clean data is much more than the increment in cost.

This project has been done collaboratively with Michelle Pugh in my group, and Alejandra Bruna and Oscar Rueda in the Caldas group. Update 2016: this method was used in: Bruna et al Cell 2016: A Biobank of Breast Cancer Explants with Preserved Intra-tumor Heterogeneity to Screen Anticancer Compounds.


  1. Hi - what software do you use for the Genome-CNV?

  2. I also have the similar question what Aaron already asked ..

  3. The use of the Nextera library ahead of exome sequencing was also proposed in this Genome Research paper earlier this year:

  4. Hi, both analysis were done using the R package CNAnorm

  5. Cool idea. If you merge the BAM files from WES and WGS, you can use CNVkit to do both steps in one shot ( Just choose an off-target bin size to get the resolution appropriate for your off-target and WGS coverage.

    Does this really make LOH analysis impossible? If you do variant calling on the exome and filter the VCF, then LOH is usually still visible, though at a lower resolution.

  6. There are some groups working on LOH using low-coverage WGS to get a genome-wide analysis (it requires some tough stats), but I do like the idea of the combination. Hopefully we'll be able to look at this over the next couple of years as this becomes a standard workflow in the lab.