Comparing CNVs |
Image reproduced and enlarged at the bottom of this post.
Copy-number is a driving event in cancer and as
such we've done lots and lots of copy-number analysis. The majority of this has
been on microarrays, which even today offer a very cost-effective way to get copy-number
variation (CNV) and loss of heterozygosity (LOH) from a sample. The METABRIC
study from Prof Carlos Caldas's lab used CNV to investigate the genomic and transcriptomic architecture of 2,000 breast tumours, and we ran over 2000 Illumina HT12 gene expression, and 2000 Affymetrix SNP6 arrays as part of the study.
We have moved our differential gene expression
analysis from microarrays to RNA-seq and several years ago considered if this
would be technically possible and cost-effective to use NGS for CNV analysis.
We discounted this due to the inability to detect LOH in low-coverage sequencing.
But about two years ago several projects, some great papers, and a beta-test of
Illumina’s Nextera exome method made us reconsider our decision.
Exome-seq
for CNV: Exomes allow low-cost
analysis (compared to genomes) of large numbers of samples and can detect low-frequency
single nucleotide variants and small InDels, in cancer it allows us to analyse tumour heterogeneity. Exome sequencing relies on the capture and enrichment of exonic
fragments from a whole genome sequencing library, using biotinylated
oligonucleotide baits. The amount of PCR, both pre- and post-capture, and the
use of one or two hybridisations adds significant intra- and inter-sample
technical variability due to strong batch effects. As such the use of CNV tools developed for
WGS is not recommended and multiple methods have been developed specifically
for exome CNV analysis.
Multiple exome seq copy number papers make the
point that calling CNV from exome data is more cost effective than from WGS,
however these assume that 30x coverage will be used. Most of the exome-CNV
tools assume that the technical variability is systematic and can be removed by
using controls, paired-samples or by removing some of the worst “noise”. At
least four comparisons of exome CNV tools have been published. Together these
reviewed the performance of nine tools: VarScan 2, eXome Hidden Markov Model
(XHMM), ExomeCNV, CONTRA, ExomeCopy, ExomeDepth, ExCopyDepth, CoNIFER, cn.MOPS.
One of the comparison papers reported that some of the exome CNV tools did not
perform fantastically when compared to array CGH.
Exome sequencing also generates large amounts
of off-target data that is often discarded. We have seen the number of useable
bases, i.e. exonic bases drop as we have moved to a standard run length of
paired-end 125bp. Essentially we are sequencing libraries with average insert
sizes under 250bp so almost all fragments have some overlap, many reads are
from non-exonic bases (intros, UTR) and many fragments will read-through into
adapter; this can total over 50% of bases. One group has specifically addressed
the off target bases as useful in detecting CNV and developed the cnvOffSeq
tool.
A
novel method for “exome-CNV”:
When testing the new Nextera exome kits from Illumina I noticed that we had
large amounts of library that was not needed for exome capture. I suggested
that we could use this to perform low-coverage sequencing to detect
copy-number. Although we would have to drop LOH analysis, the CNV data could be
generate for perhaps £20-50 per sample, assuming we used 10-20 million
single-end 50bp reads. I was inspired by seeing clear HER2 amplifications in WGS libraries run on a MiSeq QC where we had
just 1M reads per sample!
The pipeline we are now testing uses the
standard Nextera exome prep. However we generate two sequencing libraries at
the end of the process. The first is the exome pool, and is sequenced as 6-12
plex pool using PE125bp reads. The second is a pool of WGS libraries from the
left-overs from capture and is sequenced as 20-plex pools using SE50bp reads.
As we are effectively making two libraries from
a single prep and using just 50ng of DNA the workflow is very efficient. We do
not always have library left over from capture so we are considering the impact
of normalising all libraries for capture and then taking a 1ul aliquot, this
would reduce capture input by 2.5% and we do not think this should have too
much of an impact. The paired nature of genome CNV and exome-seq will allow us
to use both in future comparisons and to adopt different analysis methods as
they are developed.
How
does CNV analysis compare: The
data presented below is representative of what we see at 100kb resolution, the analysis was done by Oscar Rueda in the Caldas group. The
exome data is from a 44M reads and the WGS is from 20M reads (we are
sub-sampling to more fairly compare and see how low we can go). It is clear that the exome data are
significantly noisier, and we do seem to see evidence of slightly higher than
expected CNV calls. However the exomes are generating very useful copy-number data,
and this does mean we’ll continue to evaluate if the extra work and cost are
worth it. Currently our view is that the gain in clean data is much more than the increment in cost.
This project has been done collaboratively with Michelle Pugh in my group, and Alejandra Bruna and Oscar Rueda in the Caldas group. Update 2016: this method was used in: Bruna et al Cell 2016: A Biobank of Breast Cancer Explants with Preserved Intra-tumor Heterogeneity to Screen Anticancer Compounds.
Hi - what software do you use for the Genome-CNV?
ReplyDeleteI also have the similar question what Aaron already asked ..
ReplyDeleteThe use of the Nextera library ahead of exome sequencing was also proposed in this Genome Research paper earlier this year: http://www.ncbi.nlm.nih.gov/pubmed/25236618
ReplyDeleteHi, both analysis were done using the R package CNAnorm
ReplyDeleteCool idea. If you merge the BAM files from WES and WGS, you can use CNVkit to do both steps in one shot (https://github.com/etal/cnvkit). Just choose an off-target bin size to get the resolution appropriate for your off-target and WGS coverage.
ReplyDeleteDoes this really make LOH analysis impossible? If you do variant calling on the exome and filter the VCF, then LOH is usually still visible, though at a lower resolution.
There are some groups working on LOH using low-coverage WGS to get a genome-wide analysis (it requires some tough stats), but I do like the idea of the combination. Hopefully we'll be able to look at this over the next couple of years as this becomes a standard workflow in the lab.
ReplyDelete