Copy-number analysis is a useful tool for many researchers and we use it a lot for analysis of tumour samples. In the past this was done using SNP arrays e.g. Affymetrix SNP6.0 in METABRIC, but today we're generally using low-coverage whole genome sequencing and tools like qDNAseq. I've posted before about our use of low-coverage WGS in our exome pipeline. Most recently we've got groups doing low-coverage WGS on large numbers of samples purely for copy-number analysis.
Low-coverage WGS makes CNV-seq fast and cheap but a recent Genome Research paper suggest some great methodological improvements to push costs down to very low levels: SMASH, a fragmentation and sequencing method for genomic copy number analysis.
|WGS and SMASH generate highly concordant CNV calls|
SMASH vs WGS: Essentially both low-coverage WGS and SMASH use the same idea - map short reads to the genome and count how many fall into genomic bins to estimate copy-number for that bin. The clever bit about SMASH is that by using longer libraries made from concatamerised DNA, multiple fragments are sequenced in a single read. Essentially the same low-coverage WGS data is generated but from many fewer reads and that is what keeps costs down. The use of chimeric libraries probably removes any chance of detecting translocations and other non-CNV structural variations, but SMASH is designed to do CNV-seq as cheaply as possible.
Short Multiply Aggregated Sequence Homologies borrows heavily from the ideas behind SAGE, which was used for cDNA analysis. SMASH starts with random fragmentation of genomic DNA to around 40bp. The fragments are ligated to create concatamers of around 400bp-700bp in length, which are sequenced using longer paired-end reads. These chimeric reads are mapped using a strategy described in the paper. Briefly they look for the "maximal almost-unique match" for each read and generate read-depth counts in genomic "bins" for copy-number calling. The use of chimeric libraries has an added benefit in returning reads from unique DNA fragments this should remove the impact of PCR duplication and/or bias on CNV calling.
Although the authors discuss some of their wetlab and drylab automation they do not mention HiSeq 4000 or patterned flowcells. This is likely to be an issue and the SMASH libraries will need to be less variable in length, and around 300-500bp long to work optimally.
The cost of SMASH: Because SMASH squeezes 4-6 unique DNA fragments into each read-pair it brings the costs of sequencing down to very low levels. Assuming a cost per lane of £600 for single-end 50bp reads, and £1200 for paired-end 150bp reads on HiSeq 4000 we can count around 350M DNA fragments with low-coverage WGS (SE50), but over 2 billion with SMASH (PE150). For copy-number analysis using 10M reads per sample this is equivalent to £35 for low-coverage WGS, and just £10 for SMASH!
Using SMASH for RNA-seq, ChIP-seq & cfDNA: It might be very little work to adopt SMASH for RNA-seq. Fragmentation of oligo-dT selected mRNA to 40bp, ligation/concatamerisation of RNA fragments (similar to miRNA preps), random primed cDNA synthesis to around 400bp and adapter ligation of ds-cDNA. As the ligation can only happen once for any RNA fragment we should get data very similar to RNA-seq using unique molecular identifiers, but without the hassle. Our RNA-seq sequencing costs drop from £35 to £10 per sample, but we'll need library prep costs to come down too; and with a more complex protocol the cost benefits of SMASH may be offset. But if the removal of PCR duplicates from RNA-seq is valuable enough then SMASH for RNA may be a significant step forward.
For ChIP-seq we could also concatenate the short DNA fragments released after immuno-precipitation. It may be even better to combine SMASH with a ChIP-exo protocol to better define the binding sites of transcription factors and other DNA binding proteins. Exonuclease V digestion of the genomic DNA not protected by the bound protein, followed by end-repair, concatamerisation and SMASH sequencing should work pretty well. And again it should remove the issue of PCR duplicates.
For cfDNA the fragment length of around 160bp means the sample needs further fragmentation to work in a SMASH protocol, whereas todays cfDNA methods simply ligate adapters directly to cfDNA giving a very simple protocol. However cfDNA in urine is significantly smaller at around 50bp - and may be perfect for SMASH? The use of urine would make minimally invasive tests for pregnancy and/or tumour monitoring truly non-invasive, and possibly push costs down even further?