Thursday, 8 September 2011

Sequencing versus arrays

This is the big question for many labs today, especially core facilities investing in technology for the longer term. Everyone wants to know when we might replace microarrays with NGS for applications like, differential gene expression, splicing analysis, allele specific expression, copy number variation, loss of heterozygosity, miRNA expression, etc.
I have been considering for some time the question of when CNV-Seq will replace SNP arrays for CNV and Structural Variation analysis, or RNA-Seq will replace gene expression arrays. This is part of my job as the head of a Genomics core, and it is a bit I really like about the job. From a quality of data perspective I'd say we should have stopped a year or so ago. However other factors like nucleic acid requirements, ease of sample processing or analysis, cost of data generation, maturity of analysis tools, etc need to be considered before we can make the switch. Over at GenomeWeb they interviewed  several researchers about the rapid take-up of array CGH in clinical labs. The interview discusses how a decision on using one technology over another is not always easy. They mention similar pro's and con's as I do above, some are objective whilst others are subjective and each lab can come up with a different answer. In the interview John Iafrate, an assistant professor at Harvard Medical School says it's too early to tell if NGS Will be more sensitive than aCGH as there is not enough data yet and that CNV algorithms are developing and not widely adopted either
Sequencing vs Arrays: For this post I am going to focus primarily on differential gene expression and copy number analysis and the costs associated with generating the primary data.
How much do gene expression arrays 'cost': In my lab we run lots of Illumina HT12 arrays, over 4000 in the last five years. We have an internal charge of £125 that includes the array, labeling of cRNA, personnel and service contracts on the scanner. This is not FEC accounting but we try to cover the major expenditures. The HT12 format is very amenable to high-throughput processing and we typically run 48-192 samples as a batch using 250ng RNA as input. We have a fantastic Bioinformatics core that have implemented an analysis pipeline based on beadArray to generate differential expressed gene lists. Usually gene lists are ready for the user to look at 5-7 days after receipt of samples.
How much do snpCGH arrays 'cost': I have been involved in outsourcing some snpCGH projects from my lab. We have used Aros in Demark and they provide Affymetrix SNP6.0 and Illumina Omni1-Quad arrays at between £250-350 and £290-360 each, dependent on volume. Aros are a commercial service provider and these costs include the arrays and service provision. They typically run 96 samples per batch as a minimum using 500ng DNA as input. We have had very good experiences ofwith them so far. Our Bioinformatics core uses a pipeline based around Affymetrix Power Tools (APT) and DNACopy (Bioconductor) only. to generate segmented copy number calls, as well as structural variation and LOH analyses. Again data can be ready in a matter of weeks, usually 2-4 from sample receipt by Aros.
How much does the sequencing cost? I generated the charts below to show what the cost per sample of a CNV/SV-Seq or RNA-Seq data set might be. I assumed we would use PE50bp runs, that 10M, 20M or 100M reads are required to generate data at the same quality as an array and included sample prep using TruSeq. Illumina TruSeq v3 SBS chemistry yields about 300M reads per HiSeq lane. There are two charts as 100M reads makes the axis too difficult to interpret for the 10M and 20M read costs.

A PE50 lane costs £750 at CRI so we can run 3 multiplexed samples per lane for the same cost as a snpCGH array (~£300) and get 100M reads per sample. Or run 10 multiplexed samples per lane for the same cost as an HT12 arrays (£125) and get 30M reads per sample. If we really can perform differential gene expression with only 3M reads then we can multiplex 96 samples per lane. Pretty cool I think!
So why have we not just switched? I have followed the developments in snpCGH, mainly Omni and Axiom. And as we started to look at changing to a newer array in the institute I immediately thought we might skip straight to NGS instead. The cost of generating a sequencing data set is dropping significantly (see above), however the other costs need to be taken into account. The sample prep and analysis pipelines are issues that I think need particularly careful consideration. The copy number analysis could probably be made with low coverage sequencing data sets, perhaps just 4 fold genome coverage? But structural variation and LOH may require more data, 1M SNPs from an array are cheap today. I do not believe it is clear what coverage is required to generate CNV and LOH data of the same quality as arrays. For gene expression the outlook is more promising. As long as users want differential gene expression only then as few as 1-5M could be enough to generate data comparable to arrays.
Yet today we still have recommendations to use 1ug of nucleic acid for these sample prep protocols (I know they are dropping). And the analysis tools are still developing very rapidly so pipelines are slow to develop.
I hope we can be one of the first institutes to say we've swapped from arrays to sequencing for general GX or CNV analysis, and I think we're pretty close to doing so. However it is always going to be after careful consideration of what is best for the project being discussed. Some of those projects are going to use arrays for at lest another couple of years.
PS Ideas on LOH analysis form low overage sequencing data: I do wonder if LOH might be possible from low coverage sequencing, if the analysis does not focus on high quality individual SNP calls but instead used a haplotyping approach based on low coverage and lower quality SNP calls but over windows of 10-100 SNPs? Is anyone working on this?


  1. Wow, do you really think 1-5M reads is enough for diff exp? Are you talking 'standard' RNA-Seq, or some sort of tag-based approach? I'd generally recommend at least 10M (but I admit I don't have the data to back it up).

  2. I am not usre and would have said 10M from data published over a year ago by Manolis Dermitzakis whilst at Sanger.
    However in 2011 Illumina published this whitepaper on their website
    I'd have thought you'd already know about that article? Care to comment on what it says?

  3. We're just about to order our first custom amplicon project for the Miseq. We have included in our panel a couple of genes in which the predominant mutation is either a large deletion or duplication. What do you think are the chances of being able to make some analysis of CNV using the miseq and a few representative amplicons from both within and outside of the deleted/duplicated region. Is it worth a go?