Thursday, 8 December 2011

Reference based compression: why do we need to do it?

We are producing a lot of sequence data (see the GoogleMap and a previous post). And will continue to produce a lot more, Illumina have just made noises about 1TB runs on HiSeq using PE100bp runs (whether they will actually release this given recent stock market downgrades is unclear). Computers don't keep up fast enough, we are running out of space to store it and bandwidth to move it around (Moores law - compute power grows 60% annually, Kryder's law - data storage grows 100% annually, Nielsen’s law - internet bandwidth grows 50% annually).

So we need to delete some data but what can we afford to throw away?

Reference based compression:
Ewan Birney’s lab at EBI published a very nice paper earlier this year presenting a reference based compression method. New sequences are aligned to a reference and the differences are encoded rather than storing the raw data. At the 3rd NGS congress I had lunch with Guy Cochrane and two other sequencing users and we discussed some of the issues in compressing data. I had to apologise that I’d missed Guys’s talk but it turned out his was at the same time as mine so at least I had a good excuse!

Efficient storage of high throughput sequencing data using reference-based compression. Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, et al. Genome Res 2011.

One of the discussions in compression of sequence data is the issue of what to use as a reference and how far to go in compressing data. Quite simply the current Human genome is not quite good enough, there are large amounts of sequence in a new genome which don’t match the current Hg19 reference but the reference is improving as more genome projects release data that can be incorporated.

One interesting comment in the paper for me is the exponential increase in efficiency of storage with longer read lengths. Longer reads are good for several other reasons, there is a significant drop in cost per bp, an increase in mappability and greater likelihood of resolving structural variants or splice isoforms with a single read. It looks like long read sequencing is going to be cheaper and better for us to use for many different reasons.

Today we still use Phred encoding (http://en.wikipedia.org/wiki/Phred_quality_score) and yet nearly all users remove data lower than Q30 (I am not sure anyone really cares about the difference of Q30 to Q40 when doing analysis). As such we may well be able to compress read quality information by reducing the range encoded to ten values Q10,Q20...Q50 or even lower to just two <Q30, >Q30.

At a recent meeting Illumina presented some of their work on compressing data and their experiences in reducing the range of Qscores encoded when doing this. They saw:
  • 8 bits/base in the current Illumina bcl files
  • 7.3 bits/base are actually used in bcl for up to Q40 scores
  • 5-6 bits/base can be achieved by compressing bcl files (lossless compression)
  • 4 bits/base can be achieved with the same methods as above if only 8 Q scores are used.
  • 2 bits/base can be achieved if no quality scores are kept (lossy)
  • <1 bits/base if a BWT (Burroghs wheeler transformation: which apparently allows this to be done on the instrument and allows someone to uncompress and realign the data later on, sounds good to me) and zero quality compression are used (lossy)

Compression of cancer genomes:
The discussion over lunch got me thinking about what we can do with Cancer genomes. Nearly all cancer genome projects like ICGC, are sequencing matched tumour:normal pairs. Many of these pairs turn out to be very very similar for most of the genome so using the same basic idea as presented in the Birney group paper would allow us to sequence two genomes, the tumour:normal pair but only store the data for one and the differences. We should be able to implement this quite rapidly and impact storage on these projects. Later on as Human genome reference based compression gets better we might be able to store even less. Birney's group also report that 10-40% of reads in a new genome sequence do not map to the reference and these are obviously harder to compress. Again with T:N pairs the rate of unmapped data should be significantly lower.

Compression of other projects:
What about sequencing other than whole genomes? It may well be that we can use the increasing numbers of replicates in RNA-seq and ChIP-seq experiments to reduce the storage burden of the final data. Replicates could be compressed against each other. A triplicate experiment could be stored as a single sequence file with variability's at specific loci and a few different reads across the genome.

No comments:

Post a Comment