Thursday 22 December 2011

Merry Christmas from Core genomics

Merry Christmas and a happy New Year to everyone that read my blog this year. I only started in May and have had some great feedback. I'd seriously encourage others to start blogging too. It's great fun and the discipline of trying to write something every week is tough but I hope my writing skills are improving.

Good luck with your research next year, I look forward to reading about it.

Predictions for 2012:
5M reads and 400bp on PGM.
20M reads and 500bp on MiSeq.
A PacBio eukaryotic genome.
Goodbye GAIIx.
$1500 genome.
$250 exome.
NGS in the clinic using amplicon methods.
Fingers crossed we are going to hear a lot more form Oxford Nanopre in 2012 as well, it may well be their year.

See you at AGBT.

Wednesday 21 December 2011

Reading this will make you itch, the 1000 head louse genome project.

It’s that time of year when kids come home from school and start complaining of “the itches”; the nits are back.
I don’t know about the rest of the world but here in the UK we used to have “nit nurses” who would go from school to school checking for headlice. As a kid I remember lining up with the other kids in class and having her fingers run through my hair looking for eggs and lice. It may be strange but for me it is a lovely memory!

Nit genomics: This year when the school sent home the inevitable letter that head lice had been confirmed in my daughters class my thoughts turned to the louse genome (I am aware this is a bit nerdy). Has it been sequenced and what might we learn about the spread of nits and individual susceptibility through genomic analysis? After all the Belly Button Biodiversity project turned out to be pretty interesting didn’t it?

The nits close relative the body louse (Pediculus humanus humanus) has been recently sequenced. The body louse was sequenced using Sanger shotgun methods in 2010. 1.3M. It is a very AT rich genome. It has the smallest insect genome yet sequenced and apparently is the first animal genome to be shown to have fragmented mitochondrial mini-chromosomes.

The head louse genome is only 108Mb. As these parasites are generally quite prolific it should be possible for me to collect a reasonable number from each of my kids heads and mine and my wifes over the few weeks I am combing then out with conditioner (wet combing is as effective as any insectidcide based trewatment). I got four or five this morning from my daughter!

Ideally one would collect only larvae that have not yet started sucking blood to avoid having to sequence some of the Human genome as well (although I am not certain how much Human DNA would contaminate each louse).

With this sample it might be possible to get some idea of the population structure within a school, possibly through some molecular barcoding once we have good genes to target. Perhaps we can learn something about the spread of this organism through a community. As it is pretty harmless it should be easy to collect samples form schools allover the world. Are the head lice in Wymondham different from the lice in Waikiki, do they have lice in Waikiki?

If we could look deeper into the host could we find susceptibility loci and would screening of more susceptible individuals reduce the outbreaks we see each year? What else might we learn about this host:parasite interaction? Are different blood groups more or less affected by lice? There are so many questions we might answer.

I am not certain I will get the time to pursue this project but if there is an enterprising grad student that wants to take this on do get in touch.

Nit biology and evolution: I am writing this just because I am almost certainly never going to write about it again but I want to make sure I can explain it to my kids! Most of this comes from two papers I'd encourage you to read so see the references at the end.

Nits and lice are hemimetabolous rather than holometabolous insects. That is they develop from nymphs to adults rather than going through a larvae–pupae–adult transformation. The holometabolous strategy allows larvae and adults to occupy different ecological niches and as such has proven highly successful. However the niche occupied by nits is the same regardless of life-cycle stage. Nits are a strict Human obligate-ectoparasite and are provided with a homogenous diet (our blood) and few xenobiotic challenges. As such it appears that lice were able to reduce their genome size by losing genes associated with odorant and chemo-sensing; they have 10 fold fewer odorant genes than other insect sequenced and relatively few detoxification-enzyme encoding genes. Basically they don't need to find food or deal with harmful plant toxins.

Lice have been in contact with us for a long time and Human and Chimpanzee lice diverged at the same time as we did from our common ancestor about 25M years ago. We have been living and evolving together ever since with the body louse evolving relatively recently as we began to wear clothing. A paper by Kittler et al used a molecular clock analysis with 2 mtDNA and 2 nuclear loci across diverse human and chimpanzee lice. They saw higher diversity in African lice similar to Human diversity and estimated that body lice evolved around 70,000 years ago. They said this correlated with the time when Humans left Africa, I guess we had to wear something when we moved into Europe as it’s a whole lot colder than Africa.

Kirkness et al. Genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle. PNAS 107; 27: 2168–12173 (2010)
Kittler et al. Molecular evolution of Pediculus humanus 
and the origin of clothing. Curr Biol 13:1414–1417. (2003)

Friday 9 December 2011

Getting on the map

A recent advertising flyer from Life Tech seems to borrow from the GoogleMap of next-gen sequencers by suggesting to their community to "get on the map". The backdrop to the ad is a map of the world with highlighted areas where the reader might assume an Ion PGM is located (although the ad does not specifically claim this).

Immitation is the sincerest form of flattery:
The Ion Torrent backdrop locations match reasonably closely to the data on the GoogleMap (added by PGM owners) with respect to numbers of PGM machines by continental location.
Ion Torrent PGMs on the GoogleMap

Here is a comparison of the two sources:
North America 36 (ad) vs 39 (map)
South America 2 vs 0
Europe 31 vs 26
Africa NA (not visible) vs 2
Asia/India 21 vs 6
Australia 7 vs 17

I am all for encouraging users to register on the map and we have tended to get feedback that coverage is quite representative if only 60-70% of actual machines installations.

We will be updating the site in the next few months and I'd encourage you to add your facility or update it with your new toys.

We'd also be happy to get feedback from users about what they want to see on the map in the future.

RNA-seq and the problem with short transcripts

There are over 9000 Human transcripts <200bp in length which is about 5% of the total number of transcripts. When analysing some recent RNA-seq data here at CRI we noticed that only 17 detected transcripts from almost 30,000 are from transcripts shorter than 200bp. About 100 times lower than might be expected.

We have been asking why this might be the case. Firstly this is control data from the MAQC samples, UHRR and Brain. It may be that short transcripts are more often expressed at low levels but it may be that we are not picking them up because they are too short.

Spike-in experiments can help:
A recent paper presented data from a complex RNA spike in experiment. Jiang et al individually synthesised and pooled 96 RNA's by in vitro transcription which were either novel sequences or from the B. subtilis and M. jannaschii genomes. The RNA was stored in a buffer from Ambion to maintain stability. The synthsised RNA's were 273-2022bp in length and distributed over a concentration range spanning six orders of magnitude. They observed a high correlation between RNA concentration and read number in RNA-seq data and were able to investigate baises due to GC content and transcript length in different protocols.

This type of resource is useful for many experiments but difficult to prepare.

I have thought about using Agilents eArrays to manufacture RNAs of up to 300bp in length and to use spot number to vary concentration (15,000 unique RNA molecules spotted 1, 10, 100, 1000 or 10,000 times to vary concentration) this would create a very complex mix which should be reproducibly manufactured at almost any time for any group. This would also be very flexible in varying the actual sequences used to look at particular bias of the ends of RNA molecules in RNA-seq protocols.

But the RNAs need to be small enough:
The current TruSeq protocol uses metal hydrolysis to degrade RNA and no gel size selection is required later on. This is making it much easier to make RNA-seq libraries in a high throughput manner, however this technique possibly excludes shorter transcripts or at least makes the observed expression as measured by read counts lower than reality.

The TruSeq fragmentation protocol produces libraries with inserts of 120‐200 bp. Illumina do offer some advice on varying insert size in theor protocol but there is not a lot of room to manipulate inserts with this kind of fragmentation and size selection. They also offer an alternative protocol based on cDNA fragmentation but this method has been sidelined by RNA fragmentation due to increased coverage of the latter method across the transcript, see Wang et al.

The libraries prepared using the TruSeq protocol do not contain many observable fragments below 200bp (see image below) and 110bp or so of this is the adaptor sequence. This suggests the majority of starting RNA molecules were fragmented to around 150-250bp in length, so shorter RNAs could be fragmented too low to be sequenced in the final library.

From the TruSeq RNA library prep manual
I'd like to hear from people who are working on short transcripts to get their feedback on this. Does it matter as long as all samples will be similarly affected as many times we are interested in the differential expression of a transcript between two samples rather than between two transcripts in a single sample.

PS: What about smallRNA-seq?
There have been lots of reports on the bias of RNA ligase in small and micro-RNA RNA-seq protocols. At a recent meeting Karim Sorefan from UEA presented some very nice experiments. They produced adapter oligos with four degenerate bases at the 5' end and compared performance of these to standard oligos. The comparison made use of two reagents, a completely degenerate 21mer and complex pool of 250,000 21mer RNAs. The initial experiment with the degenerate RNA should have resulted in only one read per 400M for any single RNA molecule. They clearly showed that there are very strong biases and were able to say something about the sequences preferred by RNA ligase. The second experiment used adapters with four degenerate bases and gave significantly improved results, showing little if any bias.

This raised the question in my mind that the tissue specific or Cancer specific miRNAs published may not be quite so definitive. Many of the RNAs found using the degenerate oligos in the tissues they tested had never been seen in that tissue previously.

Jiang et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011.

Thursday 8 December 2011

Reference based compression: why do we need to do it?

We are producing a lot of sequence data (see the GoogleMap and a previous post). And will continue to produce a lot more, Illumina have just made noises about 1TB runs on HiSeq using PE100bp runs (whether they will actually release this given recent stock market downgrades is unclear). Computers don't keep up fast enough, we are running out of space to store it and bandwidth to move it around (Moores law - compute power grows 60% annually, Kryder's law - data storage grows 100% annually, Nielsen’s law - internet bandwidth grows 50% annually).

So we need to delete some data but what can we afford to throw away?

Reference based compression:
Ewan Birney’s lab at EBI published a very nice paper earlier this year presenting a reference based compression method. New sequences are aligned to a reference and the differences are encoded rather than storing the raw data. At the 3rd NGS congress I had lunch with Guy Cochrane and two other sequencing users and we discussed some of the issues in compressing data. I had to apologise that I’d missed Guys’s talk but it turned out his was at the same time as mine so at least I had a good excuse!

Efficient storage of high throughput sequencing data using reference-based compression. Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, et al. Genome Res 2011.

One of the discussions in compression of sequence data is the issue of what to use as a reference and how far to go in compressing data. Quite simply the current Human genome is not quite good enough, there are large amounts of sequence in a new genome which don’t match the current Hg19 reference but the reference is improving as more genome projects release data that can be incorporated.

One interesting comment in the paper for me is the exponential increase in efficiency of storage with longer read lengths. Longer reads are good for several other reasons, there is a significant drop in cost per bp, an increase in mappability and greater likelihood of resolving structural variants or splice isoforms with a single read. It looks like long read sequencing is going to be cheaper and better for us to use for many different reasons.

Today we still use Phred encoding ( and yet nearly all users remove data lower than Q30 (I am not sure anyone really cares about the difference of Q30 to Q40 when doing analysis). As such we may well be able to compress read quality information by reducing the range encoded to ten values Q10,Q20...Q50 or even lower to just two <Q30, >Q30.

At a recent meeting Illumina presented some of their work on compressing data and their experiences in reducing the range of Qscores encoded when doing this. They saw:
  • 8 bits/base in the current Illumina bcl files
  • 7.3 bits/base are actually used in bcl for up to Q40 scores
  • 5-6 bits/base can be achieved by compressing bcl files (lossless compression)
  • 4 bits/base can be achieved with the same methods as above if only 8 Q scores are used.
  • 2 bits/base can be achieved if no quality scores are kept (lossy)
  • <1 bits/base if a BWT (Burroghs wheeler transformation: which apparently allows this to be done on the instrument and allows someone to uncompress and realign the data later on, sounds good to me) and zero quality compression are used (lossy)

Compression of cancer genomes:
The discussion over lunch got me thinking about what we can do with Cancer genomes. Nearly all cancer genome projects like ICGC, are sequencing matched tumour:normal pairs. Many of these pairs turn out to be very very similar for most of the genome so using the same basic idea as presented in the Birney group paper would allow us to sequence two genomes, the tumour:normal pair but only store the data for one and the differences. We should be able to implement this quite rapidly and impact storage on these projects. Later on as Human genome reference based compression gets better we might be able to store even less. Birney's group also report that 10-40% of reads in a new genome sequence do not map to the reference and these are obviously harder to compress. Again with T:N pairs the rate of unmapped data should be significantly lower.

Compression of other projects:
What about sequencing other than whole genomes? It may well be that we can use the increasing numbers of replicates in RNA-seq and ChIP-seq experiments to reduce the storage burden of the final data. Replicates could be compressed against each other. A triplicate experiment could be stored as a single sequence file with variability's at specific loci and a few different reads across the genome.