CoreGenomics: July 2011

Friday, 22 July 2011

Ion Torrent has the "Law" on their side

Bang went the starter gun in August 2010 when LifeTechnologies announced the purchase of Ion Torrent for $375M (plus another $350M if certain milestones are reached).

Yesterday in Nature Life Technologies and Ion Torrent published the genome of Gordon Moore of “Moore’s law” fame, plus 3 bacterial genomes. I’d like to know how much Gordon thought about DNA sequencing before his law was quoted in almost every next-gen presentation over the last four or five years? Let alone what went through his head when approached by Ion and LifeTech! The Moore genome required about 1000 Ion chips or a bilion detector wells. In the supplementary information for the paper Ion sugest a route to a 1B well chip (see later on in this post) which would be pretty awsome.

Over at Genetic Future there is a nice commentary of the Human Genome results. Daniel MacArthur states pretty categorically that this is nowhere near a cost effective way to sequences humans “yet”. I posted on SEQanswers in February that Ion could get to HiSeq output by 2013 after reading an interview with Jonathan Rothberg, inwhich he said to expect a 10-fold improvement from Ion every six months. It looks like they have achieved so far, but will they keep this up to get to 100Gb by the middle of 2013?

Dan also makes a very strong argument against the papers claims on quality and validation with SOLiD. Dan’s comments on the low coverage are very persuasive. I’d excuse Life Tech and Ion for the 10x Moore genome, but there can be little excuse for Life Tech on a 15x SOLiD genome. They could have very easily brought this closer to the standard of 30-50x coverage. I hope it is obvious to everyone reading the paper that Ion and SOLiD come from the same company and that there is likely a vested interest in saying how great both technologies are.

Dan thought Life Technologies may have made a mistake putting the Human genome into the paper. I really hope that without it this work would not have made it into Nature! A genome today even on a new sequencing technology just does not feel like it should pass the bar of entry to Nature. It’s a tough club to get in to!

Nature News makes a lot of the fact that a bacterial genome can be sequenced in two hours. They seem to ignore library prep entirely and even with the new automated ePCR this is probabaly a days work at least. Illumina's purchase of Epicentre gives tehm the Nextera technology and it really is possible to sequence a genome in an eight hour day. If you can get one delivered of course.

One of the interesting things to me was too look at what developments we might deduce in terms of throughput. In the paper the authors state read lengths of over 200bp and that only 20-40% of detector wells generate mappable reads. Moving past 100bp reads and increasing %mappable reads to closer to 100 will make a massive difference to the ultimate output. However I do not know enough about the technical challenges Life may face here. This is almost certainly the kind of development that will get Ion Torrent the further $350 from Life.

In the supplementary figures for the Nature paper Ion demonstrate 1.3um wells on a 1.68 pitch array allowing up to 165M detector wells. They say that this could increase to 1B but it sounds like a tough step to take. At 1B wells and 30bp then Ion is giving us the same yield as a v3 HiSeq flowcell today. How long it might take to get there is another matter. There are 20 supplemental figures. Number 8 shows the instrument and points out the major features, one of which is the accompanying iPod and it's dock!

One of the challenges mentioned in discussions I have had with users of the Ion Torrent is that the current emulsion PCR is limiting the size of amplicon that can be generated on the acrylamide beads. If the Ion technology is going to get to 400bp like 454 or even to 1000bp then a larger PCRPCR paper that describes how size of bead affects amplicon length I’d be grateful. So far I have looked at "SNP genotyping by multiplexed solid-phaseamplification and fluorescent minisequencing" and "Product length, dye choice, anddetection chemistry in the bead-emulsion amplification of millions of single DNAmolecules in parallel".

It is going to be an interesting to see how Ion respond to Roche’s entry into semi-conductor sequencing. If one of them gets a system that can actually scale according to Moore’s law then Illumina will need to squeeze more out of SBS or possibly Oxford nanopore?

Wednesday, 20 July 2011

How good are the ENCODE RNA-Seq guidelines?

The ENCODE consortium released its first set of data-standards guidelines and interestingly they are for RNA-Seq. ChIP-Seq guidelines will follow later which is a little surprising considering almost all the ENCODE data so far is ChIP-Seq (see below). In some ways I’d have preferred to see the ChIP-Seq document first. As ChIP-Seq is pretty mature it would have been clear how much ENCODE had taken into account the different lab and analytical methods and distilled what was important in an experiment.

There is a bit of a hole in the guidelines from my point of view as there is no comparison or recommendation on methods. When I first looked at the site this is exactly the information I was hoping to get. There is none, zip, zilch! I was also surprised that there are no references in these guidelines. I think this is a significant shortcoming from the ENCODE consortium and one that needs to be fixed. I would very much hope that there are protocol recommendations for the more mature ChIP-Seq methods when those guidelines are written.

A lot of the guideline recommendations come from experience of microarrays. This document is nowhere near as comprehensive as MIAME but I think it will be easier for users to adopt because of this. The Metadata section is a nice concise list of information to collect for an experiment, RNA-Seq or otherwise. I'd encourage anyone doing a sequencing or array experiment to read this list and think about other factors they might need to collect in their own experiments.

Whilst these guidelines are a reasonable start and outline many of the issues RNA-Seq users need to consider, they fall a long way short of being truly useful to someone considering where to start with an RNA-Seq experiment.

ENCODE data so far: About 20 labs have submitted data to ENCODE according to their data summary. When I looked there was no ChIP-Chip data in the summary; almost 85% of the data is from sequencing experiments with 63% ChIP-Seq, 8% RNA-Seq, 7%, DNAse-Seq, 4% Methyl-Seq and 2% FAIRE-Seq.

The Guidelines
Methods: RNA-Seq Methods mentioned include. transcript quantification, differential gene expression, discovery and splicing analysis. They don’t mention allele specific expression. Many types of input can be used in these methods, Total RNA (including miRNA of course), single cell RNA, smallRNA, polyA+ RNA, polysomal RNA, etc, etc, etc. The authors do state how immature RNA-Seq is and that the applications are evolving incredibly rapidly in almost every part of an experiment; sample prep, sequencing and analysis. They say they don’t aim to cover every possible application but instead focus on the major ones and also provide recommendations for providing meta-data, something too many scientists still don’t collect before and during an experiment, let alone submit with the data for analysis.

Metadata: recommendations include the usual suspects. For Cell lines; accession number, passage number, culture conditions, STR and Mycoplasma test results. For tissue the source and genotype if this is an animal, sample collection and processing methods, cellularity scores. And for the final RNA the method used for extraction and QC results (bioanalyser database anyone?)

Replication: They say that RNA-Seq experiments should be replicated (biological rather than technical) although ENCODE recommend a minimum of two replicates, which is very low. I defy anyone to find a statistician involved in microarray experiments that would settle for anything less than three and probably four replicates today. However they do give a get out clause for those who can’t replicate by stating “unless there is a compelling reason why this is impractical or wasteful”. An interesting point is that these guidelines suggest an RPKM correlation of at least 0.92 is required otherwise an experiment should be repeated or explained. I would have thought anyone publishing their experiments would already be explaining this and that reviewers would pick up on such poor correlations.

Read-depth: This is one of the hottest topics for RNA-Seq. It makes a massive difference to the final cost of the experiment and is a major determinant in the “microarrays vs sequencing” thought process. ENCODE suggest around 30M paired end reads for differential gene expression, however Illumina are suggesting you can use as few as 2M reads per sample today if you want the same sensitivity as Affy arrays. That’s a 15 fold difference and I suspect this will be revised in the next version of the guidelines. They do say that other methods will require more reads, up to 200M.

ENCODE aim to update this document annually, I am sure many will be encouraged by this as a useful endeavor. What about a step further with an open access Genomic journal that only covers annual reviews of methods, compares the variations and makes recommendations for a consensus protocol?

“Genome Methods Reviews” perhaps?

Friday, 15 July 2011

Understanding mutation nomenclature

I have recently been analysing some next-gen sequencing data for mutation detection. I realised pretty early on how long ago it was I had the opportunity to analyse some data of my own, as my day job is the generation of someone else's. I have been scanning through lots of data, COSMIC and many publications and thought I had better get up to speed on the naming conventions for different mutations and polymorphisms. Whilst doing so I thought I'd write a little primer to look back on and share with others.

First off I'd recommend some resources and references: Almost everything comes from these sources and the website is really comprehensive.
1. Antonarakis et al: Recommendations for a nomenclature system for Human gene mutations. 1998; Human Mutation 11:1-3. This was written before the Human Genome project was completed and a nice easy downloadable reference sequence from UCSC or Ensembl was not really conceivable. Looking back at papers like this reminds me how much we seem to take for granted in our day-to-day science!
2. den Dunnen et al. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. 2000; Human Mutation 15:7-12 This paper presented an updated set of recommendations to allow comprehensive reporting of more complex mutations, and as such has been very useful in Cancer genomics.
3. http://www.hgvs.org/mutnomen/recs.html: the Human Genome Variation Society website as probably the primary resource for mutation nomenclature.

How to report mutations:
The basic structure of a nucleotide change is as follows:

{Genomic locus or reference accession}{gDNA or cDNA}{nucleotide interval}{reference nucleotide}{type of change}{actual nucleotide}

e.g. NM_000546g.76A>T, where nucleotide 76 in TP53 has been changed from A to T.

Use the accession number for the genomic reference sequence from RefSeq (e.g. NM_000546 for TP53 from HGNC). Nucleotide numbers are preceded by a "g.", "c.", "m.", "r.", "p." denoting genomic, cDNA, mitochondrial, RNA or protein sequence respectively. For g, c, and r the A of the ATG start codon is nucleotide +1, the nucleotide 5' to +1 is denoted -1 and there is no nucleotide 0. If there is more than one mutation in a single allele these are listed together inside [brackets]. Nucleotides are represented in capitals for gDNA (A,C,G,T) and in lowercase for cDNA (a,c,g,u), amino acids are represented by their one letter IUPAC codes (see the bottom of this post).

Different types of mutation:
Substitutions: c.76A>T
Deletions: c.76del or c.76_78delACT
Duplications: c.76dup or c.76_78dupACT
Insertions: e.g. c.76insG
Inversions: c.76_83inv8
Conversion: c.76_355conNM_004006.1:c.155_355
Indels: e.g. c76_83delinsTG
Translocations: t(X;4)(p21.2;q35)
Complex: see details below.

Substitutions: ">" indicates a substitution at DNA level where one nucleotide has been replaced by another (e.g. c.76A>T, where nucleotide 76 in the reference has been changed from A to T).
Deletions: "del" indicates a deletion in the reference sequence where at least one nucleotide is removed (e.g. c.76del or c.76delA, where nucleotide 76, an A has been deleted and c.76_78del or c.76_78delACT where the three nucleotides at positions 76,77&78, ACT have been deleted. The bases deleted are not absolutely required in the description but can be helpful.)
Duplications: "dup" indicates a duplication of the reference sequence (like c.76dup or c.76dupA where nucleotide 76, an A has been duplicated and c.76_78dup or c.76_78dupACT where the three nucleotides at positions 76,77&78, ACT have been duplicated.
Insertions: "ins" indicates an insertion in the reference sequence (e.g. c.76insG where a G has been inserted between nucleotides 76 and 77).
Inversions: "inv" indicates an inversion of the reference sequence between the bases reported (e.g. c.76_83inv or c.76_83inv8 where the 8 nucleotides between positions 78 and 83 have been inverted with respect to the reference.)
Conversion: "con" indicates a conversion of a region of the reference genome to another region. These can be complex events involving highly homologous sequences in the genome and thus may also include multiple changes in the converted sequence (e.g. c.76_355conNM_004006.1:c.155_355 where nucleotides 76 to 355 of the reference have been converted to nucleotides 155 to 355 from the sequence in GenBank accession NM_004006.1).
Indels: These genomic events are quite common but the recommended nomenclature requires that they are reported as DelIns (not quite so catchy!) for clarity (e.g. c76_83delinsTG where the 8 nucleotides between positions 76&83 have been deleted and a TG sequence has been inserted in their place).
Translocations: "t" indicates a translocation between genomic loci (e.g. t(X;4)(p21.2;q35)) where a translocation has occurred between chromosome bands Xp21.2 and 4q34. A translocation should be reported in reference to a sequence accession number where possible.
Complex: similarly to big changes (see below) it is often easier to report a complex change in reference to a sequence accession number with a brief textual description.
Other identifiers: The underscore "_" indicates that a range of bases have been affected (see above for numerous examples), square brackets indicate an allele "[]" (e.g. c[76A>T]) and round brackets indicate that the exact position of the change is unknown "()" and the range of uncertainty is reported using the underscore (e.g. c.(67_70)insG).

What about big changes? It would be impractical to list 100s or 1000s of bases using this nomenclature. Large changes are denoted only by the number of bases changed (e.g. c76ins1786), however an accession number for a sequence file should also be provide.

When is an insertion a duplication, etc, etc, etc? If you take the example ACTTTGTGCC in the reference and ACTTTGTGGCC in tour sample you could have a duplication of G at position 8 or an insertion of G between positions 8&9, c.8dupG or c.8_9insG. Duplicating insertions are described a duplications as this is a simpler and clearer description. Of course it is not clear whether the additional base is a true duplication resulting from polymerase slippage or is actually an insertion of a new but identical base from outside the reference sequence.

What happens downstream of the first mutation? If a duplication or deletion change in your sequence is followed by a subsequent change I am not clear how the following changes are presented. Now every base will be wrong compared to the reference, how does this complexity get resolved? (sorry but I need to add to this later, probably when I confront such a problem myself).

What happens when the reference changes? An interesting point and one that I had not considered too deeply was what do you use as the reference sequence and how do you deal with changes in it. The Human genome has a pretty good reference and this is considered to be the wild-type allele. However the reference is being updated with new information and groups or users can request that the reference allele is changed when it is shown that a variant is actually the more common allele. Obviously this has an impact on what is reported in papers, as such it is always helpful to indicate which genome build has been used. The simple answer is to always reference the genome build you used in your paper or report.

IUPAC codes: These are handy when referencing mutations so I thought I'd include the single letter codes for nucleotides and amino acids as I always forget these and there is a good chance I will remember I wrote this blog!

Symbols:
A   Adenine
C   Cytosine
G   Guanine
T   Thymine
R   G or A       puRine
Y   T or C    pYrimidine
M   A or C    aMino
K   G or T    Keto
S   G or C    Strong interaction (3 H bonds)
W   A or T    Weak interaction (2 H bonds)
H   A or C or T "not-G, H follows G in the alphabet"
B   G or T or C "not-A, B follows A"
V   G or C or A "not-T (not-U), V follows U"
D   G or A or T "not-C, D follows C"
N   G or A or T or C   aNy

Complementary symbols:
Symbol      A C G T B D H K M S V W N
Complement T G C A V H D M K S*B W*N*

Pages