CoreGenomics: Understanding mutation nomenclature

Friday, 15 July 2011

Understanding mutation nomenclature

I have recently been analysing some next-gen sequencing data for mutation detection. I realised pretty early on how long ago it was I had the opportunity to analyse some data of my own, as my day job is the generation of someone else's. I have been scanning through lots of data, COSMIC and many publications and thought I had better get up to speed on the naming conventions for different mutations and polymorphisms. Whilst doing so I thought I'd write a little primer to look back on and share with others.

First off I'd recommend some resources and references: Almost everything comes from these sources and the website is really comprehensive.
1. Antonarakis et al: Recommendations for a nomenclature system for Human gene mutations. 1998; Human Mutation 11:1-3. This was written before the Human Genome project was completed and a nice easy downloadable reference sequence from UCSC or Ensembl was not really conceivable. Looking back at papers like this reminds me how much we seem to take for granted in our day-to-day science!
2. den Dunnen et al. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. 2000; Human Mutation 15:7-12 This paper presented an updated set of recommendations to allow comprehensive reporting of more complex mutations, and as such has been very useful in Cancer genomics.
3. http://www.hgvs.org/mutnomen/recs.html: the Human Genome Variation Society website as probably the primary resource for mutation nomenclature.

How to report mutations:
The basic structure of a nucleotide change is as follows:

{Genomic locus or reference accession}{gDNA or cDNA}{nucleotide interval}{reference nucleotide}{type of change}{actual nucleotide}

e.g. NM_000546g.76A>T, where nucleotide 76 in TP53 has been changed from A to T.

Use the accession number for the genomic reference sequence from RefSeq (e.g. NM_000546 for TP53 from HGNC). Nucleotide numbers are preceded by a "g.", "c.", "m.", "r.", "p." denoting genomic, cDNA, mitochondrial, RNA or protein sequence respectively. For g, c, and r the A of the ATG start codon is nucleotide +1, the nucleotide 5' to +1 is denoted -1 and there is no nucleotide 0. If there is more than one mutation in a single allele these are listed together inside [brackets]. Nucleotides are represented in capitals for gDNA (A,C,G,T) and in lowercase for cDNA (a,c,g,u), amino acids are represented by their one letter IUPAC codes (see the bottom of this post).

Different types of mutation:
Substitutions: c.76A>T
Deletions: c.76del or c.76_78delACT
Duplications: c.76dup or c.76_78dupACT
Insertions: e.g. c.76insG
Inversions: c.76_83inv8
Conversion: c.76_355conNM_004006.1:c.155_355
Indels: e.g. c76_83delinsTG
Translocations: t(X;4)(p21.2;q35)
Complex: see details below.

Substitutions: ">" indicates a substitution at DNA level where one nucleotide has been replaced by another (e.g. c.76A>T, where nucleotide 76 in the reference has been changed from A to T).
Deletions: "del" indicates a deletion in the reference sequence where at least one nucleotide is removed (e.g. c.76del or c.76delA, where nucleotide 76, an A has been deleted and c.76_78del or c.76_78delACT where the three nucleotides at positions 76,77&78, ACT have been deleted. The bases deleted are not absolutely required in the description but can be helpful.)
Duplications: "dup" indicates a duplication of the reference sequence (like c.76dup or c.76dupA where nucleotide 76, an A has been duplicated and c.76_78dup or c.76_78dupACT where the three nucleotides at positions 76,77&78, ACT have been duplicated.
Insertions: "ins" indicates an insertion in the reference sequence (e.g. c.76insG where a G has been inserted between nucleotides 76 and 77).
Inversions: "inv" indicates an inversion of the reference sequence between the bases reported (e.g. c.76_83inv or c.76_83inv8 where the 8 nucleotides between positions 78 and 83 have been inverted with respect to the reference.)
Conversion: "con" indicates a conversion of a region of the reference genome to another region. These can be complex events involving highly homologous sequences in the genome and thus may also include multiple changes in the converted sequence (e.g. c.76_355conNM_004006.1:c.155_355 where nucleotides 76 to 355 of the reference have been converted to nucleotides 155 to 355 from the sequence in GenBank accession NM_004006.1).
Indels: These genomic events are quite common but the recommended nomenclature requires that they are reported as DelIns (not quite so catchy!) for clarity (e.g. c76_83delinsTG where the 8 nucleotides between positions 76&83 have been deleted and a TG sequence has been inserted in their place).
Translocations: "t" indicates a translocation between genomic loci (e.g. t(X;4)(p21.2;q35)) where a translocation has occurred between chromosome bands Xp21.2 and 4q34. A translocation should be reported in reference to a sequence accession number where possible.
Complex: similarly to big changes (see below) it is often easier to report a complex change in reference to a sequence accession number with a brief textual description.
Other identifiers: The underscore "_" indicates that a range of bases have been affected (see above for numerous examples), square brackets indicate an allele "[]" (e.g. c[76A>T]) and round brackets indicate that the exact position of the change is unknown "()" and the range of uncertainty is reported using the underscore (e.g. c.(67_70)insG).

What about big changes? It would be impractical to list 100s or 1000s of bases using this nomenclature. Large changes are denoted only by the number of bases changed (e.g. c76ins1786), however an accession number for a sequence file should also be provide.

When is an insertion a duplication, etc, etc, etc? If you take the example ACTTTGTGCC in the reference and ACTTTGTGGCC in tour sample you could have a duplication of G at position 8 or an insertion of G between positions 8&9, c.8dupG or c.8_9insG. Duplicating insertions are described a duplications as this is a simpler and clearer description. Of course it is not clear whether the additional base is a true duplication resulting from polymerase slippage or is actually an insertion of a new but identical base from outside the reference sequence.

What happens downstream of the first mutation? If a duplication or deletion change in your sequence is followed by a subsequent change I am not clear how the following changes are presented. Now every base will be wrong compared to the reference, how does this complexity get resolved? (sorry but I need to add to this later, probably when I confront such a problem myself).

What happens when the reference changes? An interesting point and one that I had not considered too deeply was what do you use as the reference sequence and how do you deal with changes in it. The Human genome has a pretty good reference and this is considered to be the wild-type allele. However the reference is being updated with new information and groups or users can request that the reference allele is changed when it is shown that a variant is actually the more common allele. Obviously this has an impact on what is reported in papers, as such it is always helpful to indicate which genome build has been used. The simple answer is to always reference the genome build you used in your paper or report.

IUPAC codes: These are handy when referencing mutations so I thought I'd include the single letter codes for nucleotides and amino acids as I always forget these and there is a good chance I will remember I wrote this blog!

Symbols:
A   Adenine
C   Cytosine
G   Guanine
T   Thymine
R   G or A       puRine
Y   T or C    pYrimidine
M   A or C    aMino
K   G or T    Keto
S   G or C    Strong interaction (3 H bonds)
W   A or T    Weak interaction (2 H bonds)
H   A or C or T "not-G, H follows G in the alphabet"
B   G or T or C "not-A, B follows A"
V   G or C or A "not-T (not-U), V follows U"
D   G or A or T "not-C, D follows C"
N   G or A or T or C   aNy

Complementary symbols:
Symbol      A C G T B D H K M S V W N
Complement T G C A V H D M K S*B W*N*

CoreGenomics

Pages

Friday, 15 July 2011

Understanding mutation nomenclature

No comments:

Post a Comment