Monday, 17 September 2012

HuID-seq blog

There has been lots of recent activity around using NGS gene resequencing in the clinic. Although clinical DNA sequencing has been an important tool for several decades the explosion in NGS methods for amplicon resequencing has made it feasible for just about any lab to do. Previous posts on this blog have discussed NGS amplicon methods and some of the tools needed to design amplicons.

It is not easy to say clear which technologie(s) will dominate in the clinical space, nor whether small targeted panels will be preferred over more comprehensive and larger panels, medical exomes or even whole genomes. But it does seem pretty clear that amplicon-sequencing is going to be a very important clinical tool.

Why is patient ID important:
As we are more easily able to sequence not just multiple genes but also multiple patients in s single NGS run it becomes very important to make sure results are not assigned to the wrong patient. Clinical molecular labs spend a huge amount of time and effort on making sure results don’t get mixed up, but I thought the tests themselves could be improved to determine which patient results came from at the same time as the clinical results are being generated. Just add a large enough number of SNP loci to allow patient identification by comparison to a simple blood-based genotyping assay.

SNP-seq for patient ID:
I have been discussing using additional content in amplicon (and other) tests for a year or so, but have never found the time to get in the lab and demonstrate the idea. I asked our Tech-Transfer people about it and they said whilst it was a nice idea there was little that could be protected from an IP perspective. As I am not going to get time to work on it, and as it can’t be protected easily I am hoping this blog will help stimulate discussion and someone will take the idea on board for their research. I call the method HuID-seq.

Comparison to STR profiling: There is already a gold-standard for Human Identification in the STR profiling used in forensic applications. Unfortunately the tests cannot be simply added to an NGS assay. What we need is a level of discrimination so that results cannot be sent to the wrong patient, in theory the HuID-seq could be set at a level significantly lower than forensic STRs. Today 13 STR loci are used in the United States Combined DNA Index System (CODIS) forensic kits. SNPs have lower resolving power and more are likely to be needed but before I get onto that a recent paper deserves a mention. In Biotechniques Bornman et al published an NGS based method that reproduces STR data very nicely (see Short-read, high-throughput sequencing technology for STR genotyping). Potentially this could be added to current tests but I prefer SNPs for a number of reasons.

Why SNPs, which ones and how many: SNPs are a good choice because they are easily assayed by PCR amplification of by in-solution capture methods. SNPs are also already assayed by current NGS methods and perhaps most importantly if coding SNPs are used then RNA-seq data can also be used for HuID-seq. This will be important if array-based gene expression signatures are ported over to NGS. It also means that the HuID-seq method could be used to very good effect in research projects whatever the source of data. It is often important to quality control large experimental datasets to remove duplicate samples or wrongly assigned samples. In the supplementary information for the Nature paper The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups, the authors presented a novel eQTL-like approach to check that the same patients samples were used for whole genome gene-expression and SNP genotyping arrays. They termed the method BeadArray Diagnostic for Genotype and Expression Relationships (BADGER).

Which ones: There were around 250,000 SNPs present on both Affymetrix and Illumina arrays. 2000 could be used for identification, 100 for ethnicity and 200 for gender. All SNPs would allow mapping of samples onto other SNP or sequence based data, e.g. SNP arrays, ChIP-seq, RNA-seq and exomes or genomes. We looked carefully at which of these SNPs might be used in a HuID-seq method and came up with some requirements.
  • MAF 0.5 (0.4-0.6)
  • Present on most widely used genotyping arrays.
  • Coding SNPs
  • Ability to predict identity
  • Ability to predict ethnicity
  • Ability to predict gender
How many: The number of SNPs is going to be higher than the STR loci, but it need not be very high and the “real-estate” taken up by such a HuID-seq panel need not be large in comparison to an NGS clinical test. SNPs need to be present very broadly in the population to be of any use, ideally using SNPs with MAF0.5 gives any individual a 50:50 chance of being homozygous for one allele or heterozygous. If this 50:50 chance holds true for all SNPs, and if the assay used is perfect then (i.e. no errors in genotyping) then just 4 SNPs give a >90% chance of uniquely identifying an individual. Increasing this to 8 results in a 99.5% and 20 SNPs gives a 99.9999% chance. Choosing a set of SNPS such that there is at least one per chromosome arm leads to a set of about 48. The final number of SNPs used could be determined by the requirement for unique identification, cost or complexity of PCR.

Community cohesion: It makes sense that if this approach is going to be used then everyone should use the same set of SNPs. The best way to do this would be to get people from different clinical and research backgrounds in a room to discuss the why’s and wherefores’ of different SNPs and to recommend a set to the community. If Life Tech, Illumina, Roche, or others get marketing too early on then we are likely to see multiple standards. This is exactly what we have in STR kits with the US and Europe using different sets of loci.

There are already some SNPs being used to QC data by the Broad but the GATK pages have been updated and the older page is now a dead link! LifeTech are launching an 8 SNP Ampliseq based Sample ID kit giving a resolution of 1:5000, you spike the SNP targeting reagents into your assay and go. Illumina are aslo collaborating on forensic products, Dr. Bruce Budowle at the Institute of Applied Genetics, University of North Texas discusses forensic NGS applications on You Tube with the IlluminaInc channel.

It probably has not escaped your notice that a HuID-seq panel could be used for cell line authentication as well. This is something that often gets attention but is easily forgotten by PhD students and post-docs until too late. Making a test part of their ChIP-seq or RNA-seq experiment and comparing back to a reference database would be a simple ad-on.

1 comment:

  1. Hi James,

    Both Broad and Sanger already use something like this. I believe the Broad uses 96 SNPs whereas Sanger uses ~30 (a single Sequenom plex). In both cases the markers were selected to be common across human populations, genotyped on all GWAS arrays, and well-captured by all exome capture arrays. Both sets include gender-specific markers.