Thursday 28 May 2015

Book Review: Next-Generation DNA Sequencing Informatics 2nd ed.

Next-Generation DNA Sequencing Informatics (2nd edition), edited by Stuart M. Brown (blogger) is a book to get you well and truly started in NGS Bioinformatics. The twelve chapters cover QC, sequence alignment, assembly, transcriptome and ChIP-seq analysis, visualisation, and much else besides. For this review I read the chapters on QC, RNA-seq and emerging technologies and applications.


Friday 15 May 2015

One Genome in tiny plastic bottle

The GIAB consortium (@GenomeInABottle) took a major step forward today when it released the first NIST reference material for Human genome sequencing, the story even made it into the New York Times. It comes at an important time when we're moving into an era where millions of people are getting genome-based genetic tests. The GIAB standard will allow labs to demonstrate their capability to detect known variants, and measure the noise introduced by their tests. The GIAB RM8398 is probably the most sequenced Human sample of all time, and has orders of magnitude more confirmed variants than anything else including reference calls for SNPs, small indels, and homozygous reference genotypes for almost 80% of the genome. NA12878 has already been referenced in almost 250 publications on PubMedCentral.

Wednesday 13 May 2015

#AGBT15 poster highlghts

Oops, I forgot to publish this post after the meeting!

So many posters at AGBT and so little time. I did not get round everything and these highlights were some of the many that caught my eye. I'd like to hope the organisers make the posters available in a more searchable and downloadable format for next year since we'll all be nipping into Disney or Harry Potter world! It would also be nice if these were downloadable for anyone, and not just conference attendees!

Cancer Genomics:
  • Detection of oncogenic fusion transcripts using Archer targeted sequencing technology
  • Optimizing tumour genome sequencing and analysis
  • Non-invasive cell-free tumour DNA-based detection of copy-number variations in breast and ovarian cancer samples
  • Detecting structure; variants and phasing haplotypes from cancer exome sequencing using 1ng DNA input (10X Genomics)
  • Comparative analysis of methods to detect copy number variations using single cell sequencing technique
  • Single cell DNA seq of breast cancer
  • Single cell sequencing identifies clonal stasis and punctuated copy-number evolution in triple-negative breast cancers
Epigenomics:
  • Differentiation of 5-methylcytosine and 5-hydroxymethylcytosine using Illumina infinium HumanMethylation450 beachip
  • A high-throughput method for single cell whole genome bisulfite sequencing
General Biology:
  • Mapping the "dark matter" of the genome with super long molecules – the unknown unknown
  • Unlocking protein expression in single cells using Fluidigm's C1 single-cell auto prep station
Genomics Application Development:
  • A novel target capture technique for next-generation sequencing
  • Exome library construction on an integrated microfluidic device
  • Transposase-mediated sample preparation improvements enable high-throughput variant detection using Human whole exome sequencing
  • Error proofing exome sequencing for somatic variant detection: combining analytic tools and lab process improvements to uncover and reduce artifacts in exome sequencing
  • Contiguity preserving transposition sequencing (CPT-seq) for phasing and assembly
  • Assembly of complete KIR haplotypes from a diploid individual by the direct sequencing of full-length Fosmids
  • Demultiplexing is a significant contributor to apparent cross-contamination between samples
  • Spatial single cell miRNA analysis reveals differential expression within tissue sections
  • G&T-seq: separation and parallel sequencing of the genomes and transcriptomes of single cells
  • Comparison of library prep kits for low input whole exome sequencing
  • Application of NGS technology to rare mutation detection in ctDNA and CTCs
  • High-throughput nucleic acid extraction and library construction from FFPE tissue specimens
Genomics Medicine:
  • Single-molecule DNA analysis of FMR-1 using SIMDEQ
  • Dissecting the genetic architecture of longevity using massive-scale crowd-sourced genealogy
  • Sequencing under a quality system: sensitive detection of molecular biomarkers from whole exomes and custom gene panels in support of clinical trials
  • Design and implementation of an informatics infrastructure for clinical genomics reporting and decision support
  • Centralizing an electronic informed consent platform to enable large-scale genomics research
  • Hypthesis free gene fusion detection
  • Utilization of whole genome analyse approaches for personalised therapy decision making in patients with advanced malignancies
  • Clinical performance of exome capture technology: Impact of kits, coverage and analysis
  • Targeted RNA-sequencing for simultaneous expression profiling and detection of clinically relevant gene rearrangements in FFPE biopsies
  • A novel next-generation sequencing (NGS)-based companion diagnostic predicts response to the PARP inhibitor Rucaparib in ovarian cancer
Informatics/Computational Biology:
  • A full diploid assembly of a breast cancer genome using read clouds
  • Chromatin structure fully determines replication timing program in Human cells
  • Anchored assembly: accurate structural variant detection using short-read data
  • How much data do we really need for human genomes and exomes?
  • Industrial-scale complete DNA-seq ad RNA-seq analysis kits for every researcher
  • Lab7 enterprise sequencing platform: a comprehensive NGS workflow solution
  • Error correction and de novo assembly of Oxford Nanopore sequencing
  • Dollars & science: publishing and patenting strategies in Biotech
  • Towards real-time surveillance approaches using nanopore sequencers
Microbiome:
  • Beer-omics: Microbial populations and dynamics in fermentation
  • 16S rRNA gene sequencing – comparing paired-end and single-end read data generation
  • City-scale DNA dynamics, disease surveillance, and metagenomics profiling
Technology Development:
  • Detection of genetic variants via enzymatic mismatch cleavage
  • Single-cell RNA profiling by spatial transcriptomics in prostate cancer
  • Air-seq: metagenomic sequencing for pathogen surveillance in bioaerosols
  • Integrated DNA and RNA sequencing of single cells
  • Predicting sequencing performance of FFPE samples prior to next-generation sequencing using KAPA human genomic DNA quantification and QC kit
  • Drop-seq: A droplet-based technology for single-cell mRNA-seq analysis on a massive scale

Tuesday 12 May 2015

Making high-thoughput RNA-seq easier

The cost of sequencing has dropped precipitously over the last five years, the cost of library prep has not moved by anywhere near as much. For some experiments this is a major headache, particularly in small genomes, metagenomics, and for us, RNA-seq. Reducing the cost of consumables in kits is one way to bring down prices, but the cost savings get less and less if the protocol requires a person to spend a week in the lab; developing novel methods that are simpler is likely to have more impact. In Simultaneous generation of many RNA-seq libraries in a single reaction by Shishkin et al (Nat Methods 2015), a simple additional RNA-ligation step means 100's of samples can be processed in a one tube stranded RNA-seq library prep.


Friday 8 May 2015

I thought p-values were safe to use

One of our statisticians recently co-authored a paper in Nature Methods on the use and misues of p-values: The fickle P value generates irreproducible results. After reading this paper I really felt that I’d learned something useful; that p-values, which we use to determine if an experimental result is statistically significant, are potentially so variable in the kind of experiments we’re doing (3-4 replicate RNA-seq) as to be of little value in their own right.

Many of the statistically minded readers of this blog may well be rolling their eyes at my naivety, but I find stats hard and often explanations of how and why to choose statistical tests impenetrable. This paper is clear and uses some good examples (see figure below). Although I would have liked then to pick apart some recent RNA-seq experiments!

In a recent lab meeting when presented with a list of 7 options for the definition of a p-value, some, none or all of which were true, only one member of the lab of a dozen or so got the right answer. This is a very small survey but it does raise the question how many users of p-values don't really understand what the p-value means? 


The fickle p-value: the main thrust of the paper is that the p-value used to assess whether a test result is significant or not (or rather its probability of being due to random chance) is in itself variable, and that variability is exaggerated because small samples cannot reflect the source population well: in small samples, random  variation has a substantial influence. Reporting the confidence interval for the p-value does allow us to see the likely variability in the p-value. But if a clear experimental design is not reported then interpreting p-values is as meaningless as interpreting charts without error bars.

In the example used in the paper (see figure above) the authors sample ten times from two populations that have only a small difference, i.e. a small effect size. Whilst the number of replicate measurements appears high (10 compared to the usual 3 for RNA-seq), the size of the effect is small and this makes the results very susceptible to chance.

Effect size affects sample size: One of the questions we ask users in experimental design discussions is how big is the effect they are expecting to see. By understanding this, and ideally the biological variability within sample groups we hope to be able to design a well powered study. Most users are unable to specify effect size or sample group variability with any precision, and the experiment is often the first time they’ve seen this data. In essence we might perhaps consider every experiment a pilot.

Because of this the only thing we can reasonably adjust in our experiment is the sample size i.e. the number of biological replicates in each group. However this is always more work, more time, and more money, and so users are often reluctant to increase replicate numbers to many more than four.

We might be better asking users what the minimum effect size they think they can publish will be and designing the experiment appropriately. Unfortunately we can’t help determine what the smallest meaningful effect might be – most gene expression experiments start by looking for changes of 2-fold or more, but we see knock-down experiments where only a small amount of protein is left with no discernible effect on phenotype.

What to use instead, or what to report along with p: The authors do suggest some other statistical measures to use instead of p. They also suggest reporting more details of the experiment: sample size, effect size. Perhaps the most important suggestion and also a very simple one is to report the 95% confidence intervals associated with the p-value, which tells us the likely value of p from repeated sampling.

Commentary on the paper: There has been a pretty good discussion of the paper in the commentary on theconversation. One of the commenters stated that they thought a p-value included an estimate of variability, the reply to this comment, and the follow up say a lot about how people like me view p.

“Geoff Cumming has quantified the variability around p. For example, if a study obtains P = 0.03, (regardless of statistical power) there is a 90% chance that a replicate study would return a P value somewhere between the wide range of 0 to 0.6 (90% prediction intervals), while the chances of P ≤ 0.05 is just 56%. You can't get much more varied than that!”

“I thought P already contained the '90% ...' calculations. When one hears that the probability of something is X +- Y 19 times out of 20, it sounds like that's what P is, and that it 'knows' it's own variability. This simplistic P that you seem to describe sounds almost childishly useless. Why quote a datum when it is known that the next test would produce a quite different result?”

Is your RNA-seq experiment OK: What is the impact of this on your last RNA-seq experiment? I asked one of the authors to look at data from my lab and the results were not as great as I'd like...but this experiment was done with a collaborator who refused to add more replicates so was somewhat doomed. Don't worry this was an external project!

Want to learn more about stats:  The authors have published a series of introductory statistics articles aimed at biologists in the Journal of Physiology. You might also check out theanalysisfactor “making statistics make sense”. They have a great post titled 5 Ways to Increase Power in a Study, which I’ve tried to summarise it below:

To increase power:
  1. Increase alpha
  2. Conduct a one-tailed test
  3. Increase the effect size
  4. Decrease random error
  5. Increase sample size

Thursday 7 May 2015

Book Review: Statistics for Biologists

This is sort of a book review: Statistics for Biologists is a collection of articles from the Nature journals including the Nat Meth Points of Significance column. This collection brings together the POS articles published since 2013, which aim to provide biologists with a basic introduction to stats. There are practical guides and other resources including a box-plot tool for R (try creating one in Excel).