Thursday, 28 May 2015

Book Review: Next-Generation DNA Sequencing Informatics 2nd ed.

Next-Generation DNA Sequencing Informatics (2nd edition), edited by Stuart M. Brown (blogger) is a book to get you well and truly started in NGS Bioinformatics. The twelve chapters cover QC, sequence alignment, assembly, transcriptome and ChIP-seq analysis, visualisation, and much else besides. For this review I read the chapters on QC, RNA-seq and emerging technologies and applications.

Friday, 15 May 2015

One Genome in tiny plastic bottle

The GIAB consortium (@GenomeInABottle) took a major step forward today when it released the first NIST reference material for Human genome sequencing, the story even made it into the New York Times. It comes at an important time when we're moving into an era where millions of people are getting genome-based genetic tests. The GIAB standard will allow labs to demonstrate their capability to detect known variants, and measure the noise introduced by their tests. The GIAB RM8398 is probably the most sequenced Human sample of all time, and has orders of magnitude more confirmed variants than anything else including reference calls for SNPs, small indels, and homozygous reference genotypes for almost 80% of the genome. NA12878 has already been referenced in almost 250 publications on PubMedCentral.

Wednesday, 13 May 2015

#AGBT15 poster highlghts

Oops, I forgot to publish this post after the meeting!

So many posters at AGBT and so little time. I did not get round everything and these highlights were some of the many that caught my eye. I'd like to hope the organisers make the posters available in a more searchable and downloadable format for next year since we'll all be nipping into Disney or Harry Potter world! It would also be nice if these were downloadable for anyone, and not just conference attendees!

Cancer Genomics:
  • Detection of oncogenic fusion transcripts using Archer targeted sequencing technology
  • Optimizing tumour genome sequencing and analysis
  • Non-invasive cell-free tumour DNA-based detection of copy-number variations in breast and ovarian cancer samples
  • Detecting structure; variants and phasing haplotypes from cancer exome sequencing using 1ng DNA input (10X Genomics)
  • Comparative analysis of methods to detect copy number variations using single cell sequencing technique
  • Single cell DNA seq of breast cancer
  • Single cell sequencing identifies clonal stasis and punctuated copy-number evolution in triple-negative breast cancers
  • Differentiation of 5-methylcytosine and 5-hydroxymethylcytosine using Illumina infinium HumanMethylation450 beachip
  • A high-throughput method for single cell whole genome bisulfite sequencing
General Biology:
  • Mapping the "dark matter" of the genome with super long molecules – the unknown unknown
  • Unlocking protein expression in single cells using Fluidigm's C1 single-cell auto prep station
Genomics Application Development:
  • A novel target capture technique for next-generation sequencing
  • Exome library construction on an integrated microfluidic device
  • Transposase-mediated sample preparation improvements enable high-throughput variant detection using Human whole exome sequencing
  • Error proofing exome sequencing for somatic variant detection: combining analytic tools and lab process improvements to uncover and reduce artifacts in exome sequencing
  • Contiguity preserving transposition sequencing (CPT-seq) for phasing and assembly
  • Assembly of complete KIR haplotypes from a diploid individual by the direct sequencing of full-length Fosmids
  • Demultiplexing is a significant contributor to apparent cross-contamination between samples
  • Spatial single cell miRNA analysis reveals differential expression within tissue sections
  • G&T-seq: separation and parallel sequencing of the genomes and transcriptomes of single cells
  • Comparison of library prep kits for low input whole exome sequencing
  • Application of NGS technology to rare mutation detection in ctDNA and CTCs
  • High-throughput nucleic acid extraction and library construction from FFPE tissue specimens
Genomics Medicine:
  • Single-molecule DNA analysis of FMR-1 using SIMDEQ
  • Dissecting the genetic architecture of longevity using massive-scale crowd-sourced genealogy
  • Sequencing under a quality system: sensitive detection of molecular biomarkers from whole exomes and custom gene panels in support of clinical trials
  • Design and implementation of an informatics infrastructure for clinical genomics reporting and decision support
  • Centralizing an electronic informed consent platform to enable large-scale genomics research
  • Hypthesis free gene fusion detection
  • Utilization of whole genome analyse approaches for personalised therapy decision making in patients with advanced malignancies
  • Clinical performance of exome capture technology: Impact of kits, coverage and analysis
  • Targeted RNA-sequencing for simultaneous expression profiling and detection of clinically relevant gene rearrangements in FFPE biopsies
  • A novel next-generation sequencing (NGS)-based companion diagnostic predicts response to the PARP inhibitor Rucaparib in ovarian cancer
Informatics/Computational Biology:
  • A full diploid assembly of a breast cancer genome using read clouds
  • Chromatin structure fully determines replication timing program in Human cells
  • Anchored assembly: accurate structural variant detection using short-read data
  • How much data do we really need for human genomes and exomes?
  • Industrial-scale complete DNA-seq ad RNA-seq analysis kits for every researcher
  • Lab7 enterprise sequencing platform: a comprehensive NGS workflow solution
  • Error correction and de novo assembly of Oxford Nanopore sequencing
  • Dollars & science: publishing and patenting strategies in Biotech
  • Towards real-time surveillance approaches using nanopore sequencers
  • Beer-omics: Microbial populations and dynamics in fermentation
  • 16S rRNA gene sequencing – comparing paired-end and single-end read data generation
  • City-scale DNA dynamics, disease surveillance, and metagenomics profiling
Technology Development:
  • Detection of genetic variants via enzymatic mismatch cleavage
  • Single-cell RNA profiling by spatial transcriptomics in prostate cancer
  • Air-seq: metagenomic sequencing for pathogen surveillance in bioaerosols
  • Integrated DNA and RNA sequencing of single cells
  • Predicting sequencing performance of FFPE samples prior to next-generation sequencing using KAPA human genomic DNA quantification and QC kit
  • Drop-seq: A droplet-based technology for single-cell mRNA-seq analysis on a massive scale

Tuesday, 12 May 2015

Making high-thoughput RNA-seq easier

The cost of sequencing has dropped precipitously over the last five years, the cost of library prep has not moved by anywhere near as much. For some experiments this is a major headache, particularly in small genomes, metagenomics, and for us, RNA-seq. Reducing the cost of consumables in kits is one way to bring down prices, but the cost savings get less and less if the protocol requires a person to spend a week in the lab; developing novel methods that are simpler is likely to have more impact. In Simultaneous generation of many RNA-seq libraries in a single reaction by Shishkin et al (Nat Methods 2015), a simple additional RNA-ligation step means 100's of samples can be processed in a one tube stranded RNA-seq library prep.

Friday, 8 May 2015

I thought p-values were safe to use

One of our statisticians recently co-authored a paper in Nature Methods on the use and misues of p-values: The fickle P value generates irreproducible results. After reading this paper I really felt that I’d learned something useful; that p-values, which we use to determine if an experimental result is statistically significant, are potentially so variable in the kind of experiments we’re doing (3-4 replicate RNA-seq) as to be of little value in their own right.

Many of the statistically minded readers of this blog may well be rolling their eyes at my naivety, but I find stats hard and often explanations of how and why to choose statistical tests impenetrable. This paper is clear and uses some good examples (see figure below). Although I would have liked then to pick apart some recent RNA-seq experiments!

In a recent lab meeting when presented with a list of 7 options for the definition of a p-value, some, none or all of which were true, only one member of the lab of a dozen or so got the right answer. This is a very small survey but it does raise the question how many users of p-values don't really understand what the p-value means? 

The fickle p-value: the main thrust of the paper is that the p-value used to assess whether a test result is significant or not (or rather its probability of being due to random chance) is in itself variable, and that variability is exaggerated because small samples cannot reflect the source population well: in small samples, random  variation has a substantial influence. Reporting the confidence interval for the p-value does allow us to see the likely variability in the p-value. But if a clear experimental design is not reported then interpreting p-values is as meaningless as interpreting charts without error bars.

In the example used in the paper (see figure above) the authors sample ten times from two populations that have only a small difference, i.e. a small effect size. Whilst the number of replicate measurements appears high (10 compared to the usual 3 for RNA-seq), the size of the effect is small and this makes the results very susceptible to chance.

Effect size affects sample size: One of the questions we ask users in experimental design discussions is how big is the effect they are expecting to see. By understanding this, and ideally the biological variability within sample groups we hope to be able to design a well powered study. Most users are unable to specify effect size or sample group variability with any precision, and the experiment is often the first time they’ve seen this data. In essence we might perhaps consider every experiment a pilot.

Because of this the only thing we can reasonably adjust in our experiment is the sample size i.e. the number of biological replicates in each group. However this is always more work, more time, and more money, and so users are often reluctant to increase replicate numbers to many more than four.

We might be better asking users what the minimum effect size they think they can publish will be and designing the experiment appropriately. Unfortunately we can’t help determine what the smallest meaningful effect might be – most gene expression experiments start by looking for changes of 2-fold or more, but we see knock-down experiments where only a small amount of protein is left with no discernible effect on phenotype.

What to use instead, or what to report along with p: The authors do suggest some other statistical measures to use instead of p. They also suggest reporting more details of the experiment: sample size, effect size. Perhaps the most important suggestion and also a very simple one is to report the 95% confidence intervals associated with the p-value, which tells us the likely value of p from repeated sampling.

Commentary on the paper: There has been a pretty good discussion of the paper in the commentary on theconversation. One of the commenters stated that they thought a p-value included an estimate of variability, the reply to this comment, and the follow up say a lot about how people like me view p.

“Geoff Cumming has quantified the variability around p. For example, if a study obtains P = 0.03, (regardless of statistical power) there is a 90% chance that a replicate study would return a P value somewhere between the wide range of 0 to 0.6 (90% prediction intervals), while the chances of P ≤ 0.05 is just 56%. You can't get much more varied than that!”

“I thought P already contained the '90% ...' calculations. When one hears that the probability of something is X +- Y 19 times out of 20, it sounds like that's what P is, and that it 'knows' it's own variability. This simplistic P that you seem to describe sounds almost childishly useless. Why quote a datum when it is known that the next test would produce a quite different result?”

Is your RNA-seq experiment OK: What is the impact of this on your last RNA-seq experiment? I asked one of the authors to look at data from my lab and the results were…

Want to learn more about stats:  The authors have published a series of introductory statistics articles aimed at biologists in the Journal of Physiology. You might also check out theanalysisfactor “making statistics make sense”. They have a great post titled 5 Ways to Increase Power in a Study, which I’ve tried to summarise it below:

To increase power:
  1. Increase alpha
  2. Conduct a one-tailed test
  3. Increase the effect size
  4. Decrease random error
  5. Increase sample size

Thursday, 7 May 2015

Book Review: Statistics for Biologists

This is sort of a book review: Statistics for Biologists is a collection of articles from the Nature journals including the Nat Meth Points of Significance column. This collection brings together the POS articles published since 2013, which aim to provide biologists with a basic introduction to stats. There are practical guides and other resources including a box-plot tool for R (try creating one in Excel).

Tuesday, 28 April 2015

Troubleshooting DNA ligation in NGS library prep

We're currently testing some new methods in the lab to find an optimal exome library prep, not the capture just the prep. The ideal would be a PCR-free exome, however we want to work with limited material and so maximising library prep efficiency is key, and we'll still use some PCR. The two main factors we're considering are ligation temperature/time, and DNA:adapter molar ratio. The major impact of increasing ligation efficiency is to maximise library diversity, and this applies whatever your DNA input. Even if you're not working with low-input samples, high-diversity libraries minimise the sequencing required for almost all applications.

During discussions with some users it became evident that not everyone knows what the critical bits of a DNA ligation reaction are and since adpater ligation is key to the success of many NGS library preps I thought it would be worthwhile summarising some key points here.

Image from taken from Bob Lehman's 1974 Science paper

Friday, 24 April 2015

Mammoth de-exinction: not good for elephants, not good for science?

Two woolly mammoths (Mammuthus primigenius) have had their genomes sequenced by a team led by the Swedish Museum of Natural History: Complete Genomes Reveal Signatures of Demographic and Genetic Declines in the Woolly Mammoth. The BBC story included coverage of the Long Now Foundation and their plans for de-extinction via genetic-rescue; "to produce new mammoths... repopulate [] tundra and boreal forest in Eurasia and North America", but "not to make perfect copies of extinct woolly mammoths, but to focus on the mammoth adaptations needed for Asian elephants to live in the cold climate of the tundra".

The Mammoth genome story is likely to be big news and I think that is unfortunate, not just for the elephants that are going to get fur coats and be shipped off to cooler climes, but also for the perception of science and scientists. It perpetuates the mad-scientist image and people will inevitably think of films like Jurassic Park. I find it difficult to think of reasons why we would actually need/want to adapt Asian elephants (why not African elephants too) for modern Siberia. Is anyone honestly going to use genome editing on a large scale to make a hairy elephant so it can live in the cold? This kind of coverage is not especially good for science, but it is probably great for your rating on Google!

The mammoth genomes: The two genomes are separated by 40,000 years; the first was from a 44,800 year old (Late Pleistocene) juvenile found in Siberia, the second from a 4300 year old molar taken from a Mammoth that lived in probably the last extant population on Wrangel Island.

Pretty standard library construction with the addition of a UNG step to remove uracil bases (resulting from cytosine deamination) that reduced C>T artefact's. Genomes were aligned to an (unpublished) 7x Sanger-seq African elephant (Loxodonta africana) genome, LoxAfr4. Alignment to the reference showed differences in the average length of perfectly mapping reads, 55bp and 69bp for the Siberian and Wrangel Island individuals respectively. Population size was estimated by measuring the density of heterozygous sites in the genomes, the authors are explicit in stating that there analyses is probabilistic and they "always quote a range of uncertainty". This analysis suggested two population bottlenecks; the first 280,000 years ago, and the second more recently 12,500 years ago at the start of the Holocene. This second indicated a probable significant drop in mammoth diversity in the time just before extinction, possibly due to in-breeding. The Wrangel Island sample had large regions termed "runs of homozygosity", about   23%  of the genome.

There is a possibility that more genomes are coming as the group sequenced DNA from 10 individuals to find one good one, and are on the hunt for more to better understand mammoth diversity and the reasons behind extinction. 

De-extinction: Beth Shapiro at UCSC is author of the book How to Clone a Mammoth and she'll be doing a talk and book signing at Oxford University Museum of Natural History Tuesday, 19 May 2015 at 6PM, she previously published research on the museum’s Dodo.
In How to Clone a Mammoth she discusses the challenges both scientific and ethical that make de-extinction tough, "I question if it's something we should do at all"! Cloning is likely to be impossible so we'll have to result to genome editing or recombineering to get extinct species DNA into their closest modern relatives.

We need good science stories to hook news broadcasters in whichever media we can, but with impact being one of the metrics many PIs are judged on nowadays it might be too tempting to spin a story a little too hard. Chris Mason's recent Cell paper on New York's subway metagenome got some tough criticism for over-playing the levels of Anthrax and Plague. Although the paper itself is pretty clear about the realities of what the data show.

I don't imagine the headlines "Scientist discover people don't always wash their hands after going to the loo", or "Scientists confirm elephants are related to mammoths" would have elicited such high profile coverge!

Monday, 20 April 2015

Book Review: Dr Henry Marsh "Do No Harm: Stories of Life, Death and Brain Surgery"

Do No Harm: Stories of Life, Death and Brain Surgery by Henry Marsh is one of the best books I've ever read, the last time a book made me cry I was a kid. I was in tears several times, including on a flight from London to Helsinki, no holding back, pure emotional roller-coaster.

As a cancer research scientist (genomics core facility head in a cancer research institute) I rarely see or talk to cancer patients; we do have tours a few times a year and speaking to people living, and dying, with cancer brings home the real reason we're all doing our jobs.

Dr Henry Marsh is a Neurosurgeon and this book describes his career through the lives of patients he has operated on; some are cured, some die and some are unlucky - they live but with terrible repercussions of treatment gone wrong. When something goes wrong in my lab I might get tied up in knots about the loss of an £11,000 Illumina PE125bp run, who's going to pay, were the samples irreplaceable, etc; but when surgery goes wrong for Dr Marsh the results are catastrophic. He discusses the good and bad of his career with unflinching honesty and genuine emotion.

Read Chapter 1: Pineocytoma from the Orion Books website.