CoreGenomics

Saturday, 4 July 2015

The DarXer Side of publishing on the arXiv

The use of the pre-print servers like the original arXiv and bioRxiv appears to be growing among some of the groups I follow. You've only got to read Jeff Leeks post about this and their Ballgown paper (published at NatBiotech) or Nick Loman's or Casey Bergman's 2012 blog posts to see why. Ease of reporting new results, a good way to share preliminary data, a marker for 1st to publish, etc are all good points; but posting on the arXiv is not the same as publishing in a peer-reviewed journal (this post is not about the pro's and con's of peer-review) and I hope everyone would accept that? And in Nature Jan Conrad at Stockholm University writes a commentary on arXiv's darker-side, his focus is very Physics heavy but this is unsurprising give the birth of the arXiv in Physical sciences.

arXiv (biological science submissions expanded)

What is the arXiv: arXiv was born in 1991 as a central repository for scientific manuscripts in the TeX format (LaTeX etc) with a strong focus on physics and maths. Listings are in order of posting. There is no peer review, although according to Wikipedia "a majority of the e-prints are also submitted to journals for publication" (although they don't say how many of these are rejected). arXiv is pronounced "archive", as if the "X" were the Greek letter Chi, χ).

Who publishes on the arXiv: Lots of people, but mainly physicists ^{(see "Where do biologists go below)}! The 1 millionth post happened in 2014 and there are currently over 8,000 new posts per month. The figure above on the right shows how small the number of biological submissions there are though - about 1.6% of the monthly total (yellow is Quantitative Biology). On the left you can see a breakdown of submissions by biological sub-category (dark blue is genomics).

Where do biologists go: The bioRxiv was set up for preprints in the life sciences in late 2013 and is intended to complement the original arXiv (it has been covered here, here, here, here). It is grouped into multiple subdisciplines, including genomics, cancer biology and bioinformatics. Papers get digital object identifiers (DOIs) so you can cite them, and are papers are submitted as New Results, Confirmatory Results, or Contradictory Results.

I could not find bioRxiv usage stats such as those in the image above. Almost 1600 papers have been submitted since the start. 30% of the papers are in genomics or bioinformatics. Pathologists are conservative folks which might explain why there are only 4 papers in this category - although I'd not have read this HER2 paper if I'd not written this post!

The darker side of the arXiv: Prof Conrads commentary is driven by a slew of major 'discoveries' in his field, many of which are turning out to be false alarms. The worrying part of his article is that it appears some of the authors of these pieces had enough awareness of other data that disproved their theories but chose to 'publish' regardless, and they also followed up with big press releases raising their profile. This could have a negative impact on science funding and on public perception of science, especially if the big news stories get shot down in flames.

He suggests that "online publishing of draft papers without proper refereeing have eroded traditional standards for making extraordinary claims". To do this he references a recent arXiv paper reporting discovery of dark-matter but using data that were preliminary and suggestive, rather than final and conclusive. The same day saw a second paper that refuted this claim using the same data but a more sensitive analysis using an upgraded software. The crazy things was that the first paper acknowledged this upgrade was coming but did not wait to 'publish' on the arXiv and make their mark. This story was widely reported, but with coverage focusing on the first claim, not the later refutation.

I wrote this post in response to a Tweet with this quote "Journals should discourage the referencing of arXiv papers." I think the article is a balanced one and contains important messages beyond the quote picked up on Twitter.

It is interesting to speculate about who will scrutinise the bioRxiv. The great Retraction Watch blog is unlikely to be able to keep up if the bioRxiv grows as quickly as its big brother. But bioRxiv papers need to be watched and it'll be interesting to see if the community moderation is effective.

Thursday, 2 July 2015

Does survival of the fittest apply to bioinformatics tools?

What do 48 replicates tell you about RNA-seq DGE analysis methods: that two the most widely‐used of the tools DESeq and edgeR are probably the best tools for the job*. These two tools also top the rankings of RNA-seq methods as assessed by citations with 1204 and 822 each. These are conclusions in probably the most highly replicated RNA-seq study to date**. The authors aimed to identify the correct number of replicates to use and concluded that we should be using ~6 replicates for standard RNA-seq, and we should consider increasing this to ~12 when identifying DGE irrespective of fold‐change.

Sequencers for sale

Back in 2010 a HiSeq 2000 cost $690,000 - you'll be lucky to get anything today. A V4 2500 might get you $100+k, but your V4 capable V3 just $20k. Depreciation sucks!

In accountancy, depreciation refers to the decrease in value of an asset over time. Most of us come across this when buying a car - for me that's usually on the nice side where I'm wondering why someone is selling a car they bought five years ago for so little!

But no asset lasts forever, and certainly not in the world of NGS. That "faster than Moore's law" graph you see at almost every NGS conference has a dark-side, and instrument depreciation is part of that. If this depreciation were accounted for properly thent sequencing costs in most labs would jump due to the relatively low number of runs performed.

Incidentally if you have a MiSeq sitting idle then these are pretty hot right now - I can probably put you in touch with someone who wants to buy one if you are interested!

Wednesday, 24 June 2015

King of the hill: which journal is best?

A paper on the BioArxiv describes a new metric to measure scientific journals by based on their efficiency of information distribution (citations) within the network of journals. It provides a providing together a complex picture of the intricate relations between scientific journals; but basically Science, Nature and PNAS are the top 3 journals!

Even more microfulidic RNA-seq: Drop-seq

I realised I'd not covered the recent droplet sequencing papers (the post was sitting in my draft blogs pile) and wanted to make sure I did after posting on single-cell RNA-seq last week. The use of droplets has some benefits over microfluidics where the current leader of the field was, until recently, limited to 96 cells per run. Droplets seem to offer almost unlimited numbers of cells but realistically going past 1000 seems to be beyond the needs of many experiments (do you agree?). And with the release of the upgraded C1 offering 800 cell capacity, the off-the-shelf ease of Fluidigm's sytem is likely to make it popular for a while to come. However the Fluidigm chips are biased in the cells they capture as evidenced by the chips for differently sized cells and the droplet methods lower bias may ultimately win people over.

Droplets are also likely to be much more cost-effective, for the labs that can build their own systems. Drop-seq is around $0.10 cents per cell, compared to $10.00 on the 800 cell C1 chip (according to GenomeWeb). We've also been testing the Cellular Research technology which gets to somewhere in-between with the resolve system. However for ultimate spatial information in situ sequencing is likely to be difficult to beat.

More microfluidic single-cell RNA-seq

A team from Columbia University present a microfluidic device that can capture single-cell lysates in picotire plates and produce single-cell 3' tag RNA-Seq library prep at $0.10 cents per cell: Scalable microfluidics for single cell RNA printing and sequencing. They discuss two methods - RNA “printing” on glass and RNA capture on beads, but the paper focuses on the second of these. This is an early paper with two experiments each of which could be improved by another round or two in the lab. However in this fast moving field I can’t help but wonder if this is a method that will never be commercialised and if the authors simply want to make their mark? I think the fact that single-cell methods are appearing thick and fast shows how much this space there still is to innovate into. And for the investment pundits looking to Fluidigm et al, there is a danger that a clever molecular biology approach could allow single cell without any hardware - imagine library prep inside a cell?

RNA-seq contextualised: what's possible in 2D and in situ RNA sequencing

When George Church talks about something it often turns out to be a good idea to listen. He's been talking about in situ sequencing for many years and the technology looks to be ready for take off. It could be a niche method used by a few very skilled groups, but if companies like Spatial Transcriptomics get their way we'll be using it routinely. I think in situ sequencing could be a massive, but we'll see over the next eighteen months or so if I'm right. A big question remains over how easily the technology can move from sequencing RNA in the cytoplasm to DNA in the nucleus. Being able to call mutations in cells from a tissue biopsy would be great, but the inaccessibility of DNA might mean we'll stick to expressed mutations for now.


A simplified workflow of the Spatial Transcriptomics technology

Nanopore library prep kit anyone?

A year ago I surveyed the costs of Illumina library prep and found over 15 providers; they offer generally the standard Illumina library prep method but with prices ranging from £15-£60 per sample. There was some real innovation amongst some of these kits too, enabling methods and applications not supported by Illumina (in the way of TruSeq kits). Illumina have created an ecosystem for other companies to feed on.

With the developments coming thick and fast from ONT and the people taking part in the MinION access program I wonder when the first nanopore kit might be released by a 3rd party? I'm sure a similar ecosystem will be created as MAPpers release protocols and the system becomes more widely adopted.

Saturday, 6 June 2015

Chromosome linkage to disease

I'm often trying to find an image for a post and it can be tough trying to find something that can be used freely. The U.S. DOE has an image gallery from the Genomic Science program, which includes archived images from the Human Genome Project. Many of the images do not need permissions, simply credit the DOE Genomic Science program (http://genomicscience.energy.gov).I love their 2006 poster showing the numerous genetic disorders and traits mapped to specific chromosomes.

But I want more: Unfortunately I often can't find the kind of image I'm looking for...please let me know of good sites to use?

Update: I was contacted by Anna Hupalowska about her work, some wonderful stuff, take a look.

If anyone else gets in touch I'll start a list here of scientific illustrators for hire!

Wednesday, 3 June 2015

Exomes from a spot of blood

Michael Snyder has a great paper published in AJRCCM: Exome Sequencing of Neonatal Blood Spots Identifies Genes Implicated in Bronchopulmonary Dysplasia (BPD). BPD is a lung disease of premature babies which appears to have a strong genetic component and was investigated using exome sequencing. The neat twist was they used the blood spot taken from a heel-prick to extract DNA for exome library prep. In the paper they report finding over 250 rare nonsynonymous mutations after comparing 50 affected and unaffected twin pairs. Many of these were enriched for, and up-regulated in, a murine model of BPD.

Image from : http://en.wikipedia.org/wiki/Robert_Guthrie