Saturday, 4 July 2015

The DarXer Side of publishing on the arXiv

The use of the pre-print servers like the original arXiv and bioRxiv appears to be growing among some of the groups I follow. You've only got to read Jeff Leeks post about this and their Ballgown paper (published at NatBiotech) or Nick Loman's or Casey Bergman's 2012 blog posts to see why. Ease of reporting new results, a good way to share preliminary data, a marker for 1st to publish, etc are all good points; but posting on the arXiv is not the same as publishing in a peer-reviewed journal (this post is not about the pro's and con's of peer-review) and I hope everyone would accept that? And in Nature Jan Conrad at Stockholm University writes a commentary on arXiv's darker-side, his focus is very Physics heavy but this is unsurprising give the birth of the arXiv in Physical sciences. 
arXiv (biological science submissions expanded)

What is the arXiv: arXiv was born in 1991 as a central repository for scientific manuscripts in the TeX format (LaTeX etc) with a strong focus on physics and maths. Listings are in order of posting. There is no peer review, although according to Wikipedia "a majority of the e-prints are also submitted to journals for publication" (although they don't say how many of these are rejected). arXiv is pronounced "archive", as if the "X" were the Greek letter Chi, χ).

Who publishes on the arXiv: Lots of people, but mainly physicists (see "Where do biologists go below)! The 1 millionth post happened in 2014 and there are currently over 8,000 new posts per month. The figure above on the right shows how small the number of biological submissions there are though - about 1.6% of the monthly total (yellow is Quantitative Biology). On the left you can see a breakdown of submissions by biological sub-category (dark blue is genomics).

Where do biologists go: The bioRxiv was set up for preprints in the life sciences in late 2013 and is intended to complement the original arXiv (it has been covered here, here, here, here). It is grouped into multiple subdisciplines, including genomics, cancer biology and bioinformatics. Papers get digital object identifiers (DOIs) so you can cite them, and are papers are submitted as New Results, Confirmatory Results, or Contradictory Results.

I could not find bioRxiv usage stats such as those in the image above. Almost 1600 papers have been submitted since the start. 30% of the papers are in genomics or bioinformatics. Pathologists are conservative folks which might explain why there are only 4 papers in this category - although I'd not have read this HER2 paper if I'd not written this post! 

The darker side of the arXiv: Prof Conrads commentary is driven by a slew of major 'discoveries' in his field, many of which are turning out to be false alarms. The worrying part of his article is that it appears some of the authors of these pieces had enough awareness of other data that disproved their theories but chose to 'publish' regardless, and they also followed up with big press releases raising their profile. This could have a negative impact on science funding and on public perception of science, especially if the big news stories get shot down in flames.

He suggests that "online publishing of draft papers without proper refereeing have eroded traditional standards for making extraordinary claims". To do this he references a recent arXiv paper reporting discovery of dark-matter but using data that were preliminary and suggestive, rather than final and conclusive. The same day saw a second paper that refuted this claim using the same data but a more sensitive analysis using an upgraded software. The crazy things was that the first paper acknowledged this upgrade was coming but did not wait to 'publish' on the arXiv and make their mark. This story was widely reported, but with coverage focusing on the first claim, not the later refutation.

I wrote this post in response to a Tweet with this quote "Journals should discourage the referencing of arXiv papers." I think the article is a balanced one and contains important messages beyond the quote picked up on Twitter.

It is interesting to speculate about who will scrutinise the bioRxiv. The great Retraction Watch blog is unlikely to be able to keep up if the bioRxiv grows as quickly as its big brother. But bioRxiv papers need to be watched and it'll be interesting to see if the community moderation is effective.