Friday, 22 July 2011

Ion Torrent has the "Law" on their side


Bang went the starter gun in August 2010 when LifeTechnologies announced the purchase of Ion Torrent for $375M (plus another $350M if certain milestones are reached).

Yesterday in Nature Life Technologies and Ion Torrent published the genome of Gordon Moore of “Moore’s law” fame, plus 3 bacterial genomes. I’d like to know how much Gordon thought about DNA sequencing before his law was quoted in almost every next-gen presentation over the last four or five years? Let alone what went through his head when approached by Ion and LifeTech! The Moore genome required about 1000 Ion chips or a bilion detector wells. In the supplementary information for the paper Ion sugest a route to a 1B well chip (see later on in this post) which would be pretty awsome.

Over at Genetic Future there is a nice commentary of the Human Genome results.  Daniel MacArthur states pretty categorically that this is nowhere near a cost effective way to sequences humans “yet”. I posted on SEQanswers in February that Ion could get to HiSeq output by 2013 after reading an interview with Jonathan Rothberg, inwhich he said to expect a 10-fold improvement from Ion every six months. It looks like they have achieved so far, but will they keep this up to get to 100Gb by the middle of 2013?

Dan also makes a very strong argument against the papers claims on quality and validation with SOLiD. Dan’s comments on the low coverage are very persuasive. I’d excuse Life Tech and Ion for the 10x Moore genome, but there can be little excuse for Life Tech on a 15x SOLiD genome. They could have very easily brought this closer to the standard of 30-50x coverage. I hope it is obvious to everyone reading the paper that Ion and SOLiD come from the same company and that there is likely a vested interest in saying how great both technologies are.

Dan thought Life Technologies may have made a mistake putting the Human genome into the paper. I really hope that without it this work would not have made it into Nature! A genome today even on a new sequencing technology just does not feel like it should pass the bar of entry to Nature. It’s a tough club to get in to!

Nature News makes a lot of the fact that a bacterial genome can be sequenced in two hours. They seem to ignore library prep entirely and even with the new automated ePCR this is probabaly a days work at least. Illumina's purchase of Epicentre gives tehm the Nextera technology and it really is possible to sequence a genome in an eight hour day. If you can get one delivered of course.

One of the interesting things to me was too look at what developments we might deduce in terms of throughput. In the paper the authors state read lengths of over 200bp and that only 20-40% of detector wells generate mappable reads. Moving past 100bp reads and increasing %mappable reads to closer to 100 will make a massive difference to the ultimate output. However I do not know enough about the technical challenges Life may face here. This is almost certainly the kind of development that will get Ion Torrent the further $350 from Life.

In the supplementary figures for the Nature paper Ion demonstrate 1.3um wells on a 1.68 pitch array allowing up to 165M detector wells. They say that this could increase to 1B but it sounds like a tough step to take. At 1B wells and 30bp then Ion is giving us the same yield as a v3 HiSeq flowcell today. How long it might take to get there is another matter. There are 20 supplemental figures. Number 8 shows the instrument and points out the major features, one of which is the accompanying iPod and it's dock!

One of the challenges mentioned in discussions I have had with users of the Ion Torrent is that the current emulsion PCR is limiting the size of amplicon that can be generated on the acrylamide beads. If the Ion technology is going to get to 400bp like 454 or even to 1000bp then a larger PCRPCR paper that describes how size of bead affects amplicon length I’d be grateful. So far I have looked at "SNP genotyping by multiplexed solid-phaseamplification and fluorescent minisequencing" and "Product length, dye choice, anddetection chemistry in the bead-emulsion amplification of millions of single DNAmolecules in parallel".

It is going to be an interesting to see how Ion respond to Roche’s entry into semi-conductor sequencing. If one of them gets a system that can actually scale according to Moore’s law then Illumina will need to squeeze more out of SBS or possibly Oxford nanopore?


Wednesday, 20 July 2011

How good are the ENCODE RNA-Seq guidelines?

The ENCODE consortium released its first set of data-standards guidelines and interestingly they are for RNA-Seq. ChIP-Seq guidelines will follow later which is a little surprising considering almost all the ENCODE data so far is ChIP-Seq (see below). In some ways I’d have preferred to see the ChIP-Seq document first. As ChIP-Seq is pretty mature it would have been clear how much ENCODE had taken into account the different lab and analytical methods and distilled what was important in an experiment.

There is a bit of a hole in the guidelines from my point of view as there is no comparison or recommendation on methods. When I first looked at the site this is exactly the information I was hoping to get. There is none, zip, zilch! I was also surprised that there are no references in these guidelines. I think this is a significant shortcoming from the ENCODE consortium and one that needs to be fixed. I would very much hope that there are protocol recommendations for the more mature ChIP-Seq methods when those guidelines are written.

A lot of the guideline recommendations come from experience of microarrays. This document is nowhere near as comprehensive as MIAME but I think it will be easier for users to adopt because of this. The Metadata section is a nice concise list of information to collect for an experiment, RNA-Seq or otherwise. I'd encourage anyone doing a sequencing or array experiment to read this list and think about other factors they might need to collect in their own experiments.

Whilst these guidelines are a reasonable start and outline many of the issues RNA-Seq users need to consider, they fall a long way short of being truly useful to someone considering where to start with an RNA-Seq experiment.

ENCODE data so far: About 20 labs have submitted data to ENCODE according to their data summary. When I looked there was no ChIP-Chip data in the summary; almost 85% of the data is from sequencing experiments with 63% ChIP-Seq, 8% RNA-Seq, 7%, DNAse-Seq, 4% Methyl-Seq and 2% FAIRE-Seq.

The Guidelines
Methods: RNA-Seq Methods mentioned include. transcript quantification, differential gene expression, discovery and splicing analysis. They don’t mention allele specific expression. Many types of input can be used in these methods, Total RNA (including miRNA of course), single cell RNA, smallRNA, polyA+ RNA, polysomal RNA, etc, etc, etc. The authors do state how immature RNA-Seq is and that the applications are evolving incredibly rapidly in almost every part of an experiment; sample prep, sequencing and analysis. They say they don’t aim to cover every possible application but instead focus on the major ones and also provide recommendations for providing meta-data, something too many scientists still don’t collect before and during an experiment, let alone submit with the data for analysis.

Metadata: recommendations include the usual suspects. For Cell lines; accession number, passage number, culture conditions, STR and Mycoplasma test results. For tissue the source and genotype if this is an animal, sample collection and processing methods, cellularity scores. And for the final RNA the method used for extraction and QC results (bioanalyser database anyone?)

Replication: They say that RNA-Seq experiments should be replicated (biological rather than technical) although ENCODE recommend a minimum of two replicates, which is very low. I defy anyone to find a statistician involved in microarray experiments that would settle for anything less than three and probably four replicates today. However they do give a get out clause for those who can’t replicate by stating “unless there is a compelling reason why this is impractical or wasteful”. An interesting point is that these guidelines suggest an RPKM correlation of at least 0.92 is required otherwise an experiment should be repeated or explained. I would have thought anyone publishing their experiments would already be explaining this and that reviewers would pick up on such poor correlations.

Read-depth: This is one of the hottest topics for RNA-Seq. It makes a massive difference to the final cost of the experiment and is a major determinant in the “microarrays vs sequencing” thought process. ENCODE suggest around 30M paired end reads for differential gene expression, however Illumina are suggesting you can use as few as 2M reads per sample today if you want the same sensitivity as Affy arrays. That’s a 15 fold difference and I suspect this will be revised in the next version of the guidelines. They do say that other methods will require more reads, up to 200M.

ENCODE aim to update this document annually, I am sure many will be encouraged by this as a useful endeavor. What about a step further with an open access Genomic journal that only covers annual reviews of methods, compares the variations and makes recommendations for a consensus protocol?

“Genome Methods Reviews” perhaps?

Friday, 15 July 2011

Understanding mutation nomenclature

I have recently been analysing some next-gen sequencing data for mutation detection. I realised pretty early on how long ago it was I had the opportunity to analyse some data of my own, as my day job is the generation of someone else's. I have been scanning through lots of data, COSMIC and many publications and thought I had better get up to speed on the naming conventions for different mutations and polymorphisms. Whilst doing so I thought I'd write a little primer to look back on and share with others.

First off I'd recommend some resources and references: Almost everything comes from these sources and the website is really comprehensive.
1. Antonarakis et al: Recommendations for a nomenclature system for Human gene mutations. 1998; Human Mutation 11:1-3. This was written before the Human Genome project was completed and a nice easy downloadable reference sequence from UCSC or Ensembl was not really conceivable. Looking back at papers like this reminds me how much we seem to take for granted in our day-to-day science!
2. den Dunnen et al. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. 2000; Human Mutation 15:7-12 This paper presented an updated set of recommendations to allow comprehensive reporting of more complex mutations, and as such has been very useful in Cancer genomics.
3. http://www.hgvs.org/mutnomen/recs.html: the Human Genome Variation Society website as probably the primary resource for mutation nomenclature.

How to report mutations:
The basic structure of a nucleotide change is as follows:

{Genomic locus or reference accession}{gDNA or cDNA}{nucleotide interval}{reference nucleotide}{type of change}{actual nucleotide}

e.g. NM_000546g.76A>T, where nucleotide 76 in TP53 has been changed from A to T.

Use the accession number for the genomic reference sequence from RefSeq (e.g. NM_000546 for TP53 from HGNC). Nucleotide numbers are preceded by a "g.", "c.", "m.", "r.", "p." denoting genomic, cDNA, mitochondrial, RNA or protein sequence respectively. For g, c, and r the A of the ATG start codon is nucleotide +1, the nucleotide 5' to +1 is denoted -1 and there is no nucleotide 0. If there is more than one mutation in a single allele these are listed together inside [brackets]. Nucleotides are represented in capitals for gDNA (A,C,G,T) and in lowercase for cDNA (a,c,g,u), amino acids are represented by their one letter IUPAC codes (see the bottom of this post).

Different types of mutation:
Substitutions: c.76A>T
Deletions: c.76del or c.76_78delACT
Duplications:   c.76dup or c.76_78dupACT
Insertions: e.g. c.76insG
Inversions: c.76_83inv8
Conversion: c.76_355conNM_004006.1:c.155_355
Indels: e.g. c76_83delinsTG
Translocations: t(X;4)(p21.2;q35)
Complex: see details below.

  • Substitutions: ">" indicates a  substitution at DNA level where one nucleotide has been replaced by another (e.g.  c.76A>T, where nucleotide 76 in the reference has been changed from A to T).
  • Deletions: "del" indicates a deletion in the reference sequence where at least one nucleotide is removed (e.g.  c.76del or c.76delA, where nucleotide 76, an A has been deleted and  c.76_78del or c.76_78delACT where the three nucleotides at positions 76,77&78, ACT have been deleted. The bases deleted are not absolutely required in the description but can be helpful.)
  • Duplications: "dup" indicates a duplication of the reference sequence (like  c.76dup or c.76dupA where nucleotide 76, an A has been duplicated and  c.76_78dup or c.76_78dupACT where the three nucleotides at positions 76,77&78, ACT have been duplicated.
  • Insertions: "ins" indicates an insertion in the reference sequence (e.g. c.76insG where a G has been inserted between nucleotides 76 and 77).
  • Inversions: "inv" indicates an inversion of the reference sequence between the bases reported (e.g.  c.76_83inv or c.76_83inv8 where the 8 nucleotides between positions 78 and 83 have been inverted with respect to the reference.)
  • Conversion: "con" indicates a conversion of a region of the reference genome to another region. These can be complex events involving highly homologous sequences in the genome and thus may also include multiple changes in the converted sequence (e.g. c.76_355conNM_004006.1:c.155_355 where nucleotides 76 to 355 of the reference have been converted to nucleotides 155 to 355 from the sequence in GenBank accession NM_004006.1).
  • Indels: These genomic events are quite common but the recommended nomenclature requires that they are reported as DelIns (not quite so catchy!) for clarity (e.g. c76_83delinsTG where the 8 nucleotides between positions 76&83 have been deleted and a TG sequence has been inserted in their place).
  • Translocations: "t" indicates a translocation between genomic loci (e.g. t(X;4)(p21.2;q35)) where a translocation has occurred between chromosome bands Xp21.2 and 4q34. A translocation should be reported in reference to a sequence accession number where possible.
  • Complex: similarly to big changes (see below) it is often easier to report a complex change in reference to a sequence accession number with a brief textual description.
  • Other identifiers: The underscore "_" indicates that a range of bases have been affected (see above for numerous examples), square brackets indicate an allele "[]" (e.g. c[76A>T]) and round brackets indicate that the exact position of the change is unknown "()" and the range of uncertainty is reported using the underscore (e.g. c.(67_70)insG).

What about big changes? It would be impractical to list 100s or 1000s of bases using this nomenclature. Large changes are denoted only by the number of bases changed (e.g. c76ins1786), however an accession number for a sequence file should also be provide.

When is an insertion a duplication, etc, etc, etc? If you take the example ACTTTGTGCC in the reference and ACTTTGTGGCC in tour sample you could have a duplication of G at position 8 or an insertion of G between positions 8&9, c.8dupG or c.8_9insG. Duplicating insertions are described a duplications as this is a simpler and clearer description. Of course it is not clear whether the additional base is a true duplication resulting from polymerase slippage or is actually an insertion of a new but identical base from outside the reference sequence.

What happens downstream of the first mutation? If a duplication or deletion change in your sequence is followed by a subsequent change I am not clear how the following changes are presented. Now every base will be wrong compared to the reference, how does this complexity get resolved? (sorry but I need to add to this later, probably when I confront such a problem myself).

What happens when the reference changes? An interesting point and one that I had not considered too deeply was what do you use as the reference sequence and how do you deal with changes in it. The Human genome has a pretty good reference and this is considered to be the wild-type allele. However the reference is being updated with new information and groups or users can request that the reference allele is changed when it is shown that a variant is actually the more common allele. Obviously this has an impact on what is reported in papers, as such it is always helpful to indicate which genome build has been used. The simple answer is to always reference the genome build you used in your paper or report.

IUPAC codes: These are handy when referencing mutations so I thought I'd include the single letter codes for nucleotides and amino acids as I always forget these and there is a good chance I will remember I wrote this blog!

Symbols:
A   Adenine 
C   Cytosine
G   Guanine
T   Thymine
R   G or A       puRine
Y   T or C       pYrimidine
M   A or C       aMino
K   G or T       Keto
S   G or C       Strong interaction (3 H bonds)
W   A or T       Weak interaction (2 H bonds)
H   A or C or T  "not-G, H follows G in the alphabet"
B   G or T or C  "not-A, B follows A"
V   G or C or A  "not-T (not-U), V follows U"
D   G or A or T  "not-C, D follows C"
N   G or A or T or C    aNy
 
Complementary symbols:
Symbol      A C G T B D H K M S V W N
Complement  T G C A V H D M K S*B W*N*

Saturday, 25 June 2011

What sort of user are you?

Back in April Tales of the Genomic Repairman discussed five core facility stereotypes; Core of ill-repute, High $ Hooka core, the Waiting by the phone for the core to call core, the High maintenance core and the Easy core next door. I hope my users see the core I run as an easy core next door.

I thought I’d follow up after a conversation around Genomic Repairmans blog at a recent Illumina user meeting in Chantilly where six core heads were on the same table at dinner. We all agreed that those stereoptypes exist but the same goes for users. Anyway, here is what we came up with...

The “the deadline is always tomorrow” user always professes that their experiment needs to happen now, the data is either needed for a paper (Nature, Cell or Science of course) or a looming grant deadline. If the core can’t deliver then this guy’s career is over. You meet them six weeks later in the pub after work and ask if the paper was accepted and they say, “sorry, I’ve not had time to look at the data yet”! We want to help when we can and my lab can run an array project in three days if samples are ready to roll. But whilst we all know priorities get changed you can't pull this too many times. Ever heard of the boy who cried wolf?

The “how can I do it cheaper” user always wants to cut corners to bring the cost of the experiment down. If you say four replicates they say “three”, when you say three they say “two, you can do stats on two can’t you?” Their experiments never quite seem to generate results that can be validated.

The “I didn’t quite follow the protocol” user brings their samples along but casually let you know that they had to skip a clean up or amplification step to get the samples in before heading off for the weekend. You can bet they’ll be disappointed when the job does not work and they are still asked to pay. This user is often unhappy with the quality of the work from the core.

The “I want to use this method that was just published last week” user can often lead to an exciting new application in the core. However this kind of user can also come back the week after and say forget about the last method, this ones really great. It can be difficult to know when they are coming in with the next big thing rather than a fad. This user is enthusiastic but can take up a lot of time, however most cores relish the conversations and the opportunity to work on new ideas.

The " " silent user. Most of us round the table thought many users were too quiet. Whilst they seem happy with services offered, with the quality and turnaround of data, their feedback is limited. We want these users to talk to us more. Tell us what they like and don’t like. They have the opportunity to shape how our cores develop with their comments.

The “thanks a lot” user. This user always says thanks when they get their results. Even better, they always acknowledge the work the core did in the paper and sometimes include core staff as authors if it is warranted. The core is always happy to see this person back again and will always try to give that little extra. Service for this guy comes with a smile.

A last comment: Users of core facilities should speak with the staff and mangers. If you have a Core that’s not so easy going then talk to them about it. Constructive criticism should be taken on board and who knows soon your Core of ill-repute could soon be as easy going as you’d always hoped for.

Thursday, 23 June 2011

MiSeq vs Ion: How a little bug got involved in a big battle

How a little bug got involved in a big battle: MiSeq vs Ion

There has been a lot said about the recent sequencing of E coli from the recent outbreak in Germany. Over five isolates have been sequenced on almost every sequencing platform, HiSeq, 454, Ion and MiSeq. Interestingly I am not aware of a SOLiD or a Sanger genome. I’d recommend Genomewebs coverage as a good starter if you’re interested in finding out more.

Just this week Illumina made available a slide deck and data from their analysis of E coli K12MG1655 sequenced on HiSeq (PE100) and MiSeq (PE150).

The first 8 slides are the HiSeq MiSeq comparison:

Throughput seems to have gone up to 1.5Gb, which is a 50% increase over the initial specs so growth seems to be as fast on MiSeq as it has been on GA and HiSeq platofmrs. Albeit with just one data point so far. And there is still a,long way to go before it gets to 25Gb, see my first blog.

Libraries for the comparison were made using Tru Seq and not the Nextera kits from the recent Epicentre purchase which I was a little surprised about. Ot would have been great to see a multiplexed run on the two platforms for a TruSeq and Nextera comparison in the same data set.

Essentialy MiSeq outperforms HiSeq on quality a tiny bit. Other than that the datasets are pretty much the same n all respects. The HiSeq run has the characteristic intensity fluctuation at about 75bp where laser power is adjusted mid-run. Both runs have a stepped prfile in Men Qscore which I guess is a function of alignment getting better till about 20bp then declining with read quality. In this presentation Illumina do not show the Mean Qscore at 150bp for MiSeq.

Illumina sub samples the HiSeq run for a de novo assembly comparison and the result were strickingly similar in all respects other than MiSeq gave 11 where HiSeq gave 12 contigs. I am not an assembly wiz so can’t think why this would be the case nor whether it would matter a great deal.

The next 7 slides are a comparison of MiSeq to Ion Torrent:

Again I am not a bioinformatician so can’t realistically comment on the fairness of this comparison and others are getting the data for more impartial assessment.

The Ion data was all from the 314 chip which was specced at 10MB and the average run yield was 11-24MB from the threes tes that have generated data. So it looks like Ion is increasing the yield as expected. However it is very clear that the MiSeq has a huge advantage over the Ion platform with respect to yield as it gave 1.7Gb from this run. Quality on Ion was lower but without the stepped profile seen in MIseq and HiSeq, averaging out at Q31 vs Q19.

The comparison is an interesting one although it will probably be obsolete by the time I hit post due to the rapid developments from Ion. I did hear Broad have now obtained over 290Mb of data from a 316 chips so itt looks like their roadmap is on track.

It remains to be seen which platform is going to win this particular VHS vs Betamax battle (showing my age here). An unanswered question is also whether these platforms will ultimately replace systems like HiSeq, especially when whole genome sequencing can be purchased for $4000 with a 60 day turnaround. This could drop to $2500 next year if the 1Tb runs from Illumina work outside their development labs.

HiSeq costs $750,000 to buy including a cBot, MiSeq is $125,000. If yield is not your primary concern then MiSeq may well turn out to be the kind of instrument that finally democratises sequencing.

MiSeq vs HiSeq flowcell

So here it is, I was given one of these at AGBT and there is a picture in the MiSeq brochure. I thought I’d share an image with anyone wondering what they look like.



You can see it is very different from a HiSeq flowcell in that it is encased in a plastic housing. You can also see two inlet ports just above the ‘um’ in Illumina. These allow the much shorted and therefore faster fluidics to operate. The flowcell lane is bent back on itself, which is difficult to see in this picture.

I believe that only a portion of the MiSeq lane is imaged right now. But I am not exactly sure how much and this is key to working out how much data MiSeq might ultimately yield. A side-by-side comparison to a HiSeq flowcell shows that there is about 1/3rd the surface area of a HiSeq flowcell lane. Cluster density is being reported as the same on both platforms so if the clusters per mm2 is the same then a full MiSeq lane should generate about 1/3 the yield of a HiSeq lane.



However you could not access the increase in imaging area without spending more time on imaging. If Illumina and users are willing to increase the imaged area then yields will go up. MiSeq only images one surface right now, imaging both surfaces would generate a two fold increase but would double imaging time. If MiSeq is currently imaging one ‘surface’ from a possible two and one ‘tile’ from a possible three then it may be possible to increase out put by 6 fold. Of course I am not sure if this is  possible and it relies on my assumptions about how MiSeq imaging works.


MiSeq is sold with very fast run times as a unique selling point and a single end 36bp run takes about four hours. If both surfaces can be imaged and three tiles rather than one are possible then the SE36 run time would go up to 6-8 hours but yield would jump from 0.3GB to 6Gb. For paired end 150bp runs this time would go from 1 to 2 days and yield would rocket from 1.5Gb to 9Gb. Run costs would not change.

I’d quite like my MiSeq to be shipped with a dial in the software that allows me to maximise data in the time I have available. If a run is going on overnight or the weekend for instance then why not let it run for longer and generate more data. This would be essentially for free.

More food for thought.

And ‘Yes’ that is James Watson’s signature on a HiSeq flowcell.

Monday, 20 June 2011

Why do the Chinese get such a good deal from Illumina?

Because they order 128 the day a new instrument is anounced and have deep pockets.

I have been following developments at BGI over the past three or so years and was as stunned as anyone else at their announcment of a 128 HiSeq 2000 order in January of last year. 128 instruments was a lot then and is still a lot today. Certainly the largest sequencing centre in the world and significanty more capacity than every country in the world except the USA and UK.

BGI has 137 HiSeq 2000s as well as 1x 454, 27x SOLiD and 1x Ion Torrent according to the google map of next gen sequencers. The US has 712, the UK 132 instruments each respectively.

Using the latest 600GB run chemistry they can put out 50Tb of data per week.

A recent Newsweek article profiles what is going on today over in Shenzhen. Note this BGI is nowhere near Beijing. It also specifies a price of just $500,000 for the HiSeq 2000. I thought $700,000 was nearer the mark so this represents about a 30% discount. Pretty good.

Now all anyone else need to do to realise a similar discount is spend $64M in a single purchase order. Reagents not included!

Roll on HiSeq version 2... ;-)

Thursday, 2 June 2011

The cost of maintaining those toys in the lab

I have just renewed several service contracts and this time of year always leaves me feeling a bit cold about how much money has just been spent. For my lab the bill is well over $200,000. That is nearly 50 Genomes at todays prices, and people are only just publishing that many in a single paper!

At about 8-15% of the instrument purchase price a service contract is often seen as expensive. But I don't think there are many who choose to take a chance on their microarray scanner, real-time PCR instrument or sequencers - next-gen or otherwise. The impact of downtime is significant. The cost of an engineer visit can be very high and include very expensive travel costs. And the parts can be insanely expensive, over $50,000 on a laser for instance.

Two things amuse me about the whole service contract business. First are the names used for contracts; Gold, Silver, Advantage, Prestige, etc are all very aspirational and suggest we get something truly valuable. Second is how every company professes to make absolutely no profit. Companies are certainly not aiming to lose money here and I guess it is not so easy to set margins as on a consumable item. However I do not believe the this is unprofitable.

Some bits of kit seem to plod along with nothing other than an annual preventative maintenance visit. Others seem to need constant engineer visits to keep them working. The cost of the contract often reflects this although I don't know if anyone has ever tried to compare cost vs reliability?

Many labs, mine included, try to recover some of the maintenance costs in charges made to users. However this can add huge amounts of money to the cost of accessing some little used bits of equipment. And this is one of the major reasons for my deciding whether to keep a system going for another year. Sometimes a system needs to be internally subsidised. But commercial service providers can be good value, who still runs low volume Sanger sequencing for instance?

Recently when I have spoken to people running similar labs the words 'risk management' keep coming up. It seems that more of us are considering not getting contracts on every item in the lab, but nearly all agree the big tickets need to be covered. I'd worry a little though if my centrifuge, bioanalyser or PCR machine broke and I could not run my HiSeq! I guess the crunch in the economy hits us all and anywhere we can make a saving needs to be explored.

At the end of the day service contracts are just insurance policies and ones that most of us would not be without. However I don't know of any large institution that has looked into this in any depth to see if the insurance is worth paying. Perhaps a more holistic view could be taken with no service but a big contingency pot for when things do go wrong. What could MRC or NIH save or would it cost almost as much to administer?

Friday, 27 May 2011

MiSeq: possible growth potential?

When Illumina announced the launch of MiSeq in January 2011 shortly before AGBT there was a lot of noise in the community on what this instrument might do. There was an equal amount of noise from companies offering competing systems and their early users. This peaked with a series of Ads from Ion Torrent

Just recently Keith Robison over at "Omics! Omics!" has been discussing whether Ion Torrent can keep its current first-mover advantage in the face of Illumina's previous 'form' in this sort of competition. He mentions factors where the instruments might be compared and where differentiation might emerge:  similar all-in purchase price, similar per-run cost, similar read lengths (with PE on Illumina only for now), similar numbers of reads. He also says that MiSeq looks like it will have "double the total output, faster runs and less hands on time or trouble".

Lastly Keith discusses what Ion might do to keep ahead of the game. Well they have just released a hands-free ePCR system which will remove some of the pain. And the 318 chips with >400bp reads should yield well over 1Gb.

MiSeq was initially launched with specs of 1Gb of sequence in 24 hours and a cost of $125,000. The current spec is the same on Illumina's website but also adds that about 3.4M reads will be generated per run. I went to AGBT, got the bag and attended the MiSeq launch party (free coffee, no booze). One of the things I picked up was a MiSeq flowcell. You can see one in the brochure and it is much smaller than a HiSeq flowcell, about 1/3rd the size.

It is not clear what area is being imaged right now but if MiSeq achieves the same cluster density and can utilise the new 600Gb chemistry then it may be possible to guess at where overall yield might go. I suggested back in June 2010 that HiSeq 2000 might generate 2Tb of data from a run. The audience thought this way over ambitious but we have already seen data from internal >1Tb runs from Illumina on that platform.

If we stick with 600Gb as the HiSeq run output or 75Gb per lane then a MiSeq lane (1/3rd the size of a HiSeq lane) might yield as much as 25Gb.

By my own estimates I'm still off by 100% on how much HiSeq might get to. And I'm suggesting MiSeq has room to grow 25x over its current 1Gb spec. Anyone interested in a sweepstake?