Friday 18 October 2013

How good is your NGS multiplexing?

Here’s a bold statement: "I believe almost all NGS experiments would be better off run as a single pool of samples across multiple lanes on a sequencer." 

So why do many users still run single-sample per-lane experiments or stick to multiplexes that give them a defined number of reads per lane for each sample in a pool? One reason is the maths is easy: if I need 10M reads per sample in an RNA-seq experiment then I can multiplex 20 samples per lane (assuming 200M reads per lane). But this mean my precious 40 sample experiment now has an avoidable batch effect as it is run on two lanes which could be two separate flowcells on different instruments at different times by different operators in different labs…not so good now is it!

And why doesn’t everyone simply multiplex the whole experiment into one pool in the first place? When I talk to users the biggest concern has been, and remains, the ability to create accurate pools. A poorly balanced large-pool is probably worse than multiple smaller-pools ones, as with the latter you can simply do more sequencing on perhaps one of the sub-pools to even out the sequencing in the experiment.

We have pretty agreed standards on quality (>Q30) and coverage (>10x for SNP calling), but nothing for what the CV of pool of barcoded libraries should be. What’s good and what’s bad is pretty much left up to individuals to decide.

Here are some examples from my lab: pools 1, 2 & 3 are not great; 4 is very good.

Robust library quantification is the key: What do Illumina et al do to help? The biggest step forward in the last few years has been the adoption of qPCR to quantify libraries. Most people I speak to are using the Kapa kit or a similar variant. Libraries are diluted and compared to known standards. When used properly the method allows very accurate quantification and pooling however it has one very large problem; you need to know the size of your library to calculate molarity.

The maths once you have size is pretty simple: 

We find dilutions of 1:10,000 and 1:100,000 are needed to accommodate the concentrations of most of the libraries we get. We do run libraries in triplicate and qPCR both dilutions. It’s a lot of qPCR but the results are pretty good.
Unfortunately accurate sizing is not trivial and it can be a struggle to get this right. Running libraries on a gel or Bioanalyser is time consuming and some libraries are difficult to determine a very accurate size for, e,g, amplicons & Nextera. Some users don’t bother, they just trust that their protocol always gives them the same size. The Bioanalyser is not perfect either, reads this post about Robin Coope’s iFU for improved Bioanalyser analysis. Get the sizing wrong and the yield on the flowcell is likely to be affected.

Even with accurate QT pooling is still a complex thing to get right: Illumina try to help by providing guidelines to allow users to make pools of just about any size. However these are a headache to explain to new users without the Illumina documentation. And the pooling always has a big drawback in that you may need to sequence a couple of libraries again and this can be impossible if they are not compatible.

 We run MiSeq QC on most of the library preps completed in my lab. This is very cost effective if we are sequencing a whole plate of RNA-seq or ChIP-seq, at just £5 per sample. However if we only have 24 RNA-seq samples then we’ll only want 2 lanes of HiSeq SE50bp data, this means MiSeq QC is probably a waste of time and we’ll just generate the experimental data. Unfortunately the only way to know for sure that the barcode balance is good is to perform a sequencing run!

Mixing pools-of-pools to create "Superpools": We’ve been thinking about how we might handle pools-of-pools (Superpools) on HiSeq 2500, the instrument has a two-lane flowcell that requires a $400 accessory kit if you want to put a single sample on each lane. The alternative is to run two lanes, or a superpool of libraries from different users. We’ve tested this in our LIMS and can create the pools, the samplesheet and do the run but in thinking about the process we’ve come up with a new problem. What do you do when the libraries you want to superpool are different sizes?

We can accurately quantify library concentration (if you can accurately size your libraries) but the clustering process favours small molecules. Consider the following scenario: in a superpool of two experiments on one HiSeq 2500 flowcell we have an RNA-seq library (275bp) and a ChIP-seq library (500bp). These are equimolar pooled and sequenced. When demultiplexed the RNA-seq library accounts for 80% of the run and the ChIP-seq 20%; consequently the RNA-seq user has too much data and the ChIP-seq user has too little. And all because the smaller RNA-seq library clustered more efficiently. How do you work that one out!

We’ve not empirically tested this but I think we will soon on our MiSeq.

Top tips for accurate pooling:
  1. Perform robust QT
  2. Mix libraries with high volume pipetting (~10ul)
  3. Run MiSeq QC
PS: writing this post has got me thinking of better ways to confirm pooling efficiency than sequencing. Watch this space!


  1. So do you lose some reads to issues with barcodes? For example an error in the barcode read or the barcode missing?

    Also you don't have issues with cross contamination or carry over contamination from run to run?

  2. This size/clustering problem is even important for the simpler case of pooling lots of Nextera reactions that had high variance in final fragment size distribution. My labmates and I have been thinking about this issue a lot. We really need a model that can go from size distribution in to size distribution out.

  3. Have anyone tried using the SequalPrep™
    Normalization Plate for Amplicon (Invitrogen-Life Tech) on these type of libraries? It can bind approx 25ng per well. Or perhaps using the same principle with Illumina primers covering the wells of the normalization plate in a custom assay?


Note: only a member of this blog may post a comment.