Wednesday, 3 October 2012

How to do better NGS experiments

Design, replication, multiplexing.
These are the three things I spend most of my time discussing in experimental design meetings. In my institute we hold three 30-minute design sessions every week where users talk to my group and our bioinformatics core about how best to run a particular experiment. I do believe our relatively short discussions have a big impact on the final experiments and the most common issues are the three I listed at the start.
Of course we spend lots of time talking about the pro’s and con’s of different methods, RNA-seq vs arrays or mRNA-seq vs Ribozero RNA-seq for instance, but the big three get discussed even if it is clear what method should be used.
I'd encourage everyone to think about experimental design as much as possible. Simply thinking about the next step in your experimental process is not enough. Take time to plan where your experiments are going and what are the most logical steps to get there. Then make sure each experiment is planned to make best use of your available resources. Even cheap experiments can end up being expensive in lost time. Don't save experimental design for more costly array or sequencing based projects!

Design: This is important because it suggests that an experiment has had more than one person think about it more than once. Even “simple” experiments often have confounding factors that need to be considered; or require assumptions to be made about steps in the experiment where real data might turn out to be sorely lacking.
Designing an experiment often means sitting down and listing all the issues that might affect the results and highlighting the things that can, or can’t be done to mitigate of remove these issues. This can be done by anyone with sufficient experience of the experiment being performed. We find it is best done together over a cup of tea or lunch in an informal discussion, just like our design sessions!
Replicates: Replication is vital in almost all experiments. Only if an experiment is truly limited to being done once should replication be ignored. Most people can come up with multitudes of reasons as to why more replicates are a bad idea. However when confronted with data showing how increasing replicate numbers can make experiments more robust and more likely to find significant differences, many users are persuaded to add in four or even more replicates per group.
Biological replication is king and technical replicates are often a waste of time. Be wary of pooling samples to make replicates appear tighter, you are losing information about the biological distribution of your data that might be meaningful.
We find four replicates is the minimum to consider for almost all experiments. Three works well but if one samples fails to generate results a whole experiment can be rendered useless. Four gives a big step up in the ability to detect differences between groups, five adds even more power but after six replicates many experiments start to tail off in this additional power. Unfortunately it is difficult to predict the number of replicates needed to get the best £:power ratio. It is easily done after the experiment is complete, and I have yet to go to a statistical seminar that does not put Fishers “statistical post mortem" quote* up at some point to ram this home!

* Currently at number 3 in the famous statistician quotes chart!

Multiplexing: For me this is the one people seem to forget to think about. I think I am convincing many that the correct number of samples to multiplex in a single NGS run is all of your samples. Rather than run 4, 12 or 24 samples per lane and always stick to this I prefer to argue that having all the samples in a pool and running multiple-lanes makes the experiments more robust to any sequencing issues. If a lane has problems then there is still likely to be lots of data from all the other lanes in the run.

There are also some issues with demultiplexing low-plex pools on Illumina as the software struggles to identify clusters correctly if they are too similar. We have had users submit libraries for sequencing with just two or three samples pooled. These have failed to generate the usual yield of data and demultiplexing is horrible. There is nothing we can do and it has been frustrating explaining to users that there four carefully pooled libraries have all failed when if they had just mixed all the samples together in one super-pool and run 4 lanes everything would have been fine!

Putting it all together: When we plan experiments now I try to ask how many reads might be needed per sample for a specific application. Once this is fixed then we can decide on replicate numbers for the experiment. Finally we can work out how many lanes are likely to be needed given the variability of Illumina sequencing. If an extra lane is needed later on there is enough data to start analysis, but often we don’t need more data and the sequencing becomes as efficient as possible.

PS: feel free to comment on any aspect of design discussed or missed here. Please don't ask me for help designing your projects though!


  1. RE multiplexing, what about barcode bias?

    Technically speaking, shouldn't you randomise the barcodes for each sample across all of the different lanes?

    We find two problems with Illumina barcodes/indices/multiplexing: first is barcode bias; the second is load balancing across multiple samples with different properties and quality.

  2. Although tech. reps are not important as biological reps. In NGS, technical reps are useful in accounting for bar code effect. Having randomized barcoded technical reps in different lanes is the best approach for accounting lane effect.

  3. Thanks to your post. a long due post that we were thinking about for a while saw the light. It is titled "Tips for Next-Gen Sequencing Experiment Design: Randomization" and is here at