Monday 18 June 2012

Even easier box plots and pretty easy stats help uncover a three-fold increase in Illumina PhiX error rate!

One of the things I wanted to do with this blog was share things that make my job easier. One of the jobs I often have to do is communicate numbers quickly and effectively and a box plot can really help. I also have the same kind of troubles most people face with statistics, I find it hard! In this post I will discuss the GraphPad Prism package from GraphPad. This allows you to use stats confidently and make lovely plots (although annotating them is a nightmare). Recently the statisticians in our Bioinformatics core gave a short course in using GraphPad Prism.  I thought I'd explain the box plot in a little more detail and tell you a bit about GraphPad.

Previously I had shown how to create a box plot using Excel. I went down this route because I did not have time to learn a new package and Excel is available almost everywhere. However the result is less than perfect and it is hard work, indeed one major reason for writing the previous blog post was so I had somewhere to go next time I needed to create a plot! Statisticians like box plots as they can get across a lot more than just the mean and also say something about the size of the population being investigated.

Explaining box plots: A box plot is a graphical representation of some descriptive statistics; generally the mean and one of the following; standard deviation, standard error or Inter Quartile Range. A dot box plot is a version that allows these figures to be represented along with all the data points which allows the size of the population to be clearly seen. This helps enormously when comparing sample groups and deciding if a change in mean is statistically significant or not.

Dot box plots rule!

In figure 1 very similar mean and standard error are plotted with each dot representing a sample, see how the removal of some data does not significantly affect the "results" but seeing the data allows you to make a call on how much you are willing to trust it.

Figure 1

In figure 2 you can clearly see that there appear to be some "outliers" in group 2 but these have no affect on the results as the number of measurements is so high compared to group 1. Deciding if any "outliers" are in group 1 is much harder as the number of samples is so much lower. Removing outliers is really hard, and our statisticians generally advise against it.

Figure 2

GraphPad Prism: The software costs about $300 for a personal license. It might be a lot when budgets are tight but an academic license is not so expensive when shared across a department or institute. I’d certainly encourage you to take the plunge. It very quickly allows you to produce plots like the ones in this post as well as run standard statistical tests, and a whole lot more I won't go into. Take a look at their product tour if you want to find out more.

PhiX error rates: I used GraphPad to investigate an issue I had suspected for a while. We have been seeing a bias in the error rate on Illumina sequencing flowcells where lane one appeared to be higher than other lanes. Whilst the absolute numbers are not terrible and all lanes pass our QC there may be a real impact on results if this is not taken into account; mutation calling single lane samples and comparing tumour to normal for instance.

I took one months GAIIx data (8 flowcells) and plotted error rate for each lane. Entering the data into GraphPad is the most annoying bit and I usually copy and paste from Excel. However generation of the statistics and plots (figure 3) took about three minutes from start to finish.

A one way ANOVA with a Bonferroin correction showed how significant the differences were, with a very significant difference between lanes 1&2 and the rest. In fact there appears to be more of a gradient across a flowcell as lane 2 is affected, but at a lesser degree to lane 1.

A 2 way ANOVA allowed me to determine that in this data set lane accounted for 80% of the variance and instrument only 2.5%.

Figure 3
The biggest headache with GraphPad is the woefully inadequate annotation of graphs. Quite simply you will have to get an image out of the software and into Illustrator or PowerPoint. I guess if they are making the stats easy we should not complain too much.

I am using GraphPad on a weekly basis and for most reports where I have to summarise larger datasets. Why don't you give it a try.

PS: I'll let you know what Illumina say about the error rates. Please tell me if you've seen similar.


  1. Try the gold standard for such tings - JMP. Let's you cut, slice and dice data any which way. Steep learning curve, terrible interface, same inability to properly annotate data that you complain about, but once you get over the hump it is quite powerful

  2. The biggest headache with GraphPad is the woefully inadequate annotation of graphs. Nice posting!! Thanks
    breast cancer treatment

  3. Just use the FOSS R and together with ggplot you get the best graphs EWA.Example:


Note: only a member of this blog may post a comment.