Previously I had shown how to create a box plot using Excel. I went down this route because I did not have time to learn a new package and Excel is available almost everywhere. However the result is less than perfect and it is hard work, indeed one major reason for writing the previous blog post was so I had somewhere to go next time I needed to create a plot! Statisticians like box plots as they can get across a lot more than just the mean and also say something about the size of the population being investigated.
Explaining box plots: A box plot is a graphical representation of some descriptive statistics; generally the mean and one of the following; standard deviation, standard error or Inter Quartile Range. A dot box plot is a version that allows these figures to be represented along with all the data points which allows the size of the population to be clearly seen. This helps enormously when comparing sample groups and deciding if a change in mean is statistically significant or not.
Dot box plots rule!
In figure 1 very similar mean and standard error are plotted with each dot representing a sample, see how the removal of some data does not significantly affect the "results" but seeing the data allows you to make a call on how much you are willing to trust it.
In figure 2 you can clearly see that there appear to be some "outliers" in group 2 but these have no affect on the results as the number of measurements is so high compared to group 1. Deciding if any "outliers" are in group 1 is much harder as the number of samples is so much lower. Removing outliers is really hard, and our statisticians generally advise against it.
GraphPad Prism: The software costs about $300 for a personal license. It might be a lot when budgets are tight but an academic license is not so expensive when shared across a department or institute. I’d certainly encourage you to take the plunge. It very quickly allows you to produce plots like the ones in this post as well as run standard statistical tests, and a whole lot more I won't go into. Take a look at their product tour if you want to find out more.
PhiX error rates: I used GraphPad to investigate an issue I had suspected for a while. We have been seeing a bias in the error rate on Illumina sequencing flowcells where lane one appeared to be higher than other lanes. Whilst the absolute numbers are not terrible and all lanes pass our QC there may be a real impact on results if this is not taken into account; mutation calling single lane samples and comparing tumour to normal for instance.
I took one months GAIIx data (8 flowcells) and plotted error rate for each lane. Entering the data into GraphPad is the most annoying bit and I usually copy and paste from Excel. However generation of the statistics and plots (figure 3) took about three minutes from start to finish.
A one way ANOVA with a Bonferroin correction showed how significant the differences were, with a very significant difference between lanes 1&2 and the rest. In fact there appears to be more of a gradient across a flowcell as lane 2 is affected, but at a lesser degree to lane 1.
A 2 way ANOVA allowed me to determine that in this data set lane accounted for 80% of the variance and instrument only 2.5%.
I am using GraphPad on a weekly basis and for most reports where I have to summarise larger datasets. Why don't you give it a try.
PS: I'll let you know what Illumina say about the error rates. Please tell me if you've seen similar.