We just published a paper describing the tool we use in my lab as the primary QC for all lanes of sequencing, we've used this tool on about 5000-6000 samples over th past 3-4 years.
Multi-genome alignment for quality control and contamination screening of next-generation sequencing data. Hadfield & Eldridge 2014
In designing the tool we aimed to produce something that gave us a fast, and computationally inexpensive, visual presentation of quality and yield. The tool is alignment-based and works on a 100,000 read sample of sequences and base quality scores extracted from a FASTQ file. Whilst we run MGA at the lane level it could be run on any number of FASTQ files e.g. every library in a multiplexed pool to determine the quality/contamination of each
I'm hoping the next step will be to release this as an App on BaseSpace.
How does MGA work: The MGA tool primarily displays yield, as counts of reads (clusters) and quality (error rate) with additional information to help identify lanes/libraries that may need additional troubleshooting. In figure B from the paper below we presented a HiSeq 2000 flowcell with four good lanes and four not-so good. Before I describe the results can you guess which lanes might have problems?
Figure B from Hadfield & Eldridge 2014 |
The figure is meant to represent a flowcell with 8 lanes, and will display a single lane for a MiSeq run, or 4 lanes on a NextSeq 500!
Lanes 1-4 generated between 165-200M reads each. The green bar represents the genome of interest and error rate, these lanes are almost entirely Human as expected with very little "contamination" from other genomes (as repressented by yellow) or unmapped sequences (as represented by white).
Lanes 5-8 generated only 90-140M reads. There's lots of unmapped sequence (white) and also lots of adapter contamination (as represented by purple). Additionally lane 8 is almost entirely unmapped with almost the same number of reads coming from contaminants as from the genome of interest.
How do you get MGA: The paper was only published a couple of days ago but you can read it in provisional format at Frontiers in Genetics. You can also grab the software from https://github.com/crukci-bioinformatics/MGA.
I don't know the details about running an Illumina machine, but is not the lane 8 kind of an internal control? If so, that'd be why is almost white, and the 5-7 would represent contamination from the 8? It has nothing to do with the software actually, but for the interpretation of results for the paper maybe...?
ReplyDeleteLane 8 is only a control when designated as such, and is not the only lane that can be used as a control lane. In the brief description, it appears that a spread of 8 lanes of variable quality were used to illustrate the tool. From my brief glance at the paper, it appears that it could be immensely useful.
ReplyDelete