We just published a paper describing the tool we use in my lab as the primary QC for all lanes of sequencing, we've used this tool on about 5000-6000 samples over th past 3-4 years.
Multi-genome alignment for quality control and contamination screening of next-generation sequencing data. Hadfield & Eldridge 2014
In designing the tool we aimed to produce something that gave us a fast, and computationally inexpensive, visual presentation of quality and yield. The tool is alignment-based and works on a 100,000 read sample of sequences and base quality scores extracted from a FASTQ file. Whilst we run MGA at the lane level it could be run on any number of FASTQ files e.g. every library in a multiplexed pool to determine the quality/contamination of each
I'm hoping the next step will be to release this as an App on BaseSpace.
How does MGA work: The MGA tool primarily displays yield, as counts of reads (clusters) and quality (error rate) with additional information to help identify lanes/libraries that may need additional troubleshooting. In figure B from the paper below we presented a HiSeq 2000 flowcell with four good lanes and four not-so good. Before I describe the results can you guess which lanes might have problems?
|Figure B from Hadfield & Eldridge 2014|
The figure is meant to represent a flowcell with 8 lanes, and will display a single lane for a MiSeq run, or 4 lanes on a NextSeq 500!
Lanes 1-4 generated between 165-200M reads each. The green bar represents the genome of interest and error rate, these lanes are almost entirely Human as expected with very little "contamination" from other genomes (as repressented by yellow) or unmapped sequences (as represented by white).
Lanes 5-8 generated only 90-140M reads. There's lots of unmapped sequence (white) and also lots of adapter contamination (as represented by purple). Additionally lane 8 is almost entirely unmapped with almost the same number of reads coming from contaminants as from the genome of interest.