After sequencing (i.e. you have a FASTQ file with your reads), what is a good quality assessment/control workflow of the sequence data itself?
- What are specific “red flags” we should look out for? How do you look out for these potential issues?
- What are characteristics of “good data” vs “bad data”?
- What are some tools for quality assessment of sequence data?
- What metrics should you check? How do you determine what thresholds to use as cut offs for those metrics?
Are there any generally accepted “standards” regarding the metrics.
Hi @mjovanov, can you help us out by saying what type of data you have in your fastqs? ie. length, sequencing platform, and library type? There are a few metrics that apply across the board, but to do a good job on QC, I find that post-alignment/assembly analysis is most fruitful. If you provide some more information about your experiment we can try and point you in the right direction.
Hi @rcorbett here is some of the info:
type: RNA-seq data,
length: 100bp reads,
library type: Paired reads.
What type of post-alignment analysis do you use for QC?
I generally run every FASTQ file through FastQC then aggregate these into a single report using MultQC (which makes it easier to spot outliers). Here are some things I do when looking at sequencing data QC reports. This is by no means an exhaustive list, but covers a couple red flags I look for and how to fix them:
- A FASTQ has abnormally low sequencing depth or appears to be just very different from the other read files. Either something went wrong during sequencing or went wrong when downloading/uploading the file. Run md5sum or some other checksum tool on all of your samples and make sure the checksums match what they are supposed to be. If something doesn’t match, redownload the file. If the checksums do match, the anomaly is real and you should investigate how this affects things further at every step of the pipeline. It’s usually smart to just verify checksums anyways (just to maintain your sanity when transferring files around).
- Sequence quality is low/degrades towards the end of a read (or there are a lot of Ns). Run a FASTQ trimmer and trim by quality to fix (tool defaults are usually fine). I usually drop reads that get shorter than say 30 bp (make sure this also drops the paired read or you’ll mess up the read-pairing in your FASTQs!).
- Per base sequence content shows a pattern / per sequence GC content shows multiple peaks. For some experiments, this is indicative of contamination and you should try to remove contaminated reads. Other experiments you should expect to see a pattern – RNA-Seq library prep is non-random for the first ~12 bp of a read, and a metagenomics study would expect to see a really bumpy GC content graph because there are lots of different organisms being sequenced. Carefully consider what you’d expect to see here and check for contamination if you suspect something is off. This is kind of a frustrating one to interpret, esp. if you’ve never seen a particular type of data before – so often the only course of action is to proceed through your pipeline and evaluate later output to see if something’s amiss.
- Overrepresented sequences. BLAST these guys. Chances are its nothing- but never hurts to see what they are.
- Adapter content detected. If this crops up, trim adapters. You can check what adapter sequences are being checked for in FastQC’s contaminants list. No exceptions. Many adapter trimming tools combine this with quality trimming, so you can do this in one step if you want.
This is amazing! Thanks.
Also, is using MultiQC the go to way to apply FastQC on lots of FASTQ files?
Yeah, MultiQC is my favorite tool for this kind of thing. Otherwise things can get a little murky when you’re trying to look for outlier samples or trends in like 60 FastQC reports (no one expects you to open each and every one… that would be insanity).
(Also, a bit of a hidden side benefit is that MultiQC is a really nice looking initial overview of things, and is a great way to placate someone who wants results “now, now, now” while you do the rest of the analysis)
A similar question with answers was posted here. Please see:
Maybe its doesn’t answer all of your questions, but clearly a start.
The linked post only gives examples of a couple tools that available (FASTQC – which I have used, NGSQC, and MulitQC). However, it doesn’t address what “red flags” to look out for, or characteristics of “good” vs “bad” data, or what metrics are important to check and how to decide what thresholds to use, which were the questions I was hoping to get answered. If you have any information or tips about those questions, that would be great!