jessiep asked 2 years ago

I’ve noticed that there are certain regions (eg. 60kb zone at the start of RBFOX3) that have consistent mapping qualities of 0, which I’ve only checked in a few individuals so far. I realize this is due to those sequences heavily mapping to other parts of the genome, but was wondering a) the extent to which this happens on a “normal”, or non-pathologically implying basis, and b) how best to check for sequence similarity with other regions of the genome to find out where those reads are mapping to elsewhere.

4 Answers
rcorbett Staff answered 2 years ago

Hi @jessiep,
I may not be the best to answer your question about the likelihood of ambiguous alignments in genes, but if you only have a few regions of interest, an easy way to find out where else your read could possibly align you can provide the “-h” parameter to BWA.  If you are using a different aligner there should be a similar option.
From BWA mem:

-h INT[,INT] if there are <INT hits with score >80% of the max score, output all in XA [5,200]

As you may know the alignments of your reads to the 0 mapping quality region may not be indicative of something about your sample as much as it is indicative of something about the reference.  Ie. any sample with reads of the same length that map to that region would still have mapping quality of 0.

flefebvre Staff answered 2 years ago

Hi Jessie, as rcorbett explained a region with MAPQ=0 alignments in one sample will show alignments with MAPQ=0 in a second sample if the read length was the same.

Trying to answer a) and b), low-mappability regions are everywhere in the human genome with short Illumina reads. I would suggest you have a look at (and download) the “Mappability” and other Repeats tracks in the UCSC genome browser, e.g.

For instance you’ll see how RBFOX3 is flanked by a large gap (Ns) in the assembly and apparently segmental duplications.

It could also be interesting to look at hg38 alignments and tracks. The picture seems a bit different from hg19, at least for RBFOX3.
If you are trying to contrasts normal and pathological samples, some tools will account for low-mappability. For instance, the recently published PopSV by Jean Monlong at McGill is really good at this for copy-number-variants, if you have a large number of samples.

jessiep answered 2 years ago

Thanks. Will do both. In the meantime, in RBFOX3’s case, it turns out that a bunch of the reads in that area also map to one of the GRCh37 GL contigs (as per UCSC’s self-chain repeat track). Under the assumption they are unlikely to have variation of value for my analyses, would this justify dropping the GL contigs to increase mapping scores & reduce false positive variant calls?

flefebvre Staff answered 2 years ago

Hi Jessie, I would not recommend to start playing around with the reference use unless you are very clear on what is going on. Here is a link describing reference genome best practices for you to check against:
As for the specific suggestion of removing GL contigs I don’t see how this would help. I can imagine removing the GL contigs could force reads from elsewhere in the true genome to map downstream of RBFOX3; That would increase “mapping scores” but could introduce false positive variants.

jessiep replied 2 years ago

Hi Francois,

Thanks for the recommendations.

I guess I don’t see how, if a region quite consistently has mapping qualities of 0 (eg. start of RBFOX3), we could ever reliably call any variants (eg. deletions) there..?

flefebvre Staff replied 2 years ago

Hi Jessie, we can’t reliably call variants in those regions. For instance GATK 3.7 haplotypeCaller does not consider reads with a mapping quality below 20. htseq-count used in RNA-seq defaults at MAPQ > 10. freebayes at 1, the authors of lumpy (calling SVs) appear to recommend setting the threshold at 20 , etc.

Do you have a specific interest in calling variants in low complexity (LC) regions and repeats?

jessiep replied 2 years ago

Hi Francois,

I realize that the point of RepeatMasker and mapping quality thresholds are to minimize false positives in such LC/repeat regions, but perhaps my thresholds weren’t stringent enough.

I don’t have an interest in calling LC/repeat region variants, but RBFOX3 is very important neurodevelopmentally & mutations in it have been called previously by exome seq using a similar workflow to mine. Will go back & compare details.

jessiep replied 2 years ago

(Unless you have other tips/steps I could check..? Thanks again for the help!)