I ran repeat masker with the following command
RepeatMasker -species mammal -pa 32 -s GCF_000493695.1_BalAcu1.0_genomic.fna
The organism is a minke whale, a mammal, and is known to have ~40% repeats but my analysis gives me only 2%
Only detecting Simple repeats and Low complexity regions in the genome.
How am I suppose to make the repeat masker detect all kinds of repeats, Do I need to download the database to my scratch folder?
full disclosure, i’m not an expert with RepeatMasker, but i did some digging around and it seems that this software is only for low complexity and simple repeats. you can adjust the sensitivity (as you’ve done using the -s option), but that’s about it. another option might be if you can find an alternate database?
Now, was able to dig up the following paper as well:
“Minke whale genome and aquatic adaptation in cetaceans”
In this paper, they describe using a number of different masking tools:
“The genome was searched for repetitive elements using Tandem Repeats Finder version 4.04. Transposable elements were identified using homology-based approaches. The Repbase (version 16.10) database of known repeats and a de novo repeat library generated by RepeatModeler were used. This database was used to find repeats with software such as RepeatMasker version 3.3.0.”
Let me know if this is helpful…
but if you’re still having problems i’ll see if I can find someone with a little more knowledge to help you out
(many of us are on vacation right now!!!)
Thanks for the help and taking the time
I presumed the repeatmasker in cedar is using Repbase. My output looks like this
number of length percentage
elements* occupied of sequence
SINEs: 0 0 bp 0.00 %
Alu/B1 0 0 bp 0.00 %
MIRs 0 0 bp 0.00 %
LINEs: 0 0 bp 0.00 %
LINE1 0 0 bp 0.00 %
LINE2 0 0 bp 0.00 %
L3/CR1 0 0 bp 0.00 %
RTE 0 0 bp 0.00 %
LTR elements: 0 0 bp 0.00 %
ERVL 0 0 bp 0.00 %
ERVL-MaLRs 0 0 bp 0.00 %
ERV_classI 0 0 bp 0.00 %
ERV_classII 0 0 bp 0.00 %
DNA elements: 0 0 bp 0.00 %
hAT-Charlie 0 0 bp 0.00 %
TcMar-Tigger 0 0 bp 0.00 %
Unclassified: 0 0 bp 0.00 %
Total interspersed repeats: 0 bp 0.00 %
Small RNA: 0 0 bp 0.00 %
Satellites: 0 0 bp 0.00 %
Simple repeats: 941278 37980227 bp 1.56 %
Low complexity: 196297 9559739 bp 0.39 %
the paper ‘Insights into the Evolution of Longevity from the Bowhead Whale Genome’ under Experimental Procedures the following paragraph
“Evaluation of Repeat Elements
To evaluate the percentage of repeat elements, RepeatMasker (v.4.0.3; http://www.repeatmasker.org/) was used to identify repeat elements, with parameters set as “-s -species mammal.” RMBlast was used as a sequence search engine to list out all types of repeats. Percentage of repeat elements was calculated as the total number of repeat region divided by the total length of the genome, excluding the N-region. ”
So do I need to have Repbase database from the site (https://www.girinst.org/server/RepBase/index.php) in my scratch folder or can it be installed in Cedar so I can run it.
For the sake of expediency, I would advise that you download this yourself for now.
There are some licensing details we need to look into, and currently the RepBase registration link is broken.
But we will look into how we might be able to provide this centrally in the near future, and follow up here.
I am trying to register in the database too.
The link for registration is
Hi again, I had a chance to read through the RepBase license agreement and also spoke with some other members of the team…
Under the current conditions of the license, it looks like Compute Canada won’t be able to host this centrally, which is unfortunate, but it is what it is.
And so, my earlier advice holds true, that is, you need to register and download your own personal copy that can be shared only within your research group.
Thank you very much,