I will need to blast several genomes (>30) against the NR database from NCBI, which is a huge database. The genomes may have from 10.000 to 50.000 proteins. For that, I am using blastp with 16 threads.
I downloaded the NR database to my scratch folder (CEDAR).
Right now I am testing a genome with almost 18.000 proteins. The search strategy I used was to make files with 10 proteins each. Then, each submitted job would have 8 files. Thus, I have more than 200 submitted jobs. They take from 6 to 12h to complete the search of 80 proteins.
The problem is that the jobs are being in the queue for too long (right now I have many jobs with more than 30 days). I am submitting the jobs with 20h max time and at least 8000MB memory.
I would like to know if there is a faster strategy to make the BLAST search of these genomes. Also, is it possible to avoid long waiting times in the queue.
I’m not sure about the strategy itself, perhaps someone else will chime in on that point.
What I can tell you is that time in the queue is related (amongst other things) to the wall time and memory you are requesting.
And so, i would suggest lowering the wall time to 16h for starters. if your jobs take 6-12hrs then this is still a 25% overage, and should still be adequate. If this doesn’t help, you can try reducing your wall time further… at least until the point where you see jobs failing because they have exceeded the specified wall time.
As for the memory, i’m guessing that breaking up your analysis into smaller jobs won’t help, since it’s the loading of the NR database that requires the 8Gb, however, perhaps you can check how accurate an estimate of the usage this is, i.e. i usually like to submit with about a 10% overage on memory. So perhaps check the memory footprint of the jobs and see if you can reduce the memory request as well.
So give this a try and let me know.
Also, I can investigate on my end to see if we can find out why the wait times are so long. In order to do so, I need to know:
1. your userid
2. if you have a resource allocation
Thanks for your suggestions!
other thing to consider to accelerate your blast search:
1) try copying NR and blast fasta to node local ramdisk (/tmp on cedar i think) and run makeblastdb at beginning of script on node to generate NR blast db on ramdisk
2) use multi-threading even if not so effficient implementation in blast (option -num_threads)
3) Lower number of hits returned if applicable (-max_target_seqs and also -max_hsps option can help)
4) Try also limiting your hit list using evalue filters (if applicable) to near identical hits (-evalue )
and if you use these methods to reduce the job runtime, you can adjust your wall time accordingly. this should help to reduce the wait times in the queue
jflucier and jrosner, thanks for you suggestions!
jflucier, regarding suggestion #1, do I have to run “makeblastdb” for every job submission? In this case I think it might take a long time since the NR database is huge. Or do you think loading the NR database as I am doing now takes a longer time? Thanks!
dont remember how much time it takes.
another possible strategy is to first generate the blast db using makeblastdb only once. Once it is generated, submit your blastp jobs and include a copy of the makeblastdb generated files (*.nhr, *.nin, *.nsq) to the compute node and call blastp afterwards.
We had a chance to look at your job submissions, and it looks like you’re near the bottom of the priority list on cedar. To compound this matter, there are many, many jobs sitting in the queue. So what to do?
Well, wall-time submissions are partitioned into bins, you can see them here
(scroll down to “Time Limits” section)
And so, your jobs are currently sitting in “bin 3” since you are asking for more than 12hrs. In order to move to “bin 2”, which will give your jobs a better chance for backfilling nodes, you should choose a wall time slightly under 12hrs.
Similarily, base nodes have 4 or 8GB of memory per core, so try requesting just under 8GB.
Also take some time to read through the document i posted above, there is a lot of good information that can help you determine optimal job submission parameters.
The other thing you can do to improve your wait time in the queue, is to submit a Compute Canada Resource Allocation Competition (RAC) application.
The main reason you are near the bottom of the queue is because everyone with an allocation will get onto the nodes before you do.
The call for this competition just went out, so you have until Nov 8th to submit a request.
This application needs to be submitted by your PI, and you will be interested in the Resources for Research Groups (RRG) allocation.
You can find all of the information about this here
Hope this helps!
Thanks for all the information! I will try the suggestions above to reduce the queue waiting times and the job duration.