I need to use SPAdes to assemble some genomes and will be using the computecanada server. SPAdes globally available in computecanada.
I have done the following:
– used module load to install the correct SPAdes version as well as the necessary dependencies
– verified the correct installation using spades.py –test – run SPAdes using the job scheduler using the following command in the scratch directory: sbatch spades.py -1 path_to_file_in_scratch.fastq -2 path_to_file_in_scratch .fastq -o output_dir/ –careful –mem=180000
– when running SPAdes without using the scheduler, the program starts running correctly but eventually runs out of memory
the problems I encounter are as follows:
– I get a SPAdes log error reading that the SPAdes program was terminated using KeyInterrupt (which I did not do, as sbatch had the program running in the background and had assigned a job ID for this)
– the slurm-jobID.out files read: \ »File \ »/var/spool/slurmd/job32547551/slurm_script\ », line 21, in <module> import spades_init ModuleNotFoundError: No module named \’spades_init\’\ »
Does anyone know what the problem could be?
As zhibin has noted, it may solve your problems to submit a small shell script to the cluster rather than simply prepending
sbatch to your command. This will enable you to load the modules before running spades. There are a couple of other hints that I can give you:
# SPAdes Arguments #
The memory parameter (–memory) takes the number of GB. From the output of running spades.py without arguments:
-m/--memory <int> RAM limit for SPAdes in Gb (terminates if exceeded) [default: 250]
So your command above would be loading 180TB of RAM, which would immediately exceed any of the Compute Canada nodes. I’d recommend something like
--memory 180 instead.
SPAdes, by default uses 16 CPUs, but you can specify the number with the
--threads argument. If you’re going to be using 180GB of RAM, you will almost certainly be using all of the RAM on one node. If so, there will be no other jobs running on the same node, so you might as well use all of the CPUs.
If I were you, I’d also include the SPAdes argument
--threads 20 or even
--threads 40 to use the CPUs available to you.
# Sharing run requirement information with the scheduler #
When you submit a job the queue, the scheduler tries to share the resources available as fairly as possible. If there are jobs that only require 10GB or ram and 2 cpus, we can fit many jobs on a single node. The scheduler makes sure that if a job only asks for 10GB of RAM it will be killed if it uses more than its share so that other people’s jobs running on the same node are unaffected. So that the scheduler can do the work of efficiently assigning jobs to nodes, it requires a little information about what your job requires.
For example, if we know that SPAdes is going to ask for 180GB or ram, it will be helpful to let the scheduler know as much. The script you submit using
sbatch can include a number of « directives », or « options » that set resource requirement like memory, CPUs, time, etc. If you are going to need 180GB of RAM, it would be a good idea to include the directive «
#SBATCH --mem=180G« .
# Putting it all together #
Let’s say you have the file
run.sh in the directory that contains your reads. The script
run.sh might look something like:
#!/bin/bash #SBATCH --account=<YOUR ACCOUNT HERE> #SBATCH --time=24:00:00 #SBATCH --cpus-per-task=40 #SBATCH --ntasks=1 #SBATCH --mem=180G module load mugqic/SPAdes/3.10.0 spades.py -1 fwd.fastq -2 rev.fastq -o out --careful --mem=180 --threads 20
From that same directory, you can submit the job to the cluster by running:
$ sbatch run.sh
I hope this helps!
Hi Rob, thanks for the extensive answer. I managed to run the script yesterday but it was running quite slow. Adding #SBATCH –cpus-per-task=40
#SBATCH –ntasks=1 made it a lot faster
Wonderful! More efficient use of the cluster is a win-win for everybody 🙂