FORUMSPAdes use on computecanada – Error: ModuleNotFoundError: No module named 'spades_init' 
pauluga asked 2 years ago

 I need to use SPAdes to assemble some genomes and will be using the computecanada server. SPAdes globally available in computecanada.  
I have done the following:    
– used module load to install the correct SPAdes version as well as the necessary dependencies 
 – verified the correct installation using –test  – run SPAdes using the job scheduler using the following command in the scratch directory:                                                                                    sbatch -1 path_to_file_in_scratch.fastq -2 path_to_file_in_scratch .fastq -o output_dir/ –careful                                                                                    –mem=180000
 – when running SPAdes without using the scheduler, the program starts running correctly but eventually runs out of memory  
the problems I encounter are as follows:    
– I get a SPAdes log error reading that the SPAdes program was terminated using KeyInterrupt (which I did not do, as sbatch had the program running in the background and had assigned a job ID for this)  
– the slurm-jobID.out files read:  \”File \”/var/spool/slurmd/job32547551/slurm_script\”, line 21, in <module>     import spades_init ModuleNotFoundError: No module named \’spades_init\’\”  
Does anyone know what the problem could be?

2 Answers
Best Answer
Rob Syme Staff answered 2 years ago

As zhibin has noted, it may solve your problems to submit a small shell script to the cluster rather than simply prepending sbatch to your command. This will enable you to load the modules before running spades. There are a couple of other hints that I can give you:

# SPAdes Arguments #

Memory Usage

The memory parameter (–memory) takes the number of GB. From the output of running without arguments:

-m/--memory    <int>   RAM limit for SPAdes in Gb (terminates if exceeded)
                       [default: 250]

So your command above would be loading 180TB of RAM, which would immediately exceed any of the Compute Canada nodes. I’d recommend something like --memory 180 instead.

CPU usage

SPAdes, by default uses 16 CPUs, but you can specify the number with the --threads argument. If you’re going to be using 180GB of RAM, you will almost certainly be using all of the RAM on one node. If so, there will be no other jobs running on the same node, so you might as well use all of the CPUs.
If I were you, I’d also include the SPAdes argument --threads 20 or even --threads 40 to use the CPUs available to you.

# Sharing run requirement information with the scheduler #

When you submit a job the queue, the scheduler tries to share the resources available as fairly as possible. If there are jobs that only require 10GB or ram and 2 cpus, we can fit many jobs on a single node. The scheduler makes sure that if a job only asks for 10GB of RAM it will be killed if it uses more than its share so that other people’s jobs running on the same node are unaffected. So that the scheduler can do the work of efficiently assigning jobs to nodes, it requires a little information about what your job requires.

For example, if we know that SPAdes is going to ask for 180GB or ram, it will be helpful to let the scheduler know as much. The script you submit using sbatch can include a number of “directives”, or “options” that set resource requirement like memory, CPUs, time, etc. If you are going to need 180GB of RAM, it would be a good idea to include the directive “#SBATCH --mem=180G“.

# Putting it all together #

Let’s say you have the file in the directory that contains your reads. The script might look something like:

#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=40
#SBATCH --ntasks=1
#SBATCH --mem=180G

module load mugqic/SPAdes/3.10.0 -1 fwd.fastq -2 rev.fastq -o out --careful --mem=180 --threads 20

From that same directory, you can submit the job to the cluster by running:

$ sbatch

I hope this helps!

pauluga replied 2 years ago

Hi Rob, thanks for the extensive answer. I managed to run the script yesterday but it was running quite slow. Adding #SBATCH –cpus-per-task=40
#SBATCH –ntasks=1 made it a lot faster

Rob Syme Staff replied 2 years ago

Wonderful! More efficient use of the cluster is a win-win for everybody 🙂

zhibin Staff answered 2 years ago

You can not run jobs just by adding “sbatch” in front of your command. Here is the instruction on how to run jobs on the cluster,

pauluga replied 2 years ago

thank you, this as the problem