FORUMWGS genepipes pipeline with dnaseq.py example setup.
Marco asked 1 month ago



Hello … I would like to run a Hg38 WGS genepipes pipeline with dnaseq.py on Graham or Cedar. Is there an example setup available somewhere? Thanks in advance.

Rob Syme Staff replied 1 month ago

Hi Marco

There is a tutorial on the Genpipes documentation site that might be helpful: https://genpipes.readthedocs.io/en/latest/get-started/using_gp.html

Let me know if you’ve already had a read through the documentation and are looking for something else.

6 Answers
Marco answered 1 month ago



I started to test $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.graham.ini on Graham, but that file gives error after error.
I copied it to my local directory to edit it with elements taken from $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.base.ini but that gives a never ending stream of error running dnaseq.py …
module load mugqic/python/2.7.14
module load mugqic/genpipes/3.1.4
dnaseq.py -t mugqic -c \
./dnaseq.graham.ini \
-r readset.dnaseq.txt \
-s 1-34 > ./dnaseqCommands_mugqic.sh
I then started to test $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.base.ini, which finishes and produces the dnaseqCommands_mugqic.sh file, but running it with sbatch also gives error after error. For example this script contains qsub which is not supported on Graham or Cedar.
dnaseq.py -l debug -t mugqic -c $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.base.ini -r ./readset.dnaseq.txt -s 1-34 > ./dnaseqCommands_mugqic.sh
I think my next step this morning will be to test “batch”? I am not sure what each of them means, so it will be trial and error.
-j {pbs,batch,daemon,slurm}, –job-scheduler {pbs,batch,daemon,slurm} job scheduler type (default: slurm)
I am questioning how to run dnaseqCommands_mugqic.sh on Graham. Obviously sh ./dnaseqCommands_mugqic.sh is not a good idea.
Running sbatch dnaseqCommands_mugqic.sh straight like that also doesn’t seem to be a good idea. It would quickly run out of memory and wall-time.
So, I added these lines to give the script some “space” to run, but then as mentioned it crashes with errors.
I tried to correct the errors in the script, but with the many qsub commands in the script, that seemed hopeless yesterday.
#!/bin/bash
#SBATCH –account=def-xxxxxxxxx
#SBATCH –nodes=2
#SBATCH –ntasks-per-node=32
#SBATCH –mem=0
#SBATCH –time=2-00:00:00
Thanks for any help. If you are working with Graham or Cedar, please test the $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.graham.ini and Cedar files.
I think that’s where new users should start with and get instructions from. Hopefully this discussion will provide that.
Thanks again.

Marco answered 1 month ago



I tried to get it running with “batch” mode …
dnaseq.py -l debug -t mugqic -c $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.base.ini -r ./readset.dnaseq.txt -s 1-34 -j batch -o ./genpipe > ./dnaseqCommands_mugqic_batch.sh

dnaseq.py Finishes without errors: INFO:core.pipeline:TOTAL: 85 jobs created

But when I test it for quick error checking with: sh ./dnaseqCommands_mugqic_batch.sh, I am getting …
Begin MUGQIC Job sym_link_fastq.pair_end.HEK-MCB2b at 2019-11-07T07:54:13
./dnaseqCommands_mugqic_batch.sh: line 72: : command not found
Line 72 can be seen here …
2019-11-07_0804

Rob Syme Staff answered 1 month ago



Hi Marco
There’s a lot to unpack here, so I’ll try to break it down into components. If I’ve missed anything, of if there is something that is unclear please feel free to reply below.

# Problem 1 – Genpipes Versioning
I think that using the lastest version of Genpipes will solve the bulk of your problems. I can’t see which version you’re using, but the latest version (at time of writing) can be loaded in two steps:
Step 1: Modify your ~/.bashrc to include the module definitions. Your ~/.bashrc should include the lines:

export MUGQIC_INSTALL_HOME=/cvmfs/soft.mugqic/CentOS6
module use $MUGQIC_INSTALL_HOME/modulefiles

Step 2: Before using the genpipes tools in a session, load the module. This step can also be included in your ~/.bashrc if you want this particular version of genpipes always loaded when you log in.

$ module load mugqic/genpipes/3.1.4

Is this the GenPipes version you’re currently using? Loading the module will set environment variables such as $MUGQIC_PIPELINES_HOME, which will ensure that the right configuration INI files are being used.

# Problem 2 – Testing INI configurations
Question from Marco
“I started to test $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.graham.ini on Graham, but that file gives error after error. I copied it to my local directory to edit it…”
Answer
I’d recommend that you specify multiple configuration files in order of increasing specificity. If there is a parameter specified in the second ini file, it will override whatever was specified in the first ini file. Parameters specified in the third ini file will override those specified in the second ini file, etc.
The config files I usually use are:

  1. The pipeline defaults
  2. The cluster defaults
  3. A custom ini file overriding just those options particular to the current workflow

Something like:

dnaseq.py \
  --config $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.base.ini \
           $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.cedar.ini \
           custom.ini [other options...] > dnaseqCommands.sh

This keeps the custom ini as small and readable as possible.

# Problem 3 – Scheduler
You said that you’d like to run the pipeline on either Cedar, Graham. These facilities both use the SLURM scheduler. The default scheduler is slurm, but to make sure, you can use the “-j” argument to specify it manually:

dnaseq.py -j slurm [other options...] > dnaseqCommands.sh

For example, on Guillimin, which uses PBS, you would run

dnaseq.py -j pbs [other options...] > dnaseqCommands.sh

The “-j batch” option runs each of the commands locally without submitting jobs to the scheduler. I would not recommend this on Compute Canada infrastructure, as it will consume the resources on the login node which makes the system unstable for other users. This option is for using GenPipes on non-HPC infrastructure without a scheduling system.

# Problem 4 – Running the commands
The bash file created by the dnaseq.py command will take care of submitting jobs. This script can be run directly from the login nodes, so there is no need to submit this script to the scheduler. From the login node, something like this should be sufficient to run the first five steps

$ module load mugqic/genpipes/3.1.4
$ dnaseq.py \
    -j slurm \
    -s 1-5 \
    --readsets readset.dnaseq.txt \
    --config $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.base.ini \
             $MUGQIC_PIPELINES_HOME/pipelines/dnaseq/dnaseq.cedar.ini \
             custom.ini > dnaseqCommands.sh
$ bash dnaseqCommands.sh

I hope this makes sense, but please let me know if I’ve misunderstood anything or if you have outstanding questions.

Marco answered 1 month ago



Thanks for the extensive explanation.
It feels I have improvements, but the script is making errors…

>bash ./dnaseqCommands_mugqic.sh
sbatch: invalid option — ‘l’
Try “sbatch –help” for more information

Marco replied 1 month ago

First sbatch line …
sbatch –mail-type=END,FAIL –mail-user=$JOB_MAIL -A $RAP_ID -D $OUTPUT_DIR -o $JOB_OUTPUT -J $JOB_NAME -l walltime=24:00:0 –mem=8G -l nodes=1:ppn=1 | grep “[0-9]” | cut -d\ -f4)

Marco replied 1 month ago

I had to replace …
-l nodes=1:ppn=1 for -n 1 -N 1
-l walltime= for –time=
throughout the dnaseqCommands_mugqic.sh file.

But these is a new error … sbatch: error: Unable to open file 3

Marco answered 1 month ago



It started a bunch of jobs. How does the system make sure it’s in the correct order?
2019-11-08_0905

Rob Syme Staff answered 1 month ago



The correct ordering of jobs is determined by specifying job dependencies. If job B relies on the outputs of job A, then GenPipes ensures that when job B is submitted, it tells the scheduler that job B can’t run until after job A has finished and has finished without error.
When job A is submitted, it returns an identifier, which GenPipes saves in a variable (let’s say $jobIDA). When it comes to submitting job B, GenPipes adds some arguments to the sbatch command. On Slurm, it will add something like:

sbatch [other args] --dependency=afterok:$jobIDA jobB.sh

This dependency specification ensures that the slurm scheduler won’t run job B until *after* job A has successfully completed.

If you’re interested in learning more about job dependencies for you own non-GenPipes work, I’d start here (but there are plenty of resources around online). For complex dependency trees, managing these connections becomes unwieldy and you’ll eventually write a workflow management system (which is exactly how GenPipes began, I think).