FORUMlooks like job has completed but logfile is incomplete
Ella Bowles asked 2 years ago

It looks like alignment of my samples is working for some samples and not for all in my logfiles, but when I look at filesizes of the aligned samples it looks like everything is there. I’m wondering how I can go about diagnosing what is wrong, or if someone can help me figure this out.
For reference, I am working with GBS data using the stacks pipeline and BWA for alignment to a reference genome. I’m aligning lake trout data to the arctic char genome.
I’m not sure if I can post directories and paths to files to check here in the same way as I can when I contact my cluster support (I work on Graham), but in case I can, here goes. It looks in every way like “” found in “/home/ebowles/projects/def-salmo/ebowles/LT_stacksV2/scripts” has completed properly.  In “/home/ebowles/projects/def-salmo/ebowles/LT_stacksV2/alignments” all the samples are listed and have stuff in them, but in “align_per_samp_reads_test-12601216.out” found in the ROOT directory only 14 of the 19 samples have full information for alignment (the latter half of the script). Yet, at the beginning all 19 samples are listed as being complete.
Also, in the .oe files found in the alignments directory some have the following at the bottom
[main] CMD: bwa mem -M /home/ebowles/projects/def-salmo/ebowles/LT_stacksV2/genome/bwa/saal /home/ebowles/projects/def-salmo/ebowles/LT_stacksV2/cleaned/LT-58.fq.gz
[main] Real time: 2918.384 sec; CPU: 2935.308 sec
[bam_sort_core] merging from 2 files and 1 in-memory blocks…
And others have the following at the bottom
[main] CMD: bwa mem -M /home/ebowles/projects/def-salmo/ebowles/LT_stacksV2/genome/bwa/saal /home/ebowles/projects/def-salmo/ebowles/LT_stacksV2/cleaned/LT-499.fq.gz
[main] Real time: 1052.495 sec; CPU: 1058.569 sec
For these latter samples the bam_sort_core is missing. Why would this be the case?
With thanks,
Ella Bowles

MarkHahn replied 2 years ago

Can you provide the jobids relevant to the problem you’re describing?
From the scheduler history,it looks like you made two attempts earlier today which timed out (that means they hit the time limit you specified, and were killed):

[hahn@gra-login1 ~]$ sacct –allocations -u ebowles –starttime 2019-03-17 –format=jobid,start,end,ncpus,state
JobID Start End NCPUS State
———— ——————- ——————- ———- ———-
12742458 2019-03-18T11:48:38 Unknown 32 RUNNING
12746923 2019-03-18T12:34:14 2019-03-18T13:34:31 4 TIMEOUT
12749225 2019-03-18T13:40:11 2019-03-18T15:40:35 4 TIMEOUT

When jobs are killed like that, we’d expect them to only get partway through their work (as you describe, I think).

The touchstone for figuring out your own jobs is to look at their primary output – that is, what they would print to your terminal, if you ran them directly. For instance, the output of the first job (~/projects/def-salmo/ebowles/LT_stacksV2/gstacks_test-12746923.out) clearly shows its fate:

Starting run at: Mon Mar 18 12:34:18 EDT 2019
Running gstacks on default BWA alignments…
slurmstepd: error: *** JOB 12746923 ON gra636 CANCELLED AT 2019-03-18T13:34:31 DUE TO TIME LIMIT ***

I notice that neither of these jobs used more than one processor, though they both allocated four. Perhaps that’s the problem? If you were expecting to take advantage of parallelism, perhaps the job wouldn’t have hit its elapsed-time limit if it had done work four times as fast. (This assumes that your workflow actually *is* parallel, that it can use multiple processors efficiently…)

Ella Bowles replied 2 years ago

I think Jules Gagnon might have figured it out. This issue wasn’t with respect to gstacks (the output you reference here) but instead about alignment. I’ll post again with the solution that Jules gave me if it works. The job is currently running.

2 Answers
MarkHahn answered 2 years ago

since this isn’t a domain question about bioinformatics/tools, you should probably open a ticket.
I see that you currently have a job running, and that there is recent output in the relevant directory.  I’m not clear on what you mean by “looks like” the job has completed, though.  (and an earlier job was terminated when it ran out of time).

zhibin Staff answered 2 years ago

I can not access your directory. I think [main] Real time is the last line from bwa. [bam_sort_core] comes from samtools sort. Are all your result files in bam format?

Ella Bowles replied 2 years ago

I think Jules Gagnon might have figured it out. More on this later if needed. Thank you though