Ticket 16892

Summary: issue finding stdout for slurm job id
Product: Slurm Reporter: Mike Farias <mfarias>
Component: slurmctldAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll
Version: 23.02.0   
Hardware: Linux   
OS: Linux   
Site: Arcus Bio Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.0.2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Mike Farias 2023-06-05 15:10:42 MDT
Can't seem to find any info on why this job stopped after one minute of execution (was cancelled)

[root@ab-rnd-slurm-headnode-prod-01 mfarias]# sacct -j 385606 --format=JobID,Start,End,Elapsed,NCPUS
JobID                      Start                 End    Elapsed      NCPUS 
------------ ------------------- ------------------- ---------- ---------- 
385606       2023-06-05T14:12:04 2023-06-05T14:13:01   00:00:57          3 
385606.batch 2023-06-05T14:12:04 2023-06-05T14:13:02   00:00:58          3 
[root@ab-rnd-slurm-headnode-prod-01 mfarias]# 



[root@ab-rnd-slurm-headnode-prod-01 mfarias]# sacct --jobs 385606
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
385606       nf-NFCORE+      debug bioinform+          3 CANCELLED+      0:0 
385606.batch      batch            bioinform+          3     FAILED     15:0 
[root@ab-rnd-slurm-headnode-prod-01 mfarias]# 


[root@ab-rnd-slurm-headnode-prod-01 work]# sacct -j 385606 -o workdir%-100
WorkDir                                                                                              
---------------------------------------------------------------------------------------------------- 
/mnt/scratch/nextflow/work/43/764ae49519cf36028be23f248dd3d3                                         
                                                                                                     
[root@ab-rnd-slurm-headnode-prod-01 work]# 

but when i navigate to that dir; don't see any history for job id 385606 in command.err etc...

[root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# ls -al
total 25687760
drwxrwxr-x   2 slurm slurm        4096 Jun  5 15:13 .
drwxrwxr-x 453 slurm slurm       24576 Jun  5 14:26 ..
-rw-rw-r--   1 slurm slurm  7572501294 Jun  5 14:39 902001-001_Normal.recal.bam
-rw-r--r--   1 root  root      2829840 Jun  5 14:41 902001-001_Normal.recal.bam.bai
lrwxrwxrwx   1 root  root           89 Jun  5 14:28 902001-001_Normal.recal.cram -> /mnt/scratch/nextflow/work/20/eebf8b33536a35fa490a27f34096e6/902001-001_Normal.recal.cram
lrwxrwxrwx   1 root  root           94 Jun  5 14:28 902001-001_Normal.recal.cram.crai -> /mnt/scratch/nextflow/work/1d/cc55806cda7650e5aed64f327a445c/902001-001_Normal.recal.cram.crai
-rw-r--r--   1 root  root  18725871230 Jun  5 15:08 902001-001_Tumor.recal.bam
-rw-r--r--   1 root  root      2987416 Jun  5 15:13 902001-001_Tumor.recal.bam.bai
lrwxrwxrwx   1 root  root           88 Jun  5 14:28 902001-001_Tumor.recal.cram -> /mnt/scratch/nextflow/work/9e/3f523ac0ca392d3bf09ff2b07df760/902001-001_Tumor.recal.cram
lrwxrwxrwx   1 root  root           93 Jun  5 14:28 902001-001_Tumor.recal.cram.crai -> /mnt/scratch/nextflow/work/57/5a292cf3f85c06b7f216e92eaed9b2/902001-001_Tumor.recal.cram.crai
-rw-rw-r--   1 slurm slurm           0 Jun  5 14:28 .command.begin
-rw-rw-r--   1 slurm slurm         913 Jun  5 15:13 .command.err
-rw-rw-r--   1 slurm slurm         128 Jun  5 14:05 .command.log
-rw-rw-r--   1 slurm slurm         550 Jun  5 15:58 .command.out
-rw-rw-r--   1 slurm slurm       13013 Jun  5 14:04 .command.run
-rw-rw-r--   1 slurm slurm        1812 Jun  5 14:04 .command.sh
-rw-rw-r--   1 slurm slurm           0 Jun  5 14:28 .command.trace
-rw-rw-r--   1 slurm slurm           3 Jun  5 14:05 .exitcode
lrwxrwxrwx   1 root  root          120 Jun  5 14:28 generic_loci.dat -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/6e/f855f8f7342648ba7ffd9a111ffb26/generic_loci.dat
lrwxrwxrwx   1 root  root          133 Jun  5 14:28 Homo_sapiens_assembly38.fasta -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/8c/25007458bc5671f060a87c34640026/Homo_sapiens_assembly38.fasta
lrwxrwxrwx   1 root  root          137 Jun  5 14:28 Homo_sapiens_assembly38.fasta.fai -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/8d/82390e3907253bab3943022f822697/Homo_sapiens_assembly38.fasta.fai
lrwxrwxrwx   1 root  root          134 Jun  5 14:28 microsatellite_hg38_mantis.bed -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/bf/be62ee0ae9f0918f91ed836936e7cd/microsatellite_hg38_mantis.bed
[root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# cat command.err
cat: command.err: No such file or directory
[root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# cat .command.err

real	10m33.595s
user	21m1.035s
sys	0m7.246s

real	26m59.223s
user	48m51.329s
sys	0m33.823s
/usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:150: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if (offset is 0) or read.seq[0:offset] == locus.kmer[offset:]:
/usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:475: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if self.debug_output and (n % 10000 is 0):
/usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:531: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if qsize is 0:
/usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:614: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if loop_counter % proc_check_interval is 0:
/usr/src/MANTIS-1.0.5/structures.py:80: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if length is 0:
/usr/src/MANTIS-1.0.5/structures.py:101: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if token is 'S':
[root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]#
Comment 1 Jason Booth 2023-06-05 15:38:52 MDT
Please attach the slurmctld.log and the slurmd.log from the compute node in question.
Comment 2 Felip Moll 2023-06-07 04:59:37 MDT
Mike,

Added to Jason request, can you tell us how this job was launched, and upload the batch script if any?

The output you show corresponds to something the user app did directly, it is probably not writing to stdout.

Thanks.
Comment 3 Mike Farias 2023-06-22 11:22:06 MDT
Please close; we've solved this on our side.  Many thanks! R/Mike
Comment 4 Felip Moll 2023-06-25 00:08:00 MDT
Ok Mike, marking as infogiven.