Ticket 16892 - issue finding stdout for slurm job id
Summary: issue finding stdout for slurm job id
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 23.02.0
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-06-05 15:10 MDT by Mike Farias
Modified: 2023-06-25 00:08 MDT (History)
1 user (show)

See Also:
Site: Arcus Bio
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.0.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Mike Farias 2023-06-05 15:10:42 MDT
Can't seem to find any info on why this job stopped after one minute of execution (was cancelled)

[root@ab-rnd-slurm-headnode-prod-01 mfarias]# sacct -j 385606 --format=JobID,Start,End,Elapsed,NCPUS
JobID                      Start                 End    Elapsed      NCPUS 
------------ ------------------- ------------------- ---------- ---------- 
385606       2023-06-05T14:12:04 2023-06-05T14:13:01   00:00:57          3 
385606.batch 2023-06-05T14:12:04 2023-06-05T14:13:02   00:00:58          3 
[root@ab-rnd-slurm-headnode-prod-01 mfarias]# 



[root@ab-rnd-slurm-headnode-prod-01 mfarias]# sacct --jobs 385606
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
385606       nf-NFCORE+      debug bioinform+          3 CANCELLED+      0:0 
385606.batch      batch            bioinform+          3     FAILED     15:0 
[root@ab-rnd-slurm-headnode-prod-01 mfarias]# 


[root@ab-rnd-slurm-headnode-prod-01 work]# sacct -j 385606 -o workdir%-100
WorkDir                                                                                              
---------------------------------------------------------------------------------------------------- 
/mnt/scratch/nextflow/work/43/764ae49519cf36028be23f248dd3d3                                         
                                                                                                     
[root@ab-rnd-slurm-headnode-prod-01 work]# 

but when i navigate to that dir; don't see any history for job id 385606 in command.err etc...

[root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# ls -al
total 25687760
drwxrwxr-x   2 slurm slurm        4096 Jun  5 15:13 .
drwxrwxr-x 453 slurm slurm       24576 Jun  5 14:26 ..
-rw-rw-r--   1 slurm slurm  7572501294 Jun  5 14:39 902001-001_Normal.recal.bam
-rw-r--r--   1 root  root      2829840 Jun  5 14:41 902001-001_Normal.recal.bam.bai
lrwxrwxrwx   1 root  root           89 Jun  5 14:28 902001-001_Normal.recal.cram -> /mnt/scratch/nextflow/work/20/eebf8b33536a35fa490a27f34096e6/902001-001_Normal.recal.cram
lrwxrwxrwx   1 root  root           94 Jun  5 14:28 902001-001_Normal.recal.cram.crai -> /mnt/scratch/nextflow/work/1d/cc55806cda7650e5aed64f327a445c/902001-001_Normal.recal.cram.crai
-rw-r--r--   1 root  root  18725871230 Jun  5 15:08 902001-001_Tumor.recal.bam
-rw-r--r--   1 root  root      2987416 Jun  5 15:13 902001-001_Tumor.recal.bam.bai
lrwxrwxrwx   1 root  root           88 Jun  5 14:28 902001-001_Tumor.recal.cram -> /mnt/scratch/nextflow/work/9e/3f523ac0ca392d3bf09ff2b07df760/902001-001_Tumor.recal.cram
lrwxrwxrwx   1 root  root           93 Jun  5 14:28 902001-001_Tumor.recal.cram.crai -> /mnt/scratch/nextflow/work/57/5a292cf3f85c06b7f216e92eaed9b2/902001-001_Tumor.recal.cram.crai
-rw-rw-r--   1 slurm slurm           0 Jun  5 14:28 .command.begin
-rw-rw-r--   1 slurm slurm         913 Jun  5 15:13 .command.err
-rw-rw-r--   1 slurm slurm         128 Jun  5 14:05 .command.log
-rw-rw-r--   1 slurm slurm         550 Jun  5 15:58 .command.out
-rw-rw-r--   1 slurm slurm       13013 Jun  5 14:04 .command.run
-rw-rw-r--   1 slurm slurm        1812 Jun  5 14:04 .command.sh
-rw-rw-r--   1 slurm slurm           0 Jun  5 14:28 .command.trace
-rw-rw-r--   1 slurm slurm           3 Jun  5 14:05 .exitcode
lrwxrwxrwx   1 root  root          120 Jun  5 14:28 generic_loci.dat -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/6e/f855f8f7342648ba7ffd9a111ffb26/generic_loci.dat
lrwxrwxrwx   1 root  root          133 Jun  5 14:28 Homo_sapiens_assembly38.fasta -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/8c/25007458bc5671f060a87c34640026/Homo_sapiens_assembly38.fasta
lrwxrwxrwx   1 root  root          137 Jun  5 14:28 Homo_sapiens_assembly38.fasta.fai -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/8d/82390e3907253bab3943022f822697/Homo_sapiens_assembly38.fasta.fai
lrwxrwxrwx   1 root  root          134 Jun  5 14:28 microsatellite_hg38_mantis.bed -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/bf/be62ee0ae9f0918f91ed836936e7cd/microsatellite_hg38_mantis.bed
[root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# cat command.err
cat: command.err: No such file or directory
[root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# cat .command.err

real	10m33.595s
user	21m1.035s
sys	0m7.246s

real	26m59.223s
user	48m51.329s
sys	0m33.823s
/usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:150: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if (offset is 0) or read.seq[0:offset] == locus.kmer[offset:]:
/usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:475: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if self.debug_output and (n % 10000 is 0):
/usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:531: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if qsize is 0:
/usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:614: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if loop_counter % proc_check_interval is 0:
/usr/src/MANTIS-1.0.5/structures.py:80: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if length is 0:
/usr/src/MANTIS-1.0.5/structures.py:101: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if token is 'S':
[root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]#
Comment 1 Jason Booth 2023-06-05 15:38:52 MDT
Please attach the slurmctld.log and the slurmd.log from the compute node in question.
Comment 2 Felip Moll 2023-06-07 04:59:37 MDT
Mike,

Added to Jason request, can you tell us how this job was launched, and upload the batch script if any?

The output you show corresponds to something the user app did directly, it is probably not writing to stdout.

Thanks.
Comment 3 Mike Farias 2023-06-22 11:22:06 MDT
Please close; we've solved this on our side.  Many thanks! R/Mike
Comment 4 Felip Moll 2023-06-25 00:08:00 MDT
Ok Mike, marking as infogiven.