| Summary: | issue finding stdout for slurm job id | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Mike Farias <mfarias> |
| Component: | slurmctld | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 23.02.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Arcus Bio | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.0.2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Please attach the slurmctld.log and the slurmd.log from the compute node in question. Mike, Added to Jason request, can you tell us how this job was launched, and upload the batch script if any? The output you show corresponds to something the user app did directly, it is probably not writing to stdout. Thanks. Please close; we've solved this on our side. Many thanks! R/Mike Ok Mike, marking as infogiven. |
Can't seem to find any info on why this job stopped after one minute of execution (was cancelled) [root@ab-rnd-slurm-headnode-prod-01 mfarias]# sacct -j 385606 --format=JobID,Start,End,Elapsed,NCPUS JobID Start End Elapsed NCPUS ------------ ------------------- ------------------- ---------- ---------- 385606 2023-06-05T14:12:04 2023-06-05T14:13:01 00:00:57 3 385606.batch 2023-06-05T14:12:04 2023-06-05T14:13:02 00:00:58 3 [root@ab-rnd-slurm-headnode-prod-01 mfarias]# [root@ab-rnd-slurm-headnode-prod-01 mfarias]# sacct --jobs 385606 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 385606 nf-NFCORE+ debug bioinform+ 3 CANCELLED+ 0:0 385606.batch batch bioinform+ 3 FAILED 15:0 [root@ab-rnd-slurm-headnode-prod-01 mfarias]# [root@ab-rnd-slurm-headnode-prod-01 work]# sacct -j 385606 -o workdir%-100 WorkDir ---------------------------------------------------------------------------------------------------- /mnt/scratch/nextflow/work/43/764ae49519cf36028be23f248dd3d3 [root@ab-rnd-slurm-headnode-prod-01 work]# but when i navigate to that dir; don't see any history for job id 385606 in command.err etc... [root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# ls -al total 25687760 drwxrwxr-x 2 slurm slurm 4096 Jun 5 15:13 . drwxrwxr-x 453 slurm slurm 24576 Jun 5 14:26 .. -rw-rw-r-- 1 slurm slurm 7572501294 Jun 5 14:39 902001-001_Normal.recal.bam -rw-r--r-- 1 root root 2829840 Jun 5 14:41 902001-001_Normal.recal.bam.bai lrwxrwxrwx 1 root root 89 Jun 5 14:28 902001-001_Normal.recal.cram -> /mnt/scratch/nextflow/work/20/eebf8b33536a35fa490a27f34096e6/902001-001_Normal.recal.cram lrwxrwxrwx 1 root root 94 Jun 5 14:28 902001-001_Normal.recal.cram.crai -> /mnt/scratch/nextflow/work/1d/cc55806cda7650e5aed64f327a445c/902001-001_Normal.recal.cram.crai -rw-r--r-- 1 root root 18725871230 Jun 5 15:08 902001-001_Tumor.recal.bam -rw-r--r-- 1 root root 2987416 Jun 5 15:13 902001-001_Tumor.recal.bam.bai lrwxrwxrwx 1 root root 88 Jun 5 14:28 902001-001_Tumor.recal.cram -> /mnt/scratch/nextflow/work/9e/3f523ac0ca392d3bf09ff2b07df760/902001-001_Tumor.recal.cram lrwxrwxrwx 1 root root 93 Jun 5 14:28 902001-001_Tumor.recal.cram.crai -> /mnt/scratch/nextflow/work/57/5a292cf3f85c06b7f216e92eaed9b2/902001-001_Tumor.recal.cram.crai -rw-rw-r-- 1 slurm slurm 0 Jun 5 14:28 .command.begin -rw-rw-r-- 1 slurm slurm 913 Jun 5 15:13 .command.err -rw-rw-r-- 1 slurm slurm 128 Jun 5 14:05 .command.log -rw-rw-r-- 1 slurm slurm 550 Jun 5 15:58 .command.out -rw-rw-r-- 1 slurm slurm 13013 Jun 5 14:04 .command.run -rw-rw-r-- 1 slurm slurm 1812 Jun 5 14:04 .command.sh -rw-rw-r-- 1 slurm slurm 0 Jun 5 14:28 .command.trace -rw-rw-r-- 1 slurm slurm 3 Jun 5 14:05 .exitcode lrwxrwxrwx 1 root root 120 Jun 5 14:28 generic_loci.dat -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/6e/f855f8f7342648ba7ffd9a111ffb26/generic_loci.dat lrwxrwxrwx 1 root root 133 Jun 5 14:28 Homo_sapiens_assembly38.fasta -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/8c/25007458bc5671f060a87c34640026/Homo_sapiens_assembly38.fasta lrwxrwxrwx 1 root root 137 Jun 5 14:28 Homo_sapiens_assembly38.fasta.fai -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/8d/82390e3907253bab3943022f822697/Homo_sapiens_assembly38.fasta.fai lrwxrwxrwx 1 root root 134 Jun 5 14:28 microsatellite_hg38_mantis.bed -> /mnt/scratch/nextflow/work/stage-9b81d142-a77d-4974-b36e-6e1645a0e963/bf/be62ee0ae9f0918f91ed836936e7cd/microsatellite_hg38_mantis.bed [root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# cat command.err cat: command.err: No such file or directory [root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]# cat .command.err real 10m33.595s user 21m1.035s sys 0m7.246s real 26m59.223s user 48m51.329s sys 0m33.823s /usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:150: SyntaxWarning: "is" with a literal. Did you mean "=="? if (offset is 0) or read.seq[0:offset] == locus.kmer[offset:]: /usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:475: SyntaxWarning: "is" with a literal. Did you mean "=="? if self.debug_output and (n % 10000 is 0): /usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:531: SyntaxWarning: "is" with a literal. Did you mean "=="? if qsize is 0: /usr/src/MANTIS-1.0.5/kmer_repeat_counter.py:614: SyntaxWarning: "is" with a literal. Did you mean "=="? if loop_counter % proc_check_interval is 0: /usr/src/MANTIS-1.0.5/structures.py:80: SyntaxWarning: "is" with a literal. Did you mean "=="? if length is 0: /usr/src/MANTIS-1.0.5/structures.py:101: SyntaxWarning: "is" with a literal. Did you mean "=="? if token is 'S': [root@ab-rnd-slurm-headnode-prod-01 764ae49519cf36028be23f248dd3d3]#