Currently, if a user submits a job where the error or output directory does not exist or the user is lacking permissions on, the job will fail with reason "None". Further information is only available by parsing slurmd logs on the node where the job gets scheduled. This is a confusing experience for users and often leads to a lot of wasted time for admins and inexperienced users debugging why a job fails and why there are no log files. It would helpful to included a descriptive reason for job failure in the accounting, to help users quickly identify the issue using `sacct`. Missing directory repro: 1. submit job with -o and/or -e where directories do not exist: sbatch -A general -p all.q -c 2 --mem 1G -o /path/does/not/exist/out/out.log -e /path/does/not/exist/err/err.log --wrap "echo hello" 2. output of sacct indicates reason for failure is "None": sacct -j 5871462 -o JobID,SubmitLine,Node,State,Reason,ExitCode,DerivedExitCode -p JobID|SubmitLine|NodeList|State|Reason|ExitCode|DerivedExitCode| 5871462|sbatch -A general -p all.q -c 2 --mem 1G -o /path/does/not/exist/out/out.log -e /path/does/not/exist/err/err.log --wrap echo hello|gen-slurm-sexec-d01|FAILED|None|1:0|0:0| 5871462.batch||gen-slurm-sexec-d01|FAILED||1:0|| 5871462.extern||gen-slurm-sexec-d01|COMPLETED||0:0|| 3. slurmd log on node where job ran: grep 5871462 /var/log/slurm/slurmd.log [2022-04-06T17:22:30.189] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 5871462 [2022-04-06T17:22:30.189] task/affinity: batch_bind: job 5871462 CPU input mask for node: 0x3 [2022-04-06T17:22:30.189] task/affinity: batch_bind: job 5871462 CPU final HW mask for node: 0x3 [2022-04-06T17:22:30.227] [5871462.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB [2022-04-06T17:22:30.227] [5871462.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB [2022-04-06T17:22:30.233] Launching batch job 5871462 for UID 701322 [2022-04-06T17:22:30.248] [5871462.batch] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB [2022-04-06T17:22:30.248] [5871462.batch] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB [2022-04-06T17:22:30.251] [5871462.batch] error: Could not open stdout file /path/does/not/exist/out/out.log: No such file or directory [2022-04-06T17:22:30.251] [5871462.batch] error: IO setup failed: No such file or directory [2022-04-06T17:22:30.252] [5871462.batch] error: called without a previous init. This shouldn't happen! [2022-04-06T17:22:30.252] [5871462.batch] error: called without a previous init. This shouldn't happen! [2022-04-06T17:22:30.252] [5871462.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256 [2022-04-06T17:22:30.254] [5871462.batch] done with job [2022-04-06T17:22:30.259] [5871462.extern] done with job Permission denied repro: 1. sbatch -A general -p all.q -c 2 --mem 1G -o /root/out.log -e /root/err.log --wrap "echo hello" 2. sacct -j 5871478 -o JobID,SubmitLine,Node,State,Reason,ExitCode,DerivedExitCode -p JobID|SubmitLine|NodeList|State|Reason|ExitCode|DerivedExitCode| 5871478|sbatch -A general -p all.q -c 2 --mem 1G -o /root/out.log -e /root/err.log --wrap echo hello|gen-slurm-sexec-d01|FAILED|None|1:0|0:0| 5871478.batch||gen-slurm-sexec-d01|FAILED||1:0|| 5871478.extern||gen-slurm-sexec-d01|COMPLETED||0:0|| 3. grep 5871478 /var/log/slurm/slurmd.log [2022-04-13T11:59:33.908] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 5871478 [2022-04-13T11:59:33.908] task/affinity: batch_bind: job 5871478 CPU input mask for node: 0x3 [2022-04-13T11:59:33.908] task/affinity: batch_bind: job 5871478 CPU final HW mask for node: 0x3 [2022-04-13T11:59:33.939] [5871478.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB [2022-04-13T11:59:33.939] [5871478.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB [2022-04-13T11:59:33.944] Launching batch job 5871478 for UID 701322 [2022-04-13T11:59:33.961] [5871478.batch] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB [2022-04-13T11:59:33.961] [5871478.batch] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB [2022-04-13T11:59:33.965] [5871478.batch] error: Could not open stdout file /root/out.log: Permission denied [2022-04-13T11:59:33.965] [5871478.batch] error: IO setup failed: Permission denied [2022-04-13T11:59:33.965] [5871478.batch] error: called without a previous init. This shouldn't happen! [2022-04-13T11:59:33.965] [5871478.batch] error: called without a previous init. This shouldn't happen! [2022-04-13T11:59:33.965] [5871478.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256 [2022-04-13T11:59:33.968] [5871478.batch] done with job [2022-04-13T11:59:33.973] [5871478.extern] done with job Similar issue: https://bugs.schedmd.com/show_bug.cgi?id=6034
Hi Sorry that I didn't respond earlier. Unfortunately, this isn't something easy to solve. I am still looking for the best solution to solve this issue. Commit cfad2383bcc slightly change this behavior instead of 1:0 ExitCode is now 0:53. Signal 53 corresponds to real-time signals and should be unique. I will let you know when I find the right solution, but I am afraid that we have no time to include this in 22.05. Dominik
Hi In 23.02 we add the possibility of automatically creating directories for stdout/stderr output files. Unfortunately in 23.02, we still didn't add any easy and user-available option to check if a job fails due to failure of opening stdout/stderr files. Could we drop the severity level of this issue to enhancement? Dominik
Thanks- the automatic creation of stdout/stderr directories should help significantly. I think dropping this to enhancement level is fine.