Ticket 13835 - Improve job failure reason on permissions error or missing directory
Summary: Improve job failure reason on permissions error or missing directory
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 21.08.6
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-04-13 13:07 MDT by Ali Nikkhah
Modified: 2023-03-15 06:12 MDT (History)
2 users (show)

See Also:
Site: U WA Health Metrics
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: Ubuntu
Machine Name:
CLE Version:
Version Fixed:
Target Release: 23.11
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Ali Nikkhah 2022-04-13 13:07:17 MDT
Currently, if a user submits a job where the error or output directory does not exist or the user is lacking permissions on, the job will fail with reason "None". Further information is only available by parsing slurmd logs on the node where the job gets scheduled. This is a confusing experience for users and often leads to a lot of wasted time for admins and inexperienced users debugging why a job fails and why there are no log files.

It would helpful to included a descriptive reason for job failure in the accounting, to help users quickly identify the issue using `sacct`.

Missing directory repro:

1. submit job with -o and/or -e where directories do not exist:

sbatch -A general -p all.q -c 2 --mem 1G -o /path/does/not/exist/out/out.log -e /path/does/not/exist/err/err.log --wrap "echo hello"

2. output of sacct indicates reason for failure is "None":

sacct -j 5871462 -o JobID,SubmitLine,Node,State,Reason,ExitCode,DerivedExitCode -p
JobID|SubmitLine|NodeList|State|Reason|ExitCode|DerivedExitCode|
5871462|sbatch -A general -p all.q -c 2 --mem 1G -o /path/does/not/exist/out/out.log -e /path/does/not/exist/err/err.log --wrap echo hello|gen-slurm-sexec-d01|FAILED|None|1:0|0:0|
5871462.batch||gen-slurm-sexec-d01|FAILED||1:0||
5871462.extern||gen-slurm-sexec-d01|COMPLETED||0:0||

3. slurmd log on node where job ran:

grep 5871462 /var/log/slurm/slurmd.log
[2022-04-06T17:22:30.189] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 5871462
[2022-04-06T17:22:30.189] task/affinity: batch_bind: job 5871462 CPU input mask for node: 0x3
[2022-04-06T17:22:30.189] task/affinity: batch_bind: job 5871462 CPU final HW mask for node: 0x3
[2022-04-06T17:22:30.227] [5871462.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB
[2022-04-06T17:22:30.227] [5871462.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB
[2022-04-06T17:22:30.233] Launching batch job 5871462 for UID 701322
[2022-04-06T17:22:30.248] [5871462.batch] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB
[2022-04-06T17:22:30.248] [5871462.batch] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB
[2022-04-06T17:22:30.251] [5871462.batch] error: Could not open stdout file /path/does/not/exist/out/out.log: No such file or directory
[2022-04-06T17:22:30.251] [5871462.batch] error: IO setup failed: No such file or directory
[2022-04-06T17:22:30.252] [5871462.batch] error: called without a previous init. This shouldn't happen!
[2022-04-06T17:22:30.252] [5871462.batch] error: called without a previous init. This shouldn't happen!
[2022-04-06T17:22:30.252] [5871462.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[2022-04-06T17:22:30.254] [5871462.batch] done with job
[2022-04-06T17:22:30.259] [5871462.extern] done with job


Permission denied repro:

1. sbatch -A general -p all.q -c 2 --mem 1G -o /root/out.log -e /root/err.log --wrap "echo hello"

2. sacct -j 5871478 -o JobID,SubmitLine,Node,State,Reason,ExitCode,DerivedExitCode -p

JobID|SubmitLine|NodeList|State|Reason|ExitCode|DerivedExitCode|
5871478|sbatch -A general -p all.q -c 2 --mem 1G -o /root/out.log -e /root/err.log --wrap echo hello|gen-slurm-sexec-d01|FAILED|None|1:0|0:0|
5871478.batch||gen-slurm-sexec-d01|FAILED||1:0||
5871478.extern||gen-slurm-sexec-d01|COMPLETED||0:0||

3. grep 5871478 /var/log/slurm/slurmd.log
[2022-04-13T11:59:33.908] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 5871478
[2022-04-13T11:59:33.908] task/affinity: batch_bind: job 5871478 CPU input mask for node: 0x3
[2022-04-13T11:59:33.908] task/affinity: batch_bind: job 5871478 CPU final HW mask for node: 0x3
[2022-04-13T11:59:33.939] [5871478.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB
[2022-04-13T11:59:33.939] [5871478.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB
[2022-04-13T11:59:33.944] Launching batch job 5871478 for UID 701322
[2022-04-13T11:59:33.961] [5871478.batch] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB
[2022-04-13T11:59:33.961] [5871478.batch] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=1024MB
[2022-04-13T11:59:33.965] [5871478.batch] error: Could not open stdout file /root/out.log: Permission denied
[2022-04-13T11:59:33.965] [5871478.batch] error: IO setup failed: Permission denied
[2022-04-13T11:59:33.965] [5871478.batch] error: called without a previous init. This shouldn't happen!
[2022-04-13T11:59:33.965] [5871478.batch] error: called without a previous init. This shouldn't happen!
[2022-04-13T11:59:33.965] [5871478.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[2022-04-13T11:59:33.968] [5871478.batch] done with job
[2022-04-13T11:59:33.973] [5871478.extern] done with job



Similar issue: https://bugs.schedmd.com/show_bug.cgi?id=6034
Comment 3 Dominik Bartkiewicz 2022-05-12 06:53:06 MDT
Hi

Sorry that I didn't respond earlier.
Unfortunately, this isn't something easy to solve. I am still looking for the best solution to solve this issue.
Commit cfad2383bcc slightly change this behavior instead of 1:0 ExitCode is now 0:53. Signal 53 corresponds to real-time signals and should be unique.
I will let you know when I find the right solution, but I am afraid that we have no time to include this in 22.05.

Dominik
Comment 5 Dominik Bartkiewicz 2023-03-14 07:54:39 MDT
Hi

In 23.02 we add the possibility of automatically creating directories for stdout/stderr output files. Unfortunately in 23.02, we still didn't add any easy and user-available option to check if a job fails due to failure of opening stdout/stderr files.
Could we drop the severity level of this issue to enhancement?

Dominik
Comment 6 Ali Nikkhah 2023-03-14 12:31:46 MDT
Thanks- the automatic creation of stdout/stderr directories should help significantly. I think dropping this to enhancement level is fine.