Ticket 14956 - Slurm IO setup failed if log location doesn't exist
Summary: Slurm IO setup failed if log location doesn't exist
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 22.05.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Carlos Tripiana Montes
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-09-13 08:35 MDT by Yann
Modified: 2022-09-15 03:53 MDT (History)
1 user (show)

See Also:
Site: Université de Genève
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Yann 2022-09-13 08:35:07 MDT
Dear team,

a user of research group is submitting a job using an sbatch script, and in the sbatch script they set the --output and --error to a location that doesn't exists yet. In the sbatch script, they do a mkdir of this place but it seems this isn't possible according to the slurmd log and the job is not running at all.

[2022-09-13T16:06:52.164] Launching batch job 12877483 for UID 20821119
[2022-09-13T16:06:52.175] [12877483.batch] Considering each NUMA node as a socket
[2022-09-13T16:06:52.194] [12877483.batch] task/cgroup: _memcg_initialize: job: alloc=32000MB mem.limit=30400MB memsw.limit=30400MB job_swappiness=18446744073709551614
[2022-09-13T16:06:52.194] [12877483.batch] task/cgroup: _memcg_initialize: step: alloc=32000MB mem.limit=30400MB memsw.limit=30400MB job_swappiness=18446744073709551614
[2022-09-13T16:06:52.223] [12877483.batch] error: Could not open stdout file /srv/beegfs/scratch/users/u/userxx/workdir/pilots/Pilot__genericLight__1663059340560856/logs/pilot.out: No such file or directory
[2022-09-13T16:06:52.224] [12877483.batch] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO
[2022-09-13T16:06:52.231] [12877483.batch] get_exit_code task 0 died by signal: 53
[2022-09-13T16:06:52.234] [12877483.batch] done with job
[2022-09-13T16:06:52.266] [12877483.extern] done with job

As the user told us they did always like that and it was working fine, can you let us know if this was possible one day?

By the way, if the user do that by mistake, is there a way for him to figure out he did a mistake without accessing slurmd log?

Many thanks

Yann
Comment 1 Carlos Tripiana Montes 2022-09-15 03:48:32 MDT
Hi Yann!

> As the user told us they did always like that and it was working fine, can
> you let us know if this was possible one day?

Well, maybe using another cluster manager, but I think with Slurm that is not the case.

Looking at the error: "error: Could not open stdout file [...]", what I see is that it was introduced in a commit back in 2009. Even more, the job would fail even in older versions, but w/o logging such message. I mean, we have never created such paths on the fly, and we always have checked the path already exists by the time one step is created.

But I bet maybe the path could have been created externally by a meta-script (in charge of sending burst of jobs automatically).

> By the way, if the user do that by mistake, is there a way for him to figure
> out he did a mistake without accessing slurmd log?

This is the key point. I mean, you can send a job and externally create the paths you need after. So sbatch checking for such paths is something very restrictive. Even more, which makes such checking in sbatch just a bad thing is the fact that a job can land in compute nodes that has access to different FS/paths than the node from where the sbatch was issued (something very common, ie. login nodes w/o access to some FS).

So, in the end, sbatch cannot check this. And so this needs to be either checked or automatically created when the job starts to run. As mentioned, we only check for this and leave the responsibility to the user. But if it fails, as a batch job is disconnected from the user (is not tied to a TTY), we have no way to inform the user writing to some log file.

I hope this will cover your doubts.

Cheers,
Carlos.
Comment 2 Yann 2022-09-15 03:53:23 MDT
Hi Carlos,

many thanks for the detailed answer! Indeed, the user was talking about other resource manager such as torque etc.. maybe it was working on an other cluster and they were confused.

As you said, maybe the directory was created externally AND on the sbatch:)
Well sometimes we are doing support with try and guess as you can see.

I'm closing the issue, thanks again.

Best

Yann