| Summary: | SPANK: returning error in remote context doesn't fail job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | Other | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 17.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Stanford | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.11.3 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Patch to fix logic flaw | ||
I would expect the job to fail. I'll investigate further. Created attachment 5971 [details]
Patch to fix logic flaw
The spank logic in slurm is bad and has been since inception. The logic is Slurm treats a spank function return value less than zero, but an underlying function was converting the -1 to 1 (you'll quickly see the flaw by looking the patch).
I'll get someone else to review the patch and get it committed in our next release.
(In reply to Moe Jette from comment #2) > Created attachment 5971 [details] > Patch to fix logic flaw > > The spank logic in slurm is bad and has been since inception. The logic is > Slurm treats a spank function return value less than zero, but an underlying > function was converting the -1 to 1 (you'll quickly see the flaw by looking > the patch). > > I'll get someone else to review the patch and get it committed in our next > release. Excellent, thanks for the quick turnaround! Cheers, -- Kilian This patch has been committed to Slurm and will be in our next release, 17.11.3. The commit is here: https://github.com/SchedMD/slurm/commit/c3d9135f014e7c24ac98035542f2ecaf3b77846a (In reply to Moe Jette from comment #9) > This patch has been committed to Slurm and will be in our next release, > 17.11.3. The commit is here: > https://github.com/SchedMD/slurm/commit/ > c3d9135f014e7c24ac98035542f2ecaf3b77846a Thanks Moe! Cheers, -- Kilian |
Hi! This is more a question than anything else, but I noticed that in a SPANK plugin set with "required" in plugstack.conf, if slurm_spank_init_post_opt() returns -1 in the "remote" context, the job still successfully completes. slurmstepd on the execution host logs this: slurmd[42898]: _run_prolog: run job script took usec=88567 slurmd[42898]: _run_prolog: prolog with lock for job 6245468 ran for 1 seconds slurmstepd[46706]: error: spank: required plugin spank_test.so: init_post_opt() failed with rc=-1 slurmstepd[46706]: task/cgroup: /slurm/uid_215845/job_6245468: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB slurmstepd[46706]: task/cgroup: /slurm/uid_215845/job_6245468/step_extern: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB slurmd[42898]: launch task 6245468.0 request from 215845.32264@10.10.6.34 (port 56006) slurmstepd[46731]: error: spank: required plugin spank_test.so: init_post_opt() failed with rc=-1 slurmstepd[46731]: task/cgroup: /slurm/uid_215845/job_6245468: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB slurmstepd[46731]: task/cgroup: /slurm/uid_215845/job_6245468/step_0: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB slurmstepd[46706]: done with job slurmstepd[46731]: done with job "error: spank: required plugin spank_test.so: init_post_opt() failed with rc=-1" sounds like it's doing the right thing. But the job actually starts, and completes with a return code of 0: # sacct -j 6245468 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 6245468 hostname test XXXXX 1 COMPLETED 0:0 6245468.ext+ extern XXXXX 1 COMPLETED 0:0 6245468.0 hostname XXXXX 1 COMPLETED 0:0 Shouldn't the job fail in that case? The documentation says: "If a SPANK plugin is required, then failure of any of the plugin's functions will cause slurmd to terminate the job." If slurm_spank_init_post_opt returns -1 in the "local" context, the job fails, as expected. Thanks, -- Kilian