Ticket 4651

Summary: SPANK: returning error in remote context doesn't fail job
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.0   
Hardware: Linux   
OS: Linux   
Site: Stanford Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.3
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Patch to fix logic flaw

Description Kilian Cavalotti 2018-01-19 10:52:01 MST
Hi!

This is more a question than anything else, but I noticed that in a SPANK plugin set with "required" in plugstack.conf, if slurm_spank_init_post_opt() returns -1 in the "remote" context, the job still successfully completes.

slurmstepd on the execution host logs this:


    slurmd[42898]: _run_prolog: run job script took usec=88567
    slurmd[42898]: _run_prolog: prolog with lock for job 6245468 ran for 1 seconds
    slurmstepd[46706]: error: spank: required plugin spank_test.so: init_post_opt() failed with rc=-1
    slurmstepd[46706]: task/cgroup: /slurm/uid_215845/job_6245468: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
    slurmstepd[46706]: task/cgroup: /slurm/uid_215845/job_6245468/step_extern: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
    slurmd[42898]: launch task 6245468.0 request from 215845.32264@10.10.6.34 (port 56006)
    slurmstepd[46731]: error: spank: required plugin spank_test.so: init_post_opt() failed with rc=-1
    slurmstepd[46731]: task/cgroup: /slurm/uid_215845/job_6245468: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
    slurmstepd[46731]: task/cgroup: /slurm/uid_215845/job_6245468/step_0: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
    slurmstepd[46706]: done with job
    slurmstepd[46731]: done with job


"error: spank: required plugin spank_test.so: init_post_opt() failed with rc=-1" sounds like it's doing the right thing. But the job actually starts, and completes with a return code of 0:

# sacct -j 6245468
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
6245468        hostname       test      XXXXX          1  COMPLETED      0:0
6245468.ext+     extern                 XXXXX          1  COMPLETED      0:0
6245468.0      hostname                 XXXXX          1  COMPLETED      0:0


Shouldn't the job fail in that case? The documentation says: "If a SPANK plugin is required, then failure of any of the plugin's functions will cause slurmd to terminate the job."

If slurm_spank_init_post_opt returns -1 in the "local" context, the job fails, as expected.


Thanks,
-- 
Kilian
Comment 1 Moe Jette 2018-01-19 16:07:52 MST
I would expect the job to fail.
I'll investigate further.
Comment 2 Moe Jette 2018-01-19 16:58:25 MST
Created attachment 5971 [details]
Patch to fix logic flaw

The spank logic in slurm is bad and has been since inception. The logic is Slurm treats a spank function return value less than zero, but an underlying function was converting the -1 to 1 (you'll quickly see the flaw by looking the patch).

I'll get someone else to review the patch and get it committed in our next release.
Comment 5 Kilian Cavalotti 2018-01-19 17:05:55 MST
(In reply to Moe Jette from comment #2)
> Created attachment 5971 [details]
> Patch to fix logic flaw
> 
> The spank logic in slurm is bad and has been since inception. The logic is
> Slurm treats a spank function return value less than zero, but an underlying
> function was converting the -1 to 1 (you'll quickly see the flaw by looking
> the patch).
> 
> I'll get someone else to review the patch and get it committed in our next
> release.

Excellent, thanks for the quick turnaround!

Cheers,
-- 
Kilian
Comment 9 Moe Jette 2018-01-30 11:04:10 MST
This patch has been committed to Slurm and will be in our next release, 17.11.3. The commit is here:
https://github.com/SchedMD/slurm/commit/c3d9135f014e7c24ac98035542f2ecaf3b77846a
Comment 10 Kilian Cavalotti 2018-01-30 11:05:01 MST
(In reply to Moe Jette from comment #9)
> This patch has been committed to Slurm and will be in our next release,
> 17.11.3. The commit is here:
> https://github.com/SchedMD/slurm/commit/
> c3d9135f014e7c24ac98035542f2ecaf3b77846a

Thanks Moe!

Cheers,
--
Kilian