Ticket 10454

Summary: 20.11 regression: spank plugins can't filter out invalid values anymore
Product: Slurm Reporter: Felix Abecassis <fabecassis>
Component: RegressionAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: cinek, dwightman, jbernauer, lyeager
Version: 20.11.1   
Hardware: Linux   
OS: Linux   
Site: NVIDIA (PSLA) Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.11.3 21.08pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Felix Abecassis 2020-12-15 21:22:45 MST
In the SPANK plugin pyxis (https://github.com/NVIDIA/pyxis) we use the spank_option_register function of the SPANK API to register our options callbacks (of type https://github.com/SchedMD/slurm/blob/slurm-20-11-1-1/slurm/spank.h#L209-L216)

For example, we check if one argument is not passed an empty string here:
https://github.com/NVIDIA/pyxis/blob/c8f732fe77b0da39a1fe1a44d87897e265872837/args.c#L116-L119

And it aborst the job launch as expected in Slurm 20.02:

> $ srun -A admin -p admin -N1 --container-image= hostname ; echo $?
> srun: error: pyxis: --container-image: argument required
> srun: error: Invalid --container-image argument: 
> 255

After upgrading to 20.11 on a test system, we noticed that we can't filter out invalid arguments, the job proceeds and executes normally:

> $ srun -A admin -p admin -N1 --container-image= hostname ; echo $?
> srun: error: pyxis: --container-image: argument required
> srun: error: Invalid --container-image argument: 
> sc-sdgx-735
> 0

After a git bisect:

> $ git bisect good
> e9bb9ce288c6284a801b2bd4422d519da6751783 is the first bad commit
> commit e9bb9ce288c6284a801b2bd4422d519da6751783
> Author: Tim Wickberg <tim@schedmd.com>
> Date:   Fri Oct 23 02:07:43 2020 -0600
> 
>    Quietly error out when an option is unknown.
    
>    Let the rest of the argument handling figure this out.

> src/common/plugstack.c | 2 +-
> src/common/slurm_opt.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
 
It's unclear to me what is meant by "Let the rest of the argument handling figure this out" since spank_process_option is only called here in the Slurm code and the return value of slurm_process_option is ignored in most places.

Given that this is not mentioned in the release notes and that the commit message mentions "... when an option is unknown", it looks like this new behavior was not intended. Here the option is not unknown, it's a valid SPANK option rejected by our callback.
Comment 5 Luke Yeager 2020-12-18 16:15:16 MST
Raising this to sev2. This is a blocker for us - we can't upgrade to 20.11 until this is fixed.
Comment 6 Marcin Stolarek 2020-12-18 23:12:23 MST
The patch is under review. Do you want to give it a try being aware that it's not yet approved/merged?

Cheera,
Marcin
Comment 7 Luke Yeager 2020-12-19 06:26:42 MST
No that's ok, we can wait. I just wanted to bump this to make sure it didn't get lost in the noise. Thanks for looking into it!
Comment 18 Luke Yeager 2021-01-07 10:29:57 MST
Do you expect the fix be merged before 20.11.3?
Comment 20 Marcin Stolarek 2021-01-07 21:33:20 MST
Luke,

The patch passed our review and got merged ahead of Slurm 20.11.3[1]

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/30ff31b24eaac54a643abf3c3c40c8a7d174bca8
Comment 21 Luke Yeager 2021-01-08 09:24:26 MST
That's good news, thanks! We'll give it a try as soon as 20.11.3 is released.
Comment 22 Luke Yeager 2021-01-26 10:28:57 MST
Verified as fixed in 20.11.3.