Ticket 5007

Summary:	SPANK: spank_option_getopt() from spank_task_init_privileged
Product:	Slurm	Reporter:	Ben Matthews <matthews>
Component:	Other	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	17.11.4
Hardware:	Linux
OS:	Linux
Site:	UCAR	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	spank example1 spank example2

Description Ben Matthews 2018-03-30 13:59:02 MDT

Is spank_option_getopt() supposed to work from slurm_spank_task_init_privileged?

It seems to always return an error that maps to "Generic error" and leaves the option field NULL. Maybe related to bug 4912?

The code I'm playing with is here:

https://github.com/NCAR/Inception/blob/master/slurm/slurm-inception.c

It's probably pretty likely that I'm just doing something dumb.

Comment 3 Alejandro Sanchez 2018-04-04 11:37:58 MDT

Hi Ben,

I'm able to reproduce the "Generic error" returned from spank_option_getopt() when called within slurm_spank_task_init_privileged().

I've my doubts whether that should work or not, will need to investigate that further. Looking at spank_option_getopt() implementation in src/common/plugstack.c, it seems that the opt value is first checked in an internal spank cache and if not found then the logic tries to retrieve it from an env variable with a format similar to this:

SLURM_SPANK_OPTION_<spank_plugin_name>_<option_name>

I see this logged in the logs and looks suspicious:

slurmd: debug:  unsetenv (SPANK__SLURM_SPANK_OPTION_dummy_test)

specially the double "_" after SPANK prefix. Will continue to dig further on this.

Comment 4 Ben Matthews 2018-04-04 13:10:11 MDT

In some of my tests, I do see the environment variable, so it seems like it should work (and a copy of the environment variable with the header duplicated, which seems like a bug).

At the end of the day, what I really want is a user-provided string inside slurm_spank_task_init_privileged(). If there's a better way to do that, I'm on board, but it seems like the callback thing might be a bit hard to get right without ever leaking memory anywhere. 

Ideally, if the flag is set on salloc, then I want that value, unless srun is given something different, in which case srun's value should be used. I think this is how it's supposed to work, but it's not entirely clear from the documentation. 

I'd rather it weren't an environment variable, but in this case, I guess if the user messes with the environment, they get what they deserve.

Comment 5 Alejandro Sanchez 2018-04-05 05:18:20 MDT

Based on testsuite/expect/test7.11.prog.c and 'man spank' I've managed to retrieve user provided new spank option from slurm_spank_task_init_privileged. I am not using spank_option_getopt() though. While I understand that function, you can use that code as an example start point on how to retrieve the options provided to salloc/sbatch or srun in remote context. This is an example output of a slightly modified version of test7.11.prog.c:

alex@ibiza:~/t$ salloc --test_suite_sbatch=2 bash
salloc: _test_opt_process_sbatch: opt_arg_sbatch=2
salloc: Granted job allocation 30046
alex@ibiza:~/t$ srun --test_suite_srun=3 true
srun: _test_opt_process_sbatch: opt_arg_sbatch=2
srun: _test_opt_process_srun: opt_arg_srun=3
slurmstepd-compute1: error: slurm_spank_task_init_privileged: opt_arg_srun=3, opt_arg_sbatch=2
alex@ibiza:~/t$

(I printed the message from slurm_spank_task_init_privileged as slurm_error instead of slurm_info so I can see it in srun output without using --slurmd-debug=info)

I've realized that options passed to salloc/sbatch (S_CTX_ALLOCATOR) work differently to those passed to srun (S_CTX_LOCAL).

So for the options passed to srun you can use a predefined struct spank_option with the symbol name spank_options. In this struct you can define the callback (_test_opt_process_srun) to do the sanity checks for the option and save it to a global variable (opt_arg_srun) which can be accessed later from slurm_spank_task_init_privileged().

salloc/sbatch can't use the predefined struct spank_option with the symbol name spank_options. So for S_CTX_ALLOCATOR context you've to register a new struct in slurm_spank_init():

        if (spank_option_register(sp, spank_options_reg) != ESPANK_SUCCESS)
                slurm_error("spank_option_register error");

spank_options_reg is this new struct with callback _test_opt_process_sbatch and this callback does the sanity chech for salloc/sbatch --test_suite_sbatch option and then saves the value to opt_arg_sbatch global variable. This then can also be accessed from slurm_spank_task_init_privileged.

Does this make sense? Any questions?

In the meantime, I'll study spank_option_getopt() which I think is an alternative approach to the previous register / callbacks mechanism. But I think you can live with the previous mechanism for now.

Comment 6 Alejandro Sanchez 2018-04-05 08:49:21 MDT

I've been able to use spank_option_getopt() inside slurm_spank_job_[prolog|epilog](). Actually, I think it's the only mechanism you can use for S_CTX_JOB_SCRIPT context to retrieve the options. And furthermore you can only retrieve registered options. From spank man page:

As an alternative to use of an option callback and global variable, plugins can use the spank_option_getopt option to check for supplied options after option processing. This function has the prototype:

          spank_err_t spank_option_getopt(spank_t sp,
              struct spank_option *opt, char **optargp);

       This function returns ESPANK_SUCCESS if the option defined in the
       struct spank_option opt has been used by the user. If optargp
       is non-NULL then it is set to any option argument passed (if the option
       takes an argument). The use of this method is required to process
       options in job_script context (slurm_spank_job_prolog and
       slurm_spank_job_epilog).

alex@ibiza:~/t$ sbatch --test_suite_sbatch=6 --wrap="true"
sbatch: _test_opt_process_sbatch: opt_arg_sbatch=6
Submitted batch job 30061
alex@ibiza:~/t$

spank-epilog: error: slurm_spank_job_epilog: opt_arg_srun=0, opt_arg_sbatch=0
spank-epilog: error: slurm_spank_job_epilog: getopt success, option=6

Please, let me know if there's anything else or if we can close the bug. Thanks

Comment 9 Ben Matthews 2018-04-05 11:17:10 MDT

So, to summarize, 

for the, slurm_spank_task_init_privileged() hook, I must use the callback?

for the slurm_spank_job_[prolog|epilog]() hooks, I must use spank_option_getopt()?

Two things:

- If I use the callback, and I want to store an arbitrary amount of data, which hooks might I have to free from? Doesn't your example (test7.11.prog.c) leak memory in the slurmd (at least the region pointed to by opt_out_file)? 


- If the various approaches only work in some contexts, it'd be nice if that were documented more clearly.  The documentation makes it fairly clear that you must use spank_option_getopt() from slurm_spank_job_[prolog|epilog](), but not that you must not use it from the other contexts (and it seems like a much more reasonable API - global state is usually bad :-) )

Comment 20 Alejandro Sanchez 2018-04-06 07:18:42 MDT

Created attachment 6550 [details]
spank example1

Comment 21 Alejandro Sanchez 2018-04-06 07:19:21 MDT

Created attachment 6551 [details]
spank example2

Comment 22 Alejandro Sanchez 2018-04-06 07:31:01 MDT

Hey Ben,

I've attached two spank examples with opt passing to init_privileged and epilog functions.

First one is based off testsuite/expect/test7.11.prog and second is a minimal workable version of Singularity.

You could start working based off any of these. There are more elaborated examples in full Singularity and/or NERSC Shifter plugins:

https://github.com/singularityware/singularity/blob/master/src/slurm/singularity.c

https://github.com/NERSC/shifter

https://github.com/NERSC/shifter/tree/master/wlm_integration/slurm

(In reply to Ben Matthews from comment #9)
> So, to summarize, 
> 
> for the, slurm_spank_task_init_privileged() hook, I must use the callback?

Either callback or just in init save to a global variable the option. The callback should also save the option to a global variable or set an env one anyway.
 
> for the slurm_spank_job_[prolog|epilog]() hooks, I must use
> spank_option_getopt()?

Correct.
 
> Two things:
> 
> - If I use the callback, and I want to store an arbitrary amount of data,
> which hooks might I have to free from? Doesn't your example
> (test7.11.prog.c) leak memory in the slurmd (at least the region pointed to
> by opt_out_file)? 

I've not tested that program against a memory leak tool like valgrind or such, but theory is whatever is malloc'd/xmalloc'd should be free'd/xfree'd. 
 
> - If the various approaches only work in some contexts, it'd be nice if that
> were documented more clearly.  The documentation makes it fairly clear that
> you must use spank_option_getopt() from slurm_spank_job_[prolog|epilog](),
> but not that you must not use it from the other contexts (and it seems like
> a much more reasonable API - global state is usually bad :-) )

I totally agree with that. Once I confirmed my job script context assumption with Moe he changed the documentation to make it more clear:

https://github.com/SchedMD/slurm/commit/b1261c9571c7b1d1cae5433f246042003db5e941

Please, let us know if you need anything else from here.

Comment 23 Ben Matthews 2018-04-06 11:42:25 MDT

Shifter is complex enough that I'm not 100% sure what it's doing - I think they're abusing the environment to get around this problem (which is quite fragile when you have users that tend to fill it), but Singularity and the Slurm test program both do the same concerning thing:

https://github.com/SchedMD/slurm/blob/master/testsuite/expect/test7.11.prog.c#L125
https://github.com/singularityware/singularity/blob/master/src/slurm/singularity.c#L248
https://github.com/singularityware/singularity/blob/master/src/slurm/singularity.c#L306

Perhaps you get away with this as a side effect of how the plugin infrastructure works (looking for some confirmation on that), but from where I sit, those strdup() calls are hiding malloc() calls without a corresponding free(). 

It seems like you have to do a check and potentially free from your callback, in slurm_spank_exit, slurm_spank_slurmd_exit, slurm_spank_task_init_privileged, and I don't think that really covers all the possible cases (hence my concern about this callback/global state approach). 

I suppose people don't really notice this because you're probably only leaking a few 10s of bytes per job (and likely only on the compute node), but I'd still rather not - we're tending toward being a high-throughput site. I think I still don't understand the control flow well enough to know all the places the free has to go, which is what I was hoping to get from you(?)

Comment 24 Alejandro Sanchez 2018-04-06 13:15:04 MDT

(In reply to Ben Matthews from comment #23)
> Shifter is complex enough that I'm not 100% sure what it's doing - I think
> they're abusing the environment to get around this problem (which is quite
> fragile when you have users that tend to fill it), but Singularity and the
> Slurm test program both do the same concerning thing:
> 
> https://github.com/SchedMD/slurm/blob/master/testsuite/expect/test7.11.prog.
> c#L125
> https://github.com/singularityware/singularity/blob/master/src/slurm/
> singularity.c#L248
> https://github.com/singularityware/singularity/blob/master/src/slurm/
> singularity.c#L306
> 
> Perhaps you get away with this as a side effect of how the plugin
> infrastructure works (looking for some confirmation on that), but from where
> I sit, those strdup() calls are hiding malloc() calls without a
> corresponding free(). 
> 
> It seems like you have to do a check and potentially free from your
> callback, in slurm_spank_exit, slurm_spank_slurmd_exit,
> slurm_spank_task_init_privileged, and I don't think that really covers all
> the possible cases (hence my concern about this callback/global state
> approach). 

I'll run the tests under valgrind next Monday and analyze which exit functions are proper to free the strup'ed values from init function. example1 doesn't strdup anything just atoi's the argument from the opt struct in the opt callback. Will study if that needs to be freed too and come back to you.
 
> I suppose people don't really notice this because you're probably only
> leaking a few 10s of bytes per job (and likely only on the compute node),
> but I'd still rather not - we're tending toward being a high-throughput
> site. I think I still don't understand the control flow well enough to know
> all the places the free has to go, which is what I was hoping to get from
> you(?)

Again I'll dig into that further on Monday and let you know. Need to familiarize further with SPANK infrastructure, which has been there since very old versions and it's not something I write everyday.

Comment 25 Alejandro Sanchez 2018-04-09 04:54:54 MDT

valgrind doesn't report definitely lost blocks for Example1 (spank_test.c) in slurmd nor srun.

Example2 (test7,11prog.c) doesn't leak either in slurmd but it does in srun:

==17600== 10 bytes in 1 blocks are definitely lost in loss record 6 of 183
==17600==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17600==    by 0x5920C99: strdup (strdup.c:42)
==17600==    by 0x72DB947: ???
==17600==    by 0x53FE0D7: _do_option_cb (plugstack.c:1156)
==17600==    by 0x53FEC2A: spank_process_option (plugstack.c:1187)
==17600==    by 0x11902F: _set_options (opt.c:2382)
==17600==    by 0x119862: initialize_and_process_args (opt.c:589)
==17600==    by 0x11DDD0: init_srun (srun_job.c:644)

Adding this at the end of Example2 solves the struped leak:

int slurm_spank_exit(spank_t spank, int ac, char *argv[])
{
        free(job_image);
}

I'd rather use Slurm xstrdup/xfree though. And I personally prefer Example1 approach. Please, let me know if there's anything else you need here. Thanks.

Comment 26 Ben Matthews 2018-04-09 11:09:49 MDT

Interesting, not what I'd expect - I'll put a free in the slurm_spank_exit() hook and call it good. The code that I have now seems to be working. I do have one last question though:

The section for the option parsing states that the "remote" context for the option parsing callback is slurmd, but experimentally, that callback also seems to run in slurmstepd (that is, when spank_context() == S_CTX_REMOTE). For now, I'm ignoring the "remote" parameter to the callback and using spank_context() instead, which seems to work. Unclear whether the callback runs in slurmd at all (if not, that would help explain your valgrind results). Does the option callback run in all three contexts, assuming you call spank_option_register() in each context? For the sake of any future victims, can the documentation be clarified on this point?

Thanks for all the help, especially since Tim tells me you generally don't help with SPANK development.

Comment 27 Alejandro Sanchez 2018-04-10 04:32:49 MDT

(In reply to Ben Matthews from comment #26)
> Interesting, not what I'd expect - I'll put a free in the slurm_spank_exit()
> hook and call it good. The code that I have now seems to be working. I do
> have one last question though:
> 
> The section for the option parsing states that the "remote" context for the
> option parsing callback is slurmd, but experimentally, that callback also
> seems to run in slurmstepd (that is, when spank_context() == S_CTX_REMOTE).
> For now, I'm ignoring the "remote" parameter to the callback and using
> spank_context() instead, which seems to work. Unclear whether the callback
> runs in slurmd at all (if not, that would help explain your valgrind
> results). Does the option callback run in all three contexts, assuming you
> call spank_option_register() in each context? For the sake of any future
> victims, can the documentation be clarified on this point?

If I register the option for all contexts, the option callback runs in the following contexts and 'remote' argument is set as follows for a sbatch (allocator request):

alex@ibiza:~/t$ sbatch --test_suite_sbatch=6 --wrap="true"
sbatch: error: _test_opt_process_sbatch: opt_arg_sbatch=6, remote=0, context=3
Submitted batch job 20037
alex@ibiza:~/t$

alex@ibiza:~/slurm/17.11/ibiza/log$ sudo grep remote= slurmd-compute1.log 
[2018-04-10T11:27:41.437] [20037.batch] error: _test_opt_process_sbatch: opt_arg_sbatch=6, remote=1, context=2
alex@ibiza:~/slurm/17.11/ibiza/log$

So it runs in S_CTX_ALLOCATOR (context=3, remote=0) and also in S_CTX_REMOTE (context=2, remote=1). Note that S_CTX_REMOTE context means the plugin is loaded by slurmstepd, and argument 'remote'=1 means the function is being called from the remote host (where slurmd runs).

If I submit through srun (local context) directly without being inside an allocation:

alex@ibiza:~/t$ srun -p xeon24 --test_suite_srun=3 true
srun: error: _test_opt_process_srun: opt_arg_srun=3, remote=0, context=1
slurmstepd-compute1: error: slurm_spank_task_init_privileged: opt_arg_srun=3, opt_arg_sbatch=0
alex@ibiza:~/t$

alex@ibiza:~/slurm/17.11/ibiza/log$ sudo grep remote= slurmd-compute1.log 
...
[2018-04-10T12:02:33.021] [20039.0] error: _test_opt_process_srun: opt_arg_srun=3, remote=1, context=2
[2018-04-10T12:02:33.021] [20039.0] error: _test_opt_process_srun: opt_arg_srun=3, remote=1, context=2
alex@ibiza:~/slurm/17.11/ibiza/log$

Function runs in S_CTX_LOCAL (context=1, remote=0) and also in S_CTX_REMOTE (context=2, remote=1).

In short, it looks like the function runs in two contexts: local/allocator depending on salloc/sbatch or srun requests (all remote=0 since its submission host) and remote context (slurmstepd, remote=1 since it runs in the "remote" host). I've updated the documentation to clarify this:

https://github.com/SchedMD/slurm/commit/c67bdb8ea12a01956769fc74

Does it make sense? 

> Thanks for all the help, especially since Tim tells me you generally don't
> help with SPANK development.

You're welcome.

Comment 28 Alejandro Sanchez 2018-04-20 01:40:24 MDT

Hi Ben. Is there anything else left here?

Comment 29 Ben Matthews 2018-04-20 01:47:58 MDT

(In reply to Alejandro Sanchez from comment #28)
> Hi Ben. Is there anything else left here?

No I think that covers everything. Thanks again, and sorry for disappearing there - a couple of us are at a conference this week.