I am trying to set "SLURM_TIMELIMIT" and "PBS_WALLTIME" environment variables via Lua submit filter. This is the function that gets called below. This works just fine with sbatch but when I do something like "salloc -N 1 -n 1 srun --pty /bin/bash" I get this in logs: Jul 13 15:29:16 pitzer-slurm01 slurmctld[116500]: error: _set_job_env_field: job_desc->environment is NULL I tried adding -E to srun, that did not help. FUNCTION: function set_environments(job_desc) local time_limit = job_desc.time_limit if job_desc.environment == nil then job_desc.environment = {} end if time_limit ~= nil then time_limit_secs = time_limit * 60 job_desc.environment.SLURM_TIMELIMIT = time_limit_secs job_desc.environment.PBS_WALLTIME = time_limit_secs end end Execution in slurm_job_submit function: set_environments(job_desc)
I've written a custom job_submit plugin that writes a SPANK environment variable as well as a user job environment variable sort of doing what I've already described: https://github.com/treydock/slurm/commit/347685f7bac6a2ac1cabe9eaf881a938533e1c2b During sbatch jobs, the SPANK environment variable is there but not when I launch a job allocation using salloc & srun. I think these are related as the goal here is to pass environment variables into not just the job but the prolog and in both cases the environment variables aren't making it out when using salloc & srun.
Trey, Thanks for the bug report. I can reproduce this in both 19.05 and 20.02 (and earlier). I confirmed that while salloc, sbatch, and srun all initialize job_desc->environment to NULL, only sbatch actually allocates this variable. This has the effect that you're seeing. However, I also confirmed that this is the correct behavior. The reason it's expected is that salloc and srun communicate with the compute nodes directly, so they don't send their environment (and some other things in job_desc) to the slurmctld. However, sbatch has to send everything to slurmctld, which will then send everything to the correct compute node(s) when the job is scheduled. This isn't documented from what I can tell, so we need to document which variables in job_desc aren't available to salloc and srun. We might end up documenting it in the code alongside the available fields, since we don't document it on the job_submit webpage[1] but instead just point to the code. I can offer a couple of possible suggestions that don't use the job_desc variable. You could use a cli_filter_plugin[2] to put the environment variable in the user's environment, which will then be propagated to the job's environment (works for all salloc, sbatch, and srun). Something really simple like this works for me: #define _GNU_SOURCE #include <inttypes.h> #include <ctype.h> #include <stdio.h> #include <string.h> #include <sys/types.h> #include <unistd.h> #include "src/common/slurm_xlator.h" #include "slurm/slurm_errno.h" #include "src/common/cli_filter.h" #include "src/common/slurm_opt.h" const char plugin_name[] = "cli filter none plugin"; const char plugin_type[] = "cli_filter/none"; const uint32_t plugin_version = SLURM_VERSION_NUMBER; extern int setup_defaults(slurm_opt_t *opt, bool early) { return SLURM_SUCCESS; } extern int pre_submit(slurm_opt_t *opt, int offset) { if (opt->time_limit != NO_VAL) { char *str; int size; int time_limit_secs; time_limit_secs = opt->time_limit * 60; /* Use snprintf to get length of string. */ size = snprintf(NULL, 0, "SLURM_TIME_LIMIT=%d", time_limit_secs); str = malloc(size + 1); /* Add 1 for NULL-terminating byte. */ snprintf(str, size + 1, "SLURM_TIME_LIMIT=%d", time_limit_secs); putenv(str); } return SLURM_SUCCESS; } extern void post_submit(int offset, uint32_t jobid, uint32_t stepid) { return; } However, as you already pointed out most environment variables aren't available in the job prolog. You can use a SPANK plugin to make these environment variables available to the job prolog. I gave one example of how to do this in bug 6609 comment 3. Do these suggestions work for you? [1] https://slurm.schedmd.com/job_submit_plugins.html [2] https://slurm.schedmd.com/cli_filter_plugins.html
So the CLI plugin will work for setting time limit like you illustrated but the next thing we'd like to set is GRES. This relates to this bug: https://bugs.schedmd.com/show_bug.cgi?id=9358 and was the goal of my job_submit/gres_env plugin. That plugin looks at job_desc->tres_per_node and sets a SPANK_GRES into spank env and SLURM_JOB_GRES into user env. How would I modify that example SPANK plugin you linked to access the job's requested GRES? With job_submit code it was real easy because job_desc is one of the arguments to the function but not clear to me how that would be done with SPANK. It looks like maybe I could set SLURM_JOB_GRES in the CLI plugin and then use spank_getenv to get that value and then use your example you linked to set the job control env. If I could be provided a possible example of how that might work, that would be greatly appreciated.
Also are there examples of how to use the Lua cli filter to do what you illustrated in C? I would prefer to do the simple operations in Lua if possible but having trouble finding examples.
I got the CLI filter plugin to work for GRES but the SPANK I tried the following and the response I get is this: sbatch: error: slurm_spank_init_post_opt: Error Valid only in remote context int slurm_spank_init_post_opt(spank_t spank, int ac, char *argv[]) { char val[30000]; spank_err_t err = ESPANK_SUCCESS; err = spank_getenv(spank, "SLURM_JOB_GRES", val, sizeof(val)); if(err != ESPANK_SUCCESS) { const char* errstr = spank_strerror(err); slurm_error("%s: Error %s", __func__, errstr); return 0; } spank_job_control_setenv(spank, "GRES", val, 1); return 0; } If I just use getenv, it works. I figured using spank_getenv was the proper way but I can't get that to work. This works: int slurm_spank_init_post_opt(spank_t spank, int ac, char *argv[]) { char *gres; if (getenv(GRES_ENV_VAR)) { gres = getenv(GRES_ENV_VAR); spank_job_control_setenv(spank, "GRES", gres, 1); return 0; } return 0; } I do end up with SLURM_SPANK_GRES and SLURM_JOB_GRES environment variables set in the job but that's not the end of the world having the extra SPANK one in job environment.
RE the question about spank: This is all documented on our spank page (https://slurm.schedmd.com/spank.html), but I'll just copy it here: " SPANK functions in the local and allocator environment should use the getenv, setenv, and unsetenv functions to view and modify the job's environment. SPANK functions in the remote environment should use the spank_getenv, spank_setenv, and spank_unsetenv functions to view and modify the job's environment. " slurm_spank_init_post_opt is called in the "local" environment (on the login node), not the remote environment. That's why you see that error. Continuing to quote from the doc: " Functions are also available from within the SPANK plugins to establish environment variables to be exported to the Slurm PrologSlurmctld, Prolog, Epilog and EpilogSlurmctld programs (the so-called job control environment). The name of environment variables established by these calls will be prepended with the string SPANK_ in order to avoid any security implications of arbitrary environment variable control. (After all, the job control scripts do run as root or the Slurm user.). These functions are available from local context only. spank_err_t spank_job_control_getenv(spank_t spank, const char *var, char *buf, int len); spank_err_t spank_job_control_setenv(spank_t spank, const char *var, const char *val, int overwrite); spank_err_t spank_job_control_unsetenv(spank_t spank, const char *var); " So you'll want to use these functions to have access to the environment variables in the prolog/epilog scripts. RE the question about the Lua cli_filter plugin: Here's a simple stub, you just need the file in the correct directory (usually the same spot as slurm.conf; see the cli_filter web page): function slurm_cli_setup_defaults(opt, early) return slurm.SUCCESS end function slurm_cli_pre_submit(opt, offset) return slurm.SUCCESS end function slurm_cli_post_submit(offset, jobid, stepid) return slurm.SUCCESS end However, I discovered a bug where it doesn't properly find all the options, including (unfortunately) time_limit. So until we fix that you won't be able to use the lua cli_filter plugin for those variables. I will create a separate bug to fix that and CC you on that bug. > It looks like maybe I could set SLURM_JOB_GRES in the CLI plugin and then > use spank_getenv to get that value and then use your example you linked to > set the job control env. If I could be provided a possible example of how > that might work, that would be greatly appreciated. Do you still need this? Today I've spent my time investigating the issue with the lua cli_filter plugin so I'm out of time for the day, but I could get you an example tomorrow or the next day. But it looks like you've figured it out mostly.
Thanks for clarification on SPANK plugin. The docs didn't mention that slurm_spank_init_post_opt was local context, at least not that I could find. I think all I need now is examples of how to take in CLI options in cli_filter/lua script and how to modify environment variables from that same Lua script.
I was experimenting and think I may have gotten what I need to work for GRES in cli_filter: function slurm_cli_pre_submit(options, cli_type) if options["gres"] ~= nil then io.popen("export SLURM_JOB_GRES="..options["gres"]) end return slurm.SUCCESS end If there is a better approach or more proper way, please let me know. I assume something similar could be used for setting a timelimit environment variable once the bug is fixed that was mentioned.
So io.popen works on my vagrant test environment where everything is on same node but not working on my test SLURM cluster where we more closely look like production, using the C CLI filter is working. Any tips on how to get the environment variables set from the Lua would be great.
One more example that doesn't work, but using posix lua module to do setenv: local posixstdlib = require("posix.stdlib") function slurm_cli_pre_submit(options, cli_type) if options["gres"] ~= nil then posixstdlib.setenv("SLURM_JOB_GRES", options["gres"]) end return slurm.SUCCESS end
Further testing , I think I was doing posix Lua wrong for my version of luaposix. This appears to work: This works but would still be good to know if this is correct way: require "osc_common" local posix = require("posix") function slurm_cli_pre_submit(options, cli_type) if options["gres"] ~= nil then local new_gres = gres_defaults(options["gres"]) options["gres"] = new_gres posix.setenv("SLURM_JOB_GRES", options["gres"]) end return slurm.SUCCESS end The gres_defaults just converts --gres=pfsdir to --gres=pfsdir:scratch
I've been trying to test lua-posix, but I found out that it's broken in Ubuntu 18.04 and that's my OS. Rather than upgrade my OS to test this one thing, I'll just point you to my coworker Marcin's example in bug 8821. He does exactly what you do in comment 16, so I'd say yes that is a good way to do it. About the bug I mentioned with resolving options in the lua cli_filter plugin - the way we resolve options is based on the "name" field in the option struct. These are found in src/common/slurm_opt.c in the Slurm source. The bug is that the "name" field isn't always the same string as the variable name. For the field time_limit its name is "time" so you can access it with "time" right now: opt.time or this way for the string equivalent: opt["time"] Another example is opt.job_name whose "name" is "job-name". Because of the hyphen you have to use the second way. This isn't just a straightforward fix because we use the "name" as the CLI option - i.e. to specify time_limit in CLI you do srun --time Anyway, if you want to use the lua cli_filter then you'll have to use this workaround for now and watch for the fix. I'll create the bug today and CC you on it so you can track when and in what Slurm version it's fixed.
I finally tested your example for C time limit setting, and it works. I did notice that in Lua cli_filter the opt["time"] for a job doing --time=01:00:00 returns "01:00:00" rather than 60 in minutes, while the C approach returns minutes for opt->time_limit. Is that expected and just a consequence of using Lua vs C?
Yeah. I'm not very familiar with Lua but based on my tests, opt["variable_name"] returns the string that was given by the user in the CLI. However, opt.variable_name returns the value of the variable in the C struct. In this case, time_limit is an integer, so C stores it as an integer (of type time_t) in struct slurm_opt_t and its value is appropriately 60.
Am I seeing the bug you mentioned or is something else going wrong? I added this to Lua slurm_cli_pre_submit: print(options.time_limit) The value printed is nil print(options.time) The value printed is 01:00:00. I was submitting with this: salloc -t 01:00:00 srun --pty /bin/bash Also tried your example from the bug you opened: slurm.log_info("job time limit: %d", options.time) Gives: salloc: error: pre_submit/lua: /etc/slurm/cli_filter.lua: [string "slurm.log (0, string.format(unpack({...})))"]:1: bad argument #2 to 'format' (number expected, got string) Change to: slurm.log_info("job time limit: %s", options.time) Prints: salloc: lua: job time limit: 01:00:00 Sorry if I misunderstood what you said or missed something obvious. Or is this a version difference where I'm missing a patch or something? Thanks
Actually I was mistaken. In the lua plugin it calls the "get_func" function pointer. That turns out to call mins2time_str() which does what the name implies. That's why you're seeing the 01:00:00 instead of 60. By the way, I created an internal ticket for us to better document this lua cli_filter plugin. This and other things that are non-obvious (or buggy, as we have found out) can make it confusing without closely and carefully following the Slurm source code.
I have C approach that was provided here but we opted to use Lua and wrote a complicated Lua function to parse time from string into minutes. I am going to resolve this case, thanks for the help! I am posting my solutions in case someone stumbles across this and wants to do what we're doing. function convert_time(time) if time == nil then return nil end minutes = time:match("^(%d+)$") if minutes ~= nil then return minutes end minutes, seconds = time:match("^(%d+):(%d+)$") if minutes ~= nil then new_time = tonumber(minutes) + (tonumber(seconds) / 60) return new_time end hours, minutes, seconds = time:match("^(%d+):(%d+):(%d+)$") if hours ~= nil then new_time = (tonumber(hours) * 60) + tonumber(minutes) + (tonumber(seconds) / 60) return new_time end days, hours = time:match("^(%d+)-(%d+)$") if days ~= nil then new_time = (tonumber(days) * 24 * 60) + (tonumber(hours) * 60) return new_time end days, hours, minutes = time:match("^(%d+)-(%d+):(%d+)$") if days ~= nil then new_time = (tonumber(days) * 24 * 60) + (tonumber(hours) * 60) + tonumber(minutes) return new_time end days, hours, minutes, seconds = time:match("^(%d+)-(%d+):(%d+):(%d+)$") if days ~= nil then new_time = (tonumber(days) * 24 * 60) + (tonumber(hours) * 60) + tonumber(minutes) + (tonumber(seconds) / 60) return new_time end return nil end function get_walltime(time_limit, partition) if partition == nil or partition == "" then partition = "batch" end if is_undefined_int(time_limit) or time_limit == nil then if partition == "debug" or partition == "gpudebug" then time_limit = 30 else time_limit = 60 end end return time_limit end CLI filter function: function slurm_cli_pre_submit(options, cli_type) local time = convert_time(options["time"]) local time_limit = get_walltime(time, options["partition"]) local time_limit_sec = tonumber(time_limit) * 60 posix.setenv("SLURM_TIME_LIMIT", time_limit_sec) posix.setenv("PBS_WALLTIME", time_limit_sec) return slurm.SUCCESS end I am able to unit test this code and it works, example unit test using luassert: function test_slurm_cli_pre_submit_timelimit_format3() opts = {} opts["time"] = "01:30:30" spy.on(posix, "setenv") ret = slurm_cli_pre_submit(opts, "") assert.spy(posix.setenv).was.called_with(match.has_match("SLURM_TIME_LIMIT"), match.has_match("5430")) assert.spy(posix.setenv).was.called_with(match.has_match("PBS_WALLTIME"), match.has_match("5430")) opts["time"] = "01:00:00" ret = slurm_cli_pre_submit(opts, "") assert.spy(posix.setenv).was.called_with(match.has_match("SLURM_TIME_LIMIT"), match.has_match("3600")) assert.spy(posix.setenv).was.called_with(match.has_match("PBS_WALLTIME"), match.has_match("3600")) luaunit.assertEquals(ret, nil) end
*** Ticket 9358 has been marked as a duplicate of this ticket. ***