| Summary: | The --export=NONE results in idle sbatch jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | HHLR Admins <hhlr-admins> |
| Component: | Other | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian |
| Version: | 17.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=5037 https://bugs.schedmd.com/show_bug.cgi?id=7734 |
||
| Site: | Hessen | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | exerpts from slurmd log files from nodes | ||
(In reply to HHLR Admins from comment #0) Hey Benjamin, `su` is needed on the node because when `--export=NONE` (or =some_vars) is specified with sbatch, it triggers the `--get-user-env` argument as well. I’m not sure the reason for this. It's also not documented. I'll look into updating the docs to document this behavior. From sbatch.conf: “--get-user-env[=timeout][mode]: This option will tell sbatch to retrieve the login environment variables for the user specified in the --uid option. The environment variables are retrieved by running something of this sort "su - <username> -c /usr/bin/env" and parsing the output.” Thus, `su` isn’t found and slurmd is fataling. As for `User requested launch of zero tasks!` and `Unable to unlink domain socket`: I’m not sure why. Could you give me the slurm command that is failing and your slurm.conf? Thanks, Michael Benjamin,
For now, to avoid triggering --user-get-env, just avoid --export=NONE or --export=<some_vars>.
If you still want to add <some_vars> to your current env, but do not want trigger --user-get-env, replace e.g. this:
$ sbatch --export=ALL,SOMEVAR=hi myscript.sh
with this:
$ SOMEVAR=hi sbatch --export=ALL myscript.sh
This should give you what you expect.
Let me know if that helps.
Thanks,
-Michael
Thanks, that sounds like a good idea to work around it. We also will reinstate the su command on our nodes, which should hopefully get it all to work as intended again. I am almost certain the other messages in the slurmd log were a result of the same issue. Best, Benjamin Hi, i got some further feedback from our users in regards to this issue. It might be interesting to you, especially as it was initially not sure why the bahaviour is as it is and why --get-user-env needs to get triggered when exporting the user environment is not intended. "There is the option --export=NONE to `sbatch` not to forward the user's set environment variables to the submitted job. Unfortunately this will trigger --get-user-env to set the user's environment anyway, which sounds contary to the specified NONE. I would expect that NONE means NONE (except the set SLURM environment variables and a bare minimum PATH=/bin:/sbin:/usr/bin:/usr/sbin or alike), but instead the behavior is something which might be called --export=USERDEFAULTS. Besides specifying `if tty >/dev/null; then…` in the users's profiles to avoid to export anything to a batch job: is there any way to get the desired behavior? What was the initial reason to force the users's environments to the jobs? Other queueing systems default not to do this and one has to use `qsub -V …` to explicitly export the current environment in rare cases. My experience is, that any changes to the profiles after the job submission might lead to a crash of these former submitted jobs. And this can be really hard to investigate. With a self-contained script all outside changes are unimportant to the submitted jobs." Best, Benjamin (In reply to HHLR Admins from comment #11) Hey Benjamin, We updated the docs to better describe the current behavior. See https://github.com/SchedMD/slurm/commit/5bc5a02a2a42f000c70b1cb00447347ad51add0f. > My experience is, that any changes to the profiles after the job submission > might lead to a crash of these former submitted jobs. And this can be really > hard to investigate. With a self-contained script all outside changes are > unimportant to the submitted jobs." > Besides specifying `if tty >/dev/null; then…` in the users's profiles to > avoid to export anything to a batch job: is there any way to get the desired > behavior? This depends on the behavior desired. If you don’t want any env vars for an individual *job*, then use an additional `--export=NONE` with *srun* to prevent inheriting any env vars from the containing sbatch script. Note that `srun --export=NONE` does NOT do an implicit `--get-user-env`, unlike sbatch. If you don’t want any env variables passed to the script in the first place (from the submission host OR the compute node), the following seems to work for me: env -i sbatch --export=ALL ./my-script.sh `env -i` will pass an empty environment to sbatch, and `--export=ALL` will propagate that empty environment AND prevent `--get-user-env` from triggering once the script arrives on the node. See https://stackoverflow.com/questions/9671027/sanitize-environment-with-command-or-bash-script. > What was the initial reason to force the users's environments to > the jobs? Other queueing systems default not to do this and one has to use > `qsub -V …` to explicitly export the current environment in rare cases. Part of the reasoning for keeping the implicit `--set-user-env` is that the script submitted by sbatch usually needs an environment to run stuff. A lot of people’s scripts might rely on this and not realize it. E.g. how could `srun <cmd>` work in the script if no PATH variable exists? (In practice, though, it seems that PATH, PWD, HOSTNAME, TMPDIR, and possibly some other vars are still set. I’m not sure if this is due to Slurm, defaults for the shell or Linux, or something else.) The historical reason is that back when Slurm was just a resource manager and not a scheduler, other schedulers would sit on top of Slurm and schedule jobs. In particular, Moab (and Maui) would submit all jobs for all users from a daemon running as root, and Slurm didn't want to inherit that root environment. Now that Slurm does scheduling, this doesn’t seem very applicable. > "There is the option --export=NONE to `sbatch` not to forward the user's set > environment variables to the submitted job. Unfortunately this will trigger > --get-user-env to set the user's environment anyway, which sounds contary to > the specified NONE. I would expect that NONE means NONE (except the set > SLURM environment variables and a bare minimum > PATH=/bin:/sbin:/usr/bin:/usr/sbin or alike), but instead the behavior is > something which might be called --export=USERDEFAULTS. Thanks for the input; I like the idea of --export=USERDEFAULTS, because that is explicit and intuitive. We'll need to discuss this internally to see what makes sense for the future. Let me know if this works for you! Thanks, Michael If this doesn't work for you, let me know. Closing out ticket. Thanks, Michael |
Created attachment 8317 [details] exerpts from slurmd log files from nodes Hi, i recently noticed, that the --export flag can result in Batch Jobs the just idle, run into their time Limit and not write any output/error files. - Not using the flag causes no problems. - Using --Export=ALL works fine. - using --Export=NONE (or =some variables of choice) with an sbatch Job causes the Job to idle, and not write the usual Output/error files. - using --Export=NONE (or =some variables of choice) with an interactive srun Job (with --pty /bin/bash) works fine. Slurmd logs from the later two cases are attached. In the case of the problematic Job that does nothing, the Prolog scripts are successfully executed, but then there are some troubling entries in the slurmd log: - error: User requested launch of zero tasks! - fatal: Could not locate command: /usr/bin/su - error: Unable to unlink domain socket: No such file or Directory I do not understand why there is this message About requesting Zero Tasks, as the script properly specified -n 4. The message about /usr/bin/su is correct, as the node in Question does not have the su binary. However why this is necessary only in the case when the user Environment is not fully exported is unclear to me. I guess that the issue with the socket is following from the earlier Problems and might be related to the issue that there is no Output/error file. Best, Benjamin