| Summary: | srun jobs not getting correct environment | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Gordon Dexter <gmdexter> |
| Component: | Configuration | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 21.08.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Johns Hopkins Univ. HLTCOE | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Gordon Dexter
2021-12-14 13:27:12 MST
Please attach your slurm.conf & friends. (In reply to Gordon Dexter from comment #0) > Our users start interactive jobs with a wrapper script which we maintain. > This wrapper script calls srun -I --pty --export=HOME,TERM /bin/bash Has any consideration been given to converting to interactive salloc jobs instead? > LaunchParameters=use_interactive_step Your installed version supports it and I would expect it to make life easier than requiring a wrapper script. > Running this gets the user's environment by running their profile on the > allocated node. However, it doesn't get the full environment, variables set > in .bash_profile are missing. Is this the .bash_profile in the user's home directory same on every node? > As far as I can tell, Slurm gets the users environment with the command 'su > - username -c "slurmstepd getenv"' when run via sbatch, but uses 'su > username -c "slurmstepd getenv"' when run via srun. Note the lack of an > extra ' - ', meaning that the user is not getting a proper login environment > when using Slurm interactively. > > The --get-user-env=L sounds like it should solve this, but it's not > available for srun (even though srun is implicitly starting a whole new > job), and doesn't work with salloc because our SlurmUser isn't root. Is > there some way to force this setting on all jobs? The '--get-user-env' is probably not the solution you're looking for as that is for privileged users to submit jobs us a given user. It is generally expected that a user's env will be correct at job submission. However, the `--export` argument exists to allow users customize the job starting environment. In many cases, sites that have complex environment needs, will use `--export=None` and will then source in a preferred configuration or a preferred set of env modules. We generally prefer that jobs are started in a fresh login environment, to eliminate a possible source of variation/error between jobs, and because some env vars like HOSTNAME or PATH need to vary between nodes (which have different software loads than the submit hosts). We originally had --export=None but found that it resulted in $HOME and $USER being empty. Additionally, $TERM should be carried over for interactive jobs. I did find a workaround of sorts: Invoking /bin/bash with -l makes it act like it's a login shell. HOME,USER,TERM are still not set unless you export them, but other variables set in the .bash_profile are. So changing our script to add -l after the shell seems to work. Similarly, adding -i to force bash to be interactive seems to fix the user aliases, which were also a problem before. As for using salloc with LaunchOptions=use_interactive_step, that's an interesting alternative. Instead of InteractiveStepOptions="--interactive --preserve-env --pty $SHELL" we'd probably have something like: InteractiveStepOptions="--export=HOME,USER,TERM --pty $SHELL -l" This is because we want our 'srsh' wrapper script to behave more or less like qrsh (or even a normal ssh session) and not require users to prepend 'srun' to any commands. However this doesn't allow us the flexibility to use our wrapper script to insert --x11 into the args if the user's DISPLAY is set, and --x11 causes the command to fail if DISPLAY is not set. It would be nice if srun had some '-x11-optional' argument that enabled X11 if possible, but still let the session start normally if not. If you have any other suggestions to get this behavior I'm all ears, otherwise this ticket can be closed. Hi Gordon, It sounds like you have a good idea of how to make sure jobs have the right environment with salloc. I was trying to come up with a way to do this using a job submit filter. You should be able to look for the presence of the DISPLAY variable, but our current submit filter implementation doesn't allow you to set the x11 flag for a job submission. You are also right that there isn't an optional x11 flag that would cover both cases where DISPLAY was or wasn't set. I'm afraid I don't have a solution for the case where you want to set the x11 flag for some jobs. I'll go ahead and close the ticket, as you stated. If you do have any follow up questions, feel free to update the ticket and I'll get back to you. Thanks, Ben Hi Gordon, I was thinking about this some more and we may actually be able to use the cli_filter plugin rather than the job_submit plugin to do what you want with the x11 flag. This is a client-side plugin rather than one that executes on the controller. This plugin allows you to look at a job at various stages as it is being submitted and add flags if necessary. The thing to keep in mind is that since the cli_filter executes client side there is the possibility that users can circumvent it, but in your case the plugin would be designed to make their life easier rather than impose restrictions. You can find documentation for the plugin here: https://slurm.schedmd.com/cli_filter_plugins.html Let me know if this sounds like something that would work. Thanks, Ben Yes, the cli_filter does sound like a possibility; we already use it for several other things. For now our wrapper script works well for us, and our users are used to it; honestly if we did go the salloc/use_interactive_step route we would probably keep 'srsh', just as a wrapper or alias for salloc instead of for srun. Thank you for your help. |