Ticket 7734 - slurm environment changed after slurm upgraded from 18.08.7 to 19.05.2
Summary: slurm environment changed after slurm upgraded from 18.08.7 to 19.05.2
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other tickets)
Version: 19.05.2
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nate Rini
QA Contact: Ben Roberts
URL:
Depends on:
Blocks:
 
Reported: 2019-09-11 23:43 MDT by whong
Modified: 2020-01-21 12:42 MST (History)
5 users (show)

See Also:
Site: Swinburne
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.6 20.02.pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description whong 2019-09-11 23:43:46 MDT
We just upgraded our slurm from 18.08.7 to 19.05.2.1 on last Thursday. The upgrade process itself went well. We built the new slurm version, performed live db migration, restarted slurmdbd and slurmctld. We rebuilt all MPI's against the new version of slurm. Restarted all slurmd daemons on compute and login nodes.

Then we ran into issues on srun jobs. It seems some environment changed on the new release. It is not documented and hard to figure out. The following environment variables are required to prevent the job environment from being polluted by the shell jobs are submitted from, and also to fix the problem with srun not inheriting the sbatch environment:

SBATCH_EXPORT=none
SBATCH_EXPORT_ENV=none
SRUN_EXPORT_ENV=all

Also noticed is that X11 forwarding was also broken due to permission problems with XAUTHORITY in TmpFS directory, but we managed to fix it.

Is this change on purpose? Could you please take a look?

Thanks,
Wei
Comment 1 Jason Booth 2019-09-12 11:57:22 MDT
Hi Wei

> Then we ran into issues on srun jobs. It seems some environment changed on the new release. It is not documented and hard to figure out. 

Would you describe what you are seeing on your end and what you noticed has changed? It is also not clear if you are using the *_EXPORT options to help resolve the issue you are seeing or if you are just mentioning this as a matter of association to the problem.
Comment 3 Lewis Lakerink 2019-09-15 16:24:57 MDT
Ignoring X11 forwarding because the permissions problem was specific to our set up and is now working - x11 changes were clearly documented, thank you.

In regards to the environment issues Wei reported, there was a - seemingly - undocumented change to how sbatch handles the environment variables from the user's shell at sbatch time.

What used to work, in 18 was the following:-
SBATCH_EXPORT_ENV=none
SRUN_EXPORT_ENV=all

However, with 19, we also additionally need to add to get the same behavior as 18:-
SBATCH_EXPORT=none

That is to say, that these three environment variables must be set in order to get the same functionality as 18, and the additional SBATCH_EXPORT environment variable was not in any of the release notes or news that I could find.

Without the SBATCH_EXPORT environment variable set in the users environment at submission time, srun within a batch script wouldn't pick up the environment set up by the batch job. 

And without SBATCH_EXPORT_ENV AND SRUN_EXPORT_ENV the users environment at submission time would pollute the sbatch job envioronment.

While we've fixed this and operations have returned to normal on our cluster, we're just wondering if:-
A. We're unintentionally abusing the environment variables to control sbatch/srun when we don't have to be, or
B. How slurm handles SBATCH_EXPORT changed and was not documented in news or changelogs?

Cheers,
Lewis
Comment 4 Lewis Lakerink 2019-09-15 22:36:20 MDT
Sorry, that is a mistake.

For 18 we only had SBATCH_EXPORT=none, to get the same functionality on 19, we had to use all three above.
Comment 9 Nate Rini 2019-09-25 16:38:29 MDT
(In reply to Lewis Lakerink from comment #3)
> In regards to the environment issues Wei reported, there was a - seemingly -
> undocumented change to how sbatch handles the environment variables from the
> user's shell at sbatch time.
> 
> What used to work, in 18 was the following:-
> SRUN_EXPORT_ENV=all
> 
> However, with 19, we also additionally need to add to get the same behavior
> as 18:-
> SBATCH_EXPORT=none
> SBATCH_EXPORT_ENV=none
> SRUN_EXPORT_ENV=all
Is this what you meant per comment #4?

There is no env variable SLURM_EXPORT/SBATCH_EXPORT_ENV only SLURM_EXPORT/SLURM_EXPORT_ENV. Defining SBATCH_EXPORT=none and SBATCH_EXPORT_ENV=none will have no affect on Slurm.
Comment 10 Robin Humble 2019-09-25 23:57:55 MDT
(In reply to Nate Rini from comment #9)
> There is no env variable SLURM_EXPORT/SBATCH_EXPORT_ENV only
> SLURM_EXPORT/SLURM_EXPORT_ENV. Defining SBATCH_EXPORT=none and
> SBATCH_EXPORT_ENV=none will have no affect on Slurm.

there is lots of confusion here (including your conflicting statements about SLURM_EXPORT above - the first one should be SBATCH_EXPORT?), so let me start again...

the behaviour we want (pseudocode) is srun=all for interactive use, srun=all within batch scripts, and sbatch=none so that batch script env's are clean and unpolluted from login node env's.

in slurm 17/18. we set

SBATCH_EXPORT=none

in /etc/profile.d/ scripts and this gave us the desired behaviour.

'man sbatch' says:

SBATCH_EXPORT         Same as --export

which is where we got that from.

the upgrade to 19 broke this.
in 19 there was no PATH or LD_LIBRARY_PATH set after a srun inside a batch script. ie. in a batch script

srun /usr/bin/env

showed no PATH or LD_LIBRARY_PATH at all.
this would break all jobs that use mpirun/srun in a batch script.

to work around this change of behaviour, we tried

SBATCH_EXPORT=none
SRUN_EXPORT_ENV=all

and this didn't fix it.
we tried a bunch of other things too.
in the end we found the combo

SBATCH_EXPORT=none
SBATCH_EXPORT_ENV=none
SRUN_EXPORT_ENV=all

which works to restore the pre-19 behaviour.

we were aware that SBATCH_EXPORT_ENV probably doesn't exist, but for some reason that helped. I'm 99% sure we tried SLURM_EXPORT_ENV and SLURM_EXPORT (undocumented) too, but that didn't fix it.

we also had a queue full of 1000's of jobs which already had SBATCH_EXPORT=none in their env's, so there is no way we could unset this. probably still isn't.

I hope I haven't further muddied the waters with a mistake in the above! :)

cheers,
robin
Comment 12 Nate Rini 2019-09-26 12:48:42 MDT
Robin,

On one of your users, can you please call the following (as root):
> root@host# su - $USERID -c /usr/bin/env |grep PATH

Thanks,
--Nate
Comment 17 Robin Humble 2019-09-30 08:19:01 MDT
Hi Nate,

(In reply to Nate Rini from comment #12)
> On one of your users, can you please call the following (as root):
> > root@host# su - $USERID -c /usr/bin/env |grep PATH

sure.

 # su - someUsername -c /usr/bin/env | egrep 'PATH|SLURM|SBATCH|SRUN' | sort
LD_LIBRARY_PATH=/apps/slurm/latest/lib:/apps/slurm/latest/lib/slurm:/opt/nvidia/latest/usr/lib64
LIBRARY_PATH=/apps/slurm/latest/lib:/opt/nvidia/latest/usr/lib64
__LMOD_REF_COUNT_LD_LIBRARY_PATH=/apps/slurm/latest/lib:1;/apps/slurm/latest/lib/slurm:1;/opt/nvidia/latest/usr/lib64:1
__LMOD_REF_COUNT_LIBRARY_PATH=/apps/slurm/latest/lib:1;/opt/nvidia/latest/usr/lib64:1
__LMOD_REF_COUNT_MANPATH=/apps/slurm/latest/share/man:1;/opt/nvidia/latest/usr/share/man:1;/apps/lmod/lmod/lmod/share/man:1
__LMOD_REF_COUNT_MODULEPATH=/apps/Modules/modulefiles:1;/opt/module/modulefiles:1
__LMOD_REF_COUNT_PATH=/apps/slurm/latest/sbin:1;/apps/slurm/latest/bin:1;/opt/nvidia/latest/usr/bin:1;/usr/lib64/qt-3.3/bin:1;/usr/local/bin:1;/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1
MANPATH=/apps/slurm/latest/share/man:/opt/nvidia/latest/usr/share/man:/apps/lmod/lmod/lmod/share/man::
MODULEPATH=/apps/Modules/modulefiles:/opt/module/modulefiles
MODULEPATH_ROOT=/apps/lmod/modulefiles
PATH=/apps/slurm/latest/sbin:/apps/slurm/latest/bin:/opt/nvidia/latest/usr/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin
SBATCH_EXPORT_ENV=none
SBATCH_EXPORT=none
SRUN_EXPORT_ENV=all

cheers,
robin
Comment 18 Nate Rini 2019-09-30 10:05:44 MDT
Robin,

Working on a patch now. This at the very least a docbug. I will provide updates as there is progress.

Thanks,
--Nate
Comment 25 Nate Rini 2019-09-30 12:01:55 MDT
(In reply to Robin Humble from comment #17)
> SBATCH_EXPORT_ENV=none
This should have 0 affect on Slurm.
> SBATCH_EXPORT=none
This should stop your extern step from getting any environment from the caller.
> SRUN_EXPORT_ENV=all

Robin,

Could you please try the following in your environment before calling sbatch?

> SBATCH_EXPORT_ENV=none
> SRUN_EXPORT_ENV=all

Thanks,
--Nate
Comment 26 Daniel Grimwood 2020-01-16 01:55:06 MST
From what I can tell, sbatch does not export SRUN_*, only SLURM_*, so that last question won't work.  The sbatch man page seems to only mention exporting SLURM_* too.  It would help and avoid confusion if sbatch also exports all variables named SRUN_* in addition to SLURM_*.  I'm not sure if that should get spawned off into a feature request.

As an aside, it would help if SBATCH_EXPORT, SLURM_EXPORT_ENV, and SRUN_EXPORT_ENV were all consistent in naming convention.
Comment 27 Nate Rini 2020-01-16 09:12:38 MST
(In reply to Daniel Grimwood from comment #26)
> From what I can tell, sbatch does not export SRUN_*, only SLURM_*, so that
> last question won't work.

It should:

Here is an example of what it does on my test system:
> $ env SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all sbatch --wrap 'printenv SBATCH_EXPORT_ENV SRUN_EXPORT_ENV'
> Submitted batch job 2460
> $ cat slurm-2460.out 
> none
> all

.

> The sbatch man page seems to only mention
> exporting SLURM_* too.  It would help and avoid confusion if sbatch also
> exports all variables named SRUN_* in addition to SLURM_*.  I'm not sure if
> that should get spawned off into a feature request.
This ticket has a pending update (under QA review) to the docs that should make this a little more clear.

> As an aside, it would help if SBATCH_EXPORT, SLURM_EXPORT_ENV, and
> SRUN_EXPORT_ENV were all consistent in naming convention.
Please submit an RFE ticket. I believe these are named this way for historical reasons but that can be changed.

Thanks
--Nate
Comment 28 Daniel Grimwood 2020-01-16 18:28:06 MST
Hi Nate,

thanks.  I've started bug 8349 about the feature request.

For your example, it looks like the naming inconsistency got you too ;).

>env SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all sbatch --wrap 'printenv SBATCH_EXPORT_ENV SRUN_EXPORT_ENV'
none
all

The above works because SBATCH_EXPORT_ENV does not exist, so it defaults to all.

>env SBATCH_EXPORT=none SRUN_EXPORT_ENV=all sbatch --qos=high --wrap 'printenv SBATCH_EXPORT SRUN_EXPORT_ENV'
results in no output

>env SBATCH_EXPORT=all SRUN_EXPORT_ENV=all sbatch --qos=high --wrap 'printenv SBATCH_EXPORT SRUN_EXPORT_ENV'
all
all

> env SBATCH_EXPORT=none SRUN_EXPORT_ENV=all SLURM_bla=aaa sbatch --qos=high --wrap 'printenv SBATCH_EXPORT SRUN_EXPORT_ENV SLURM_bla'
aaa

So neither SRUN_* or SBATCH_* get exported always, only SLURM_*.

With regards,
Daniel.
Comment 29 Robin Humble 2020-01-17 00:04:00 MST
I have a slurm test setup now, so I'll attempt to re-illustrate our problem.

here is a minimalist profile.d script

$ cat /etc/profile.d/site.sh
export SBATCH_EXPORT=none
# path to slurm
export PATH=$PATH:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin

and here's something that's a bit like a real batch script with a srun/mpirun in it. (note that I don't think sbatch --wrap is enough to detect our issue.)

$ cat script 
#!/bin/bash
printenv | egrep 'EXPORT|PATH'
echo add to PATH and then srun
export PATH=$PATH:/lala/lala
srun /bin/printenv | egrep 'EXPORT|PATH'

my submit shell env is as per profile.d above.

$ sbatch script
Submitted batch job 57

$ cat slurm-57.out 
SLURM_EXPORT_ENV=none
SBATCH_EXPORT=none
PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin
add to PATH and then srun
SLURM_EXPORT_ENV=none

note that srun now propagates no PATH at all.
all batch scripts that have srun/mpirun in them will crash.
as there was no PATH at all I even needed /bin/printenv instead of just printenv in the batch script.

additionally that SLURM_EXPORT_ENV just came out of nowhere. I didn't set that. slurm has set it.

let's try again with our admittedly nonsensical setup... :)

$ cat /etc/profile.d/site.sh
export SBATCH_EXPORT=none
export SBATCH_EXPORT_ENV=none
export SRUN_EXPORT_ENV=all
# path to slurm
export PATH=$PATH:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin

$ sbatch script
Submitted batch job 58

$ cat slurm-58.out 
SLURM_EXPORT_ENV=none
SBATCH_EXPORT=none
PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin
SRUN_EXPORT_ENV=all
SBATCH_EXPORT_ENV=none
add to PATH and then srun
SLURM_EXPORT_ENV=none
SBATCH_EXPORT=none
PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin:/lala/lala
SRUN_EXPORT_ENV=all
SBATCH_EXPORT_ENV=none

and now we get the expected output.
things launched by srun inside the batch script have a PATH and they run ok.

I don't pretend to know why our combo of env variables works, but it does... :-/

slurm 18 only needed a profile.d with SBATCH_EXPORT=none.
slurm 19 broke that.

cheers,
robin
Comment 38 Nate Rini 2020-01-17 19:47:55 MST
(In reply to Robin Humble from comment #29)
> $ cat slurm-57.out 
> SLURM_EXPORT_ENV=none
> SBATCH_EXPORT=none
> PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin
> add to PATH and then srun
> SLURM_EXPORT_ENV=none

I get the same on my test cluster.
 
> $ cat slurm-58.out 
> SLURM_EXPORT_ENV=none
> SBATCH_EXPORT=none
> PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin
> SRUN_EXPORT_ENV=all
> SBATCH_EXPORT_ENV=none
> add to PATH and then srun
> SLURM_EXPORT_ENV=none
> SBATCH_EXPORT=none
> PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin:/lala/lala
> SRUN_EXPORT_ENV=all
> SBATCH_EXPORT_ENV=none

I also get the same on my test cluster with the second site.sh.

> I don't pretend to know why our combo of env variables works, but it does...

Both SBATCH_EXPORT and SRUN_EXPORT_ENV are in the Slurm source. 
SBATCH_EXPORT_ENV is not in the source and should be treated like any other user set environment variable.

For the sake of completeness, I tested it with the following site.sh too:
> [fred@login ~]$ cat /etc/profile.d/site.sh 
> export SBATCH_EXPORT=none
> export SRUN_EXPORT_ENV=all
> # path to slurm
> export PATH=$PATH:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin
> [fred@login ~]$ printenv SBATCH_EXPORT_ENV
> [fred@login ~]$ cat /test.sbatch 
> #!/bin/bash
> printenv | egrep 'EXPORT|PATH'
> echo add to PATH and then srun
> export PATH=$PATH:/lala/lala
> srun /bin/printenv | egrep 'EXPORT|PATH'
> [fred@login ~]$ sbatch /test.sbatch 
> Submitted batch job 2
> [fred@login ~]$ cat slurm-2.out 
> SLURM_EXPORT_ENV=none
> SBATCH_EXPORT=none
> PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/fred/.local/bin:/home/fred/bin
> SRUN_EXPORT_ENV=all
> add to PATH and then srun
> SLURM_EXPORT_ENV=none
> SBATCH_EXPORT=none
> PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/fred/.local/bin:/home/fred/bin:/lala/lala
> SRUN_EXPORT_ENV=all

I see no change locally without SBATCH_EXPORT_ENV being defined at all.

> slurm 18 only needed a profile.d with SBATCH_EXPORT=none.

I tested the same on 18.08.9:
> [fred@login ~]$ cat /etc/profile.d/site.sh 
> export SBATCH_EXPORT=none
> export SRUN_EXPORT_ENV=all
> # path to slurm
> export PATH=$PATH:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin
> [fred@login ~]$ printenv SBATCH_EXPORT_ENV
> [fred@login ~]$ cat /test.sbatch 
> #!/bin/bash
> printenv | egrep 'EXPORT|PATH'
> echo add to PATH and then srun
> export PATH=$PATH:/lala/lala
> srun /bin/printenv | egrep 'EXPORT|PATH'
> [fred@login ~]$  sbatch /test.sbatch 
> Submitted batch job 2
> [fred@login ~]$ cat slurm-2.out 
> SBATCH_EXPORT=none
> PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/fred/.local/bin:/home/fred/bin
> SRUN_EXPORT_ENV=all
> add to PATH and then srun
> SBATCH_EXPORT=none
> PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/fred/.local/bin:/home/fred/bin:/lala/lala
> SRUN_EXPORT_ENV=all
Correct, 18.08 did not require setting SRUN_EXPORT_ENV.

> slurm 19 broke that.

Slurm 19.05 had a major rewrite of how the user arguments and environment are handled (to add cli_filter). In this specific case, a previously unknown bug happened to be fixed. Per the sbatch man page in 18.08, SLURM_EXPORT_ENV should be set by sbatch in the job.

18.08:
> [fred@login ~]$ sbatch --wrap 'printenv SLURM_EXPORT_ENV'
> Submitted batch job 3
> [fred@login ~]$ cat slurm-3.out 
> [fred@login ~]$ 

19.05:
> [fred@login ~]$ sbatch --wrap 'printenv SLURM_EXPORT_ENV'
> Submitted batch job 2
> [fred@login ~]$ cat slurm-2.out 
> none

By testing a very simple job, we can see that sbatch in 18.08 is not setting SLURM_EXPORT_ENV when it should be. This in effect caused your jobs to work, even though they should not have in 18.08.

(In reply to Daniel Grimwood from comment #28)
> For your example, it looks like the naming inconsistency got you too ;).
> 
> >env SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all sbatch --wrap 'printenv SBATCH_EXPORT_ENV SRUN_EXPORT_ENV'
> none
> all
Good catch, it should have been SBATCH_EXPORT.
 
> So neither SRUN_* or SBATCH_* get exported always, only SLURM_*.
What is exported should be determined by '--export', SBATCH_EXPORT, SLURM_EXPORT_ENV, and SRUN_EXPORT_ENV. What is or not exported by Slurm is (or at least should be) documented in the man pages of the executable under "OUTPUT ENVIRONMENT VARIABLES". In attempt to make this less confusing, the man pages have been updated with examples (via https://github.com/SchedMD/slurm/commit/adf9f8f477e6f1668017238a4bea6312b1c0670b).

Please also note there is a slightly related bug#7586 in srun not setting SLURM_EXPORT_ENV which should be upstream soon.

Please tell me if you have any more questions or issues.

Thanks,
--Nate
Comment 39 Daniel Grimwood 2020-01-19 19:00:59 MST
Thanks Nate.
yes it's working as documented in the man page.
I will put in another feature request so that SRUN_* also gets exported by sbatch, so that it won't be necessary to set SRUN_EXPORT_ENV in the /etc/profile.d/site.sh (or equivalent) on the compute nodes.

Regards,
Daniel.
Comment 40 Nate Rini 2020-01-20 10:43:39 MST
Daniel,

Closing ticket per your response.

Thanks,
--Nate