We just upgraded our slurm from 18.08.7 to 19.05.2.1 on last Thursday. The upgrade process itself went well. We built the new slurm version, performed live db migration, restarted slurmdbd and slurmctld. We rebuilt all MPI's against the new version of slurm. Restarted all slurmd daemons on compute and login nodes. Then we ran into issues on srun jobs. It seems some environment changed on the new release. It is not documented and hard to figure out. The following environment variables are required to prevent the job environment from being polluted by the shell jobs are submitted from, and also to fix the problem with srun not inheriting the sbatch environment: SBATCH_EXPORT=none SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all Also noticed is that X11 forwarding was also broken due to permission problems with XAUTHORITY in TmpFS directory, but we managed to fix it. Is this change on purpose? Could you please take a look? Thanks, Wei
Hi Wei > Then we ran into issues on srun jobs. It seems some environment changed on the new release. It is not documented and hard to figure out. Would you describe what you are seeing on your end and what you noticed has changed? It is also not clear if you are using the *_EXPORT options to help resolve the issue you are seeing or if you are just mentioning this as a matter of association to the problem.
Ignoring X11 forwarding because the permissions problem was specific to our set up and is now working - x11 changes were clearly documented, thank you. In regards to the environment issues Wei reported, there was a - seemingly - undocumented change to how sbatch handles the environment variables from the user's shell at sbatch time. What used to work, in 18 was the following:- SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all However, with 19, we also additionally need to add to get the same behavior as 18:- SBATCH_EXPORT=none That is to say, that these three environment variables must be set in order to get the same functionality as 18, and the additional SBATCH_EXPORT environment variable was not in any of the release notes or news that I could find. Without the SBATCH_EXPORT environment variable set in the users environment at submission time, srun within a batch script wouldn't pick up the environment set up by the batch job. And without SBATCH_EXPORT_ENV AND SRUN_EXPORT_ENV the users environment at submission time would pollute the sbatch job envioronment. While we've fixed this and operations have returned to normal on our cluster, we're just wondering if:- A. We're unintentionally abusing the environment variables to control sbatch/srun when we don't have to be, or B. How slurm handles SBATCH_EXPORT changed and was not documented in news or changelogs? Cheers, Lewis
Sorry, that is a mistake. For 18 we only had SBATCH_EXPORT=none, to get the same functionality on 19, we had to use all three above.
(In reply to Lewis Lakerink from comment #3) > In regards to the environment issues Wei reported, there was a - seemingly - > undocumented change to how sbatch handles the environment variables from the > user's shell at sbatch time. > > What used to work, in 18 was the following:- > SRUN_EXPORT_ENV=all > > However, with 19, we also additionally need to add to get the same behavior > as 18:- > SBATCH_EXPORT=none > SBATCH_EXPORT_ENV=none > SRUN_EXPORT_ENV=all Is this what you meant per comment #4? There is no env variable SLURM_EXPORT/SBATCH_EXPORT_ENV only SLURM_EXPORT/SLURM_EXPORT_ENV. Defining SBATCH_EXPORT=none and SBATCH_EXPORT_ENV=none will have no affect on Slurm.
(In reply to Nate Rini from comment #9) > There is no env variable SLURM_EXPORT/SBATCH_EXPORT_ENV only > SLURM_EXPORT/SLURM_EXPORT_ENV. Defining SBATCH_EXPORT=none and > SBATCH_EXPORT_ENV=none will have no affect on Slurm. there is lots of confusion here (including your conflicting statements about SLURM_EXPORT above - the first one should be SBATCH_EXPORT?), so let me start again... the behaviour we want (pseudocode) is srun=all for interactive use, srun=all within batch scripts, and sbatch=none so that batch script env's are clean and unpolluted from login node env's. in slurm 17/18. we set SBATCH_EXPORT=none in /etc/profile.d/ scripts and this gave us the desired behaviour. 'man sbatch' says: SBATCH_EXPORT Same as --export which is where we got that from. the upgrade to 19 broke this. in 19 there was no PATH or LD_LIBRARY_PATH set after a srun inside a batch script. ie. in a batch script srun /usr/bin/env showed no PATH or LD_LIBRARY_PATH at all. this would break all jobs that use mpirun/srun in a batch script. to work around this change of behaviour, we tried SBATCH_EXPORT=none SRUN_EXPORT_ENV=all and this didn't fix it. we tried a bunch of other things too. in the end we found the combo SBATCH_EXPORT=none SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all which works to restore the pre-19 behaviour. we were aware that SBATCH_EXPORT_ENV probably doesn't exist, but for some reason that helped. I'm 99% sure we tried SLURM_EXPORT_ENV and SLURM_EXPORT (undocumented) too, but that didn't fix it. we also had a queue full of 1000's of jobs which already had SBATCH_EXPORT=none in their env's, so there is no way we could unset this. probably still isn't. I hope I haven't further muddied the waters with a mistake in the above! :) cheers, robin
Robin, On one of your users, can you please call the following (as root): > root@host# su - $USERID -c /usr/bin/env |grep PATH Thanks, --Nate
Hi Nate, (In reply to Nate Rini from comment #12) > On one of your users, can you please call the following (as root): > > root@host# su - $USERID -c /usr/bin/env |grep PATH sure. # su - someUsername -c /usr/bin/env | egrep 'PATH|SLURM|SBATCH|SRUN' | sort LD_LIBRARY_PATH=/apps/slurm/latest/lib:/apps/slurm/latest/lib/slurm:/opt/nvidia/latest/usr/lib64 LIBRARY_PATH=/apps/slurm/latest/lib:/opt/nvidia/latest/usr/lib64 __LMOD_REF_COUNT_LD_LIBRARY_PATH=/apps/slurm/latest/lib:1;/apps/slurm/latest/lib/slurm:1;/opt/nvidia/latest/usr/lib64:1 __LMOD_REF_COUNT_LIBRARY_PATH=/apps/slurm/latest/lib:1;/opt/nvidia/latest/usr/lib64:1 __LMOD_REF_COUNT_MANPATH=/apps/slurm/latest/share/man:1;/opt/nvidia/latest/usr/share/man:1;/apps/lmod/lmod/lmod/share/man:1 __LMOD_REF_COUNT_MODULEPATH=/apps/Modules/modulefiles:1;/opt/module/modulefiles:1 __LMOD_REF_COUNT_PATH=/apps/slurm/latest/sbin:1;/apps/slurm/latest/bin:1;/opt/nvidia/latest/usr/bin:1;/usr/lib64/qt-3.3/bin:1;/usr/local/bin:1;/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1 MANPATH=/apps/slurm/latest/share/man:/opt/nvidia/latest/usr/share/man:/apps/lmod/lmod/lmod/share/man:: MODULEPATH=/apps/Modules/modulefiles:/opt/module/modulefiles MODULEPATH_ROOT=/apps/lmod/modulefiles PATH=/apps/slurm/latest/sbin:/apps/slurm/latest/bin:/opt/nvidia/latest/usr/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin SBATCH_EXPORT_ENV=none SBATCH_EXPORT=none SRUN_EXPORT_ENV=all cheers, robin
Robin, Working on a patch now. This at the very least a docbug. I will provide updates as there is progress. Thanks, --Nate
(In reply to Robin Humble from comment #17) > SBATCH_EXPORT_ENV=none This should have 0 affect on Slurm. > SBATCH_EXPORT=none This should stop your extern step from getting any environment from the caller. > SRUN_EXPORT_ENV=all Robin, Could you please try the following in your environment before calling sbatch? > SBATCH_EXPORT_ENV=none > SRUN_EXPORT_ENV=all Thanks, --Nate
From what I can tell, sbatch does not export SRUN_*, only SLURM_*, so that last question won't work. The sbatch man page seems to only mention exporting SLURM_* too. It would help and avoid confusion if sbatch also exports all variables named SRUN_* in addition to SLURM_*. I'm not sure if that should get spawned off into a feature request. As an aside, it would help if SBATCH_EXPORT, SLURM_EXPORT_ENV, and SRUN_EXPORT_ENV were all consistent in naming convention.
(In reply to Daniel Grimwood from comment #26) > From what I can tell, sbatch does not export SRUN_*, only SLURM_*, so that > last question won't work. It should: Here is an example of what it does on my test system: > $ env SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all sbatch --wrap 'printenv SBATCH_EXPORT_ENV SRUN_EXPORT_ENV' > Submitted batch job 2460 > $ cat slurm-2460.out > none > all . > The sbatch man page seems to only mention > exporting SLURM_* too. It would help and avoid confusion if sbatch also > exports all variables named SRUN_* in addition to SLURM_*. I'm not sure if > that should get spawned off into a feature request. This ticket has a pending update (under QA review) to the docs that should make this a little more clear. > As an aside, it would help if SBATCH_EXPORT, SLURM_EXPORT_ENV, and > SRUN_EXPORT_ENV were all consistent in naming convention. Please submit an RFE ticket. I believe these are named this way for historical reasons but that can be changed. Thanks --Nate
Hi Nate, thanks. I've started bug 8349 about the feature request. For your example, it looks like the naming inconsistency got you too ;). >env SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all sbatch --wrap 'printenv SBATCH_EXPORT_ENV SRUN_EXPORT_ENV' none all The above works because SBATCH_EXPORT_ENV does not exist, so it defaults to all. >env SBATCH_EXPORT=none SRUN_EXPORT_ENV=all sbatch --qos=high --wrap 'printenv SBATCH_EXPORT SRUN_EXPORT_ENV' results in no output >env SBATCH_EXPORT=all SRUN_EXPORT_ENV=all sbatch --qos=high --wrap 'printenv SBATCH_EXPORT SRUN_EXPORT_ENV' all all > env SBATCH_EXPORT=none SRUN_EXPORT_ENV=all SLURM_bla=aaa sbatch --qos=high --wrap 'printenv SBATCH_EXPORT SRUN_EXPORT_ENV SLURM_bla' aaa So neither SRUN_* or SBATCH_* get exported always, only SLURM_*. With regards, Daniel.
I have a slurm test setup now, so I'll attempt to re-illustrate our problem. here is a minimalist profile.d script $ cat /etc/profile.d/site.sh export SBATCH_EXPORT=none # path to slurm export PATH=$PATH:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin and here's something that's a bit like a real batch script with a srun/mpirun in it. (note that I don't think sbatch --wrap is enough to detect our issue.) $ cat script #!/bin/bash printenv | egrep 'EXPORT|PATH' echo add to PATH and then srun export PATH=$PATH:/lala/lala srun /bin/printenv | egrep 'EXPORT|PATH' my submit shell env is as per profile.d above. $ sbatch script Submitted batch job 57 $ cat slurm-57.out SLURM_EXPORT_ENV=none SBATCH_EXPORT=none PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin add to PATH and then srun SLURM_EXPORT_ENV=none note that srun now propagates no PATH at all. all batch scripts that have srun/mpirun in them will crash. as there was no PATH at all I even needed /bin/printenv instead of just printenv in the batch script. additionally that SLURM_EXPORT_ENV just came out of nowhere. I didn't set that. slurm has set it. let's try again with our admittedly nonsensical setup... :) $ cat /etc/profile.d/site.sh export SBATCH_EXPORT=none export SBATCH_EXPORT_ENV=none export SRUN_EXPORT_ENV=all # path to slurm export PATH=$PATH:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin $ sbatch script Submitted batch job 58 $ cat slurm-58.out SLURM_EXPORT_ENV=none SBATCH_EXPORT=none PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin SRUN_EXPORT_ENV=all SBATCH_EXPORT_ENV=none add to PATH and then srun SLURM_EXPORT_ENV=none SBATCH_EXPORT=none PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin:/lala/lala SRUN_EXPORT_ENV=all SBATCH_EXPORT_ENV=none and now we get the expected output. things launched by srun inside the batch script have a PATH and they run ok. I don't pretend to know why our combo of env variables works, but it does... :-/ slurm 18 only needed a profile.d with SBATCH_EXPORT=none. slurm 19 broke that. cheers, robin
(In reply to Robin Humble from comment #29) > $ cat slurm-57.out > SLURM_EXPORT_ENV=none > SBATCH_EXPORT=none > PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin > add to PATH and then srun > SLURM_EXPORT_ENV=none I get the same on my test cluster. > $ cat slurm-58.out > SLURM_EXPORT_ENV=none > SBATCH_EXPORT=none > PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin > SRUN_EXPORT_ENV=all > SBATCH_EXPORT_ENV=none > add to PATH and then srun > SLURM_EXPORT_ENV=none > SBATCH_EXPORT=none > PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/rjh/.local/bin:/home/rjh/bin:/lala/lala > SRUN_EXPORT_ENV=all > SBATCH_EXPORT_ENV=none I also get the same on my test cluster with the second site.sh. > I don't pretend to know why our combo of env variables works, but it does... Both SBATCH_EXPORT and SRUN_EXPORT_ENV are in the Slurm source. SBATCH_EXPORT_ENV is not in the source and should be treated like any other user set environment variable. For the sake of completeness, I tested it with the following site.sh too: > [fred@login ~]$ cat /etc/profile.d/site.sh > export SBATCH_EXPORT=none > export SRUN_EXPORT_ENV=all > # path to slurm > export PATH=$PATH:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin > [fred@login ~]$ printenv SBATCH_EXPORT_ENV > [fred@login ~]$ cat /test.sbatch > #!/bin/bash > printenv | egrep 'EXPORT|PATH' > echo add to PATH and then srun > export PATH=$PATH:/lala/lala > srun /bin/printenv | egrep 'EXPORT|PATH' > [fred@login ~]$ sbatch /test.sbatch > Submitted batch job 2 > [fred@login ~]$ cat slurm-2.out > SLURM_EXPORT_ENV=none > SBATCH_EXPORT=none > PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/fred/.local/bin:/home/fred/bin > SRUN_EXPORT_ENV=all > add to PATH and then srun > SLURM_EXPORT_ENV=none > SBATCH_EXPORT=none > PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/fred/.local/bin:/home/fred/bin:/lala/lala > SRUN_EXPORT_ENV=all I see no change locally without SBATCH_EXPORT_ENV being defined at all. > slurm 18 only needed a profile.d with SBATCH_EXPORT=none. I tested the same on 18.08.9: > [fred@login ~]$ cat /etc/profile.d/site.sh > export SBATCH_EXPORT=none > export SRUN_EXPORT_ENV=all > # path to slurm > export PATH=$PATH:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin > [fred@login ~]$ printenv SBATCH_EXPORT_ENV > [fred@login ~]$ cat /test.sbatch > #!/bin/bash > printenv | egrep 'EXPORT|PATH' > echo add to PATH and then srun > export PATH=$PATH:/lala/lala > srun /bin/printenv | egrep 'EXPORT|PATH' > [fred@login ~]$ sbatch /test.sbatch > Submitted batch job 2 > [fred@login ~]$ cat slurm-2.out > SBATCH_EXPORT=none > PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/fred/.local/bin:/home/fred/bin > SRUN_EXPORT_ENV=all > add to PATH and then srun > SBATCH_EXPORT=none > PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/rjh/slurm/19.05.2.1/bin:/home/rjh/slurm/19.05.2.1/sbin:/home/fred/.local/bin:/home/fred/bin:/lala/lala > SRUN_EXPORT_ENV=all Correct, 18.08 did not require setting SRUN_EXPORT_ENV. > slurm 19 broke that. Slurm 19.05 had a major rewrite of how the user arguments and environment are handled (to add cli_filter). In this specific case, a previously unknown bug happened to be fixed. Per the sbatch man page in 18.08, SLURM_EXPORT_ENV should be set by sbatch in the job. 18.08: > [fred@login ~]$ sbatch --wrap 'printenv SLURM_EXPORT_ENV' > Submitted batch job 3 > [fred@login ~]$ cat slurm-3.out > [fred@login ~]$ 19.05: > [fred@login ~]$ sbatch --wrap 'printenv SLURM_EXPORT_ENV' > Submitted batch job 2 > [fred@login ~]$ cat slurm-2.out > none By testing a very simple job, we can see that sbatch in 18.08 is not setting SLURM_EXPORT_ENV when it should be. This in effect caused your jobs to work, even though they should not have in 18.08. (In reply to Daniel Grimwood from comment #28) > For your example, it looks like the naming inconsistency got you too ;). > > >env SBATCH_EXPORT_ENV=none SRUN_EXPORT_ENV=all sbatch --wrap 'printenv SBATCH_EXPORT_ENV SRUN_EXPORT_ENV' > none > all Good catch, it should have been SBATCH_EXPORT. > So neither SRUN_* or SBATCH_* get exported always, only SLURM_*. What is exported should be determined by '--export', SBATCH_EXPORT, SLURM_EXPORT_ENV, and SRUN_EXPORT_ENV. What is or not exported by Slurm is (or at least should be) documented in the man pages of the executable under "OUTPUT ENVIRONMENT VARIABLES". In attempt to make this less confusing, the man pages have been updated with examples (via https://github.com/SchedMD/slurm/commit/adf9f8f477e6f1668017238a4bea6312b1c0670b). Please also note there is a slightly related bug#7586 in srun not setting SLURM_EXPORT_ENV which should be upstream soon. Please tell me if you have any more questions or issues. Thanks, --Nate
Thanks Nate. yes it's working as documented in the man page. I will put in another feature request so that SRUN_* also gets exported by sbatch, so that it won't be necessary to set SRUN_EXPORT_ENV in the /etc/profile.d/site.sh (or equivalent) on the compute nodes. Regards, Daniel.
Daniel, Closing ticket per your response. Thanks, --Nate