Ticket 5537

Summary: SRUN_EXPORT
Product: Slurm Reporter: Daniel Grimwood <daniel.grimwood>
Component: User CommandsAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: matthews
Version: 17.02.9   
Hardware: Linux   
OS: Linux   
Site: Pawsey Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 19.05.0pre4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Daniel Grimwood 2018-08-07 22:14:37 MDT
Hi,

can we have a SRUN_EXPORT equivalent of SBATCH_EXPORT please?

Due to cross-cluster workflows and the benefits of starting with a clean environment, we often use
sbatch --export=NONE
but then within jobscripts we then need to do
srun --export=ALL
because the sbatch line has set SLURM_EXPORT_ENV which srun then inherits.  If there's a SRUN_EXPORT then that would presumably take priority over SLURM_EXPORT_ENV.

Ideally in our slurm modulefile we'd like to set both
SBATCH_EXPORT=NONE
SRUN_EXPORT=ALL
as the defaults and then let users override these.

With regards,
Daniel.
Comment 1 Tim Wickberg 2018-08-08 00:58:42 MDT
The main issue I can see with this is inconsistent behavior if your users use srun commands off the login nodes. They'd then export the environment in that situation - but not in batch scripts - which could further confuse matters.

I believe you can accomplish what you're after today through the use of a trivial TaskProlog script:

===========
#!/bin/sh

echo "export SLURM_EXPORT_ENV=ALL"
===========

This would ensure an srun within a batch job has that setting overridden, but not for salloc or srun commands directly launched from the login nodes.

Let me know if that works for you, in some quick testing here it appears to handle the use case you'd described.

cheers,
- Tim
Comment 2 Daniel Grimwood 2018-08-08 01:18:04 MDT
Hi Tim,

thanks for the quick reply and sharing your concerns.  We'll discuss and test it internally.

I like the TaskProlog script but am a bit concerned about a blanket overwriting of SLURM_EXPORT_ALL, but I think we can do a bash test for whether SBATCH_EXPORT is set and then do the setting of SLURM_EXPORT_ALL.

Part of the appeal of a new SRUN_EXPORT is that users could override it in their jobscripts and be in more control.

For the potential login node inconsistency, that can be dealt with by only having the slurm modulefile on the compute nodes do the export, while the login nodes do nothing.  Having said that, we only promote the use of srun to our users inside of sbatch and salloc.

With regards,
Daniel.
Comment 3 Daniel Grimwood 2018-08-12 21:20:14 MDT
Hi Tim,

we discussed internally and are not in favour of modifying the environment from within a TaskProlog script.  This can result in unexpected behaviour for our more advanced users, who may be changing these environment variables themselves.

We could set a SRUN_EXPORT=NONE on login nodes if that addresses your concerns about interactive srun.

Do you have any other suggestions?

With regards,
Daniel.
Comment 4 Tim Wickberg 2018-08-12 22:02:10 MDT
(In reply to Daniel Grimwood from comment #3)
> Hi Tim,
> 
> we discussed internally and are not in favour of modifying the environment
> from within a TaskProlog script.  This can result in unexpected behaviour
> for our more advanced users, who may be changing these environment variables
> themselves.

Why don't you just test in the TaskProlog if the environment variable was set by the user?

The TaskProlog script is launched with a copy of the users' environment (as it existed when the step was submitted).

> We could set a SRUN_EXPORT=NONE on login nodes if that addresses your
> concerns about interactive srun.

That still gets messy - now Slurm would have to decide to strip that back out of the environment or not, or it'd be getting propagated in some locations that would cause problems.

I'd suggest you revisit the TaskProlog - something like the following seems to address all your concerns, and is something you could drop in place today:

if [ -z "${SLURM_EXPORT_ENV}" ]; then
    echo "export SLURM_EXPORT_ENV=ALL"
fi
Comment 5 Daniel Grimwood 2018-08-14 00:54:37 MDT
Thanks Tim for the suggestion.

We tell our users to always set
#SBATCH --export=NONE
as it makes reproducible jobscripts, and we have quite a few workflows that span multiple slurm clusters which have different environments.  Because of this, SLURM_EXPORT_ENV is always set, and the test you suggested would always return true.

We could change the test to be either the environment variable being not existent or value being NONE, but even then it is possible that the user actually wants to export NONE in their srun.

Ideally for us, srun would not use the environment variable set by sbatch, so sbatch not setting SLURM_EXPORT_ENV would also work.

With regards,
Daniel.
Comment 6 Tim Wickberg 2019-04-17 23:25:45 MDT
Hey guys -

I've made it back to the states, and got a chance to look at this again based on what we'd discussed last week. I do see the value in this based on your description, and the new SRUN_EXPORT_ENV environment variable which will be in the 19.05 releases should cover your use case.

While we will not be including this in any 18.08 maintenance releases, if you desire, you should be able to back-port this easily - it's just a single line change to add the new input variable.

Commit details follow for posterity.

cheers,
- Tim

commit 9d529ae6d98d8bbea7325ac8faa9bcb20dc4d7f8
Author:     Tim Wickberg <tim@schedmd.com>
AuthorDate: Wed Apr 17 23:14:54 2019 -0600

    Add SRUN_EXPORT_ENV as input to srun.
    
    Overrides any setting for SLURM_EXPORT_ENV, which can make nesting jobs
    simpler.
    
    If SBATCH_EXPORT_ENV=NONE (which will cause SLURM_EXPORT_ENV=NONE to be set
    in the batch step) is used alongside SRUN_EXPORT_ENV=ALL, this allows for
    the batch environment to be reset, but then for changes made in the batch
    script (e.g., loading modules with 'module load') to propagate out as part
    of the step launch.
    
    The same can be accomplished by the user in their scripts by explicitly
    setting 'srun --export=ALL ...' for every step launch, but this should
    provide an easier mechanism for sites to make this behavior the default
    for their users by pushing this pair of environment variables into their
    users default profiles.
    
    Bug 5537.