Ticket 5315

Summary: Aliases set in custom module not working if called from submission script
Product: Slurm Reporter: Marco <marco.delapierre>
Component: User CommandsAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex
Version: 17.02.9   
Hardware: Cray XC   
OS: Linux   
Site: Pawsey Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version: CLE 6.0 update 05
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Marco 2018-06-14 01:25:43 MDT
Hi there,
we have a user of our HPC centre who is setting up some aliases in the following way (Environment TCL modules in use):
1. user's .profile loads a custom module
2. this custom module sources a TCL file with additional procedure (i.e. function) definitions
3. the custom module uses the custom procedures to load custom modules
4. some of the custom modules, among other actions, create aliases

What happens is that these aliases do not work when called in certain situations.

When do they work?
- normal interactive shells (SLURM not used)
- script executed using bash -li $scriptname, with -i required to expand aliases (SLURM not used)
- interactive SLURM  session through "salloc"

When don't they work?
- script executed using SLURM sbatch (shebang #!/bin/bash -l)

I have played around this non working case and found that:
- aliases work if custom module is loaded inside the SLURM script rather than at point 1. above
- aliases work with set up above, provided that at point 3. "module load" is used rather than the custom TCL procedures

Beside the obvious workaround of loading the module inside the script, I was wondering if this issue rings any bell on your side. At the moment my feeling is that there is some subtle bad interaction between SLURM and Environment modules, especially in regards to the way aliases are handled.

Any input would be much appreciated.

With kind regards,
Dr Marco De La Pierre
Comment 1 Alejandro Sanchez 2018-06-15 04:53:48 MDT
When a job allocation is granted for a submitted batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.

If this node hasn't loaded the custom modules that create the aliases, the batch script won't be able to do the alias substitutions, because the shell in the batch host doesn't maintain the list of aliases that were set in the submission host (login node).

It is common to load modules from within the batch script.

Depending on your SallocDefaultCommand, your "salloc" session by default just executes $SHELL on the login node. A typical SallocDefaultCommand could be defined like this:

# For systems with generic resources (GRES) defined, the SallocDefaultCommand value should explicitly specify a zero count for the configured GRES
SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu_bind=no --mpi=none $SHELL"

I guess your "salloc" session just executed $SHELL from within the submission host, such shell sourced the .profile and triggered the chain of points 1 to 4.

If this chain isn't sourced in the batch host where the batch script is executed then the alias won't be available.

Does it make sense?
Comment 2 Marco 2018-06-18 23:51:04 MDT
Hi Alejandro,
thanks for coming back to me.
Actually, I know the batch host goes through steps 1. to 4., as modules of point 4. turn out as loaded, with for instance all of the variables loaded by such modules being available in the batch environment. 
The ONLY missing bit are the aliases set by some of those modules.
Comment 3 Alejandro Sanchez 2018-06-19 05:25:35 MDT
Hey Marco,

Although I doubt this is a Slurm problem, and it is more a how bash and bash builtins work when executing scripts, I've been digging a bit into the bash man page and found this:

       Aliases are not expanded when the shell is not interactive, unless the expand_aliases shell option is set using shopt (see  the  description  of  shopt  under  SHELL BUILTIN COMMANDS below).

So I tried this and it seems the alias defined in the ~/.profile file is properly expanded within the executed script when using 'bash -l' + 'shopt -s expand_aliases'.

alex@smd-server:~/tests$ ssh smd1 'grep alex ~/.profile'
alias alex='hostname -s'
alex@smd-server:~/tests$ cat test.bash
#!/bin/bash -l
shopt -s expand_aliases
alex
alex@smd-server:~/tests$ sbatch -w smd1 test.bash
Submitted batch job 20028
alex@smd-server:~/tests$ cat slurm-20028.out 
smd1
alex@smd-server:~/tests$

Could you try this out and see if it works for you? Thank you.
Comment 4 Marco 2018-06-19 23:42:56 MDT
Hi Alejandro,

thanks for sharing this. I am actually using bash -l, and have already tried the shopt tip without success.
I agree with you that this is most probably not a Slurm problem, and besides bash there are also the Environment modules that add complexity to the situation. I filed in this ticket mainly to know whether on the Slurm side you had seen this before.
I have just a couple of last related questions then:
- are there any Slurm configuration variables/settings that could change variable behaviour?
- what are the differences between how salloc and sbatch create the shell environments? (as a starting point to this, you previously mentioned SallocDefaultCommand)

Many thanks, best regards,
Marco
Comment 5 Alejandro Sanchez 2018-06-20 05:30:32 MDT
I think this FAQ entry is very related to this:

https://slurm.schedmd.com/faq.html#user_env

So when executing the job's spawned applications Slurm doesn't source the ~/.profile and/or ~/,bashrc files.

There's the --export option to setup which environment variables are meant to be propagated to the compute nodes (all by default). But I think bash doesn't store the list of alias substitutions in the env, so changing the value won't do anything. With an SallocDefaultCommand like this:

alex@smd-server:~/tests$ scontrol show conf | grep -i salloc
SallocDefaultCommand    = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu_bind=no --mpi=none $SHELL -l
alex@smd-server:~/tests$

I get the .profile sourced for 'salloc'. Same if I manually execute 'srun --pty bash -l'.

Perhaps a workaround for this would be using a TaskProlog script:

https://slurm.schedmd.com/prolog_epilog.html

Please, let me know if there's anything else you need from here.
Comment 6 Marco 2018-06-20 19:01:53 MDT
Thanks Alejandro,

your comments have been valuable for better understanding how Slurm works under the hood.
For this specific case, as only one user is having issues, I think it is not worth changing system-wide Slurm configurations.
As he is only having troubles with scripts submitted with sbatch, I will advise him to load the modules from within the script rather than in his .profile. 

Thanks again for your support, 
with kind regards,
Marco
Comment 7 Alejandro Sanchez 2018-06-21 03:41:10 MDT
(In reply to Marco from comment #6)
> Thanks Alejandro,
> 
> your comments have been valuable for better understanding how Slurm works
> under the hood.
> For this specific case, as only one user is having issues, I think it is not
> worth changing system-wide Slurm configurations.
> As he is only having troubles with scripts submitted with sbatch, I will
> advise him to load the modules from within the script rather than in his
> .profile. 
> 
> Thanks again for your support, 
> with kind regards,
> Marco

All right. Anyhow as I tested in my comment 3 a combination of bash -l plus the shopt tip is sourcing the .profile for me. What I think is that there's a weird thing happening in between the rest of your 1-4 steps with the tcl alias generation. I'm gonna go ahead and close this. Please, reopen if there's anything else.