Ticket 10377

Summary: SLURM_HINT=nomultithread preventing salloc from alloacting
Product: Slurm Reporter: Kevin Buckley <kevin.buckley>
Component: RegressionAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact: Brian Christiansen <brian>
Severity: 4 - Minor Issue    
Priority: --- CC: brian.gilmer, cinek, slbolling
Version: 20.11.0   
Hardware: Cray XC   
OS: Linux   
Site: Pawsey Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: SUSE Machine Name: chaos
CLE Version: 6.0 UP07 PS45 Version Fixed: 20.11.3
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf as requested
output from broken salloc
output from salloc

Description Kevin Buckley 2020-12-06 23:15:08 MST
As you may recall from past Pawsey tickets, our response to Cray
being unable to turn off hyperthreading in their XC blades has
been to add a 

SLURM_HINT=nomultithread 

into the Slurm modulefile.

I've just realised that I missed this bit in the RELEASE NOTES

 -- Add support for an "Interactive Step", designed to be used with salloc to
    launch a terminal on an allocated compute node automatically. Enable by
    setting "use_interactive_step" as part of LaunchParameters.

when coming to play with 20.11 last week, but, having remedied that
found that I was then getting

elogin$ salloc -p acceptance
Password: 
salloc: Granted job allocation 11054
salloc: Waiting for resource configuration
salloc: Nodes nid00013 are ready for job
srun: fatal: Following options are mutually exclusive: --hint, --ntasks-per-core, --threads-per-core, -B, but more than one set by environment variables.
salloc: Relinquishing job allocation 11054
salloc: Job allocation 11054 has been revoked.
elogin$

As the only thing I could see being explcitly set in that bare
salloc invocation would be the SLURM_HINT EnvVar, I decided to 
unset it in a few places, after which things worked as expected.

elogin$ salloc -w nid00019
Password: 
salloc: Granted job allocation 11058
salloc: Waiting for resource configuration
salloc: Nodes nid00019 are ready for job
kbuckley@nid00019:~$


Assuming we would like to continue with the hint, for the
remaining life of our XC resources, is there a recommended
replacement for sites (surely not just us?) that were using
it prior to 20.11?

Kevin

PS

I have seen #9670 but the suggestion there is that the breakage
would have been visible in 20.02 as well, and we haven't been
(and still aren't) seeing any issues with having the SLURM_HINT
in the modulefile with 20.02.
Comment 1 Marcin Stolarek 2020-12-07 01:42:15 MST
Kevin,

The error message you see comes from the validate_hint_option function called on the Slurm client side[1].

Could you please share the result of `env` and `salloc -vvvv` commands, please attach your 20.11 slurm.conf as well?

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/blob/c358d2dcbdf2a0fd4b7e28a2962c78a9723b2c98/src/common/slurm_opt.c#L5599-L5604
Comment 2 Kevin Buckley 2020-12-07 01:59:31 MST
Will do the env and salloc tommoow our time Marcin,
as we will have to break the TDS by putting the EnvVar
back in, but the slurm.conf you can have shortly.
Comment 3 Kevin Buckley 2020-12-07 02:02:39 MST
Created attachment 16991 [details]
slurm.conf as requested
Comment 4 Kevin Buckley 2020-12-07 18:35:58 MST
Created attachment 17021 [details]
output from broken salloc

Four files in the tarball

env.schedmd.10377
salloc-vvvv.schedmd.10377

env.schedmd.10377.login
salloc-vvvv.schedmd.10377.login

First pair generated from a shell on the eLogin node,
second from a shell on the internal login node.

Note that salloc from the eLogin node runs a wrapper
that spawns a shell on the internal login node and 
runs the real salloc from there.

Output is pretty much the same, bar some EnvVars that
differ because eLogin nodes don't have the same CLE OS
image as the in-the-Cray login node.

HTH
Comment 5 Kevin Buckley 2020-12-07 18:43:21 MST
Created attachment 17022 [details]
output from salloc

Left out the env and salloc outputs from an environment
where the SLURM_HINT=nomultithread has been removed.
Comment 6 Kevin Buckley 2020-12-17 20:56:43 MST
As a potential way forwards, I have been investigating the possibility
of using a Lua job submit script to check for any explicit user setting
of --threads-per-core, and, if one isn't found, adding one ahead of the
job launching.

I have found that the following does what's required
(Note some of the wrapping around this has been ommited
 and that the 65534 was detrmined by trial and error)

  if( job_desc.threads_per_core == 65534 ) then

    job_desc.threads_per_core = 1

  end

However, whilst it does what is required, there are some subtle
changes to the SLURM_ varibales found in the job environment, vis

No explict                      Explicit                        --threads-per-core=1
#SBATCH --threads-per-core=1    #SBATCH --threads-per-core=1     added by Lua script

SLURM_CPUS_ON_NODE=        48                             24    24                     
SLURM_JOB_CPUS_PER_NODE=   48                             24    24             
                                SLURM_NPROCS=1                          
                                SLURM_NTASKS=1                          
SLURM_TASKS_PER_NODE=      48                              1    24                     
                                SLURM_THREADS_PER_CORE=1               


Note the three extra EnvVars when the user explictly asks for 
  --threads-per-core=1 
and that the SLURM_TASKS_PER_NODE becomes 1

Note also that when the Lua script adds the threads_per_core=1
into the job_desc, not even the SLURM_THREADS_PER_CORE gets
popsulated.

Is that to be expected ?

It is a pretty noddy job submission script but I'll add the 
lines here for completeness:

#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --partition=acceptance
#SBATCH --time=00:05:00
#SBATCH --job-name=IS-2194-xthi_cray
#SBATCH --export=none
#SBATCH --constraint=haswell
##BATCH --threads-per-core=1    # Uncomnent for explicit user setting
#
module load craype-haswell
#
env
srun --export=all --label \
  --ntasks=24             \
  --threads-per-core=1    \
 ./xthi.prgenv_cray
Comment 8 Marcin Stolarek 2021-01-05 09:27:10 MST
Kevin,

Sorry for the delay - I've been trying to reproduce the environment differences you've seen looking at the code and running test jobs without success.
1)Without job_submit plugin, command:
>sbatch --wrap="env"
>Submitted batch job 59110
2)Without job_submit plugin, command:
>sbatch --threads-per-core=1 --wrap="env"
>Submitted batch job 59111
3)With the job submit plugin (using the snippet from you):
>sbatch --wrap="env"
>Submitted batch job 59112

>[root@slurmctl bug10454]# diff slurm-59110.out slurm-59112.out
>26c26
>< SLURM_JOBID=59110
>---
>> SLURM_JOBID=59112
>35c35
>< SLURM_JOB_ID=59110
>---
>> SLURM_JOB_ID=59112
>46c46
>< SLURM_TASK_PID=8010
>---
>> SLURM_TASK_PID=9243
>[root@slurmctl bug10454]# diff slurm-59111.out slurm-59112.out
>8d7
>< SLURM_THREADS_PER_CORE=1
>27c26
>< SLURM_JOBID=59111
>---
>> SLURM_JOBID=59112
>36c35
>< SLURM_JOB_ID=59111
>---
>> SLURM_JOB_ID=59112
>47c46
>< SLURM_TASK_PID=8075
>---
>> SLURM_TASK_PID=9243

Am I doing something differently than you did?

cheers,
Marcin
Comment 9 Kevin Buckley 2021-01-05 19:00:10 MST
> Am I doing something differently than you did?

Hard to tell just from the output you have supplied.

I guess the question for you, in that your output doesn't 
show the differences that we are seeing is, do any of your
output files contain that variables we are seeing the diffs
with ?
Comment 15 Marcin Stolarek 2021-01-11 06:18:03 MST
Kevin,

I just wanted to let you know that the patch for the initial issue - the use of SLURM_HINT to set the default behavior is under review. 

cheers,
Marcin
Comment 16 Kevin Buckley 2021-01-11 18:01:58 MST
On 2021/01/11 21:18, bugs@schedmd.com wrote:
> https://bugs.schedmd.com/show_bug.cgi?id=10377
> 
> --- Comment #15 from Marcin Stolarek <cinek@schedmd.com> ---
> Kevin,
> 
> I just wanted to let you know that the patch for the initial issue - the use of
> SLURM_HINT to set the default behavior is under review.

Thanks Marcin,

although as far as I am concerned, this comment from another recent "bug"

     https://bugs.schedmd.com/show_bug.cgi?id=10383#c36

saying

   The problem with a workaround that goes through environment variables is that it is brittle.
   Environment variables can be set and unset.
   One should not have to rely on an environment variable to get the correct behavior.

is one I am in very much agreement with, and wish our users would
simply tell the scheduler what they want, via --threads-per-core=N,
rather than "rely" on a behaviour, introduced by a EnvVar, that they
may not even be aware of.
Comment 20 Brian F Gilmer 2021-01-12 10:18:28 MST
A case was filed by a customer:
We upgraded our XC system with KNL nodes to 7.0.UP02 and SLURM 20.11
We set the environment variable SLURM_HINT=nomultithread
When you run salloc, it sets the environment variable SLURM_THREADS_PER_CORE=1
Now if you run srun hostname it will fail with
srun: fatal: Following options are mutually exclusive: --hint, --ntasks-per-core, --threads-per-core, -B, but more than one set by environment variables.
There is a schedmd case open on this already: BUG 10377. There is not a resolution at this time.
The slurm_opt.c program did not have a validate_hint function before this release. Also, when run on an older release of slurm, the SLURM_THREADS_PER_CORE=1 variable is not set after running salloc.

---
It is not clear if comment #15 refers to a fix for this issue.
Comment 21 Marcin Stolarek 2021-01-12 11:31:28 MST
Brian,
>It is not clear if comment #15 refers to a fix for this issue.

Yes, I was referring to the case you described.

There is a different issue Kevin mentioned in comment 6, I'll open a separate bug to get it covered.

cheers,
Marcin
Comment 22 Kevin Buckley 2021-01-12 17:47:44 MST
Probably more for Brian's benefit, but it needs to stated that
Pawsey (or anyone else, I'd wager) should no longer need to use 

 SLURM_HINT=nomultithread 

by default, as the same outcome, in respect of multithreading, 
can achieved by telling its users to explicity specify

 --threads-per-core=1

however, Pawsey is still looking at how best to get that message
across to its user community.


This issue has now become one of the slightly different EnvVars
seen in the job environment, depending on how the 

 --threads-per-core=1

request is supplied to the job submission.

Hoping that's useful,
Kevin
Comment 24 Marcin Stolarek 2021-01-13 06:33:32 MST
Kevin,

I've opened Bug 10620 as a follow-up to your questions from comment 6 and added you to CC there. I agree that there is no obvious reason for the behavior.

The initial issue demonstrated by:
>export SLURM_HINT=nomultithread
>salloc -p acceptance
>[...]
>srun: fatal: Following options are mutually exclusive: --hint, --ntasks-per-core, >--threads-per-core, -B, but more than one set by environment variables.

Got fixed by f450cc6837d8581[1].

>Probably more for Brian's benefit, but it needs to stated that
>Pawsey (or anyone else, I'd wager) should no longer need to use 
>SLURM_HINT=nomultithread 
>by default, as the same outcome, in respect of multithreading, 
>can achieved by telling its users to explicity specify
>--threads-per-core=1

Those are in some sense an equivalent, since --hint=nomultithread sets --threads-per-core=1. I think that quite good explanation of the details behind --hint=nomultithread is in Bug 8119 comment 4.

If I'm reading correctly that you'd like to have more elegant solution than SLURM_HINT being exported to users environment, you can use a lua cli_filter plugin based on the below snippet
> function slurm_cli_setup_defaults(options, cli_type)
>         options['threads-per-core'] = 1 #either or options['hint']='nomultithread'
> 
>         return slurm.SUCCESS
> end
> function slurm_cli_post_submit(options, cli_type)
>         return slurm.SUCCESS
> end
> function slurm_cli_pre_submit(options, cli_type)
>         return slurm.SUCCESS
> end
It will effectively forbid specification of the conflicting options on slurm 20.11.0 - 20.11.2, but should be fine starting with 20.11.3 where in case of conflict --hint/SLURM_HINT will be ignored with a user message at verbose level. 

Let me know if you have further questions. In case of no reply, I'll close the bug as fixed in 20.11.3.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/f450cc6837d85812378612cdd734f9c038bfbfce
Comment 25 Nate Rini 2021-01-13 13:58:35 MST
*** Ticket 10600 has been marked as a duplicate of this ticket. ***
Comment 26 Kevin Buckley 2021-01-13 20:05:23 MST
> If I'm reading correctly that you'd like to have more elegant solution
> than SLURM_HINT being exported to users environment, you can use a lua
> cli_filter plugin based on the below snippet

Not quite.

1) I don't need a more elegant soultion, as no problem exists.

2) I don't think we should have ever been using SLURM_HINT, but
   telling our user community to tell the scheduler what they 
   want, explictly, which is, for no multithreading, to just
   request  --threads-per-core=1

3) I have already said (comment 6) that I have created a 
   Lua script that would "do the right thing" for users who 
   /forget to/don't know how to/can't remember how to/ tell
   the scheduler what they want.

The outstanding issue here is how we, Pawsey, educate our user
community to do the right thing from now on.

To that extent, along with the spliting out of the extra issue
that the investogations around this one brought to light, the 
fix for this issue probaly wasn't even required, though I am
sure some people (people who like using EnvVars) will find it 
useful!

Cheers,
Kevin
Comment 27 Kevin Buckley 2021-01-13 20:12:39 MST
I notice, after Nate added a link to a "duplicate issue",
that I can't follow the link ?

Is there a way to not render the link as clickable, so 
as to not have people following things they can't access.

Just out of interest, seeing as how it is a duplicate of
this issue, what's the big secret ?
Comment 28 Marcin Stolarek 2021-01-13 22:25:07 MST
>3) I have already said (comment 6) that I have created a 
>   Lua script that would "do the right thing" for users who 
>   /forget to/don't know how to/can't remember how to/ tell
>   the scheduler what they want.

The small difference is that I'd recommend use of CliFilter instead of JobSubmit. It gives you the possibility to set a default for --hint, it's processing is done at client side so it doesn't affect slurmctld performance and an example avoids the direct check against the "NO_VAL" which may be an issue if thethe variable type changes in the future.

>Is there a way to not render the link as clickable, so 
>as to not have people following things they can't access.
I'll raise that question to the team, however, I'm afraid that it may be a Bugzilla limitation.

>Just out of interest, seeing as how it is a duplicate of
>this issue, what's the big secret ?
Technically it's because of the security group settings of the other bug. Obviously, I can't directly answer the "what's the big secret". Some sites find it useful to mark the whole communication with us private because of a variety of reasons. Mostly coming from customer security policy related to logs, ip addresses, or even host names being shared.

Is there anything else I can help you with here?

cheers,
Marcin
Comment 29 Kevin Buckley 2021-01-13 23:15:29 MST
On 2021/01/14 13:25, bugs@schedmd.com wrote:
> 
>>3) I have already said (comment 6) that I have created a 
>>   Lua script ...
> 
> The small difference is that I'd recommend use of CliFilter instead of
> JobSubmit. It gives you the possibility to set a default for --hint, it's
> processing is done at client side so it doesn't affect slurmctld performance
> and an example avoids the direct check against the "NO_VAL" which may be an
> issue if thethe variable type changes in the future.

Thanks for expanding as to the "extra elegance" in your solution, Marcin.

As we are not using any CliFilter stuff (well, that I am aware of), I
would have missed the distinction.

Is that (Use CliFilter instead of JobSubmit) a "general" recommendation
now ?

I'll certainly try your suggestion out on our TDS but, as we are unlikely
to bump Slurm up on the production Crays until 20.11.3 comes out then we
won't have to use it in anger, as I'm sure the fact that we can go to
20.11.3 without anything breaking will mean we won't change anything.

Cheers again,
Kevin
Comment 30 Marcin Stolarek 2021-01-13 23:51:19 MST
>Is that (Use CliFilter instead of JobSubmit) a "general" recommendation
now ?

Those are just different and should be used for a different purpose. JobSubmit plugin is running on slurmctld side which has a drawback of potential performance impact in case of complicated logic being implemented there. On the other side users have no way to skip this logic, so if it's a site policy that has to be strictly enforced JobSubmit is the best place for that. It's also the only place where you can filter job modification by `scontrol update job...` command.

CliFilter runs at the user's commands end, which gives you the possibility to work on options seen exactly like on the user side. For instance --hint/--pty options are not directly translated to job descriptor you see on slurmctld side. Since it's done on the user side than it doesn't impact infrastructure performance/scales in a trivial way. It's also relatively easy to skip it - using user build a client (srun/sbatch) binary or substituted slurm.conf.

>I'll certainly try your suggestion out on our TDS but, as we are unlikely
>to bump Slurm up on the production Crays until 20.11.3 comes out then we
>won't have to use it in anger, as I'm sure the fact that we can go to
>20.11.3 without anything breaking will mean we won't change anything.
Understood.

I'm closing the bug as fixed now, should you have any questions please reopen.


cheers,
Marcin
Comment 31 Kevin Buckley 2021-01-14 00:46:04 MST
Apologies for re-opening Marcin: need to ask one final thing 
as regards your cli_filter example.

You only seem to be checking for an explicit value of a known
parameter, vis:

> options['threads-per-core'] = 1 #either or options['hint']='nomultithread'
> return slurm.SUCCESS

whereas our extsing, though less elegant, solution checks
for the lack of the parameter altogether, which it assumes
means that a user didn't want hyperthreading or didn't know
about hyperthreading.

We do actually have some users using our MPP Crays for high 
throughput computing who do jump through hoops to remove our
default SLURM_HINT EnvVar and are then effectively requesting
--threads-per-core=2 
without actually being explict.

So, in cli_filter-land, do I have to check for both known values

--threads-per-core=1
--threads-per-core=2

or rather (for "cores" that threaded above 2), each value 
individually or can I just check to see if the parameter 
has been  populated and, if it has not, set a "default" 
for the user ?

Subtle difference but I'm assuming you didn't just replicate
my job_submit attempt in cli_fliter-land for a reason?
Comment 32 Marcin Stolarek 2021-01-14 00:58:21 MST
Kevin,

>Apologies for re-opening Marcin: need to ask one final thing as regards your cli_filter example.

No issue, I'm glad you asked. The cli_filter set_defaults function is called before any option processing[1] is done, it's purpose is to overwrite the default values.
If anything will be directly specified as a command-line option it will just overwrite the value set in the set_defaults cli_filter function.

cheers,
Marcin

[1]https://slurm.schedmd.com/cli_filter_plugins.html
Comment 33 Kevin Buckley 2021-01-14 01:25:38 MST
OK: got it.

cli_filter's slurm_cli_setup_defaults() stanza is a 
Lua replacement for our default EnvVar approach, not
for the Lua job_submit code stanza.

However, if we adopted the cli_filter approach, our
high throughput users will then still need to be told
to explcitly use --threads-per-core=2, as opposed to 
us telling them, as we currenlty do, how to get rid of
the EncVar. Basically a change to hwt we are telling 
them?

I still think we're likely to just use the 20.11.3 fix
to leave things as they are - so no need to tell the users
to do anything differently - and wait for our new system
to come on stream, where everything will be "new" (ho-ho!).

Cheers again: will close again for you.