| Summary: | SLURM_HINT=nomultithread preventing salloc from alloacting | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kevin Buckley <kevin.buckley> |
| Component: | Regression | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | Brian Christiansen <brian> |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian.gilmer, cinek, slbolling |
| Version: | 20.11.0 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | Pawsey | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | SUSE | Machine Name: | chaos |
| CLE Version: | 6.0 UP07 PS45 | Version Fixed: | 20.11.3 |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf as requested
output from broken salloc output from salloc |
||
Kevin, The error message you see comes from the validate_hint_option function called on the Slurm client side[1]. Could you please share the result of `env` and `salloc -vvvv` commands, please attach your 20.11 slurm.conf as well? cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/c358d2dcbdf2a0fd4b7e28a2962c78a9723b2c98/src/common/slurm_opt.c#L5599-L5604 Will do the env and salloc tommoow our time Marcin, as we will have to break the TDS by putting the EnvVar back in, but the slurm.conf you can have shortly. Created attachment 16991 [details]
slurm.conf as requested
Created attachment 17021 [details]
output from broken salloc
Four files in the tarball
env.schedmd.10377
salloc-vvvv.schedmd.10377
env.schedmd.10377.login
salloc-vvvv.schedmd.10377.login
First pair generated from a shell on the eLogin node,
second from a shell on the internal login node.
Note that salloc from the eLogin node runs a wrapper
that spawns a shell on the internal login node and
runs the real salloc from there.
Output is pretty much the same, bar some EnvVars that
differ because eLogin nodes don't have the same CLE OS
image as the in-the-Cray login node.
HTH
Created attachment 17022 [details]
output from salloc
Left out the env and salloc outputs from an environment
where the SLURM_HINT=nomultithread has been removed.
As a potential way forwards, I have been investigating the possibility
of using a Lua job submit script to check for any explicit user setting
of --threads-per-core, and, if one isn't found, adding one ahead of the
job launching.
I have found that the following does what's required
(Note some of the wrapping around this has been ommited
and that the 65534 was detrmined by trial and error)
if( job_desc.threads_per_core == 65534 ) then
job_desc.threads_per_core = 1
end
However, whilst it does what is required, there are some subtle
changes to the SLURM_ varibales found in the job environment, vis
No explict Explicit --threads-per-core=1
#SBATCH --threads-per-core=1 #SBATCH --threads-per-core=1 added by Lua script
SLURM_CPUS_ON_NODE= 48 24 24
SLURM_JOB_CPUS_PER_NODE= 48 24 24
SLURM_NPROCS=1
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE= 48 1 24
SLURM_THREADS_PER_CORE=1
Note the three extra EnvVars when the user explictly asks for
--threads-per-core=1
and that the SLURM_TASKS_PER_NODE becomes 1
Note also that when the Lua script adds the threads_per_core=1
into the job_desc, not even the SLURM_THREADS_PER_CORE gets
popsulated.
Is that to be expected ?
It is a pretty noddy job submission script but I'll add the
lines here for completeness:
#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --partition=acceptance
#SBATCH --time=00:05:00
#SBATCH --job-name=IS-2194-xthi_cray
#SBATCH --export=none
#SBATCH --constraint=haswell
##BATCH --threads-per-core=1 # Uncomnent for explicit user setting
#
module load craype-haswell
#
env
srun --export=all --label \
--ntasks=24 \
--threads-per-core=1 \
./xthi.prgenv_cray
Kevin, Sorry for the delay - I've been trying to reproduce the environment differences you've seen looking at the code and running test jobs without success. 1)Without job_submit plugin, command: >sbatch --wrap="env" >Submitted batch job 59110 2)Without job_submit plugin, command: >sbatch --threads-per-core=1 --wrap="env" >Submitted batch job 59111 3)With the job submit plugin (using the snippet from you): >sbatch --wrap="env" >Submitted batch job 59112 >[root@slurmctl bug10454]# diff slurm-59110.out slurm-59112.out >26c26 >< SLURM_JOBID=59110 >--- >> SLURM_JOBID=59112 >35c35 >< SLURM_JOB_ID=59110 >--- >> SLURM_JOB_ID=59112 >46c46 >< SLURM_TASK_PID=8010 >--- >> SLURM_TASK_PID=9243 >[root@slurmctl bug10454]# diff slurm-59111.out slurm-59112.out >8d7 >< SLURM_THREADS_PER_CORE=1 >27c26 >< SLURM_JOBID=59111 >--- >> SLURM_JOBID=59112 >36c35 >< SLURM_JOB_ID=59111 >--- >> SLURM_JOB_ID=59112 >47c46 >< SLURM_TASK_PID=8075 >--- >> SLURM_TASK_PID=9243 Am I doing something differently than you did? cheers, Marcin > Am I doing something differently than you did?
Hard to tell just from the output you have supplied.
I guess the question for you, in that your output doesn't
show the differences that we are seeing is, do any of your
output files contain that variables we are seeing the diffs
with ?
Kevin, I just wanted to let you know that the patch for the initial issue - the use of SLURM_HINT to set the default behavior is under review. cheers, Marcin On 2021/01/11 21:18, bugs@schedmd.com wrote: > https://bugs.schedmd.com/show_bug.cgi?id=10377 > > --- Comment #15 from Marcin Stolarek <cinek@schedmd.com> --- > Kevin, > > I just wanted to let you know that the patch for the initial issue - the use of > SLURM_HINT to set the default behavior is under review. Thanks Marcin, although as far as I am concerned, this comment from another recent "bug" https://bugs.schedmd.com/show_bug.cgi?id=10383#c36 saying The problem with a workaround that goes through environment variables is that it is brittle. Environment variables can be set and unset. One should not have to rely on an environment variable to get the correct behavior. is one I am in very much agreement with, and wish our users would simply tell the scheduler what they want, via --threads-per-core=N, rather than "rely" on a behaviour, introduced by a EnvVar, that they may not even be aware of. A case was filed by a customer: We upgraded our XC system with KNL nodes to 7.0.UP02 and SLURM 20.11 We set the environment variable SLURM_HINT=nomultithread When you run salloc, it sets the environment variable SLURM_THREADS_PER_CORE=1 Now if you run srun hostname it will fail with srun: fatal: Following options are mutually exclusive: --hint, --ntasks-per-core, --threads-per-core, -B, but more than one set by environment variables. There is a schedmd case open on this already: BUG 10377. There is not a resolution at this time. The slurm_opt.c program did not have a validate_hint function before this release. Also, when run on an older release of slurm, the SLURM_THREADS_PER_CORE=1 variable is not set after running salloc. --- It is not clear if comment #15 refers to a fix for this issue. Brian, >It is not clear if comment #15 refers to a fix for this issue. Yes, I was referring to the case you described. There is a different issue Kevin mentioned in comment 6, I'll open a separate bug to get it covered. cheers, Marcin Probably more for Brian's benefit, but it needs to stated that Pawsey (or anyone else, I'd wager) should no longer need to use SLURM_HINT=nomultithread by default, as the same outcome, in respect of multithreading, can achieved by telling its users to explicity specify --threads-per-core=1 however, Pawsey is still looking at how best to get that message across to its user community. This issue has now become one of the slightly different EnvVars seen in the job environment, depending on how the --threads-per-core=1 request is supplied to the job submission. Hoping that's useful, Kevin Kevin, I've opened Bug 10620 as a follow-up to your questions from comment 6 and added you to CC there. I agree that there is no obvious reason for the behavior. The initial issue demonstrated by: >export SLURM_HINT=nomultithread >salloc -p acceptance >[...] >srun: fatal: Following options are mutually exclusive: --hint, --ntasks-per-core, >--threads-per-core, -B, but more than one set by environment variables. Got fixed by f450cc6837d8581[1]. >Probably more for Brian's benefit, but it needs to stated that >Pawsey (or anyone else, I'd wager) should no longer need to use >SLURM_HINT=nomultithread >by default, as the same outcome, in respect of multithreading, >can achieved by telling its users to explicity specify >--threads-per-core=1 Those are in some sense an equivalent, since --hint=nomultithread sets --threads-per-core=1. I think that quite good explanation of the details behind --hint=nomultithread is in Bug 8119 comment 4. If I'm reading correctly that you'd like to have more elegant solution than SLURM_HINT being exported to users environment, you can use a lua cli_filter plugin based on the below snippet > function slurm_cli_setup_defaults(options, cli_type) > options['threads-per-core'] = 1 #either or options['hint']='nomultithread' > > return slurm.SUCCESS > end > function slurm_cli_post_submit(options, cli_type) > return slurm.SUCCESS > end > function slurm_cli_pre_submit(options, cli_type) > return slurm.SUCCESS > end It will effectively forbid specification of the conflicting options on slurm 20.11.0 - 20.11.2, but should be fine starting with 20.11.3 where in case of conflict --hint/SLURM_HINT will be ignored with a user message at verbose level. Let me know if you have further questions. In case of no reply, I'll close the bug as fixed in 20.11.3. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/f450cc6837d85812378612cdd734f9c038bfbfce *** Ticket 10600 has been marked as a duplicate of this ticket. *** > If I'm reading correctly that you'd like to have more elegant solution > than SLURM_HINT being exported to users environment, you can use a lua > cli_filter plugin based on the below snippet Not quite. 1) I don't need a more elegant soultion, as no problem exists. 2) I don't think we should have ever been using SLURM_HINT, but telling our user community to tell the scheduler what they want, explictly, which is, for no multithreading, to just request --threads-per-core=1 3) I have already said (comment 6) that I have created a Lua script that would "do the right thing" for users who /forget to/don't know how to/can't remember how to/ tell the scheduler what they want. The outstanding issue here is how we, Pawsey, educate our user community to do the right thing from now on. To that extent, along with the spliting out of the extra issue that the investogations around this one brought to light, the fix for this issue probaly wasn't even required, though I am sure some people (people who like using EnvVars) will find it useful! Cheers, Kevin I notice, after Nate added a link to a "duplicate issue", that I can't follow the link ? Is there a way to not render the link as clickable, so as to not have people following things they can't access. Just out of interest, seeing as how it is a duplicate of this issue, what's the big secret ? >3) I have already said (comment 6) that I have created a > Lua script that would "do the right thing" for users who > /forget to/don't know how to/can't remember how to/ tell > the scheduler what they want. The small difference is that I'd recommend use of CliFilter instead of JobSubmit. It gives you the possibility to set a default for --hint, it's processing is done at client side so it doesn't affect slurmctld performance and an example avoids the direct check against the "NO_VAL" which may be an issue if thethe variable type changes in the future. >Is there a way to not render the link as clickable, so >as to not have people following things they can't access. I'll raise that question to the team, however, I'm afraid that it may be a Bugzilla limitation. >Just out of interest, seeing as how it is a duplicate of >this issue, what's the big secret ? Technically it's because of the security group settings of the other bug. Obviously, I can't directly answer the "what's the big secret". Some sites find it useful to mark the whole communication with us private because of a variety of reasons. Mostly coming from customer security policy related to logs, ip addresses, or even host names being shared. Is there anything else I can help you with here? cheers, Marcin On 2021/01/14 13:25, bugs@schedmd.com wrote: > >>3) I have already said (comment 6) that I have created a >> Lua script ... > > The small difference is that I'd recommend use of CliFilter instead of > JobSubmit. It gives you the possibility to set a default for --hint, it's > processing is done at client side so it doesn't affect slurmctld performance > and an example avoids the direct check against the "NO_VAL" which may be an > issue if thethe variable type changes in the future. Thanks for expanding as to the "extra elegance" in your solution, Marcin. As we are not using any CliFilter stuff (well, that I am aware of), I would have missed the distinction. Is that (Use CliFilter instead of JobSubmit) a "general" recommendation now ? I'll certainly try your suggestion out on our TDS but, as we are unlikely to bump Slurm up on the production Crays until 20.11.3 comes out then we won't have to use it in anger, as I'm sure the fact that we can go to 20.11.3 without anything breaking will mean we won't change anything. Cheers again, Kevin >Is that (Use CliFilter instead of JobSubmit) a "general" recommendation now ? Those are just different and should be used for a different purpose. JobSubmit plugin is running on slurmctld side which has a drawback of potential performance impact in case of complicated logic being implemented there. On the other side users have no way to skip this logic, so if it's a site policy that has to be strictly enforced JobSubmit is the best place for that. It's also the only place where you can filter job modification by `scontrol update job...` command. CliFilter runs at the user's commands end, which gives you the possibility to work on options seen exactly like on the user side. For instance --hint/--pty options are not directly translated to job descriptor you see on slurmctld side. Since it's done on the user side than it doesn't impact infrastructure performance/scales in a trivial way. It's also relatively easy to skip it - using user build a client (srun/sbatch) binary or substituted slurm.conf. >I'll certainly try your suggestion out on our TDS but, as we are unlikely >to bump Slurm up on the production Crays until 20.11.3 comes out then we >won't have to use it in anger, as I'm sure the fact that we can go to >20.11.3 without anything breaking will mean we won't change anything. Understood. I'm closing the bug as fixed now, should you have any questions please reopen. cheers, Marcin Apologies for re-opening Marcin: need to ask one final thing
as regards your cli_filter example.
You only seem to be checking for an explicit value of a known
parameter, vis:
> options['threads-per-core'] = 1 #either or options['hint']='nomultithread'
> return slurm.SUCCESS
whereas our extsing, though less elegant, solution checks
for the lack of the parameter altogether, which it assumes
means that a user didn't want hyperthreading or didn't know
about hyperthreading.
We do actually have some users using our MPP Crays for high
throughput computing who do jump through hoops to remove our
default SLURM_HINT EnvVar and are then effectively requesting
--threads-per-core=2
without actually being explict.
So, in cli_filter-land, do I have to check for both known values
--threads-per-core=1
--threads-per-core=2
or rather (for "cores" that threaded above 2), each value
individually or can I just check to see if the parameter
has been populated and, if it has not, set a "default"
for the user ?
Subtle difference but I'm assuming you didn't just replicate
my job_submit attempt in cli_fliter-land for a reason?
Kevin, >Apologies for re-opening Marcin: need to ask one final thing as regards your cli_filter example. No issue, I'm glad you asked. The cli_filter set_defaults function is called before any option processing[1] is done, it's purpose is to overwrite the default values. If anything will be directly specified as a command-line option it will just overwrite the value set in the set_defaults cli_filter function. cheers, Marcin [1]https://slurm.schedmd.com/cli_filter_plugins.html OK: got it. cli_filter's slurm_cli_setup_defaults() stanza is a Lua replacement for our default EnvVar approach, not for the Lua job_submit code stanza. However, if we adopted the cli_filter approach, our high throughput users will then still need to be told to explcitly use --threads-per-core=2, as opposed to us telling them, as we currenlty do, how to get rid of the EncVar. Basically a change to hwt we are telling them? I still think we're likely to just use the 20.11.3 fix to leave things as they are - so no need to tell the users to do anything differently - and wait for our new system to come on stream, where everything will be "new" (ho-ho!). Cheers again: will close again for you. |
As you may recall from past Pawsey tickets, our response to Cray being unable to turn off hyperthreading in their XC blades has been to add a SLURM_HINT=nomultithread into the Slurm modulefile. I've just realised that I missed this bit in the RELEASE NOTES -- Add support for an "Interactive Step", designed to be used with salloc to launch a terminal on an allocated compute node automatically. Enable by setting "use_interactive_step" as part of LaunchParameters. when coming to play with 20.11 last week, but, having remedied that found that I was then getting elogin$ salloc -p acceptance Password: salloc: Granted job allocation 11054 salloc: Waiting for resource configuration salloc: Nodes nid00013 are ready for job srun: fatal: Following options are mutually exclusive: --hint, --ntasks-per-core, --threads-per-core, -B, but more than one set by environment variables. salloc: Relinquishing job allocation 11054 salloc: Job allocation 11054 has been revoked. elogin$ As the only thing I could see being explcitly set in that bare salloc invocation would be the SLURM_HINT EnvVar, I decided to unset it in a few places, after which things worked as expected. elogin$ salloc -w nid00019 Password: salloc: Granted job allocation 11058 salloc: Waiting for resource configuration salloc: Nodes nid00019 are ready for job kbuckley@nid00019:~$ Assuming we would like to continue with the hint, for the remaining life of our XC resources, is there a recommended replacement for sites (surely not just us?) that were using it prior to 20.11? Kevin PS I have seen #9670 but the suggestion there is that the breakage would have been visible in 20.02 as well, and we haven't been (and still aren't) seeing any issues with having the SLURM_HINT in the modulefile with 20.02.