Ticket 12993

Summary: node_features rebooting nodes already in the correct state with 2+ constraints
Product: Slurm Reporter: Felix Abecassis <fabecassis>
Component: SchedulingAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: dwightman, jbernauer, lyeager
Version: 21.08.4   
Hardware: Linux   
OS: Linux   
Site: NVIDIA (PSLA) Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.6 22.05.0pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: v3 21.08 - add NEWS
21.08 v5

Description Felix Abecassis 2021-12-08 11:53:10 MST
On a production system running Slurm 20.11 with the pre-upstream node_features/helpers plugin (from https://bugs.schedmd.com/show_bug.cgi?id=9567) we noticed that Slurm will reboot *all* the nodes allocated to a job when only rebooting one node would have been sufficient. That is to say only one node did not have the right active features, while all the other nodes had the requested node features already.

The problem was reproduced on a single-node multiple-slurmd dev machine running Slurm 21.08.4, with the upstream node_features/helpers plugin, but it's unclear if the issue lies within the node_features/helpers plugin itself, or the core logic for all node_features plugins.



Here is the repro on 21.08.4, first we define multiple features in node_features.conf:
```
Feature=nps=4 Helper=/etc/slurm/node_features_helpers/amd_rome
Feature=nps=2 Helper=/etc/slurm/node_features_helpers/amd_rome
Feature=nps=1 Helper=/etc/slurm/node_features_helpers/amd_rome
Feature=mig=off Helper=/etc/slurm/node_features_helpers/nvidia_mig_off
Feature=mig=on Helper=/etc/slurm/node_features_helpers/nvidia_mig_on
MutuallyExclusive=nps=1,nps=2,nps=4
MutuallyExclusive=mig=on,mig=off
```

Then, let's assume we are in a state where node-1 is currently in mode nps=1, and node-[2-4] are in mode nps=4:
```
$ sinfo -a -o "%20N | %20b | %40f | %T"
NODELIST             | ACTIVE_FEATURES      | AVAIL_FEATURES                           | STATE
node-1               | nps=1,mig=on         | nps=4,nps=2,nps=1,mig=off,mig=on         | idle
node-[2-4]           | nps=4,mig=on         | nps=4,nps=2,nps=1,mig=off,mig=on         | idle
```

Let's launch a 4-nodes job with "nps=4,mig=on":
```
$ srun -A admin -N4 --constraint nps=4,mig=on bash -c 'echo $SLURM_NODELIST'
```

We expect that only node-1 needs to be rebooted, but all nodes are rebooted:
```
slurmctld: reboot_job_nodes: reboot nodes node-[1-4] features nps=4&mig=on
slurmctld: sched: _slurm_rpc_allocate_resources JobId=108 NodeList=node-[1-4] usec=651
```

As the title mentions, it only happens if multiple constraints are passed to the job. If a single constraint is passed, only node-1 will be rebooted (assuming the same initial state):
```
$ srun -A admin -N4 --constraint nps=4 bash -c 'echo $SLURM_NODELIST'
```
slurmctld.log:
```
slurmctld: reboot_job_nodes: reboot nodes node-1 features nps=4
slurmctld: sched: _slurm_rpc_allocate_resources JobId=110 NodeList=node-[1-4] usec=618
```
So this use case works as expected, a single node is rebooted.

Asking for just "nps=4" means "nps=4,mig=don't-care", so it's not strictly equivalent compared to "nps=4,mig=on". But the latter is more specific, so Slurm should avoid the reboot in nodes that are already in the right state.

In our cluster, we always have 2 constraints injected to all submitted jobs to act as default constraint values (see https://bugs.schedmd.com/show_bug.cgi?id=12289), e.g. all jobs would have "nps=4,mig=off" injected even if they specified no constraints at all. Hence we are subject to this bug.
Comment 1 Marshall Garey 2021-12-08 12:19:02 MST
Hi Doug,

This may be fixed by the work Brian and I did in bug 12366. Because we considered it a behavioral change, we only put the commits into the master branch. It included some fixes for bug 12366 as well as a refactor of some logic for rebooting nodes due to node_features or power save.

Could you please run your tests on the current master branch and let me know if it fixes it or not? If it isn't fixed, can you upload the results of your test from the master branch?
Comment 2 Marshall Garey 2021-12-08 12:20:17 MST
Sorry Felix, I saw Doug's name in the node features commits and for some reason it stuck in my head. :)
Comment 3 Douglas Wightman 2021-12-08 12:21:52 MST
Hi Marshall! No worries.
Comment 4 Felix Abecassis 2021-12-08 13:26:46 MST
I still see the problem with Slurm master on this commit:
608264b67c Minor formating issues

Nodes 2, 3, 4 show that a reboot is initiated for features that are already in FeaturesActive:
```
slurmd: CPUs=12 Boards=1 Sockets=1 Cores=6 Threads=2 Memory=32013 TmpDisk=120025 Uptime=761863 CPUSpecList=(null) FeaturesAvail=nps=4,nps=2,nps=1,mig=off,mig=on FeaturesActive=nps=4,mig=on
slurmd: Node reboot request with features nps=4&mig=on being processed
```

In addition, on this branch the active features go into a weird state after the reboot of all nodes:
```
$ sinfo -a -o "%10N | %30b | %f"
NODELIST   | ACTIVE_FEATURES                | AVAIL_FEATURES
node-[1-4] | nps=4,mig=on,nps=4&mig=on      | nps=4,nps=2,nps=1,mig=off,mig=on
```

I'm wondering if this is the issue that was mentioned in https://bugs.schedmd.com/show_bug.cgi?id=12068#c18 (now hidden to us).
Comment 5 Marshall Garey 2021-12-08 13:55:54 MST
Thanks for the additional information and for testing it on master. I'll look into it and keep you updated on my progress.
Comment 8 Marshall Garey 2021-12-27 17:04:30 MST
Just letting you know I've reproduced this and identified the problem. I have a partial fix already. I'll keep you updated on my progress.

Thanks for the detailed reproducer.
Comment 13 Marshall Garey 2022-01-21 15:50:16 MST
Felix, please ignore that last comment. It was meant to be private.
Comment 16 Marshall Garey 2022-01-25 14:40:09 MST
Felix,

I've attached a patch to fix the problem (which I accidentally made public, but that's fine).

attachment 23091 [details]

Can you try out this patch and let us know if it works for you?
* Can you test the normal workflow?
* Can you also test 3+ constraints or anything else that you'd like to test?

There actually were a couple problems - what you reported, but also the rebooting logic never handled 3+ constraints.

I've done testing and it has started through our peer-review process, but we'd like to make sure we aren't breaking anything for you.

Thanks
Comment 17 Felix Abecassis 2022-01-25 21:53:34 MST
Hi Marshall,

It doesn't seem to work with the normal workflow, but I suspect the issue is simple, just a missing conversion between "&" and "," as the separator between constraints

With the patch applied, when I run the command in the original scenario:
$ srun -A admin -N4 --constraint nps=4,mig=on bash -c 'echo $SLURM_NODELIST'

I indeed see that only one node is rebooting (node-1), but it doesn't call the helper binaries as it is confused by the features being passed:
slurmd: Node reboot request with features nps=4,mig=on being processed
slurmd: node_features/helpers: node_features_p_node_set: skipping unregistered feature "nps=4,mig=on"

The separator in this function is expected to be &:
https://github.com/SchedMD/slurm/blob/4af095d4aa99c956ead1bf59785b3189eab196f3/src/plugins/node_features/helpers/node_features_helpers.c#L499
Comment 19 Marshall Garey 2022-01-27 09:02:50 MST
Felix,

Good find. If you change that ampersand to a comma, does everything work for you?

Basically, slurmctld has expected reboot features to be comma separated, so that's what I did in the patch that I gave you, but I missed that spot that you pointed out. However, I want to make sure that making reboot features comma separated doesn't break anything for you.

If there is a problem with comma separated reboot features, then we can also allow features to be ampersand separated.

Thanks for your testing.
Comment 22 Felix Abecassis 2022-01-27 16:47:36 MST
Comma as a separator is a tiny bit better to deal with in bash, as you don't need to quote the string nps=4,mig=on, compared to nps=4&mig=on (bash would background the command and set the variable mig=on).

There seems to be another problem with the patch after fixing the ampersand. If I ask for nps=4,mig=on, it will not reboot the nodes when it should have:

sinfo -a -o "%20N | %20b | %40f | %T"

srun -A admin -N4 --constraint 'nps=4&mig=on' bash -c 'echo $SLURM_NODELIST'

sinfo -a -o "%20N | %20b | %40f | %T"

NODELIST             | ACTIVE_FEATURES      | AVAIL_FEATURES                           | STATE
node-[1-4]           | nps=1,mig=on         | nps=4,nps=2,nps=1,mig=off,mig=on         | idle

node-[1-4]
node-[1-4]
node-[1-4]
node-[1-4]

NODELIST             | ACTIVE_FEATURES      | AVAIL_FEATURES                           | STATE
node-[1-4]           | nps=1,mig=on         | nps=4,nps=2,nps=1,mig=off,mig=on         | completing

The srun executes immediately, without a reboot.
But with just --constraint nps=4, Slurm will reboot all nodes as expected.
Comment 23 Marshall Garey 2022-01-27 17:10:52 MST
(In reply to Felix Abecassis from comment #22)
> Comma as a separator is a tiny bit better to deal with in bash, as you don't
> need to quote the string nps=4,mig=on, compared to nps=4&mig=on (bash would
> background the command and set the variable mig=on).
> 
> There seems to be another problem with the patch after fixing the ampersand.
> If I ask for nps=4,mig=on, it will not reboot the nodes when it should have:

Did the node(s) already have one of these features set as active? In my testing, the nodes do reboot if they do not have any of the required features. But, I did just find an issue where if the node has one of the requested features, then it doesn't reboot.

I'll look into this. Also, I'm looking into getting this (rebooting with multiple features) added into our expect test for the node_features/helpers plugin.
Comment 24 Felix Abecassis 2022-01-27 17:28:07 MST
> Did the node(s) already have one of these features set as active? 

It was on a single machine with multiple slurmd, so the line is a bit blurry as the "state" of each fake feature is actually stored in a file:
$ cat /etc/slurm/node_features_helpers/nps.state 
nps=4

But it was working without the patch, and sinfo is showing that Slurm believes that all nodes are in mode nps=1 before and after the job requesting nps=4.
Comment 25 Felix Abecassis 2022-01-27 17:29:13 MST
But the mig feature was present on all nodes already, yes.
Comment 26 Marshall Garey 2022-01-27 17:31:31 MST
(In reply to Felix Abecassis from comment #25)
> But the mig feature was present on all nodes already, yes.

Okay, thanks for confirming. It was another mistake in my patch and was easy enough to fix. I've added to to my notes of another thing we need to test.

I'll do some more testing (and try to be more thorough this time!) and upload a new patch tomorrow.
Comment 29 Marshall Garey 2022-01-31 15:48:09 MST
Felix,

I've made attachment attachment 23179 [details] (21.08 v5) public. This is the latest version of my patch. Make sure it's 21.08 v5.

I tested:

* No active features
* One active feature
  - on one node
  - on two nodes
* Two active features
  - on one node
  - on two nodes

submitting a job to

* One node
* Two nodes

I ran combinations of the above.

I checked to see if nodes that were supposed to reboot were queued for rebooting, and that nodes that were not supposed to reboot were not queued for rebooting. I also checked for the "skipping unregistered feature" log (to make sure that the script ran), and I ran scontrol show node to make sure that the features were actually updated.

Could you test this patch?
Comment 30 Felix Abecassis 2022-02-01 21:31:22 MST
Yup, everything seems to be working great now! Thank you!

I suppose this fix will be in 22.05, is that correct?
Comment 31 Marshall Garey 2022-02-02 09:36:28 MST
Awesome, thanks for confirming that!

I hope this will go into 21.08. No feature changes were involved. I'll let you know what our final decision is, though.
Comment 34 Marshall Garey 2022-02-02 15:21:07 MST
Felix,

We've pushed this to github and the fixes will be part of the 21.08.6 release. Thanks for reporting this!

Closing as resolved/fixed.

Commits:

9367f592cc NEWS for the previous 3 commits
9548cd9d56 Make build_active_feature_bitmap2() handle more than two features
e5fbe27e40 node_features/helpers - fix handling reboot with multiple job features
12aab56696 Remove unnecessary xstrdup()