On a production system running Slurm 20.11 with the pre-upstream node_features/helpers plugin (from https://bugs.schedmd.com/show_bug.cgi?id=9567) we noticed that Slurm will reboot *all* the nodes allocated to a job when only rebooting one node would have been sufficient. That is to say only one node did not have the right active features, while all the other nodes had the requested node features already. The problem was reproduced on a single-node multiple-slurmd dev machine running Slurm 21.08.4, with the upstream node_features/helpers plugin, but it's unclear if the issue lies within the node_features/helpers plugin itself, or the core logic for all node_features plugins. Here is the repro on 21.08.4, first we define multiple features in node_features.conf: ``` Feature=nps=4 Helper=/etc/slurm/node_features_helpers/amd_rome Feature=nps=2 Helper=/etc/slurm/node_features_helpers/amd_rome Feature=nps=1 Helper=/etc/slurm/node_features_helpers/amd_rome Feature=mig=off Helper=/etc/slurm/node_features_helpers/nvidia_mig_off Feature=mig=on Helper=/etc/slurm/node_features_helpers/nvidia_mig_on MutuallyExclusive=nps=1,nps=2,nps=4 MutuallyExclusive=mig=on,mig=off ``` Then, let's assume we are in a state where node-1 is currently in mode nps=1, and node-[2-4] are in mode nps=4: ``` $ sinfo -a -o "%20N | %20b | %40f | %T" NODELIST | ACTIVE_FEATURES | AVAIL_FEATURES | STATE node-1 | nps=1,mig=on | nps=4,nps=2,nps=1,mig=off,mig=on | idle node-[2-4] | nps=4,mig=on | nps=4,nps=2,nps=1,mig=off,mig=on | idle ``` Let's launch a 4-nodes job with "nps=4,mig=on": ``` $ srun -A admin -N4 --constraint nps=4,mig=on bash -c 'echo $SLURM_NODELIST' ``` We expect that only node-1 needs to be rebooted, but all nodes are rebooted: ``` slurmctld: reboot_job_nodes: reboot nodes node-[1-4] features nps=4&mig=on slurmctld: sched: _slurm_rpc_allocate_resources JobId=108 NodeList=node-[1-4] usec=651 ``` As the title mentions, it only happens if multiple constraints are passed to the job. If a single constraint is passed, only node-1 will be rebooted (assuming the same initial state): ``` $ srun -A admin -N4 --constraint nps=4 bash -c 'echo $SLURM_NODELIST' ``` slurmctld.log: ``` slurmctld: reboot_job_nodes: reboot nodes node-1 features nps=4 slurmctld: sched: _slurm_rpc_allocate_resources JobId=110 NodeList=node-[1-4] usec=618 ``` So this use case works as expected, a single node is rebooted. Asking for just "nps=4" means "nps=4,mig=don't-care", so it's not strictly equivalent compared to "nps=4,mig=on". But the latter is more specific, so Slurm should avoid the reboot in nodes that are already in the right state. In our cluster, we always have 2 constraints injected to all submitted jobs to act as default constraint values (see https://bugs.schedmd.com/show_bug.cgi?id=12289), e.g. all jobs would have "nps=4,mig=off" injected even if they specified no constraints at all. Hence we are subject to this bug.
Hi Doug, This may be fixed by the work Brian and I did in bug 12366. Because we considered it a behavioral change, we only put the commits into the master branch. It included some fixes for bug 12366 as well as a refactor of some logic for rebooting nodes due to node_features or power save. Could you please run your tests on the current master branch and let me know if it fixes it or not? If it isn't fixed, can you upload the results of your test from the master branch?
Sorry Felix, I saw Doug's name in the node features commits and for some reason it stuck in my head. :)
Hi Marshall! No worries.
I still see the problem with Slurm master on this commit: 608264b67c Minor formating issues Nodes 2, 3, 4 show that a reboot is initiated for features that are already in FeaturesActive: ``` slurmd: CPUs=12 Boards=1 Sockets=1 Cores=6 Threads=2 Memory=32013 TmpDisk=120025 Uptime=761863 CPUSpecList=(null) FeaturesAvail=nps=4,nps=2,nps=1,mig=off,mig=on FeaturesActive=nps=4,mig=on slurmd: Node reboot request with features nps=4&mig=on being processed ``` In addition, on this branch the active features go into a weird state after the reboot of all nodes: ``` $ sinfo -a -o "%10N | %30b | %f" NODELIST | ACTIVE_FEATURES | AVAIL_FEATURES node-[1-4] | nps=4,mig=on,nps=4&mig=on | nps=4,nps=2,nps=1,mig=off,mig=on ``` I'm wondering if this is the issue that was mentioned in https://bugs.schedmd.com/show_bug.cgi?id=12068#c18 (now hidden to us).
Thanks for the additional information and for testing it on master. I'll look into it and keep you updated on my progress.
Just letting you know I've reproduced this and identified the problem. I have a partial fix already. I'll keep you updated on my progress. Thanks for the detailed reproducer.
Felix, please ignore that last comment. It was meant to be private.
Felix, I've attached a patch to fix the problem (which I accidentally made public, but that's fine). attachment 23091 [details] Can you try out this patch and let us know if it works for you? * Can you test the normal workflow? * Can you also test 3+ constraints or anything else that you'd like to test? There actually were a couple problems - what you reported, but also the rebooting logic never handled 3+ constraints. I've done testing and it has started through our peer-review process, but we'd like to make sure we aren't breaking anything for you. Thanks
Hi Marshall, It doesn't seem to work with the normal workflow, but I suspect the issue is simple, just a missing conversion between "&" and "," as the separator between constraints With the patch applied, when I run the command in the original scenario: $ srun -A admin -N4 --constraint nps=4,mig=on bash -c 'echo $SLURM_NODELIST' I indeed see that only one node is rebooting (node-1), but it doesn't call the helper binaries as it is confused by the features being passed: slurmd: Node reboot request with features nps=4,mig=on being processed slurmd: node_features/helpers: node_features_p_node_set: skipping unregistered feature "nps=4,mig=on" The separator in this function is expected to be &: https://github.com/SchedMD/slurm/blob/4af095d4aa99c956ead1bf59785b3189eab196f3/src/plugins/node_features/helpers/node_features_helpers.c#L499
Felix, Good find. If you change that ampersand to a comma, does everything work for you? Basically, slurmctld has expected reboot features to be comma separated, so that's what I did in the patch that I gave you, but I missed that spot that you pointed out. However, I want to make sure that making reboot features comma separated doesn't break anything for you. If there is a problem with comma separated reboot features, then we can also allow features to be ampersand separated. Thanks for your testing.
Comma as a separator is a tiny bit better to deal with in bash, as you don't need to quote the string nps=4,mig=on, compared to nps=4&mig=on (bash would background the command and set the variable mig=on). There seems to be another problem with the patch after fixing the ampersand. If I ask for nps=4,mig=on, it will not reboot the nodes when it should have: sinfo -a -o "%20N | %20b | %40f | %T" srun -A admin -N4 --constraint 'nps=4&mig=on' bash -c 'echo $SLURM_NODELIST' sinfo -a -o "%20N | %20b | %40f | %T" NODELIST | ACTIVE_FEATURES | AVAIL_FEATURES | STATE node-[1-4] | nps=1,mig=on | nps=4,nps=2,nps=1,mig=off,mig=on | idle node-[1-4] node-[1-4] node-[1-4] node-[1-4] NODELIST | ACTIVE_FEATURES | AVAIL_FEATURES | STATE node-[1-4] | nps=1,mig=on | nps=4,nps=2,nps=1,mig=off,mig=on | completing The srun executes immediately, without a reboot. But with just --constraint nps=4, Slurm will reboot all nodes as expected.
(In reply to Felix Abecassis from comment #22) > Comma as a separator is a tiny bit better to deal with in bash, as you don't > need to quote the string nps=4,mig=on, compared to nps=4&mig=on (bash would > background the command and set the variable mig=on). > > There seems to be another problem with the patch after fixing the ampersand. > If I ask for nps=4,mig=on, it will not reboot the nodes when it should have: Did the node(s) already have one of these features set as active? In my testing, the nodes do reboot if they do not have any of the required features. But, I did just find an issue where if the node has one of the requested features, then it doesn't reboot. I'll look into this. Also, I'm looking into getting this (rebooting with multiple features) added into our expect test for the node_features/helpers plugin.
> Did the node(s) already have one of these features set as active? It was on a single machine with multiple slurmd, so the line is a bit blurry as the "state" of each fake feature is actually stored in a file: $ cat /etc/slurm/node_features_helpers/nps.state nps=4 But it was working without the patch, and sinfo is showing that Slurm believes that all nodes are in mode nps=1 before and after the job requesting nps=4.
But the mig feature was present on all nodes already, yes.
(In reply to Felix Abecassis from comment #25) > But the mig feature was present on all nodes already, yes. Okay, thanks for confirming. It was another mistake in my patch and was easy enough to fix. I've added to to my notes of another thing we need to test. I'll do some more testing (and try to be more thorough this time!) and upload a new patch tomorrow.
Felix, I've made attachment attachment 23179 [details] (21.08 v5) public. This is the latest version of my patch. Make sure it's 21.08 v5. I tested: * No active features * One active feature - on one node - on two nodes * Two active features - on one node - on two nodes submitting a job to * One node * Two nodes I ran combinations of the above. I checked to see if nodes that were supposed to reboot were queued for rebooting, and that nodes that were not supposed to reboot were not queued for rebooting. I also checked for the "skipping unregistered feature" log (to make sure that the script ran), and I ran scontrol show node to make sure that the features were actually updated. Could you test this patch?
Yup, everything seems to be working great now! Thank you! I suppose this fix will be in 22.05, is that correct?
Awesome, thanks for confirming that! I hope this will go into 21.08. No feature changes were involved. I'll let you know what our final decision is, though.
Felix, We've pushed this to github and the fixes will be part of the 21.08.6 release. Thanks for reporting this! Closing as resolved/fixed. Commits: 9367f592cc NEWS for the previous 3 commits 9548cd9d56 Make build_active_feature_bitmap2() handle more than two features e5fbe27e40 node_features/helpers - fix handling reboot with multiple job features 12aab56696 Remove unnecessary xstrdup()