| Summary: | node_features rebooting nodes already in the correct state with 2+ constraints | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Felix Abecassis <fabecassis> |
| Component: | Scheduling | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | dwightman, jbernauer, lyeager |
| Version: | 21.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NVIDIA (PSLA) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 21.08.6 22.05.0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
v3 21.08 - add NEWS
21.08 v5 |
||
|
Description
Felix Abecassis
2021-12-08 11:53:10 MST
Hi Doug, This may be fixed by the work Brian and I did in bug 12366. Because we considered it a behavioral change, we only put the commits into the master branch. It included some fixes for bug 12366 as well as a refactor of some logic for rebooting nodes due to node_features or power save. Could you please run your tests on the current master branch and let me know if it fixes it or not? If it isn't fixed, can you upload the results of your test from the master branch? Sorry Felix, I saw Doug's name in the node features commits and for some reason it stuck in my head. :) Hi Marshall! No worries. I still see the problem with Slurm master on this commit: 608264b67c Minor formating issues Nodes 2, 3, 4 show that a reboot is initiated for features that are already in FeaturesActive: ``` slurmd: CPUs=12 Boards=1 Sockets=1 Cores=6 Threads=2 Memory=32013 TmpDisk=120025 Uptime=761863 CPUSpecList=(null) FeaturesAvail=nps=4,nps=2,nps=1,mig=off,mig=on FeaturesActive=nps=4,mig=on slurmd: Node reboot request with features nps=4&mig=on being processed ``` In addition, on this branch the active features go into a weird state after the reboot of all nodes: ``` $ sinfo -a -o "%10N | %30b | %f" NODELIST | ACTIVE_FEATURES | AVAIL_FEATURES node-[1-4] | nps=4,mig=on,nps=4&mig=on | nps=4,nps=2,nps=1,mig=off,mig=on ``` I'm wondering if this is the issue that was mentioned in https://bugs.schedmd.com/show_bug.cgi?id=12068#c18 (now hidden to us). Thanks for the additional information and for testing it on master. I'll look into it and keep you updated on my progress. Just letting you know I've reproduced this and identified the problem. I have a partial fix already. I'll keep you updated on my progress. Thanks for the detailed reproducer. Felix, please ignore that last comment. It was meant to be private. Felix,
I've attached a patch to fix the problem (which I accidentally made public, but that's fine).
attachment 23091 [details]
Can you try out this patch and let us know if it works for you?
* Can you test the normal workflow?
* Can you also test 3+ constraints or anything else that you'd like to test?
There actually were a couple problems - what you reported, but also the rebooting logic never handled 3+ constraints.
I've done testing and it has started through our peer-review process, but we'd like to make sure we aren't breaking anything for you.
Thanks
Hi Marshall, It doesn't seem to work with the normal workflow, but I suspect the issue is simple, just a missing conversion between "&" and "," as the separator between constraints With the patch applied, when I run the command in the original scenario: $ srun -A admin -N4 --constraint nps=4,mig=on bash -c 'echo $SLURM_NODELIST' I indeed see that only one node is rebooting (node-1), but it doesn't call the helper binaries as it is confused by the features being passed: slurmd: Node reboot request with features nps=4,mig=on being processed slurmd: node_features/helpers: node_features_p_node_set: skipping unregistered feature "nps=4,mig=on" The separator in this function is expected to be &: https://github.com/SchedMD/slurm/blob/4af095d4aa99c956ead1bf59785b3189eab196f3/src/plugins/node_features/helpers/node_features_helpers.c#L499 Felix, Good find. If you change that ampersand to a comma, does everything work for you? Basically, slurmctld has expected reboot features to be comma separated, so that's what I did in the patch that I gave you, but I missed that spot that you pointed out. However, I want to make sure that making reboot features comma separated doesn't break anything for you. If there is a problem with comma separated reboot features, then we can also allow features to be ampersand separated. Thanks for your testing. Comma as a separator is a tiny bit better to deal with in bash, as you don't need to quote the string nps=4,mig=on, compared to nps=4&mig=on (bash would background the command and set the variable mig=on). There seems to be another problem with the patch after fixing the ampersand. If I ask for nps=4,mig=on, it will not reboot the nodes when it should have: sinfo -a -o "%20N | %20b | %40f | %T" srun -A admin -N4 --constraint 'nps=4&mig=on' bash -c 'echo $SLURM_NODELIST' sinfo -a -o "%20N | %20b | %40f | %T" NODELIST | ACTIVE_FEATURES | AVAIL_FEATURES | STATE node-[1-4] | nps=1,mig=on | nps=4,nps=2,nps=1,mig=off,mig=on | idle node-[1-4] node-[1-4] node-[1-4] node-[1-4] NODELIST | ACTIVE_FEATURES | AVAIL_FEATURES | STATE node-[1-4] | nps=1,mig=on | nps=4,nps=2,nps=1,mig=off,mig=on | completing The srun executes immediately, without a reboot. But with just --constraint nps=4, Slurm will reboot all nodes as expected. (In reply to Felix Abecassis from comment #22) > Comma as a separator is a tiny bit better to deal with in bash, as you don't > need to quote the string nps=4,mig=on, compared to nps=4&mig=on (bash would > background the command and set the variable mig=on). > > There seems to be another problem with the patch after fixing the ampersand. > If I ask for nps=4,mig=on, it will not reboot the nodes when it should have: Did the node(s) already have one of these features set as active? In my testing, the nodes do reboot if they do not have any of the required features. But, I did just find an issue where if the node has one of the requested features, then it doesn't reboot. I'll look into this. Also, I'm looking into getting this (rebooting with multiple features) added into our expect test for the node_features/helpers plugin. > Did the node(s) already have one of these features set as active?
It was on a single machine with multiple slurmd, so the line is a bit blurry as the "state" of each fake feature is actually stored in a file:
$ cat /etc/slurm/node_features_helpers/nps.state
nps=4
But it was working without the patch, and sinfo is showing that Slurm believes that all nodes are in mode nps=1 before and after the job requesting nps=4.
But the mig feature was present on all nodes already, yes. (In reply to Felix Abecassis from comment #25) > But the mig feature was present on all nodes already, yes. Okay, thanks for confirming. It was another mistake in my patch and was easy enough to fix. I've added to to my notes of another thing we need to test. I'll do some more testing (and try to be more thorough this time!) and upload a new patch tomorrow. Felix,
I've made attachment attachment 23179 [details] (21.08 v5) public. This is the latest version of my patch. Make sure it's 21.08 v5.
I tested:
* No active features
* One active feature
- on one node
- on two nodes
* Two active features
- on one node
- on two nodes
submitting a job to
* One node
* Two nodes
I ran combinations of the above.
I checked to see if nodes that were supposed to reboot were queued for rebooting, and that nodes that were not supposed to reboot were not queued for rebooting. I also checked for the "skipping unregistered feature" log (to make sure that the script ran), and I ran scontrol show node to make sure that the features were actually updated.
Could you test this patch?
Yup, everything seems to be working great now! Thank you! I suppose this fix will be in 22.05, is that correct? Awesome, thanks for confirming that! I hope this will go into 21.08. No feature changes were involved. I'll let you know what our final decision is, though. Felix, We've pushed this to github and the fixes will be part of the 21.08.6 release. Thanks for reporting this! Closing as resolved/fixed. Commits: 9367f592cc NEWS for the previous 3 commits 9548cd9d56 Make build_active_feature_bitmap2() handle more than two features e5fbe27e40 node_features/helpers - fix handling reboot with multiple job features 12aab56696 Remove unnecessary xstrdup() |