Summary: | Make sinfo -t distinguish between node states idle (responding) and idle~ (powered_down) | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | User Commands | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | brian, tim |
Version: | 24.05.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://support.schedmd.com/show_bug.cgi?id=22307 https://support.schedmd.com/show_bug.cgi?id=22585 |
||
Site: | DTU Physics | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 25.05.0rc1 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Ticket Depends on: | 22307 | ||
Ticket Blocks: |
Description
Ole.H.Nielsen@fysik.dtu.dk
2024-11-15 02:15:34 MST
Ole, On one hand lexically powered down node won't respond, on the other hand "responding" and "powered down" are separate bit flags for nodes. I understand your use case, however, I'm hesitatnt in attempting to change the behavior, since it isn't a clear bug and . I'll discuss how we can approach that with our CTO and let you know. cheers, Marcin I'm out of the office, back on November 25. Jeg er ikke på kontoret, tilbage igen 25. november. Best regards / Venlig hilsen, Ole Holm Nielsen Hi Marcin, (In reply to Marcin Stolarek from comment #1) > On one hand lexically powered down node won't respond, on the other hand > "responding" and "powered down" are separate bit flags for nodes. > I understand your use case, however, I'm hesitatnt in attempting to change > the behavior, since it isn't a clear bug and . > > I'll discuss how we can approach that with our CTO and let you know. Can I ask if you've had any progress in resolving this issue? I think it would be good to fix the sinfo command as requested. Thanks, Ole Ole, I'll let you know when we have a decision on the approach here. I know it is a while, but we had SC and next Slurm release in the mean time, which usually means busy time for our senior staff. cheers, Marcin Can I ask if you've had any progress in resolving this issue? I think it would be good to fix the sinfo command as requested. Thanks, Ole Marcin is currently out of the office this week however he is currently investigating a few ways to approach the issue. I will have him update you first thing next week when he is back in the office. Ole, I'm sorry it takes so much time. This is one of the cases where actual implamentation probably won't be very complicated, but we want to make sure we're moving into right direction. I'm in touch with our CTO on that, we have to discuss the approach to make sure we're not introducing a breaking change in behavior or if we're not solving only a specific case of a more general issue. I'll keep you posted on that. cheers, Marcin Hi Marcin, (In reply to Marcin Stolarek from comment #10) > I'm sorry it takes so much time. This is one of the cases where actual > implamentation probably won't be very complicated, but we want to make sure > we're moving into right direction. > > I'm in touch with our CTO on that, we have to discuss the approach to make > sure we're not introducing a breaking change in behavior or if we're not > solving only a specific case of a more general issue. At SC24 I had a chat with Skyler and Tim Mullen about this issue, and the comment was that it may be an oversight in the power saving plugin. This sounds quite plausible to me. IHTH, Ole Ole,
I know it took us a while, but our conclusion is to add the ability to negate the state in the list given to `-t`, so your specific case can be handled by the command like:
>sinfo -t "idle,~powered_down".
I'll take care of the development and will let you know when the improvement is merged.
cheers,
Marcin
Hi Marcin, (In reply to Marcin Stolarek from comment #15) > I know it took us a while, but our conclusion is to add the ability to > negate the state in the list given to `-t`, so your specific case can be > handled by the command like: > >sinfo -t "idle,~powered_down". > > I'll take care of the development and will let you know when the improvement > is merged. That sounds like an excellent solution! It still leaves the question of what is the supposed function of the flag "--responding"? I can't make sense of it. The ~powered_down seems to be a new node state which isn't available at present. Could you make sure that it gets added also to 24.05 (which we run at this time) and maybe also to 23.11? Thanks, Ole Node states in Slurm are quite complex topic. It's not that easy to plot a state diagram for those, since many of them are more "flags". In these terms, "NO_RESPOND"[1] bit isn't set for powered down nodes. I understand that from the language perspective it seems unnatural to call powered down computer responding, however, from the code perspective it's simply a boolen value used for decisions. The way --responding is implemented directly relies on that value[2]. This code is very old, and we generally avoid behavior changes if they aren't a clear bug - it's hard to determine if changing it won't break scripts other people (like you:) developed over years. I can't commit to anything in terms of the version the changes will be included. We only do that for paid developments. Since it's a new feature, it will be targeted to master, so the earliest possible is 25.05. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-24-11-1-1/slurm/slurm.h#L1014 [2]https://github.com/SchedMD/slurm/blob/slurm-24.11/src/sinfo/sinfo.c#L677-L678 Ole,
The discussed changes were merged to our master branch (commits: 89e67427..f9e99750) and are going to be released as part of the next Slurm major release (version 25.05).
Are you able to build Slurm from master branch (in test environment) to verify if it allows desired selection of nodes?
The manual of sinfo with the new features documented looks like:
>-t, --states=<states>
> List nodes only having the given state(s). Multiple states may be comma separated and the comparison is case insensitive. If the states are separated by '+', then the nodes must be in
> all states. The state can be prefixed with '~' which will invert the result of match. Possible values include (case insensitive): ALLOC, ALLOCATED, BLOCKED, CLOUD, COMP, COMPLETING,
> DOWN, DRAIN (for node in DRAINING or DRAINED states), DRAINED, DRAINING, FAIL, FUTURE, FUTR, IDLE, MAINT, MIX, MIXED, NO_RESPOND, NPC, PERFCTRS, PLANNED, POWER_DOWN, POWERING_DOWN,
> POWERED_DOWN, POWERING_UP, REBOOT_ISSUED, REBOOT_REQUESTED, RESV, RESERVED, UNK, and UNKNOWN. By default nodes in the specified state are reported whether they are responding or not.
> The --dead and --responding options may be used to filter nodes by the corresponding flag.
Since the changes are limited to the sinfo command, it should be relatively safe to backport them to an older Slurm version locally, if needed."
cheers,
Marcin
Hi Marcin, Thanks a lot for the update: (In reply to Marcin Stolarek from comment #19) > The discussed changes were merged to our master branch (commits: > 89e67427..f9e99750) and are going to be released as part of the next Slurm > major release (version 25.05). > > Are you able to build Slurm from master branch (in test environment) to > verify if it allows desired selection of nodes? We unfortunately don't have a test environment for trying out the master branch. > The manual of sinfo with the new features documented looks like: > >-t, --states=<states> > > List nodes only having the given state(s). Multiple states may be comma separated and the comparison is case insensitive. If the states are separated by '+', then the nodes must be in > > all states. The state can be prefixed with '~' which will invert the result of match. Possible values include (case insensitive): ALLOC, ALLOCATED, BLOCKED, CLOUD, COMP, COMPLETING, > > DOWN, DRAIN (for node in DRAINING or DRAINED states), DRAINED, DRAINING, FAIL, FUTURE, FUTR, IDLE, MAINT, MIX, MIXED, NO_RESPOND, NPC, PERFCTRS, PLANNED, POWER_DOWN, POWERING_DOWN, > > POWERED_DOWN, POWERING_UP, REBOOT_ISSUED, REBOOT_REQUESTED, RESV, RESERVED, UNK, and UNKNOWN. By default nodes in the specified state are reported whether they are responding or not. The new prefix '~' will become useful, but people have to discover that it exists. > > The --dead and --responding options may be used to filter nodes by the corresponding flag. The meaning of --responding and --dead is still completely illogical to me! There exists no hint as to the meaning of --responding. Can you perhaps expand the documentation to explain what --responding does, and why POWERED_DOWN nodes seem to be "responding"? > Since the changes are limited to the sinfo command, it should be relatively > safe to backport them to an older Slurm version locally, if needed." Thanks, but for simplicity I prefer to wait for 25.05 later this year. Best regards, Ole >We unfortunately don't have a test environment for trying out the master branch. Understood. Just in case you don't know it's possible to run a whole Slurm "cluster" on a single PC building it with configure option --enable-multiple-slurmd. >The meaning of --responding and --dead is still completely illogical to me! [...]--responding does, and why POWERED_DOWN nodes seem to be "responding"? I understand your frustration with the --responding and --dead options. Let me try to clarify --responding first. The option is defined as the node wasn't marked as not responding by slurmctld. Slurmctld marks nodes "not responding" when it expects the slurmd to reply and it does not. Following that with a `--dead` option, it means that the node was assigned NO_RESPONDING flag. It's a little bit like a nondual logic, so the --responding option is really ~(NO_RESPONDING) which isn't fully justified. A human world analogy may be that when person lies down and doesn't respond she isn't necessarily dead, but may be. I'll check with our "Docs team" on how to make it more clear in the man of sinfo. cheers, Marcin Ole, We've merged an improvement to the docs: 2445007c[1] that should cover the remaining issue. I hope you'll find it more appropriate. Is there anything else I can help you with in the ticket? cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/2445007c681fc7013fc5f0621f4bd2c5e3c07f7c Please let me know if you've had a chance to review my last message. If the issue is now resolved or if you no longer need assistance, please let me know as well so I can close the ticket. cheers, Marcin I'm out of the office, back on June 6. Jeg er ikke på kontoret, tilbage igen 6. juni. Best regards / Venlig hilsen, Ole Holm Nielsen Dear Marcin, (In reply to Marcin Stolarek from comment #23) > Please let me know if you've had a chance to review my last message. > If the issue is now resolved or if you no longer need assistance, please let > me know as well so I can close the ticket. Thanks very much for making this solution! We look forward to testing 25.05 soon. Please close this ticket. Best regards, Ole Thanks for the confirmation. |