We would like to write some SuspendProgram and ResumeProgram as documented in https://slurm.schedmd.com/power_save.html. We have different types of node hardware whose power state have to be managed differently: 1. Nodes with a BMC NIC interface where "ipmitool chassis power ..." commands can be used. 2. Nodes where the BMC cannot be used for powering up due to the shared NICs going down when the node is off [U+1F641] 3. Cloud nodes where special cloud CLI commands must be used (such as Azure CLI). I was thinking to add a node feature to indicate the kind of power control mechanism available, for example along these lines for the 3 above cases: Nodename=node001 Feature=power_ipmi Nodename=node002 Feature=power_none Nodename=node003 Feature=power_azure At SC22 Skyler told me that SuspendProgram and ResumeProgram *actually* receives an undocumented 2nd argument after the hostlist 1st argument. The 2nd argument would contain some node features information which I need for the purpose described above. Question: Can you please document the 2nd argument in the slurm.conf man-page and web-pages? What is the minimum Slurm version where this is available? Thanks, Ole
I'm out of the office, back on Nov. 21. Jeg er ikke på kontoret, tilbage den 21/11. Best regards / Venlig hilsen, Ole Holm Nielsen
Hi Ole, I looked into the code surrounding the suspend and resume scripts and I don't see that they are able to take a second argument. I talked with Skyler about this and he said that he was conflating the suspend and resume script actions with the reboot actions, which does allow for a second argument. You can see that here: https://github.com/SchedMD/slurm/blob/510ba4f17dfa559b579aa054cb8a415dcc224abc/src/slurmctld/power_save.c#L624 He apologizes for the confusion. It sounds like your intent with this is to have the suspend and resume scripts take different actions based on the node it's acting on. Is that right? If so, Skyler recommended using an external node map to keep track of how to handle different types of nodes. If that's not feasible because things are changing too often then you might be able to get feature information about the node being acted on with a call to 'scontrol show node' in the script. This is not as ideal since that can be an expensive call to the scheduler, and can add stress to the system if there are a lot of these events. Let me know if that helps. Thanks, Ben
Hi Ben, (In reply to Ben Roberts from comment #5) > He apologizes for the confusion. It sounds like your intent with this is to > have the suspend and resume scripts take different actions based on the node > it's acting on. Is that right? If so, Skyler recommended using an external > node map to keep track of how to handle different types of nodes. If that's > not feasible because things are changing too often then you might be able to > get feature information about the node being acted on with a call to > 'scontrol show node' in the script. This is not as ideal since that can be > an expensive call to the scheduler, and can add stress to the system if > there are a lot of these events. Thanks for this suggestion for the suspend and resume scripts. However, it seems to me that maintaining a separate node map file would be difficult to maintain. A partition map would be a little simpler. I don't expect a lot of node suspend and resume events, so perhaps the load on slurmctld from the scripts would be negligible? In stead of using 'scontrol show node' in the script, would this command perhaps be a litle more lightweight: $ sinfo -hN -O Nodelist,Features --nodes=<hostlist> Another question: If the suspend and resume scripts should fail for whatever reason (programming error, unexpected response from the node, node not booting up, etc...) how is slurmctld going to respond in such cases? The manual https://slurm.schedmd.com/power_save.html does not describe if any return codes can be passed back to slurmctld, or what happens if the script crashes. Thanks, Ole
Hi Ole, If you have these different types of nodes divided by partition that would make things easier to track. I agree that if you're not constantly suspending and resuming nodes that the stress placed on the server is not going to be significant. As far as using sinfo rather than scontrol, the requests going to the controller are the same, so you can use whichever one you prefer to get that information. They both issue a REQUEST_PARTITION_INFO and REQUEST_NODE_INFO RPC, as seen below. Note that the REQUEST_STATS_INFO count goes up too, due to the 'sdiag' request. $ sdiag | grep -A 6 "Remote Procedure Call statistics by message" Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:954 ave_time:170 total_time:162698 REQUEST_JOB_INFO ( 2003) count:941 ave_time:207 total_time:194958 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:285 ave_time:1113 total_time:317293 REQUEST_STATS_INFO ( 2035) count:17 ave_time:217 total_time:3689 REQUEST_NODE_INFO ( 2007) count:13 ave_time:322 total_time:4195 REQUEST_PING ( 1008) count:2 ave_time:203 total_time:406 $ scontrol show nodes node01 > /dev/null $ sdiag | grep -A 6 "Remote Procedure Call statistics by message" Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:955 ave_time:170 total_time:162843 REQUEST_JOB_INFO ( 2003) count:941 ave_time:207 total_time:194958 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:285 ave_time:1113 total_time:317293 REQUEST_STATS_INFO ( 2035) count:18 ave_time:217 total_time:3912 REQUEST_NODE_INFO ( 2007) count:14 ave_time:325 total_time:4554 REQUEST_PING ( 1008) count:2 ave_time:203 total_time:406 $ sinfo -N --nodes=node01,node02 > /dev/null $ sdiag | grep -A 6 "Remote Procedure Call statistics by message" Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:956 ave_time:170 total_time:163098 REQUEST_JOB_INFO ( 2003) count:941 ave_time:207 total_time:194958 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:285 ave_time:1113 total_time:317293 REQUEST_STATS_INFO ( 2035) count:19 ave_time:215 total_time:4097 REQUEST_NODE_INFO ( 2007) count:15 ave_time:322 total_time:4839 REQUEST_PING ( 1008) count:2 ave_time:203 total_time:406 Let me look some more into the second half of your question and get back to you. Thanks, Ben
Hi Ole, I looked into whether there was any mechanism to handle cases where the script crashed or returned different types of exit codes, but this is not something we look for. When resuming nodes, the thing we look for is for slurmd to respond to the controller within the timeout period. If that doesn't happen then the node is marked as DOWN. If the script used to resume the node fails then the exit code will be recorded in in slurmctld.log file for your review. This is in described in the documentation with the following sentence: If the slurmd daemon fails to respond within the configured ResumeTimeout value with an updated BootTime, the node will be placed in a DOWN state and the job requesting the node will be requeued. The suspension of nodes is similar where we don't handle different exit codes for the suspend script. If the script fails we assume the node did power down and will be available again to be scheduled after the SuspendTimeout. The exit code of the script will be logged for review. Let me know if you have questions about this. Thanks, Ben
Hi Ole, I hope you had a great SC22. Thanks for stopping by the booth. I have been looking at this due to another ticket and can really see the gap that exists in Slurm currently around making it convenient to the ResumeProgram and SuspendProgram to handle jobs and nodes using data directly from Slurm. It would be much appreciated if you would open an RFE to help improve this area of Slurm! Just for some of my current thoughts of how to improve this: - add node information (e.g. node features, extra) to SLURM_RESUME_FILE - add SLURM_SUSPEND_FILE which also contains node information (e.g. node features, extra) This way the <Resume|Suspend>Program can use Slurm data to influence handling of the node. Cheers, Skyler
Hi Ben, (In reply to Ben Roberts from comment #10) > I looked into whether there was any mechanism to handle cases where the > script crashed or returned different types of exit codes, but this is not > something we look for. When resuming nodes, the thing we look for is for > slurmd to respond to the controller within the timeout period. If that > doesn't happen then the node is marked as DOWN. If the script used to > resume the node fails then the exit code will be recorded in in > slurmctld.log file for your review. This is in described in the > documentation with the following sentence: > > If the slurmd daemon fails to respond within the configured ResumeTimeout > value with an updated BootTime, the node will be placed in a DOWN state and > the job requesting the node will be requeued. I have enabled power_save on one of our partitions now. I do see the behavior that you describe, so all is well. For example: [2022-12-05T06:00:59.441] node x140 not resumed by ResumeTimeout(1000) - marking down and power_save [2022-12-05T09:28:59.839] node x081 not resumed by ResumeTimeout(1000) - marking down and power_save There does not seem to be any exit code logged. > The suspension of nodes is similar where we don't handle different exit > codes for the suspend script. If the script fails we assume the node did > power down and will be available again to be scheduled after the > SuspendTimeout. The exit code of the script will be logged for review. Thanks, that makes sense. Ole
Hi Skyler, (In reply to Skyler Malinowski from comment #11) > I hope you had a great SC22. Thanks for stopping by the booth. Thanks for the nice discussions at SC22. > I have been looking at this due to another ticket and can really see the gap > that exists in Slurm currently around making it convenient to the > ResumeProgram and SuspendProgram to handle jobs and nodes using data > directly from Slurm. It would be much appreciated if you would open an RFE > to help improve this area of Slurm! > > Just for some of my current thoughts of how to improve this: > - add node information (e.g. node features, extra) to SLURM_RESUME_FILE > - add SLURM_SUSPEND_FILE which also contains node information (e.g. node > features, extra) > This way the <Resume|Suspend>Program can use Slurm data to influence > handling of the node. Thanks for the suggestion. I have created an RFE in bug 15550 now. Let's see if this is going to get accepted. FYI: I have published my power_save scripts on GitHub at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save where the nodesuspend and noderesume scripts use an "sinfo" command to obtain the node features of the nodes being handled: $ sinfo -hN -O "Nodelist: ,Features:" --nodes=$* > $TMPFILE It would be better if we could avoid such queries to slurmctld and use SLURM_RESUME_FILE and a new SLURM_SUSPEND_FILE in stead. Best regards, Ole
Hi Ole, Since you have opened a feature request for this in bug 15550 I'll go ahead and close this ticket. Please let us know if you need anything in the future. Thanks, Ben