15439 – Missing documentation of argument #2 to SuspendProgram and ResumeProgram

Ticket 15439 - Missing documentation of argument #2 to SuspendProgram and ResumeProgram

Summary: Missing documentation of argument #2 to SuspendProgram and ResumeProgram

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Documentation (show other tickets)
Version:	21.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-11-16 15:59 MST by Ole.H.Nielsen@fysik.dtu.dk
Modified:	2022-12-06 13:38 MST (History)
CC List:	2 users (show)

See Also:
Site:	DTU Physics
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ole.H.Nielsen@fysik.dtu.dk 2022-11-16 15:59:21 MST

We would like to write some SuspendProgram and ResumeProgram as documented in https://slurm.schedmd.com/power_save.html.  We have different types of node hardware whose power state have to be managed differently:

1. Nodes with a BMC NIC interface where "ipmitool chassis power ..." commands can be used.

2. Nodes where the BMC cannot be used for powering up due to the shared NICs going down when the node is off ﷐[U+1F641]﷑

3. Cloud nodes where special cloud CLI commands must be used (such as Azure CLI).

I was thinking to add a node feature to indicate the kind of power control mechanism available, for example along these lines for the 3 above cases:

Nodename=node001 Feature=power_ipmi
Nodename=node002 Feature=power_none
Nodename=node003 Feature=power_azure 

At SC22 Skyler told me that SuspendProgram and ResumeProgram *actually* receives an undocumented 2nd argument after the hostlist 1st argument.  The 2nd argument would contain some node features information which I need for the purpose described above. 

Question: Can you please document the 2nd argument in the slurm.conf man-page and web-pages?

What is the minimum Slurm version where this is available?

Thanks,
Ole

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2022-11-18 07:38:58 MST

I'm out of the office, back on Nov. 21.
Jeg er ikke på kontoret, tilbage den 21/11.

Best regards / Venlig hilsen,
Ole Holm Nielsen

Comment 5 Ben Roberts 2022-11-21 14:48:55 MST

Hi Ole,

I looked into the code surrounding the suspend and resume scripts and I don't see that they are able to take a second argument.  I talked with Skyler about this and he said that he was conflating the suspend and resume script actions with the reboot actions, which does allow for a second argument.  You can see that here:
https://github.com/SchedMD/slurm/blob/510ba4f17dfa559b579aa054cb8a415dcc224abc/src/slurmctld/power_save.c#L624

He apologizes for the confusion.  It sounds like your intent with this is to have the suspend and resume scripts take different actions based on the node it's acting on.  Is that right?  If so, Skyler recommended using an external node map to keep track of how to handle different types of nodes.  If that's not feasible because things are changing too often then you might be able to get feature information about the node being acted on with a call to 'scontrol show node' in the script.  This is not as ideal since that can be an expensive call to the scheduler, and can add stress to the system if there are a lot of these events.  

Let me know if that helps.

Thanks,
Ben

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2022-11-23 05:57:25 MST

Hi Ben,

(In reply to Ben Roberts from comment #5)
> He apologizes for the confusion.  It sounds like your intent with this is to
> have the suspend and resume scripts take different actions based on the node
> it's acting on.  Is that right?  If so, Skyler recommended using an external
> node map to keep track of how to handle different types of nodes.  If that's
> not feasible because things are changing too often then you might be able to
> get feature information about the node being acted on with a call to
> 'scontrol show node' in the script.  This is not as ideal since that can be
> an expensive call to the scheduler, and can add stress to the system if
> there are a lot of these events.  

Thanks for this suggestion for the suspend and resume scripts.  However, it seems to me that maintaining a separate node map file would be difficult to maintain. A partition map would be a little simpler.  I don't expect a lot of node suspend and resume events, so perhaps the load on slurmctld from the scripts would be negligible?

In stead of using 'scontrol show node' in the script, would this command perhaps be a litle more lightweight:

$ sinfo -hN -O Nodelist,Features --nodes=<hostlist>

Another question: If the suspend and resume scripts should fail for whatever reason (programming error, unexpected response from the node, node not booting up, etc...) how is slurmctld going to respond in such cases?

The manual https://slurm.schedmd.com/power_save.html does not describe if any return codes can be passed back to slurmctld, or what happens if the script crashes.

Thanks,
Ole

Comment 8 Ben Roberts 2022-11-23 12:02:01 MST

Hi Ole,

If you have these different types of nodes divided by partition that would make things easier to track.  I agree that if you're not constantly suspending and resuming nodes that the stress placed on the server is not going to be significant.  As far as using sinfo rather than scontrol, the requests going to the controller are the same, so you can use whichever one you prefer to get that information.  They both issue a REQUEST_PARTITION_INFO and REQUEST_NODE_INFO RPC, as seen below.  Note that the REQUEST_STATS_INFO count goes up too, due to the 'sdiag' request.

$ sdiag | grep -A 6 "Remote Procedure Call statistics by message"
Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:954    ave_time:170    total_time:162698
        REQUEST_JOB_INFO                        ( 2003) count:941    ave_time:207    total_time:194958
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:285    ave_time:1113   total_time:317293
        REQUEST_STATS_INFO                      ( 2035) count:17     ave_time:217    total_time:3689
        REQUEST_NODE_INFO                       ( 2007) count:13     ave_time:322    total_time:4195
        REQUEST_PING                            ( 1008) count:2      ave_time:203    total_time:406

$ scontrol show nodes node01 > /dev/null

$ sdiag | grep -A 6 "Remote Procedure Call statistics by message"
Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:955    ave_time:170    total_time:162843
        REQUEST_JOB_INFO                        ( 2003) count:941    ave_time:207    total_time:194958
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:285    ave_time:1113   total_time:317293
        REQUEST_STATS_INFO                      ( 2035) count:18     ave_time:217    total_time:3912
        REQUEST_NODE_INFO                       ( 2007) count:14     ave_time:325    total_time:4554
        REQUEST_PING                            ( 1008) count:2      ave_time:203    total_time:406

$ sinfo -N --nodes=node01,node02 > /dev/null

$ sdiag | grep -A 6 "Remote Procedure Call statistics by message"
Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:956    ave_time:170    total_time:163098
        REQUEST_JOB_INFO                        ( 2003) count:941    ave_time:207    total_time:194958
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:285    ave_time:1113   total_time:317293
        REQUEST_STATS_INFO                      ( 2035) count:19     ave_time:215    total_time:4097
        REQUEST_NODE_INFO                       ( 2007) count:15     ave_time:322    total_time:4839
        REQUEST_PING                            ( 1008) count:2      ave_time:203    total_time:406



Let me look some more into the second half of your question and get back to you.  

Thanks,
Ben

Comment 10 Ben Roberts 2022-11-30 09:18:14 MST

Hi Ole,

I looked into whether there was any mechanism to handle cases where the script crashed or returned different types of exit codes, but this is not something we look for.  When resuming nodes, the thing we look for is for slurmd to respond to the controller within the timeout period.  If that doesn't happen then the node is marked as DOWN.  If the script used to resume the node fails then the exit code will be recorded in in slurmctld.log file for your review.  This is in described in the documentation with the following sentence:

If the slurmd daemon fails to respond within the configured ResumeTimeout value with an updated BootTime, the node will be placed in a DOWN state and the job requesting the node will be requeued.


The suspension of nodes is similar where we don't handle different exit codes for the suspend script.  If the script fails we assume the node did power down and will be available again to be scheduled after the SuspendTimeout.  The exit code of the script will be logged for review.  

Let me know if you have questions about this.

Thanks,
Ben

Comment 11 Skyler Malinowski 2022-12-02 09:41:49 MST

Hi Ole,

I hope you had a great SC22. Thanks for stopping by the booth.

I have been looking at this due to another ticket and can really see the gap that exists in Slurm currently around making it convenient to the ResumeProgram and SuspendProgram to handle jobs and nodes using data directly from Slurm. It would be much appreciated if you would open an RFE to help improve this area of Slurm!

Just for some of my current thoughts of how to improve this:
- add node information (e.g. node features, extra) to SLURM_RESUME_FILE
- add SLURM_SUSPEND_FILE which also contains node information (e.g. node features, extra)
This way the <Resume|Suspend>Program can use Slurm data to influence handling of the node.

Cheers,
Skyler

Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2022-12-05 05:37:31 MST

Hi Ben,

(In reply to Ben Roberts from comment #10)
> I looked into whether there was any mechanism to handle cases where the
> script crashed or returned different types of exit codes, but this is not
> something we look for.  When resuming nodes, the thing we look for is for
> slurmd to respond to the controller within the timeout period.  If that
> doesn't happen then the node is marked as DOWN.  If the script used to
> resume the node fails then the exit code will be recorded in in
> slurmctld.log file for your review.  This is in described in the
> documentation with the following sentence:
> 
> If the slurmd daemon fails to respond within the configured ResumeTimeout
> value with an updated BootTime, the node will be placed in a DOWN state and
> the job requesting the node will be requeued.

I have enabled power_save on one of our partitions now.  I do see the behavior that you describe, so all is well.  For example:

[2022-12-05T06:00:59.441] node x140 not resumed by ResumeTimeout(1000) - marking down and power_save
[2022-12-05T09:28:59.839] node x081 not resumed by ResumeTimeout(1000) - marking down and power_save

There does not seem to be any exit code logged.

> The suspension of nodes is similar where we don't handle different exit
> codes for the suspend script.  If the script fails we assume the node did
> power down and will be available again to be scheduled after the
> SuspendTimeout.  The exit code of the script will be logged for review.  

Thanks, that makes sense.

Ole

Comment 13 Ole.H.Nielsen@fysik.dtu.dk 2022-12-05 06:03:24 MST

Hi Skyler,

(In reply to Skyler Malinowski from comment #11)
> I hope you had a great SC22. Thanks for stopping by the booth.

Thanks for the nice discussions at SC22.

> I have been looking at this due to another ticket and can really see the gap
> that exists in Slurm currently around making it convenient to the
> ResumeProgram and SuspendProgram to handle jobs and nodes using data
> directly from Slurm. It would be much appreciated if you would open an RFE
> to help improve this area of Slurm!
> 
> Just for some of my current thoughts of how to improve this:
> - add node information (e.g. node features, extra) to SLURM_RESUME_FILE
> - add SLURM_SUSPEND_FILE which also contains node information (e.g. node
> features, extra)
> This way the <Resume|Suspend>Program can use Slurm data to influence
> handling of the node.

Thanks for the suggestion.  I have created an RFE in bug 15550 now.  Let's see if this is going to get accepted.

FYI: I have published my power_save scripts on GitHub at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save where the nodesuspend and noderesume scripts use an "sinfo" command to obtain the node features of the nodes being handled:

$ sinfo -hN -O "Nodelist: ,Features:" --nodes=$* > $TMPFILE

It would be better if we could avoid such queries to slurmctld and use SLURM_RESUME_FILE and a new SLURM_SUSPEND_FILE in stead.

Best regards,
Ole

Comment 14 Ben Roberts 2022-12-06 13:38:42 MST

Hi Ole,

Since you have opened a feature request for this in bug 15550 I'll go ahead and close this ticket.  Please let us know if you need anything in the future.

Thanks,
Ben