Ticket 15550 - Add node features to SLURM_RESUME_FILE and new SLURM_SUSPEND_FILE
Summary: Add node features to SLURM_RESUME_FILE and new SLURM_SUSPEND_FILE
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 21.08.8
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-12-05 05:57 MST by Ole.H.Nielsen@fysik.dtu.dk
Modified: 2022-12-22 16:08 MST (History)
2 users (show)

See Also:
Site: DTU Physics
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Ole.H.Nielsen@fysik.dtu.dk 2022-12-05 05:57:07 MST
In bug 15439 comment 11 we have discussed the need for the ResumeProgram and SuspendProgram to have access to node features for the purpose of performing different suspend/resume actions such as IPMI or cloud commands.

For example, we have defined node features such as:

$ scontrol show node x002
NodeName=x002 Arch=x86_64 CoresPerSocket=12 
   CPUAlloc=24 CPUTot=24 CPULoad=24.01
   AvailableFeatures=xeon2650v4,opa,xeon24,power_ipmi
   ActiveFeatures=xeon2650v4,opa,xeon24,power_ipmi
   (lines deleted)

It would be most helpful if the SLURM_RESUME_FILE as well as a new SLURM_SUSPEND_FILE would add the node features information so that ResumeProgram and SuspendProgram do not need to inquire slurmctld explicitly to read such features.

Thanks,
Ole
Comment 2 Jason Booth 2022-12-06 10:19:34 MST
Ole is your site interested in funding/sponsoring this feature?
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2022-12-09 04:10:51 MST
Hi Jason,

(In reply to Jason Booth from comment #2)
> Ole is your site interested in funding/sponsoring this feature?

I have considered this question.  The suggested enhancement would give the power_save plugin's scripts SuspendProgram and ResumeProgram some extra relevant data to work with.

However, my power saving scripts in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save in stead use a single call of sdiag to obtain the data directly from slurmctld.  Therefore I do not have a strong need for the suggested enhancement.

Thanks for considering this anyhow.

Best regards,
Ole
Comment 4 Jason Booth 2022-12-13 15:12:14 MST
Ole, I am moving this over to our sev 5 categories so that we can track this appropriately.