Ticket 15550

Summary: Add node features to SLURM_RESUME_FILE and new SLURM_SUSPEND_FILE
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: slurmctldAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: bas.vandervlies, skyler
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2022-12-05 05:57:07 MST
In bug 15439 comment 11 we have discussed the need for the ResumeProgram and SuspendProgram to have access to node features for the purpose of performing different suspend/resume actions such as IPMI or cloud commands.

For example, we have defined node features such as:

$ scontrol show node x002
NodeName=x002 Arch=x86_64 CoresPerSocket=12 
   CPUAlloc=24 CPUTot=24 CPULoad=24.01
   AvailableFeatures=xeon2650v4,opa,xeon24,power_ipmi
   ActiveFeatures=xeon2650v4,opa,xeon24,power_ipmi
   (lines deleted)

It would be most helpful if the SLURM_RESUME_FILE as well as a new SLURM_SUSPEND_FILE would add the node features information so that ResumeProgram and SuspendProgram do not need to inquire slurmctld explicitly to read such features.

Thanks,
Ole
Comment 2 Jason Booth 2022-12-06 10:19:34 MST
Ole is your site interested in funding/sponsoring this feature?
Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2022-12-09 04:10:51 MST
Hi Jason,

(In reply to Jason Booth from comment #2)
> Ole is your site interested in funding/sponsoring this feature?

I have considered this question.  The suggested enhancement would give the power_save plugin's scripts SuspendProgram and ResumeProgram some extra relevant data to work with.

However, my power saving scripts in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save in stead use a single call of sdiag to obtain the data directly from slurmctld.  Therefore I do not have a strong need for the suggested enhancement.

Thanks for considering this anyhow.

Best regards,
Ole
Comment 4 Jason Booth 2022-12-13 15:12:14 MST
Ole, I am moving this over to our sev 5 categories so that we can track this appropriately.