Summary: | sinfo does not print cloud nodes | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | User Commands | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | bart, ben, brian |
Version: | 21.08.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=4751 | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | slurm.conf |
Hi I can't recreate this locally. Could you send me the output from "scontrol -F show node" Dominik (In reply to Dominik Bartkiewicz from comment #1) > Hi > > I can't recreate this locally. > Could you send me the output from "scontrol -F show node" Yes, it doesn't print the cloud nodes: $ scontrol -F show node NodeName=test001 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=8 CPULoad=0.00 AvailableFeatures=xeonx5570 ActiveFeatures=xeonx5570 Gres=(null) NodeAddr=test001 NodeHostName=test001 Version=21.08.8-2 OS=Linux 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 08:57:35 EDT 2022 RealMemory=23000 AllocMem=0 FreeMem=21589 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=100000 Weight=10313 Owner=N/A MCS_label=N/A Partitions=xeon8 BootTime=2022-05-29T20:49:32 SlurmdStartTime=2022-06-03T11:02:27 LastBusyTime=2022-06-08T10:47:33 CfgTRES=cpu=8,mem=23000M,billing=6 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=test002 Arch=x86_64 CoresPerSocket=4 CPUAlloc=0 CPUTot=8 CPULoad=0.00 AvailableFeatures=xeonx5570 ActiveFeatures=xeonx5570 Gres=(null) NodeAddr=test002 NodeHostName=test002 Version=21.08.8-2 OS=Linux 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 08:57:35 EDT 2022 RealMemory=23000 AllocMem=0 FreeMem=21598 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=100000 Weight=10313 Owner=N/A MCS_label=N/A Partitions=xeon8 BootTime=2022-05-29T20:49:28 SlurmdStartTime=2022-06-03T11:02:27 LastBusyTime=2022-06-08T10:47:33 CfgTRES=cpu=8,mem=23000M,billing=6 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I was running a job on an Azure cloud node, and when the node is being powered down the sinfo command actually prints the node: $ sinfo -N -h -O NODELIST,StateComplete:40 -n camd001,camd002 camd001 idle+cloud+powering_down After the node was powered down, sinfo again prints empty output. Hi You should set PrivateData=cloud in slurm.conf. Let me know if this will solve this issue. Dominik Hi Dominik, (In reply to Dominik Bartkiewicz from comment #4) > You should set PrivateData=cloud in slurm.conf. > Let me know if this will solve this issue. Wow, that fixes the issue! I run this as the root user: $ sinfo -N -h -O NODELIST,StateComplete:40 -n camd001,camd002 camd001 idle+cloud+powered_down camd002 down+cloud+powered_down+not_responding This is, however, not consistent with the slurm.conf man-page: PrivateData This controls what type of information is hidden from regular users. By default, all information is visible to all users. User SlurmUser and root can always view all information. cloud Powered down nodes in the cloud are visible. The "cloud" parameter description is highly ambiguous, since the slurm and root users should always see the "cloud" data. Can we agree that the lack of cloud node display for the slurm and root users is a bug? It would certainly be convenient if we didn't have to configure PrivateData in this case! Thanks, Ole In comment #0 I also found that a single cloud node name it DOES get printed, whereas for a node range expression nothing gets printed. This seems really buggy. Hi For sure documentation is not precise. and I agree that PrivateData isn't the best place for this option. But I don't think we want to change this default behavior. We will fix the documentation and internally discuss if we can move this option to a more appropriate place in 23.03. Dominik (In reply to Dominik Bartkiewicz from comment #7) > For sure documentation is not precise. > and I agree that PrivateData isn't the best place for this option. > > But I don't think we want to change this default behavior. > We will fix the documentation and internally discuss if we can move this > option to a more appropriate place in 23.03. Thanks for your analysis and suggested resolution. I still believe that PrivateData=cloud should *NOT* be required, because the users slurm and root should by default see all cloud nodes (normal users should not see them). IMHO, there is a bug in sinfo (and other tools?) causing cloud nodes to not be printed by default for the slurm and root users. Will SchedMD accept this argument and work towards a bug fix? Thanks, Ole > Thanks for your analysis and suggested resolution. I still believe that > PrivateData=cloud should *NOT* be required, because the users slurm and root > should by default see all cloud nodes (normal users should not see them). Not displaying down cloud nodes to all users is the default behavior from the beginning of CLOUD nodes. In slurm-14-11-0 we add an option to print them in PrivateData. This never was or was supposed to be a private date because all of these data are available in slurm.conf. > > IMHO, there is a bug in sinfo (and other tools?) causing cloud nodes to not > be printed by default for the slurm and root users. Will SchedMD accept > this argument and work towards a bug fix? This is a documentation bug, and we will fix it. Perhaps default behavior isn't best, but now it is too late to change it just as a bug fix. In my opinion, this option shouldn't be available in PrivateData because it is not PrivateData. Instead, we can move this to SlurmctldParameters or add show_flags to allow tools to control this behavior and then depreciate cloud in PrivateData. I will discuss this with the team internally and let you know what we propose. Dominik Hi This commit updates documentation: https://github.com/SchedMD/slurm/commit/d889d317fddff We already have an internal ticket (bug 4751) to track this issue. I believe that in the next major release, we will implement this option in a better way. Please let me know if you need anything else. Dominik (In reply to Dominik Bartkiewicz from comment #16) > This commit updates documentation: > https://github.com/SchedMD/slurm/commit/d889d317fddff > > We already have an internal ticket (bug 4751) to track this issue. > I believe that in the next major release, we will implement this option in a > better way. Please let me know if you need anything else. Thanks very much for documenting this! I see that it's already in the 22.05.2 docs. Best regards, Ole I'm closing this bug as infogiven. Please let us know if you have any other questions or concerns. Dominik |
Created attachment 25405 [details] slurm.conf Our test cluster has 2 on-premise nodes and 2 Azure cloud nodes defined in slurm.conf: NodeName=test[001-002] Weight=10313 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=23000 TmpDisk=100000 Feature=xeonx5570 NodeName=camd[001-002] Weight=10005 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=CLOUD RealMemory=26000 TmpDisk=10000 Feature=xeon8272cl,Azure I need to use sinfo to inquire the state of nodes, and this doesn't work correctly with node range expressions for cloud nodes. An "sinfo -N" lists only non-cloud nodes: $ sinfo -N -h -O NODELIST,StateComplete:40 test001 idle test002 idle Nodes with state CLOUD are not listed: $ sinfo -N -h -O NODELIST,StateComplete:40 -t CLOUD If I specify a single cloud node name it DOES get printed: $ sinfo -N -h -O NODELIST,StateComplete:40 -n camd001 camd001 idle+cloud+powered_down $ sinfo -N -h -O NODELIST,StateComplete:40 -n camd002 camd002 down+cloud+powered_down+not_responding But if I specify a node range expression nothing gets printed: $ sinfo -N -h -O NODELIST,StateComplete:40 -n camd[001-002] $ sinfo -N -h -O NODELIST,StateComplete:40 -n camd001,camd002 Something seems to be inconsistent with the output from sinfo. The empty output from node range expressions would seem to be a bug. I need sinfo to work correctly for my cloud node power up/down scripts. Can you help clarify what's going on?