Ticket 15352

Summary:	Experiencing issue when have "AccountingStorageEnforce" in slurm.conf
Product:	Slurm	Reporter:	Will Dennis <wdennis>
Component:	Limits	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	21.08.8
Hardware:	Linux
OS:	Linux
Site:	NEC Labs	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	Ubuntu	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf file show qos part_gpu output sacctmgr show associations output job exceeding qos limits details slurmctld.log, lines from 11/3

Description Will Dennis 2022-11-03 19:36:27 MDT

Hello,

We utilize various QoS param's on this Slurm cluster, some on partitions. Here is an example of partition QoS:

root@ml-slurm-ctlr:~# scontrol show partition gpu
PartitionName=gpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=part_gpu
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=ml-gpu[01-02,04-11]
   PriorityJobFactor=50 PriorityTier=50 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=304 TotalNodes=10 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.0,GRES/gpu=4.0
   
root@ml-slurm-ctlr:~# sacctmgr show qos name=part_gpu format=name,maxtrespu%50
      Name                                          MaxTRESPU
---------- --------------------------------------------------
  part_gpu                         cpu=16,gres/gpu=4,mem=128G

A user filed a ticket with me today saying that the partition limits were not being respected even though seemingly set; in looking at my slurm.conf file, I saw that the issue was that there was no "AccountingStorageEnforce" stanza in the conf file. So, I added it as follows:

AccountingStorageEnforce=associations,limits,qos

and then restarted slurmctld service, followed by an scontrol reconfigure. However, after I had done this, I got a report from another user that they could not submit jobs. I tested, and saw this:

wdennis@ml-slurm-submit03:~$ srun --pty -p gpu -n 1 -t 0 --gres=gpu:1 --mem=120G bash -i
srun: error: Unable to allocate resources: Invalid qos specification

Commenting out the AccountingStorageEnforce line followed by a service restart/scontrol reconfigure resolved that issue, but again it seems we have no limits in place... How can I debug the "Invalid qos specification" and get it working?

Comment 1 Caden Ellis 2022-11-04 15:21:35 MDT

Will,

Can you please submit your slurm.conf and slurmctld.log. Also could you provide the full output of these various commands?

1) sacctmgr show qos part_gpu (full output)
2) sacctmgr show assoc
3) scontrol show job <jobid> (on any job not obeying the limit)

Thank you,

Caden

Comment 2 Will Dennis 2022-11-04 20:35:02 MDT

Created attachment 27610 [details]
slurm.conf file

Comment 3 Will Dennis 2022-11-04 20:39:16 MDT

Created attachment 27611 [details]
show qos part_gpu output

Comment 4 Will Dennis 2022-11-04 20:39:54 MDT

Created attachment 27612 [details]
sacctmgr show associations output

Comment 5 Will Dennis 2022-11-04 20:41:10 MDT

Created attachment 27613 [details]
job exceeding qos limits details

Comment 6 Will Dennis 2022-11-04 20:42:15 MDT

4 of 5 files you requested; other one too big for web attachment, will try to send as email attachment.

Comment 7 Will Dennis 2022-11-04 20:47:01 MDT

Created attachment 27614 [details]
slurmctld.log, lines from 11/3

Comment 8 Will Dennis 2022-11-04 20:54:53 MDT

had to cut down slurmctld.log file size down, so just selected events from 11/3 (day of issue.)

Comment 9 Caden Ellis 2022-11-07 11:51:05 MST

Will,

In the slurmctld.log I see this:

error: This association 47(account='ml', user='hxia', partition='(null)') does not have access to qos normal

Looking at your sacctmgr show assoc output, I see that no assoc has access to the normal QOS, just the high and low. When you submit a job with srun without specifying a QOS, the default is the "normal" QOS. So they either need to specify which QOS (high or low) they are using in the srun submission line with --qos=<qos>, or you need to use sacctmgr to give everyone access to the default "normal" qos, which is probably preferable. 

After this, and uncommenting your "AccountingStorageEnforce=associations,limits,qos" line, you should be good to go. 

When users submit to a partition, their jobs should then be limited by the partition QOS first, as per our resource limit hierarchy. 
https://slurm.schedmd.com/resource_limits.html

Unrelated to your bug, I see that you have cpu_bind and gres debug flags enabled in your slurm.conf. This puts some extra strain on the controller and makes the slurmctld.logs huge. If you don't have an explicit need for those, I would suggest turning them off. This will make your slurmctld.logs more concise. 

Let me know if this answers your questions,

Caden Ellis

Comment 10 Will Dennis 2022-11-07 12:52:45 MST

Is there a way to add “normal” qos to all associations at one, or must it be done on each association individually?


From: bugs@schedmd.com <bugs@schedmd.com>
Date: Monday, November 7, 2022 at 1:51 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 15352] Experiencing issue when have "AccountingStorageEnforce" in slurm.conf
Comment # 9<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15352%23c9&data=05%7C01%7Cwdennis%40nec-labs.com%7C2bc0603a9c9740b95fda08dac0f1071e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C638034438707420794%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o2NJaP7s%2FdnyhD3aIYJdzfsA9vSaD%2FQ9YaRUJEYYW6Y%3D&reserved=0> on bug 15352<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15352&data=05%7C01%7Cwdennis%40nec-labs.com%7C2bc0603a9c9740b95fda08dac0f1071e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C638034438707577005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=nSbaI%2FYnMtCXywZ6UcGlYxUmrtwP9NL0Ozj7GC0OV8E%3D&reserved=0> from Caden Ellis<mailto:caden@schedmd.com>

Will,



In the slurmctld.log I see this:



error: This association 47(account='ml', user='hxia', partition='(null)') does

not have access to qos normal



Looking at your sacctmgr show assoc output, I see that no assoc has access to

the normal QOS, just the high and low. When you submit a job with srun without

specifying a QOS, the default is the "normal" QOS. So they either need to

specify which QOS (high or low) they are using in the srun submission line with

--qos=<qos>, or you need to use sacctmgr to give everyone access to the default

"normal" qos, which is probably preferable.



After this, and uncommenting your

"AccountingStorageEnforce=associations,limits,qos" line, you should be good to

go.



When users submit to a partition, their jobs should then be limited by the

partition QOS first, as per our resource limit hierarchy.

https://slurm.schedmd.com/resource_limits.html<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fresource_limits.html&data=05%7C01%7Cwdennis%40nec-labs.com%7C2bc0603a9c9740b95fda08dac0f1071e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C638034438707577005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=iytNa%2B0U5sjUBkf5GJfJsokof4e4i1mLvbX196CqzvI%3D&reserved=0>



Unrelated to your bug, I see that you have cpu_bind and gres debug flags

enabled in your slurm.conf. This puts some extra strain on the controller and

makes the slurmctld.logs huge. If you don't have an explicit need for those, I

would suggest turning them off. This will make your slurmctld.logs more

concise.



Let me know if this answers your questions,



Caden Ellis

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 11 Caden Ellis 2022-11-07 14:36:33 MST

Here is what I did, since all of my users are on the same cluster.

sacctmgr modify cluster cluster=<clustername> set qos+=normal

You could also change the default QOS so it doesn't use normal, if that more fits your needs.
https://slurm.schedmd.com/sacctmgr.html#OPT_DefaultQOS

Caden

Comment 12 Caden Ellis 2022-11-09 10:37:51 MST

Did this work for you Will?

Comment 13 Will Dennis 2022-11-09 11:16:32 MST

Yes, with the application of "normal" qos to all associations, then applying the "AccountingStorageEnforce" directive, it seems like we are good now. Thanks for the assist!
(Also have tuned off the Debug on CPU/gres, thanks for that, I believe this was an earlier setting for another bug tshoot...)

Comment 14 Caden Ellis 2022-11-09 14:04:09 MST

Closing