| Summary: | Experiencing issue when have "AccountingStorageEnforce" in slurm.conf | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Will Dennis <wdennis> |
| Component: | Limits | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NEC Labs | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | Ubuntu |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf file
show qos part_gpu output sacctmgr show associations output job exceeding qos limits details slurmctld.log, lines from 11/3 |
||
Will, Can you please submit your slurm.conf and slurmctld.log. Also could you provide the full output of these various commands? 1) sacctmgr show qos part_gpu (full output) 2) sacctmgr show assoc 3) scontrol show job <jobid> (on any job not obeying the limit) Thank you, Caden Created attachment 27610 [details]
slurm.conf file
Created attachment 27611 [details]
show qos part_gpu output
Created attachment 27612 [details]
sacctmgr show associations output
Created attachment 27613 [details]
job exceeding qos limits details
4 of 5 files you requested; other one too big for web attachment, will try to send as email attachment. Created attachment 27614 [details]
slurmctld.log, lines from 11/3
had to cut down slurmctld.log file size down, so just selected events from 11/3 (day of issue.) Will, In the slurmctld.log I see this: error: This association 47(account='ml', user='hxia', partition='(null)') does not have access to qos normal Looking at your sacctmgr show assoc output, I see that no assoc has access to the normal QOS, just the high and low. When you submit a job with srun without specifying a QOS, the default is the "normal" QOS. So they either need to specify which QOS (high or low) they are using in the srun submission line with --qos=<qos>, or you need to use sacctmgr to give everyone access to the default "normal" qos, which is probably preferable. After this, and uncommenting your "AccountingStorageEnforce=associations,limits,qos" line, you should be good to go. When users submit to a partition, their jobs should then be limited by the partition QOS first, as per our resource limit hierarchy. https://slurm.schedmd.com/resource_limits.html Unrelated to your bug, I see that you have cpu_bind and gres debug flags enabled in your slurm.conf. This puts some extra strain on the controller and makes the slurmctld.logs huge. If you don't have an explicit need for those, I would suggest turning them off. This will make your slurmctld.logs more concise. Let me know if this answers your questions, Caden Ellis Is there a way to add “normal” qos to all associations at one, or must it be done on each association individually? From: bugs@schedmd.com <bugs@schedmd.com> Date: Monday, November 7, 2022 at 1:51 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 15352] Experiencing issue when have "AccountingStorageEnforce" in slurm.conf Comment # 9<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15352%23c9&data=05%7C01%7Cwdennis%40nec-labs.com%7C2bc0603a9c9740b95fda08dac0f1071e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C638034438707420794%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o2NJaP7s%2FdnyhD3aIYJdzfsA9vSaD%2FQ9YaRUJEYYW6Y%3D&reserved=0> on bug 15352<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D15352&data=05%7C01%7Cwdennis%40nec-labs.com%7C2bc0603a9c9740b95fda08dac0f1071e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C638034438707577005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=nSbaI%2FYnMtCXywZ6UcGlYxUmrtwP9NL0Ozj7GC0OV8E%3D&reserved=0> from Caden Ellis<mailto:caden@schedmd.com> Will, In the slurmctld.log I see this: error: This association 47(account='ml', user='hxia', partition='(null)') does not have access to qos normal Looking at your sacctmgr show assoc output, I see that no assoc has access to the normal QOS, just the high and low. When you submit a job with srun without specifying a QOS, the default is the "normal" QOS. So they either need to specify which QOS (high or low) they are using in the srun submission line with --qos=<qos>, or you need to use sacctmgr to give everyone access to the default "normal" qos, which is probably preferable. After this, and uncommenting your "AccountingStorageEnforce=associations,limits,qos" line, you should be good to go. When users submit to a partition, their jobs should then be limited by the partition QOS first, as per our resource limit hierarchy. https://slurm.schedmd.com/resource_limits.html<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fresource_limits.html&data=05%7C01%7Cwdennis%40nec-labs.com%7C2bc0603a9c9740b95fda08dac0f1071e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C638034438707577005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=iytNa%2B0U5sjUBkf5GJfJsokof4e4i1mLvbX196CqzvI%3D&reserved=0> Unrelated to your bug, I see that you have cpu_bind and gres debug flags enabled in your slurm.conf. This puts some extra strain on the controller and makes the slurmctld.logs huge. If you don't have an explicit need for those, I would suggest turning them off. This will make your slurmctld.logs more concise. Let me know if this answers your questions, Caden Ellis ________________________________ You are receiving this mail because: * You reported the bug. Here is what I did, since all of my users are on the same cluster. sacctmgr modify cluster cluster=<clustername> set qos+=normal You could also change the default QOS so it doesn't use normal, if that more fits your needs. https://slurm.schedmd.com/sacctmgr.html#OPT_DefaultQOS Caden Did this work for you Will? Yes, with the application of "normal" qos to all associations, then applying the "AccountingStorageEnforce" directive, it seems like we are good now. Thanks for the assist! (Also have tuned off the Debug on CPU/gres, thanks for that, I believe this was an earlier setting for another bug tshoot...) Closing |
Hello, We utilize various QoS param's on this Slurm cluster, some on partitions. Here is an example of partition QoS: root@ml-slurm-ctlr:~# scontrol show partition gpu PartitionName=gpu AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=part_gpu DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=ml-gpu[01-02,04-11] PriorityJobFactor=50 PriorityTier=50 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=304 TotalNodes=10 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED TRESBillingWeights=CPU=1.0,GRES/gpu=4.0 root@ml-slurm-ctlr:~# sacctmgr show qos name=part_gpu format=name,maxtrespu%50 Name MaxTRESPU ---------- -------------------------------------------------- part_gpu cpu=16,gres/gpu=4,mem=128G A user filed a ticket with me today saying that the partition limits were not being respected even though seemingly set; in looking at my slurm.conf file, I saw that the issue was that there was no "AccountingStorageEnforce" stanza in the conf file. So, I added it as follows: AccountingStorageEnforce=associations,limits,qos and then restarted slurmctld service, followed by an scontrol reconfigure. However, after I had done this, I got a report from another user that they could not submit jobs. I tested, and saw this: wdennis@ml-slurm-submit03:~$ srun --pty -p gpu -n 1 -t 0 --gres=gpu:1 --mem=120G bash -i srun: error: Unable to allocate resources: Invalid qos specification Commenting out the AccountingStorageEnforce line followed by a service restart/scontrol reconfigure resolved that issue, but again it seems we have no limits in place... How can I debug the "Invalid qos specification" and get it working?