| Summary: | Memory limits for job array | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Chin <dwc62> |
| Component: | Limits | Assignee: | Chad Vizino <chad> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=14293 | ||
| Site: | Drexel | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
scontrol show job output slurmctld log cgroup.conf slurmctld log 20220719T18++ |
||
|
Description
David Chin
2022-07-20 09:29:12 MDT
Please attach the slurmctld.log and the output of "scontrol show job <JOBID>". Created attachment 25940 [details]
scontrol show job output
Created attachment 25941 [details]
slurmctld log
Created attachment 25942 [details]
cgroup.conf
Hi. I'm looking into this. Would you please also supply: slurmctld.log covering maybe an hour preceding and including the start of the array jobs on node051 (looks like they started at 2022-07-19T21:53:00)? Counting running array jobs on each node (from the output in comment 3) I see: >32 NodeList=node051 >48 NodeList=node054 >48 NodeList=node055 >20 NodeList=node056 Could you supply a "scontrol show node node051"? (In reply to Chad Vizino from comment #6) > Hi. I'm looking into this. Would you please also supply: > > slurmctld.log covering maybe an hour preceding and including the start of > the array jobs on node051 (looks like they started at 2022-07-19T21:53:00)? > > Counting running array jobs on each node (from the output in comment 3) I > see: > > >32 NodeList=node051 > >48 NodeList=node054 > >48 NodeList=node055 > >20 NodeList=node056 > Could you supply a "scontrol show node node051"? NodeName=node051 Arch=x86_64 CoresPerSocket=12 CPUAlloc=0 CPUTot=48 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node051 NodeHostName=node051 Version=21.08.8-2 OS=Linux 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 RealMemory=192000 AllocMem=0 FreeMem=189812 Sockets=4 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=874000 Weight=1 Owner=N/A MCS_label=N/A Partitions=def,long BootTime=2022-07-18T15:02:32 SlurmdStartTime=2022-07-18T15:04:36 LastBusyTime=2022-07-21T05:33:42 CfgTRES=cpu=48,mem=187.50G,billing=48 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Drained by CMDaemon [root@2022-07-20T11:46:21] Created attachment 25953 [details]
slurmctld log 20220719T18++
Hi Chad: I think I found the solution since we encountered exactly this in the past. We have to set "SelectTypeParameters=CR_Core_Memory" on the partitions. In the update using Bright, this setting was not retained. In Bright: [mgmtnode->wlm[clustername]->jobqueue[def]] set selecttypeparameters CR_Core_Memory and that updates the config for the "def" partition in slurm.conf I've asked the user to scancel the current job array, and resubmit. We should see if it works by tomorrow (user is in Australia, so they are time-shifted). Dave Actually, I just tried it myself and it worked. Submitted an array requesting "--mem=90G" so I expect at most two tasks per node, and that's what I see.
3458855_84 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node014
3458855_85 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node015
3458855_86 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node015
3458855_87 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node016
3458855_88 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node016
3458855_89 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node017
3458855_90 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node017
3458855_91 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node018
3458855_92 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node018
3458855_93 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node019
3458855_94 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node019
3458855_95 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node020
3458855_96 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node020
(In reply to David Chin from comment #10) > I think I found the solution since we encountered exactly this in the past. > > We have to set "SelectTypeParameters=CR_Core_Memory" on the partitions. In > the update using Bright, this setting was not retained. > > In Bright: > > [mgmtnode->wlm[clustername]->jobqueue[def]] set selecttypeparameters > CR_Core_Memory > > and that updates the config for the "def" partition in slurm.conf > > I've asked the user to scancel the current job array, and resubmit. We > should see if it works by tomorrow (user is in Australia, so they are > time-shifted). Great! Glad you found that--memory does need to be factored in to your scheduling. I had seen that setting (CR_Core) and was planning to ask about it next. Let me know how that goes. We also have a suggestion about your cgroup configuration. You have set values for swap when you have ConstrainSwapSpace=no. Also, *KMem* parameters are known to be buggy in the kernel, so we suggest you remove them. Just leave cgroup.conf just like this: >ConstrainCores=yes >ConstrainRAMSpace=yes >ConstrainSwapSpace=no >ConstrainDevices=yes Let me know if you have any more questions about this. In the mean time I'm going to drop the severity of this by one level for now but will still actively watch this case. Hi, Chad:
The user's jobs are now being scheduled as expected, i.e. taking memory request into account.
We can close this out.
Regards,
Dave
(In reply to David Chin from comment #14) > The user's jobs are now being scheduled as expected, i.e. taking memory > request into account. > > We can close this out. Very good. Closing. |