Created attachment 25931 [details] slurm.conf I think this is an old issue that you helped me fix, but is now showing up again since the update from 20.02 to 21.08. I am unable to find that old ticket using my email as "reporter" and selecting all statuses. Our nodes have 192 GiB RAM installed. One user submits a job array: #SBATCH --mem=40G (and currently updated to "--mem=80G") In Slurm 20.02, our configuration meant that no more than 4 array tasks ran on each node. However, now, with "--mem=80G", there are up to 32 array tasks running on a node. The job script is: #!/bin/bash #SBATCH --array=0-199 #SBATCH --account=diezrouxPrj #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=is379@drexel.edu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=80G #SBATCH --time=45:00:00 #SBATCH --requeue . /etc/profile.d/modules.sh module load netlogo/5.3.1 my_task_id=$( printf %03d $SLURM_ARRAY_TASK_ID ) java -Djava.util.prefs.userRoot=$TMP -Djava.util.prefs.systemRoot=$TMP -cp ${NETLOGOHOME}/app/NetLogo.jar \ org.nlogo.headless.Main \ --threads $SLURM_NTASKS ... Current slurm.conf is attached. Thanks, Dave Chin
Please attach the slurmctld.log and the output of "scontrol show job <JOBID>".
Created attachment 25940 [details] scontrol show job output
Created attachment 25941 [details] slurmctld log
Created attachment 25942 [details] cgroup.conf
Hi. I'm looking into this. Would you please also supply: slurmctld.log covering maybe an hour preceding and including the start of the array jobs on node051 (looks like they started at 2022-07-19T21:53:00)? Counting running array jobs on each node (from the output in comment 3) I see: >32 NodeList=node051 >48 NodeList=node054 >48 NodeList=node055 >20 NodeList=node056 Could you supply a "scontrol show node node051"?
(In reply to Chad Vizino from comment #6) > Hi. I'm looking into this. Would you please also supply: > > slurmctld.log covering maybe an hour preceding and including the start of > the array jobs on node051 (looks like they started at 2022-07-19T21:53:00)? > > Counting running array jobs on each node (from the output in comment 3) I > see: > > >32 NodeList=node051 > >48 NodeList=node054 > >48 NodeList=node055 > >20 NodeList=node056 > Could you supply a "scontrol show node node051"? NodeName=node051 Arch=x86_64 CoresPerSocket=12 CPUAlloc=0 CPUTot=48 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node051 NodeHostName=node051 Version=21.08.8-2 OS=Linux 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 RealMemory=192000 AllocMem=0 FreeMem=189812 Sockets=4 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=874000 Weight=1 Owner=N/A MCS_label=N/A Partitions=def,long BootTime=2022-07-18T15:02:32 SlurmdStartTime=2022-07-18T15:04:36 LastBusyTime=2022-07-21T05:33:42 CfgTRES=cpu=48,mem=187.50G,billing=48 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Drained by CMDaemon [root@2022-07-20T11:46:21]
Created attachment 25953 [details] slurmctld log 20220719T18++
Hi Chad: I think I found the solution since we encountered exactly this in the past. We have to set "SelectTypeParameters=CR_Core_Memory" on the partitions. In the update using Bright, this setting was not retained. In Bright: [mgmtnode->wlm[clustername]->jobqueue[def]] set selecttypeparameters CR_Core_Memory and that updates the config for the "def" partition in slurm.conf I've asked the user to scancel the current job array, and resubmit. We should see if it works by tomorrow (user is in Australia, so they are time-shifted). Dave
Actually, I just tried it myself and it worked. Submitted an array requesting "--mem=90G" so I expect at most two tasks per node, and that's what I see. 3458855_84 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node014 3458855_85 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node015 3458855_86 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node015 3458855_87 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node016 3458855_88 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node016 3458855_89 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node017 3458855_90 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node017 3458855_91 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node018 3458855_92 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node018 3458855_93 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node019 3458855_94 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node019 3458855_95 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node020 3458855_96 def tstarr_1node dwc62 urcfadmprj R 0:06 15:00 1 1 90G node020
(In reply to David Chin from comment #10) > I think I found the solution since we encountered exactly this in the past. > > We have to set "SelectTypeParameters=CR_Core_Memory" on the partitions. In > the update using Bright, this setting was not retained. > > In Bright: > > [mgmtnode->wlm[clustername]->jobqueue[def]] set selecttypeparameters > CR_Core_Memory > > and that updates the config for the "def" partition in slurm.conf > > I've asked the user to scancel the current job array, and resubmit. We > should see if it works by tomorrow (user is in Australia, so they are > time-shifted). Great! Glad you found that--memory does need to be factored in to your scheduling. I had seen that setting (CR_Core) and was planning to ask about it next. Let me know how that goes. We also have a suggestion about your cgroup configuration. You have set values for swap when you have ConstrainSwapSpace=no. Also, *KMem* parameters are known to be buggy in the kernel, so we suggest you remove them. Just leave cgroup.conf just like this: >ConstrainCores=yes >ConstrainRAMSpace=yes >ConstrainSwapSpace=no >ConstrainDevices=yes Let me know if you have any more questions about this. In the mean time I'm going to drop the severity of this by one level for now but will still actively watch this case.
Hi, Chad: The user's jobs are now being scheduled as expected, i.e. taking memory request into account. We can close this out. Regards, Dave
(In reply to David Chin from comment #14) > The user's jobs are now being scheduled as expected, i.e. taking memory > request into account. > > We can close this out. Very good. Closing.