Ticket 14579

Summary:	Memory limits for job array
Product:	Slurm	Reporter:	David Chin <dwc62>
Component:	Limits	Assignee:	Chad Vizino <chad>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	felip.moll
Version:	21.08.8
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=14293
Site:	Drexel	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf scontrol show job output slurmctld log cgroup.conf slurmctld log 20220719T18++

Description David Chin 2022-07-20 09:29:12 MDT

Created attachment 25931 [details]
slurm.conf

I think this is an old issue that you helped me fix, but is now showing up again since the update from 20.02 to 21.08. I am unable to find that old ticket using my email as "reporter" and selecting all statuses.

Our nodes have 192 GiB RAM installed. One user submits a job array:

    #SBATCH --mem=40G

(and currently updated to "--mem=80G")

In Slurm 20.02, our configuration meant that no more than 4 array tasks ran on each node.

However, now, with "--mem=80G", there are up to 32 array tasks running on a node.

The job script is:

#!/bin/bash
#SBATCH --array=0-199
#SBATCH --account=diezrouxPrj
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=is379@drexel.edu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=80G
#SBATCH --time=45:00:00
#SBATCH --requeue

. /etc/profile.d/modules.sh
module load netlogo/5.3.1

my_task_id=$( printf %03d $SLURM_ARRAY_TASK_ID )

java -Djava.util.prefs.userRoot=$TMP -Djava.util.prefs.systemRoot=$TMP -cp ${NETLOGOHOME}/app/NetLogo.jar \
    org.nlogo.headless.Main \
    --threads $SLURM_NTASKS ...


Current slurm.conf is attached.

Thanks,
    Dave Chin

Comment 1 Jason Booth 2022-07-20 11:16:41 MDT

Please attach the slurmctld.log and the output of "scontrol show job <JOBID>".

Comment 3 David Chin 2022-07-20 13:28:22 MDT

Created attachment 25940 [details]
scontrol show job output

Comment 4 David Chin 2022-07-20 13:28:39 MDT

Created attachment 25941 [details]
slurmctld log

Comment 5 David Chin 2022-07-20 13:28:58 MDT

Created attachment 25942 [details]
cgroup.conf

Comment 6 Chad Vizino 2022-07-20 15:23:41 MDT

Hi. I'm looking into this. Would you please also supply:

slurmctld.log covering maybe an hour preceding and including the start of the array jobs on node051 (looks like they started at 2022-07-19T21:53:00)?

Counting running array jobs on each node (from the output in comment 3) I see:

>32    NodeList=node051
>48    NodeList=node054
>48    NodeList=node055
>20    NodeList=node056
Could you supply a "scontrol show node node051"?

Comment 8 David Chin 2022-07-21 06:11:39 MDT

(In reply to Chad Vizino from comment #6)
> Hi. I'm looking into this. Would you please also supply:
> 
> slurmctld.log covering maybe an hour preceding and including the start of
> the array jobs on node051 (looks like they started at 2022-07-19T21:53:00)?
> 
> Counting running array jobs on each node (from the output in comment 3) I
> see:
> 
> >32    NodeList=node051
> >48    NodeList=node054
> >48    NodeList=node055
> >20    NodeList=node056
> Could you supply a "scontrol show node node051"?

NodeName=node051 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUTot=48 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node051 NodeHostName=node051 Version=21.08.8-2
   OS=Linux 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
   RealMemory=192000 AllocMem=0 FreeMem=189812 Sockets=4 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=874000 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=def,long
   BootTime=2022-07-18T15:02:32 SlurmdStartTime=2022-07-18T15:04:36
   LastBusyTime=2022-07-21T05:33:42
   CfgTRES=cpu=48,mem=187.50G,billing=48
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Drained by CMDaemon [root@2022-07-20T11:46:21]

Comment 9 David Chin 2022-07-21 06:17:13 MDT

Created attachment 25953 [details]
slurmctld log 20220719T18++

Comment 10 David Chin 2022-07-21 06:39:31 MDT

Hi Chad:

I think I found the solution since we encountered exactly this in the past.

We have to set "SelectTypeParameters=CR_Core_Memory" on the partitions. In the update using Bright, this setting was not retained.

In Bright:

[mgmtnode->wlm[clustername]->jobqueue[def]] set selecttypeparameters CR_Core_Memory

and that updates the config for the "def" partition in slurm.conf

I've asked the user to scancel the current job array, and resubmit. We should see if it works by tomorrow (user is in Australia, so they are time-shifted).

Dave

Comment 11 David Chin 2022-07-21 06:54:36 MDT

Actually, I just tried it myself and it worked. Submitted an array requesting "--mem=90G" so I expect at most two tasks per node, and that's what I see.

      3458855_84  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node014
      3458855_85  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node015
      3458855_86  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node015
      3458855_87  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node016
      3458855_88  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node016
      3458855_89  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node017
      3458855_90  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node017
      3458855_91  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node018
      3458855_92  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node018
      3458855_93  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node019
      3458855_94  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node019
      3458855_95  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node020
      3458855_96  def tstarr_1node    dwc62   urcfadmprj  R       0:06       15:00   1    1      90G node020

Comment 13 Chad Vizino 2022-07-21 08:05:01 MDT

(In reply to David Chin from comment #10)
> I think I found the solution since we encountered exactly this in the past.
> 
> We have to set "SelectTypeParameters=CR_Core_Memory" on the partitions. In
> the update using Bright, this setting was not retained.
> 
> In Bright:
> 
> [mgmtnode->wlm[clustername]->jobqueue[def]] set selecttypeparameters
> CR_Core_Memory
> 
> and that updates the config for the "def" partition in slurm.conf
> 
> I've asked the user to scancel the current job array, and resubmit. We
> should see if it works by tomorrow (user is in Australia, so they are
> time-shifted).
Great! Glad you found that--memory does need to be factored in to your scheduling. I had seen that setting (CR_Core) and was planning to ask about it next. Let me know how that goes.

We also have a suggestion about your cgroup configuration. You have set values for swap when you have ConstrainSwapSpace=no. Also, *KMem* parameters are known to be buggy in the kernel, so we suggest you remove them.

Just leave cgroup.conf just like this:

>ConstrainCores=yes
>ConstrainRAMSpace=yes
>ConstrainSwapSpace=no
>ConstrainDevices=yes
Let me know if you have any more questions about this. In the mean time I'm going to drop the severity of this by one level for now but will still actively watch this case.

Comment 14 David Chin 2022-07-22 06:56:44 MDT

Hi, Chad:

The user's jobs are now being scheduled as expected, i.e. taking memory request into account.

We can close this out.

Regards,
    Dave

Comment 15 Chad Vizino 2022-07-23 14:10:53 MDT

(In reply to David Chin from comment #14)
> The user's jobs are now being scheduled as expected, i.e. taking memory
> request into account.
> 
> We can close this out.
Very good. Closing.