| Summary: | High load, high %cpu usage, slow system, cgroups | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Renata Dart <renata> |
| Component: | Limits | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.02.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SLAC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi, is there any update on this issue, or do you need anything further from me? I'd like to understand more about setting constraincores in cgroup.conf and if that might be of help for high load, excessive cpu usage situations like this. Thanks, Renata Renata - I have Marcin looking into this for you. He will reply to you by tomorrow. Thanks! Renata On Wed, 23 Sep 2020, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=9865 > >--- Comment #2 from Jason Booth <jbooth@schedmd.com> --- >Renata - I have Marcin looking into this for you. He will reply to you by >tomorrow. > >-- >You are receiving this mail because: >You reported the bug. Renata, If you are not constraining cores then it's possible to submit a single core job that will use more cores. The user may even be unaware of his disturbing activity since some applications/frameworks will for instance start as many tasks as many cores discovered by them. I'd strongly recommend enabling the option in general. Is there any reaaon you decided to leave core unconstrained? Cheers, Marcin Hi Marcin, to be honest, I think it was just an oversight coupled with an incomplete understanding of what it did, even though the name seems obvious enough. I didn't actually think that the issue I included in this ticket was a case of using more cores than requested by the job. The high load and slowness for the interactive users was my main concern. Can you tell anything about that from what I included? Is there a slurm cgroup (or other) setting that I should be using that may address that? And do I need to restart all the slurmds to set constraincores, or just slurmctld? Thanks, Renata On Wed, 23 Sep 2020, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=9865 > >--- Comment #4 from Marcin Stolarek <cinek@schedmd.com> --- >Renata, > >If you are not constraining cores then it's possible to submit a single core >job that will use more cores. The user may even be unaware of his disturbing >activity since some applications/frameworks will for instance start as many >tasks as many cores discovered by them. > >I'd strongly recommend enabling the option in general. Is there any reaaon you >decided to leave core unconstrained? > >Cheers, >Marcin > >-- >You are receiving this mail because: >You reported the bug. Renata, I may be wrong, since we just have an example job, but if the specification is consistent swmclau2 user specified NumNodes=1 NumCPUs=16, which means that for multiple threads his process should go up to 1600% of CPU, the top command shows ~26*100% which means that his running at least 26 threads. >The high load and slowness for the interactive users was my main concern. Yep, but this is probably caused by other jobs running on the cores slurmctld selected for those users. >Is there a slurm cgroup (or other) setting that I should be >using that may address that? And do I need to restart all the slurmds >to set constraincores, or just slurmctld? Yes, change yor etc/slurm/cgroup.conf to have: >ConstrainCores=Yes and call `scontrol reconfigure`. To verify if it works correctly go to the directory where you cgroups are mounted (by default it's /sys/fs/cgroup/) and check the content of cpuset/slurm/uid_UidOfUser/job_JobId/cpuset.cpus - it should contain a list of CPUs allowed like e.g "5-8". This should be enough to limit job to use only CPUs it was given, however, the TaskPlugin stack we recommend (in general case) is: >TaskPlugin=task/affinity, task/cgroup This combination makes use of cgroups to limit CPUs/memory/devices accessible by a job and gives end-users additional --cpu-bind options implemented in task/affinity plugin. Let me know if enabling cores constrainment worked as expected. Keep in mind that it will affect only new jobs, so you may need to wait some time for the changes to be fully implemented. cheers, Marcin Hi Marcin, thanks for your analysis. I have just implemented it and will see what happens. Just to be sure, I did not have to systemctl restart slurmd on the individual nodes, just restart slurmctld and scontrol reconfigure? Renata On Thu, 24 Sep 2020, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=9865 > >--- Comment #6 from Marcin Stolarek <cinek@schedmd.com> --- >Renata, > >I may be wrong, since we just have an example job, but if the specification is >consistent swmclau2 user specified NumNodes=1 NumCPUs=16, which means that for >multiple threads his process should go up to 1600% of CPU, the top command >shows ~26*100% which means that his running at least 26 threads. > >>The high load and slowness for the interactive users was >my main concern. >Yep, but this is probably caused by other jobs running on the cores slurmctld >selected for those users. > >>Is there a slurm cgroup (or other) setting that I should be >>using that may address that? And do I need to restart all the slurmds >>to set constraincores, or just slurmctld? >Yes, change yor etc/slurm/cgroup.conf to have: >>ConstrainCores=Yes >and call `scontrol reconfigure`. > >To verify if it works correctly go to the directory where you cgroups are >mounted (by default it's /sys/fs/cgroup/) and check the content of >cpuset/slurm/uid_UidOfUser/job_JobId/cpuset.cpus - it should contain a list of >CPUs allowed like e.g "5-8". > >This should be enough to limit job to use only CPUs it was given, however, the >TaskPlugin stack we recommend (in general case) is: >>TaskPlugin=task/affinity, task/cgroup >This combination makes use of cgroups to limit CPUs/memory/devices accessible >by a job and gives end-users additional --cpu-bind options implemented in >task/affinity plugin. > >Let me know if enabling cores constrainment worked as expected. Keep in mind >that it will affect only new jobs, so you may need to wait some time for the >changes to be fully implemented. > >cheers, >Marcin > >-- >You are receiving this mail because: >You reported the bug. >on the individual nodes, just restart slurmctld and scontrol reconfigure? Just `scontrol reconfigure` was needed to get this change implemented, so you didn't have to bounce slurmctld. Did you check if per user,job and step directories under cpuset filesystem are created and contain cpuste.cpus files with the content aligned with the result of scontrol show job -d JOBID? (I mean CPU indices displayed in the per node listing of the command)[1]. cheers, Marcin [1]JobId=36977 JobName=wrap UserId=root(0) GroupId=root(0) MCS_label=N/A Priority=66944 Nice=0 Account=root QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:02 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2020-09-24T15:18:01 EligibleTime=2020-09-24T15:18:01 AccrueTime=2020-09-24T15:18:01 StartTime=2020-09-24T15:18:06 EndTime=Unknown Deadline=N/A PreemptEligibleTime=2020-09-24T15:18:06 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-24T15:18:06 Partition=AllNodes AllocNode:Sid=slurmctl:7785 ReqNodeList=(null) ExcNodeList=(null) NodeList=test01 BatchHost=test01 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* JOB_GRES=(null) Nodes=test01 CPU_IDs=0-1 Mem=0 GRES= ^^^^^^^^^^^^-------------- List of CPUs assigned by slurmctld MinCPUsNode=1 MinMemoryNode=10M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/slurm-sources/bug9851 StdErr=/slurm-sources/bug9851/slurm-36977.out StdIn=/dev/null StdOut=/slurm-sources/bug9851/slurm-36977.out Power= MailUser=(null) MailType=NONE Hi Marcin, I just ran a job requesting 6 cores and see this on the running host: [root@rome0142 job_61192]# pwd /sys/fs/cgroup/cpuset/slurm/uid_1197/job_61192 [root@rome0142 job_61192]# cat cpuset.cpus 0-2,64-66 and the same under step_0 and step_batch. Renata On Thu, 24 Sep 2020, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=9865 > >--- Comment #8 from Marcin Stolarek <cinek@schedmd.com> --- >>on the individual nodes, just restart slurmctld and scontrol reconfigure? >Just `scontrol reconfigure` was needed to get this change implemented, so you >didn't have to bounce slurmctld. > >Did you check if per user,job and step directories under cpuset filesystem are >created and contain cpuste.cpus files with the content aligned with the result >of scontrol show job -d JOBID? (I mean CPU indices displayed in the per node >listing of the command)[1]. > >cheers, >Marcin > >[1]JobId=36977 JobName=wrap > UserId=root(0) GroupId=root(0) MCS_label=N/A > Priority=66944 Nice=0 Account=root QOS=normal > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=00:00:02 TimeLimit=UNLIMITED TimeMin=N/A > SubmitTime=2020-09-24T15:18:01 EligibleTime=2020-09-24T15:18:01 > AccrueTime=2020-09-24T15:18:01 > StartTime=2020-09-24T15:18:06 EndTime=Unknown Deadline=N/A > PreemptEligibleTime=2020-09-24T15:18:06 PreemptTime=None > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-24T15:18:06 > Partition=AllNodes AllocNode:Sid=slurmctl:7785 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=test01 > BatchHost=test01 > NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=2,node=1,billing=2 > Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* > JOB_GRES=(null) > Nodes=test01 CPU_IDs=0-1 Mem=0 GRES= > ^^^^^^^^^^^^-------------- List of CPUs assigned by >slurmctld > MinCPUsNode=1 MinMemoryNode=10M MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=(null) > WorkDir=/slurm-sources/bug9851 > StdErr=/slurm-sources/bug9851/slurm-36977.out > StdIn=/dev/null > StdOut=/slurm-sources/bug9851/slurm-36977.out > Power= > MailUser=(null) MailType=NONE > >-- >You are receiving this mail because: >You reported the bug. Looks good, let's check with interactive users when after MaxTime for the partition in question(shared) or other partitions overlaping with it. cheers, Marcin Hi Marcin, we are still seeing high load on the hosts with jobs running
from user swmclau2. We think it is related to the number of threads
his jobs are using. His userid is 14185:
[renata@rome0122 ~]$ squeue | grep rome0122
62173_751 shared trainer swmclau2 R 2:35:49 1 rome0122
62173_752 shared trainer swmclau2 R 2:35:49 1 rome0122
62173_756 shared trainer swmclau2 R 2:35:49 1 rome0122
62173_758 shared trainer swmclau2 R 2:35:49 1 rome0122
62173_761 shared trainer swmclau2 R 2:35:49 1 rome0122
62173_762 shared trainer swmclau2 R 2:35:49 1 rome0122
62173_772 shared trainer swmclau2 R 2:35:49 1 rome0122
62173_773 shared trainer swmclau2 R 2:35:49 1 rome0122
[renata@rome0122 ~]$ w
11:16:40 up 12 days, 20:07, 1 user, load average: 1009.70, 799.51, 722.91
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
renata pts/0 sdf-login01.slac 11:16 0.00s 0.21s 0.03s w
[renata@rome0122 ~]$ ps axms | grep ^14185 | grep -c ' Rl'
1024
As the number of threads he uses drops then so does the load.
I am wondering if the numa stuff is set up properly. The nodes are amd epyc 7702
systems and I have
SchedulerParameters=Ignore_NUMA
We haven't turned off the numa balancing in the OS though. The nodes
look like:
[renata@rome0001 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7702 64-Core Processor
Stepping: 0
CPU MHz: 1996.252
BogoMIPS: 3992.50
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
NUMA node2 CPU(s): 32-47
NUMA node3 CPU(s): 48-63
NUMA node4 CPU(s): 64-79
NUMA node5 CPU(s): 80-95
NUMA node6 CPU(s): 96-111
NUMA node7 CPU(s): 112-127
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca
[renata@rome0001 ~]$
Does this seem like the right set-up?
Thanks,
Renata
On Fri, 25 Sep 2020, bugs@schedmd.com wrote:
>https://bugs.schedmd.com/show_bug.cgi?id=9865
>
>--- Comment #10 from Marcin Stolarek <cinek@schedmd.com> ---
>Looks good, let's check with interactive users when after MaxTime for the
>partition in question(shared) or other partitions overlaping with it.
>
>cheers,
>Marcin
>
>--
>You are receiving this mail because:
>You reported the bug.
Renata, >we are still seeing high load on the hosts with jobs running >from user swmclau2. We think it is related to the number of threads >his jobs are using. That's actually expected and you're correct it's related to number of threads/processes running on the host. Per load definition it's number of processes in running or ready queue (waiting or using CPU resources). However, if specific processes are bound to appropriate cores high load should not imapct all CPUs, since other jobs are assigned theirs individual resources and are not competing with the user threads. Do you see "slowness" from the interactive job running on the same host as the highly threaded one? >SchedulerParameters=Ignore_NUMA Setting this for cluster with AMD Epyc sounds resonable. cheers, Marcin Hi Marcin, there were no interactive jobs running this time. And when I logged in, the system seemed responsive. I'll monitor some more and see try to check with any interactive users that I see running on the same host. Thanks, Renata On Tue, 29 Sep 2020, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=9865 > >--- Comment #12 from Marcin Stolarek <cinek@schedmd.com> --- >Renata, > >>we are still seeing high load on the hosts with jobs running >>from user swmclau2. We think it is related to the number of threads >>his jobs are using. > >That's actually expected and you're correct it's related to number of >threads/processes running on the host. Per load definition it's number of >processes in running or ready queue (waiting or using CPU resources). However, >if specific processes are bound to appropriate cores high load should not >imapct all CPUs, since other jobs are assigned theirs individual resources and >are not competing with the user threads. Do you see "slowness" from the >interactive job running on the same host as the highly threaded one? > >>SchedulerParameters=Ignore_NUMA >Setting this for cluster with AMD Epyc sounds resonable. > >cheers, >Marcin > >-- >You are receiving this mail because: >You reported the bug. >And when I logged in, the system seemed responsive. This is a good symptom though the load is high it doesn't impact all processes, since the set of threads responsible for that is limited to appropriate CPUs. > I'll monitor some more and see try to check with any interactive users that I see running on the same host. It should be all right. The "system responsiveness" verification you did is essentially the same thing. cheers, Marcin Renata, Did you have a chance to verify the solution - either logging into nodes with higher load or checking interactive users' experience? cheers, Marcin Hi Marcin, I haven't been able to get anything definitive out of users, but I haven't heard anymore complaints either. I continue to see high loads on the hosts running jobs for that one user, but they are also very responsive. I think we should say that setting constraincores has helped/fixed the issue and I'll open up a new ticket if needed. Thanks for all of your help, Renata On Wed, 7 Oct 2020, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=9865 > >--- Comment #15 from Marcin Stolarek <cinek@schedmd.com> --- > Renata, > >Did you have a chance to verify the solution - either logging into nodes with >higher load or checking interactive users' experience? > >cheers, >Marcin > >-- >You are receiving this mail because: >You reported the bug. |
Hi SchedMD, our slurm cluster was set up a couple of months ago and the use has slowly been building. We have partitions set up so that the groups who have purchased hardware get priority access to their nodes and we have a shared partition for everyone else. The hosts are all amd 128 core systems. Recently we have experienced a heavy load on some of our hosts and complaints from some of the interactive users of jupyterlab about slowness. In addition to the high load, the shared user, swmclau2, in this case running 7 jobs on host rome0001, has a high % cpu showing in top. We have cgroups turned on, but not constraincores. Would turning constraincores on help in this situation? Here is our cgroup.conf and cgroup entries in slurm.conf: [renata@slurmctld1 slurm]$ cat cgroup.conf ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- CgroupAutomount=yes ConstrainCores=no ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes MemorySwappiness=10 [renata@slurmctld1 slurm]$ [renata@slurmctld1 slurm]$ [renata@slurmctld1 slurm]$ grep -i cgroup slurm.conf ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup ---------------------- Here is the squeue output for this host: [renata@rome0001 ~]$ squeue | grep rome0001 58924 supercdms sys/dash jnels1 R 12:42 1 rome0001 58906 supercdms sys/dash swatkins R 4:34:34 1 rome0001 58478_633 shared trainer swmclau2 R 5:55:40 1 rome0001 58478_634 shared trainer swmclau2 R 5:55:40 1 rome0001 58478_635 shared trainer swmclau2 R 5:55:40 1 rome0001 58478_629 shared trainer swmclau2 R 19:58:14 1 rome0001 58478_630 shared trainer swmclau2 R 19:58:14 1 rome0001 58478_621 shared trainer swmclau2 R 1-00:33:44 1 rome0001 58478_622 shared trainer swmclau2 R 1-00:33:44 1 rome0001 ----------------------- Here is what top says: top - 14:26:17 up 24 days, 22:11, 3 users, load average: 641.22, 686.43, 605.22 Tasks: 1561 total, 1 running, 1560 sleeping, 0 stopped, 0 zombie %Cpu(s): 99.9 us, 0.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 52783219+total, 21961225+free, 29294486+used, 15275084 buff/cache KiB Swap: 13126860+total, 13126860+free, 0 used. 23297265+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 39129 swmclau2 20 0 52.7g 38.0g 6676 S 2631 7.6 30518:11 python3 39128 swmclau2 20 0 52.8g 37.9g 6676 S 2599 7.5 30447:55 python3 110887 swmclau2 20 0 52.8g 38.1g 6676 S 2494 7.6 22642:38 python3 18655 swmclau2 20 0 52.8g 37.4g 6676 S 2111 7.4 5483:24 python3 18652 swmclau2 20 0 52.6g 38.4g 6676 S 1544 7.6 5768:20 python3 18658 swmclau2 20 0 52.3g 38.6g 6676 S 746.2 7.7 5633:50 python3 110886 swmclau2 20 0 52.3g 38.7g 6676 S 671.6 7.7 22519:58 python3 110894 ytl 20 0 174060 4164 1756 S 1.3 0.0 0:13.81 top 116969 renata 20 0 173960 3972 1656 R 0.7 0.0 0:00.15 top 1 root 20 0 203240 8244 4216 S 0.3 0.0 7:29.54 systemd 9 root 20 0 0 0 0 S 0.3 0.0 191:48.50 rcu_sched 450 root 20 0 0 0 0 S 0.3 0.0 0:01.36 ksoftirqd/87 2532 root 20 0 22280 1964 988 S 0.3 0.0 35:24.05 irqbalance 3382 telegraf 20 0 4379296 107064 20988 S 0.3 0.0 869:17.36 telegraf 3400 root 20 0 584228 21992 6720 S 0.3 0.0 1:58.71 tuned 5160 root 0 -20 23.5g 1.3g 111884 S 0.3 0.3 44:02.95 mmfsd 37605 swatkins 20 0 770272 104776 8396 S 0.3 0.0 0:14.69 jupyter-lab 110674 jnels1 20 0 1185724 91664 8308 S 0.3 0.0 0:07.73 jupyter-lab And this is what the cpu usage in top looks like. I couldn't figure out how to see all 128, but the first 50 look like this: Tasks: 1605 total, 3 running, 1602 sleeping, 0 stopped, 0 zombie %Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st --------------------- A snapshot of vmstat: [root@rome0001 log]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 897 0 0 270594112 663736 14792704 0 0 0 0 0 0 37 3 60 0 0 897 0 0 270364064 663736 14792704 0 0 0 36 129107 30781 100 0 0 0 0 897 0 0 270156960 663736 14792704 0 0 0 0 129113 31025 100 0 0 0 0 897 0 0 269886080 663736 14792712 0 0 0 0 129249 30943 100 0 0 0 0 898 0 0 269633536 663736 14792712 0 0 0 0 129865 30952 100 0 0 0 0 905 0 0 269426752 663736 14792712 0 0 0 0 132271 31546 100 0 0 0 0 897 0 0 269164800 663740 14792712 0 0 0 12 132624 32051 100 0 0 0 0 897 0 0 268862560 663740 14792716 0 0 0 0 129201 31030 100 0 0 0 0 897 0 0 268625120 663740 14792716 0 0 0 20 129243 31167 100 0 0 0 0 897 0 0 268417840 663740 14792716 0 0 0 8 130159 31176 100 0 0 0 0 897 0 0 268235456 663740 14792716 0 0 0 36 129938 31075 100 0 0 0 0 897 0 0 267909248 663740 14792716 0 0 0 4 129157 30938 100 0 0 0 0 897 0 0 267632416 663740 14792716 0 0 0 0 129377 31151 100 0 0 0 0 897 0 0 267400320 663740 14792708 0 0 0 0 129294 31115 100 0 0 0 0 897 0 0 267199264 663740 14792708 0 0 0 156 129134 31003 100 0 0 0 0 --------------- User jnels1 was showing up repeatedly in the messages file as running out of memory, but that seemed to be confined to their cgroup which seems to be doing the right thing. Just mentioning that it was happening repeatedly: [2152088.364814] Memory cgroup out of memory: Kill process 109300 (python3) score 574 or sacrifice child [2152088.374035] Killed process 108583 (python3), UID 14642, total-vm:4984520kB, anon-rss:1198820kB, file-rss:2860kB, shmem-rss:4kB [2152142.565672] python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 --------------- Here is what the shared user's (swmclau2) job(s) looks like: [renata@rome0001 ~]$ scontrol show job 58478_633 JobId=58898 ArrayJobId=58478 ArrayTaskId=633 JobName=trainer UserId=swmclau2(14185) GroupId=ki(1092) MCS_label=N/A Priority=152 Nice=0 Account=shared QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=02:23:19 TimeLimit=3-00:00:00 TimeMin=N/A SubmitTime=2020-09-17T12:32:15 EligibleTime=2020-09-17T12:32:15 AccrueTime=2020-09-17T12:32:15 StartTime=2020-09-18T08:22:03 EndTime=2020-09-21T08:22:03 Deadline=N/A PreemptEligibleTime=2020-09-18T08:22:03 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-18T08:22:03 Partition=shared AllocNode:Sid=sdf-login02:20302 ReqNodeList=(null) ExcNodeList=(null) NodeList=rome0001 BatchHost=rome0001 NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=64320M,node=1,billing=16 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=16 MinMemoryCPU=4020M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/swmclau2/knn_4_cdf_zheng07/tmp.sbatch WorkDir=/sdf/home/s/swmclau2/Git/pearce/bin/trainer StdErr=/scratch/swmclau2/knn_4_cdf_zheng07/trainer_633.err StdIn=/dev/null StdOut=/scratch/swmclau2/knn_4_cdf_zheng07/trainer_633.out Power= MailUser=(null) MailType=NONE Thanks, Renata