Ticket 9865

Summary:	High load, high %cpu usage, slow system, cgroups
Product:	Slurm	Reporter:	Renata Dart <renata>
Component:	Limits	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.02.3
Hardware:	Linux
OS:	Linux
Site:	SLAC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Renata Dart 2020-09-21 11:52:51 MDT

Hi SchedMD, our slurm cluster was set up a couple of months ago and
the use has slowly been building.  We have partitions set up so that
the groups who have purchased hardware get priority access to their
nodes and we have a shared partition for everyone else.  The hosts are
all amd 128 core systems.  Recently we have experienced a heavy load
on some of our hosts and complaints from some of the interactive users
of jupyterlab about slowness.  In addition to the high load, the
shared user, swmclau2, in this case running 7 jobs on host rome0001,
has a high % cpu showing in top.  We have cgroups turned on, but not
constraincores.  Would turning constraincores on help in this
situation?

Here is our cgroup.conf and cgroup entries in slurm.conf:

[renata@slurmctld1 slurm]$ cat cgroup.conf 
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes

ConstrainCores=no
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

MemorySwappiness=10
[renata@slurmctld1 slurm]$ 
[renata@slurmctld1 slurm]$ 
[renata@slurmctld1 slurm]$ grep -i cgroup slurm.conf
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

----------------------
Here is the squeue output for this host:

[renata@rome0001 ~]$ squeue | grep rome0001
             58924 supercdms sys/dash   jnels1  R      12:42      1 rome0001 
             58906 supercdms sys/dash swatkins  R    4:34:34      1 rome0001 
         58478_633    shared  trainer swmclau2  R    5:55:40      1 rome0001 
         58478_634    shared  trainer swmclau2  R    5:55:40      1 rome0001 
         58478_635    shared  trainer swmclau2  R    5:55:40      1 rome0001 
         58478_629    shared  trainer swmclau2  R   19:58:14      1 rome0001 
         58478_630    shared  trainer swmclau2  R   19:58:14      1 rome0001 
         58478_621    shared  trainer swmclau2  R 1-00:33:44      1 rome0001 
         58478_622    shared  trainer swmclau2  R 1-00:33:44      1 rome0001

-----------------------

Here is what top says:

top - 14:26:17 up 24 days, 22:11,  3 users,  load average: 641.22, 686.43, 605.22
Tasks: 1561 total,   1 running, 1560 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.9 us,  0.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 52783219+total, 21961225+free, 29294486+used, 15275084 buff/cache
KiB Swap: 13126860+total, 13126860+free,        0 used. 23297265+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 
 39129 swmclau2  20   0   52.7g  38.0g   6676 S  2631  7.6  30518:11 python3                                                                 
 39128 swmclau2  20   0   52.8g  37.9g   6676 S  2599  7.5  30447:55 python3                                                                 
110887 swmclau2  20   0   52.8g  38.1g   6676 S  2494  7.6  22642:38 python3                                                                 
 18655 swmclau2  20   0   52.8g  37.4g   6676 S  2111  7.4   5483:24 python3                                                                 
 18652 swmclau2  20   0   52.6g  38.4g   6676 S  1544  7.6   5768:20 python3                                                                 
 18658 swmclau2  20   0   52.3g  38.6g   6676 S 746.2  7.7   5633:50 python3                                                                 
110886 swmclau2  20   0   52.3g  38.7g   6676 S 671.6  7.7  22519:58 python3                                                                 
110894 ytl       20   0  174060   4164   1756 S   1.3  0.0   0:13.81 top                                                                     
116969 renata    20   0  173960   3972   1656 R   0.7  0.0   0:00.15 top                                                                     
     1 root      20   0  203240   8244   4216 S   0.3  0.0   7:29.54 systemd                                                                 
     9 root      20   0       0      0      0 S   0.3  0.0 191:48.50 rcu_sched                                                               
   450 root      20   0       0      0      0 S   0.3  0.0   0:01.36 ksoftirqd/87                                                            
  2532 root      20   0   22280   1964    988 S   0.3  0.0  35:24.05 irqbalance                                                              
  3382 telegraf  20   0 4379296 107064  20988 S   0.3  0.0 869:17.36 telegraf                                                                
  3400 root      20   0  584228  21992   6720 S   0.3  0.0   1:58.71 tuned                                                                   
  5160 root       0 -20   23.5g   1.3g 111884 S   0.3  0.3  44:02.95 mmfsd                                                                   
 37605 swatkins  20   0  770272 104776   8396 S   0.3  0.0   0:14.69 jupyter-lab                                                             
110674 jnels1    20   0 1185724  91664   8308 S   0.3  0.0   0:07.73 jupyter-lab                                        


And this is what the cpu usage in top looks like.  I couldn't figure out how to see all
128, but the first 50 look like this:

Tasks: 1605 total,   3 running, 1602 sleeping,   0 stopped,   0 zombie
%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

---------------------

A snapshot of vmstat:
[root@rome0001 log]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
897  0      0 270594112 663736 14792704    0    0     0     0    0    0 37  3 60  0  0
897  0      0 270364064 663736 14792704    0    0     0    36 129107 30781 100  0  0  0  0
897  0      0 270156960 663736 14792704    0    0     0     0 129113 31025 100  0  0  0  0
897  0      0 269886080 663736 14792712    0    0     0     0 129249 30943 100  0  0  0  0
898  0      0 269633536 663736 14792712    0    0     0     0 129865 30952 100  0  0  0  0
905  0      0 269426752 663736 14792712    0    0     0     0 132271 31546 100  0  0  0  0
897  0      0 269164800 663740 14792712    0    0     0    12 132624 32051 100  0  0  0  0
897  0      0 268862560 663740 14792716    0    0     0     0 129201 31030 100  0  0  0  0
897  0      0 268625120 663740 14792716    0    0     0    20 129243 31167 100  0  0  0  0
897  0      0 268417840 663740 14792716    0    0     0     8 130159 31176 100  0  0  0  0
897  0      0 268235456 663740 14792716    0    0     0    36 129938 31075 100  0  0  0  0
897  0      0 267909248 663740 14792716    0    0     0     4 129157 30938 100  0  0  0  0
897  0      0 267632416 663740 14792716    0    0     0     0 129377 31151 100  0  0  0  0
897  0      0 267400320 663740 14792708    0    0     0     0 129294 31115 100  0  0  0  0
897  0      0 267199264 663740 14792708    0    0     0   156 129134 31003 100  0  0  0  0

---------------
User jnels1 was showing up repeatedly in the messages file as running
out of memory, but that seemed to be confined to their cgroup which
seems to be doing the right thing.  Just mentioning that it was
happening repeatedly:

[2152088.364814] Memory cgroup out of memory: Kill process 109300 (python3) score 574 or sacrifice child
[2152088.374035] Killed process 108583 (python3), UID 14642, total-vm:4984520kB, anon-rss:1198820kB, file-rss:2860kB, shmem-rss:4kB
[2152142.565672] python3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0

---------------

Here is what the shared user's (swmclau2) job(s) looks like:


[renata@rome0001 ~]$ scontrol show job 58478_633
JobId=58898 ArrayJobId=58478 ArrayTaskId=633 JobName=trainer
   UserId=swmclau2(14185) GroupId=ki(1092) MCS_label=N/A
   Priority=152 Nice=0 Account=shared QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=02:23:19 TimeLimit=3-00:00:00 TimeMin=N/A
   SubmitTime=2020-09-17T12:32:15 EligibleTime=2020-09-17T12:32:15
   AccrueTime=2020-09-17T12:32:15
   StartTime=2020-09-18T08:22:03 EndTime=2020-09-21T08:22:03 Deadline=N/A
   PreemptEligibleTime=2020-09-18T08:22:03 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-18T08:22:03
   Partition=shared AllocNode:Sid=sdf-login02:20302
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=rome0001
   BatchHost=rome0001
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=64320M,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=4020M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/swmclau2/knn_4_cdf_zheng07/tmp.sbatch
   WorkDir=/sdf/home/s/swmclau2/Git/pearce/bin/trainer
   StdErr=/scratch/swmclau2/knn_4_cdf_zheng07/trainer_633.err
   StdIn=/dev/null
   StdOut=/scratch/swmclau2/knn_4_cdf_zheng07/trainer_633.out
   Power=
   MailUser=(null) MailType=NONE

Thanks,
Renata

Comment 1 Renata Dart 2020-09-23 11:04:35 MDT

Hi, is there any update on this issue, or do you need anything further from me?  I'd like to understand more about setting constraincores in cgroup.conf and if that might be of help for high load, excessive cpu usage situations like this.

Thanks,
Renata

Comment 2 Jason Booth 2020-09-23 11:09:07 MDT

Renata - I have Marcin looking into this for you. He will reply to you by tomorrow.

Comment 3 Renata Dart 2020-09-23 11:15:12 MDT

Thanks!

Renata

On Wed, 23 Sep 2020, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=9865
>
>--- Comment #2 from Jason Booth <jbooth@schedmd.com> ---
>Renata - I have Marcin looking into this for you. He will reply to you by
>tomorrow.
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 4 Marcin Stolarek 2020-09-23 12:34:30 MDT

Renata,

If you are not constraining cores then it's possible to submit a single core job that will use more cores. The user may even be unaware of his disturbing activity since some applications/frameworks will for instance start as many tasks as many cores discovered by them.

I'd strongly recommend enabling the option in general. Is there any reaaon you decided to leave core unconstrained?

Cheers,
Marcin

Comment 5 Renata Dart 2020-09-23 13:24:28 MDT

Hi Marcin, to be honest, I think it was just an oversight coupled with
an incomplete understanding of what it did, even though the name
seems obvious enough.  I didn't actually think that the issue I
included in this ticket was a case of using more cores than requested
by the job.  The high load and slowness for the interactive users was
my main concern.  Can you tell anything about that from what I
included?  Is there a slurm cgroup (or other) setting that I should be
using that may address that?  And do I need to restart all the slurmds
to set constraincores, or just slurmctld?

Thanks,
Renata

On Wed, 23 Sep 2020, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=9865
>
>--- Comment #4 from Marcin Stolarek <cinek@schedmd.com> ---
>Renata,
>
>If you are not constraining cores then it's possible to submit a single core
>job that will use more cores. The user may even be unaware of his disturbing
>activity since some applications/frameworks will for instance start as many
>tasks as many cores discovered by them.
>
>I'd strongly recommend enabling the option in general. Is there any reaaon you
>decided to leave core unconstrained?
>
>Cheers,
>Marcin
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 6 Marcin Stolarek 2020-09-24 05:47:19 MDT

Renata,

I may be wrong, since we just have an example job, but if the specification is consistent swmclau2 user specified NumNodes=1 NumCPUs=16, which means that for multiple threads his process should go up to 1600% of CPU, the top command shows ~26*100% which means that his running at least 26 threads.

>The high load and slowness for the interactive users was
my main concern. 
Yep, but this is probably caused by other jobs running on the cores slurmctld selected for those users.

>Is there a slurm cgroup (or other) setting that I should be
>using that may address that?  And do I need to restart all the slurmds
>to set constraincores, or just slurmctld?
Yes, change yor etc/slurm/cgroup.conf to have:
>ConstrainCores=Yes
and call `scontrol reconfigure`.

To verify if it works correctly go to the directory where you cgroups are mounted (by default it's /sys/fs/cgroup/) and check the content of cpuset/slurm/uid_UidOfUser/job_JobId/cpuset.cpus - it should contain a list of CPUs allowed like e.g "5-8".

This should be enough to limit job to use only CPUs it was given, however, the TaskPlugin stack we recommend (in general case) is:
>TaskPlugin=task/affinity, task/cgroup
This combination makes use of cgroups to limit CPUs/memory/devices accessible by a job and gives end-users additional --cpu-bind options implemented in task/affinity plugin.

Let me know if enabling cores constrainment worked as expected. Keep in mind that it will affect only new jobs, so you may need to wait some time for the changes to be fully implemented.

cheers,
Marcin

Comment 7 Renata Dart 2020-09-24 07:19:32 MDT

Hi Marcin, thanks for your analysis.  I have just implemented it and will see 
what happens.  Just to be sure, I did not have to systemctl restart slurmd
on the individual nodes, just restart slurmctld and scontrol reconfigure?

Renata

On Thu, 24 Sep 2020, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=9865
>
>--- Comment #6 from Marcin Stolarek <cinek@schedmd.com> ---
>Renata,
>
>I may be wrong, since we just have an example job, but if the specification is
>consistent swmclau2 user specified NumNodes=1 NumCPUs=16, which means that for
>multiple threads his process should go up to 1600% of CPU, the top command
>shows ~26*100% which means that his running at least 26 threads.
>
>>The high load and slowness for the interactive users was
>my main concern. 
>Yep, but this is probably caused by other jobs running on the cores slurmctld
>selected for those users.
>
>>Is there a slurm cgroup (or other) setting that I should be
>>using that may address that?  And do I need to restart all the slurmds
>>to set constraincores, or just slurmctld?
>Yes, change yor etc/slurm/cgroup.conf to have:
>>ConstrainCores=Yes
>and call `scontrol reconfigure`.
>
>To verify if it works correctly go to the directory where you cgroups are
>mounted (by default it's /sys/fs/cgroup/) and check the content of
>cpuset/slurm/uid_UidOfUser/job_JobId/cpuset.cpus - it should contain a list of
>CPUs allowed like e.g "5-8".
>
>This should be enough to limit job to use only CPUs it was given, however, the
>TaskPlugin stack we recommend (in general case) is:
>>TaskPlugin=task/affinity, task/cgroup
>This combination makes use of cgroups to limit CPUs/memory/devices accessible
>by a job and gives end-users additional --cpu-bind options implemented in
>task/affinity plugin.
>
>Let me know if enabling cores constrainment worked as expected. Keep in mind
>that it will affect only new jobs, so you may need to wait some time for the
>changes to be fully implemented.
>
>cheers,
>Marcin
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 8 Marcin Stolarek 2020-09-24 09:19:06 MDT

>on the individual nodes, just restart slurmctld and scontrol reconfigure?
Just `scontrol reconfigure` was needed to get this change implemented, so you didn't have to bounce slurmctld.

Did you check if per user,job and step directories under cpuset filesystem are created and contain cpuste.cpus files with the content aligned with the result of scontrol show job -d JOBID? (I mean CPU indices displayed in the per node listing of the command)[1].

cheers,
Marcin

[1]JobId=36977 JobName=wrap
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=66944 Nice=0 Account=root QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:02 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-09-24T15:18:01 EligibleTime=2020-09-24T15:18:01
   AccrueTime=2020-09-24T15:18:01
   StartTime=2020-09-24T15:18:06 EndTime=Unknown Deadline=N/A
   PreemptEligibleTime=2020-09-24T15:18:06 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-24T15:18:06
   Partition=AllNodes AllocNode:Sid=slurmctl:7785
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=test01
   BatchHost=test01
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   JOB_GRES=(null)
     Nodes=test01 CPU_IDs=0-1 Mem=0 GRES=
                   ^^^^^^^^^^^^-------------- List of CPUs assigned by slurmctld
   MinCPUsNode=1 MinMemoryNode=10M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/slurm-sources/bug9851
   StdErr=/slurm-sources/bug9851/slurm-36977.out
   StdIn=/dev/null
   StdOut=/slurm-sources/bug9851/slurm-36977.out
   Power=
   MailUser=(null) MailType=NONE

Comment 9 Renata Dart 2020-09-24 14:01:02 MDT

Hi Marcin, I just ran a job requesting 6 cores and see this on the running host:

[root@rome0142 job_61192]# pwd
/sys/fs/cgroup/cpuset/slurm/uid_1197/job_61192
[root@rome0142 job_61192]# cat cpuset.cpus
0-2,64-66

and the same under step_0 and step_batch.

Renata


On Thu, 24 Sep 2020, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=9865
>
>--- Comment #8 from Marcin Stolarek <cinek@schedmd.com> ---
>>on the individual nodes, just restart slurmctld and scontrol reconfigure?
>Just `scontrol reconfigure` was needed to get this change implemented, so you
>didn't have to bounce slurmctld.
>
>Did you check if per user,job and step directories under cpuset filesystem are
>created and contain cpuste.cpus files with the content aligned with the result
>of scontrol show job -d JOBID? (I mean CPU indices displayed in the per node
>listing of the command)[1].
>
>cheers,
>Marcin
>
>[1]JobId=36977 JobName=wrap
>   UserId=root(0) GroupId=root(0) MCS_label=N/A
>   Priority=66944 Nice=0 Account=root QOS=normal
>   JobState=RUNNING Reason=None Dependency=(null)
>   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>   DerivedExitCode=0:0
>   RunTime=00:00:02 TimeLimit=UNLIMITED TimeMin=N/A
>   SubmitTime=2020-09-24T15:18:01 EligibleTime=2020-09-24T15:18:01
>   AccrueTime=2020-09-24T15:18:01
>   StartTime=2020-09-24T15:18:06 EndTime=Unknown Deadline=N/A
>   PreemptEligibleTime=2020-09-24T15:18:06 PreemptTime=None
>   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-24T15:18:06
>   Partition=AllNodes AllocNode:Sid=slurmctl:7785
>   ReqNodeList=(null) ExcNodeList=(null)
>   NodeList=test01
>   BatchHost=test01
>   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>   TRES=cpu=2,node=1,billing=2
>   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
>   JOB_GRES=(null)
>     Nodes=test01 CPU_IDs=0-1 Mem=0 GRES=
>                   ^^^^^^^^^^^^-------------- List of CPUs assigned by
>slurmctld
>   MinCPUsNode=1 MinMemoryNode=10M MinTmpDiskNode=0
>   Features=(null) DelayBoot=00:00:00
>   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>   Command=(null)
>   WorkDir=/slurm-sources/bug9851
>   StdErr=/slurm-sources/bug9851/slurm-36977.out
>   StdIn=/dev/null
>   StdOut=/slurm-sources/bug9851/slurm-36977.out
>   Power=
>   MailUser=(null) MailType=NONE
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 10 Marcin Stolarek 2020-09-25 00:38:21 MDT

Looks good, let's check with interactive users when after MaxTime for the partition in question(shared) or other partitions overlaping with it.

cheers,
Marcin

Comment 11 Renata Dart 2020-09-28 12:19:06 MDT

Hi Marcin, we are still seeing high load on the hosts with jobs running
from user swmclau2.  We think it is related to the number of threads
his jobs are using.  His userid is  14185:

[renata@rome0122 ~]$ squeue | grep rome0122
         62173_751    shared  trainer swmclau2  R    2:35:49      1 rome0122 
         62173_752    shared  trainer swmclau2  R    2:35:49      1 rome0122 
         62173_756    shared  trainer swmclau2  R    2:35:49      1 rome0122 
         62173_758    shared  trainer swmclau2  R    2:35:49      1 rome0122 
         62173_761    shared  trainer swmclau2  R    2:35:49      1 rome0122 
         62173_762    shared  trainer swmclau2  R    2:35:49      1 rome0122 
         62173_772    shared  trainer swmclau2  R    2:35:49      1 rome0122 
         62173_773    shared  trainer swmclau2  R    2:35:49      1 rome0122 

[renata@rome0122 ~]$ w
 11:16:40 up 12 days, 20:07,  1 user,  load average: 1009.70, 799.51, 722.91
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
renata   pts/0    sdf-login01.slac 11:16    0.00s  0.21s  0.03s w

[renata@rome0122 ~]$ ps axms | grep ^14185 | grep -c ' Rl'
1024

As the number of threads he uses drops then so does the load.

I am wondering if the numa stuff is set up properly.  The nodes are amd epyc 7702
systems and I have

SchedulerParameters=Ignore_NUMA

We haven't turned off the numa balancing in the OS though.  The nodes
look like:

[renata@rome0001 ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    1
Core(s) per socket:    64
Socket(s):             2
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7702 64-Core Processor
Stepping:              0
CPU MHz:               1996.252
BogoMIPS:              3992.50
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
NUMA node1 CPU(s):     16-31
NUMA node2 CPU(s):     32-47
NUMA node3 CPU(s):     48-63
NUMA node4 CPU(s):     64-79
NUMA node5 CPU(s):     80-95
NUMA node6 CPU(s):     96-111
NUMA node7 CPU(s):     112-127
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca
[renata@rome0001 ~]$ 

Does this seem like the right set-up?

Thanks,
Renata


On Fri, 25 Sep 2020, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=9865
>
>--- Comment #10 from Marcin Stolarek <cinek@schedmd.com> ---
>Looks good, let's check with interactive users when after MaxTime for the
>partition in question(shared) or other partitions overlaping with it.
>
>cheers,
>Marcin
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 12 Marcin Stolarek 2020-09-29 01:32:06 MDT

Renata,

>we are still seeing high load on the hosts with jobs running
>from user swmclau2.  We think it is related to the number of threads
>his jobs are using.

That's actually expected and you're correct it's related to number of threads/processes running on the host. Per load definition it's number of processes in running or ready queue (waiting or using CPU resources). However, if specific processes are bound to appropriate cores high load should not imapct all CPUs, since other jobs are assigned theirs individual resources and are not competing with the user threads. Do you see "slowness" from the interactive job running on the same host as the highly threaded one?

>SchedulerParameters=Ignore_NUMA
Setting this for cluster with AMD Epyc sounds resonable.

cheers,
Marcin

Comment 13 Renata Dart 2020-09-29 09:02:32 MDT

Hi Marcin, there were no interactive jobs running this time.
And when I logged in, the system seemed responsive.  I'll
monitor some more and see try to check with any interactive
users that I see running on the same host.

Thanks,
Renata

  

On Tue, 29 Sep 2020, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=9865
>
>--- Comment #12 from Marcin Stolarek <cinek@schedmd.com> ---
>Renata,
>
>>we are still seeing high load on the hosts with jobs running
>>from user swmclau2.  We think it is related to the number of threads
>>his jobs are using.
>
>That's actually expected and you're correct it's related to number of
>threads/processes running on the host. Per load definition it's number of
>processes in running or ready queue (waiting or using CPU resources). However,
>if specific processes are bound to appropriate cores high load should not
>imapct all CPUs, since other jobs are assigned theirs individual resources and
>are not competing with the user threads. Do you see "slowness" from the
>interactive job running on the same host as the highly threaded one?
>
>>SchedulerParameters=Ignore_NUMA
>Setting this for cluster with AMD Epyc sounds resonable.
>
>cheers,
>Marcin
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 14 Marcin Stolarek 2020-09-30 01:05:57 MDT

>And when I logged in, the system seemed responsive.

This is a good symptom though the load is high it doesn't impact all processes, since the set of threads responsible for that is limited to appropriate CPUs.

> I'll monitor some more and see try to check with any interactive users that I see running on the same host.

It should be all right. The "system responsiveness" verification you did is essentially the same thing.

cheers,
Marcin

Comment 15 Marcin Stolarek 2020-10-07 03:27:38 MDT

 Renata,

Did you have a chance to verify the solution - either logging into nodes with higher load or checking interactive users' experience?

cheers,
Marcin

Comment 16 Renata Dart 2020-10-07 08:22:24 MDT

Hi Marcin, I haven't been able to get anything definitive out of
users, but I haven't heard anymore complaints either.  I continue to
see high loads on the hosts running jobs for that one user, but they
are also very responsive.  I think we should say that setting
constraincores has helped/fixed the issue and I'll open up a new
ticket if needed.

Thanks for all of your help,
Renata

On Wed, 7 Oct 2020, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=9865
>
>--- Comment #15 from Marcin Stolarek <cinek@schedmd.com> ---
> Renata,
>
>Did you have a chance to verify the solution - either logging into nodes with
>higher load or checking interactive users' experience?
>
>cheers,
>Marcin
>
>-- 
>You are receiving this mail because:
>You reported the bug.