Ticket 5243 - cpus are not assigned correctly in the cpuset cgroup on POWER8NVL
Summary: cpus are not assigned correctly in the cpuset cgroup on POWER8NVL
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.11.7
Hardware: Other Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-06-01 03:05 MDT by Cineca HPC Systems
Modified: 2018-10-31 07:30 MDT (History)
3 users (show)

See Also:
Site: Cineca
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
/etc/slurm and job logs (53.00 KB, application/gzip)
2018-06-01 03:05 MDT, Cineca HPC Systems
Details
lstopo output (5.48 KB, text/plain)
2018-06-04 03:52 MDT, Cineca HPC Systems
Details
job with ConstrainDevices=no (53.95 KB, text/plain)
2018-06-06 08:01 MDT, Cineca HPC Systems
Details
extra debug (568 bytes, patch)
2018-06-21 07:59 MDT, Dominik Bartkiewicz
Details | Diff
slurmd logs with extra debugs (462.77 KB, text/x-log)
2018-06-27 02:38 MDT, Cineca HPC Systems
Details
open.c LD_PRELOAD library (1.08 KB, text/x-csrc)
2018-08-09 05:47 MDT, Dominik Bartkiewicz
Details
backtrace log produced by open.c (13.94 KB, text/x-log)
2018-08-10 03:24 MDT, Cineca HPC Systems
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Cineca HPC Systems 2018-06-01 03:05:19 MDT
Created attachment 6974 [details]
/etc/slurm and job logs

Hi Support,

we have installed SLURM 17.11.7 on a new cluster based on POWER8 with NVIDIA link. All nodes have 2 POWER8NVL processors and 4 Tesla P100. SMT is enabled

[afederic@davide44 ~]$ ppc64_cpu --info
Core   0:    0*    1*    2*    3*    4*    5*    6*    7* 
Core   1:    8*    9*   10*   11*   12*   13*   14*   15* 
Core   2:   16*   17*   18*   19*   20*   21*   22*   23* 
Core   3:   24*   25*   26*   27*   28*   29*   30*   31* 
Core   4:   32*   33*   34*   35*   36*   37*   38*   39* 
Core   5:   40*   41*   42*   43*   44*   45*   46*   47* 
Core   6:   48*   49*   50*   51*   52*   53*   54*   55* 
Core   7:   56*   57*   58*   59*   60*   61*   62*   63* 
Core   8:   64*   65*   66*   67*   68*   69*   70*   71* 
Core   9:   72*   73*   74*   75*   76*   77*   78*   79* 
Core  10:   80*   81*   82*   83*   84*   85*   86*   87* 
Core  11:   88*   89*   90*   91*   92*   93*   94*   95* 
Core  12:   96*   97*   98*   99*  100*  101*  102*  103* 
Core  13:  104*  105*  106*  107*  108*  109*  110*  111* 
Core  14:  112*  113*  114*  115*  116*  117*  118*  119* 
Core  15:  120*  121*  122*  123*  124*  125*  126*  127* 

SMT thread IDs are those along the lines.

Submitting a job that ask for 8 tasks (8 cores)

[afederic@davide44 ~]$ sbatch -n 8 -w davide44 sleep.sh
Submitted batch job 38

Slurm correctly reports the job properties

[afederic@davide44 ~]$ scontrol show job 38
JobId=38 JobName=sleep.sh
   UserId=afederic(28541) GroupId=interactive(25200) MCS_label=N/A
   Priority=4294901728 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:07 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2018-06-01T10:38:11 EligibleTime=2018-06-01T10:38:11
   StartTime=2018-06-01T10:38:11 EndTime=2018-06-01T11:08:11 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-06-01T10:38:11
   Partition=system AllocNode:Sid=davide44:67100
   ReqNodeList=davide44 ExcNodeList=(null)
   NodeList=davide44
   BatchHost=davide44
   NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=8000M,node=1,billing=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/davide/home/userinternal/afederic/sleep.sh
   WorkDir=/davide/home/userinternal/afederic
   StdErr=/davide/home/userinternal/afederic/slurm-38.out
   StdIn=/dev/null
   StdOut=/davide/home/userinternal/afederic/slurm-38.out
   Power=

but it assignes all the "real" cores in the cpuset cgroup

[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch/cpuset.cpus 
0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120

I also submitted a job using -B flag

[afederic@davide44 ~]$ sbatch -B 1:8:1 -w davide44 sleep.sh 
Submitted batch job 41

but in this case also the number of cpus assigne to de job is wrong

[afederic@davide44 ~]$ scontrol show job 41
JobId=41 JobName=sleep.sh
   UserId=afederic(28541) GroupId=interactive(25200) MCS_label=N/A
   Priority=4294901725 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:05 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2018-06-01T10:43:09 EligibleTime=2018-06-01T10:43:09
   StartTime=2018-06-01T10:43:09 EndTime=2018-06-01T11:13:09 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-06-01T10:43:09
   Partition=system AllocNode:Sid=davide44:67100
   ReqNodeList=davide44 ExcNodeList=(null)
   NodeList=davide44
   BatchHost=davide44
   NumNodes=1 NumCPUs=64 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:8:1
   TRES=cpu=64,mem=62.50G,node=1,billing=64
   Socks/Node=1 NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/davide/home/userinternal/afederic/sleep.sh
   WorkDir=/davide/home/userinternal/afederic
   StdErr=/davide/home/userinternal/afederic/slurm-41.out
   StdIn=/dev/null
   StdOut=/davide/home/userinternal/afederic/slurm-41.out
   Power=

and the cpuset cgroup is always the same

[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_41/step_batch/cpuset.cpus 
0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120

Looking at slurmd logs (debug5) it seems Slurm is correctly determining the 
cpu topology with hwloc, but then when it assigns cpuset.cpus value something 
very strange happens. This is the log of the first job (sbatch -n 8)

[2018-06-01T10:38:11.969] [38.batch] debug:  task/cgroup: job abstract cores are '0'
[2018-06-01T10:38:11.969] [38.batch] debug:  task/cgroup: step abstract cores are '0'
[2018-06-01T10:38:11.969] [38.batch] debug:  task/cgroup: job physical cores are '0-7'
[2018-06-01T10:38:11.969] [38.batch] debug:  task/cgroup: step physical cores are '0-7'
[2018-06-01T10:38:11.969] [38.batch] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_28541' already exists
[2018-06-01T10:38:11.969] [38.batch] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541'
[2018-06-01T10:38:11.969] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7,0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120' for '/sys/fs/cgroup/cpuset/slurm/uid_28541'
[2018-06-01T10:38:11.969] [38.batch] debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38' already exists
[2018-06-01T10:38:11.969] [38.batch] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38'
[2018-06-01T10:38:11.970] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38'
[2018-06-01T10:38:11.970] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38'
[2018-06-01T10:38:11.970] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38'
[2018-06-01T10:38:11.970] [38.batch] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch'
[2018-06-01T10:38:11.972] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch'
[2018-06-01T10:38:11.972] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch'
[2018-06-01T10:38:11.972] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch'

from the last line we was exppecting to find 0-7 in the cpuset, instead we find all the cores.

Moreover launching a with srun with -B 1:8:X with X > 1 or -n Y with Y > 7, results in the following error

[afederic@davide44 ~]$ srun: error: davide44: tasks 1-7: Exited with exit code 1
srun: Terminating job step 49.0
srun: error: davide44: task 0: Killed
srun: Force Terminated job step 49.0

slurmd logs reports an invalid parameter in the call to sched_setaffinity

[2018-06-01T10:55:37.195] [49.0] debug3: sched_getaffinity(89196) = 0x1010101010101010101010101010101
[2018-06-01T10:55:37.195] [49.0] debug3: get_cpuset (mask_cpu[256]) 0x00000000000000000000000000000001,0x00000000000000000000000000000002,0x00000000000
000000000000000000004,0x00000000000000000000000000000008,0x00000000000000000000000000000010,0x00000000000000000000000000000020,0x0000000000000000000000
0000000040,0x00000000000000000000000000000080
[2018-06-01T10:55:37.196] [49.0] sched_setaffinity(89196,128,0x2) failed: Invalid argument
[2018-06-01T10:55:37.196] [49.0] debug3: sched_getaffinity(89196) = 0x1010101010101010101010101010101
[2018-06-01T10:55:37.196] [49.0] debug:  task_g_pre_launch: task/affinity: Unspecified error
[2018-06-01T10:55:37.196] [49.0] error: Failed to invoke task plugins: task_p_pre_launch error

I'm attaching the following files:

- slurm.tgz, a tgz of /etc/slurm
- job-XY.log, slurmd logs for job XY

thanks
Ale
Comment 1 Tim Wickberg 2018-06-01 11:41:26 MDT
Can you attach output from 'lstopo' as well?

If you disable the task/affinity plugin, is the current system usable (albeit without optimal affinity) for now?
Comment 5 Cineca HPC Systems 2018-06-04 03:52:02 MDT
Created attachment 6988 [details]
lstopo output
Comment 6 Cineca HPC Systems 2018-06-04 03:59:23 MDT
Hi
in the production cluster I'm using TaskPlugin=task/cgroup with ConstrainCores=no in cgroup.conf.
I will remove task/affinty also from the test cluster, where ConstrainCores=yes, and let you know.

thanks
ale
Comment 7 Cineca HPC Systems 2018-06-04 06:16:52 MDT
Hi Tim,
ok, task/affinity plugin is disabled also in the 2 nodes cluster so the sched_setaffinty invalid parameter error doesn't occur. 

What about the wrong core ids written in the cpuset.cpus?
I see from slurmd logs that the HW topology (hwloc_topology_load) is discovered correctly in terms of S:C:T but then, for an 8 cores job, task/cgroup tries to assign the first 8 threads (1 core)

[2018-06-04T13:39:57.918] [52.batch] debug:  task/cgroup: job abstract cores are '0'
[2018-06-04T13:39:57.918] [52.batch] debug:  task/cgroup: step abstract cores are '0'
[2018-06-04T13:39:57.918] [52.batch] debug:  task/cgroup: job physical cores are '0-7'
[2018-06-04T13:39:57.918] [52.batch] debug:  task/cgroup: step physical cores are '0-7'

Then xcgroup_set_param sets these values in cpuset.cpus

[2018-06-04T13:39:57.919] [52.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_52/step_batch'
[2018-06-04T13:39:57.921] [52.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_52/step_batch'

so I would expect to find 0-7 in the cpuset.cpus parameter. 
Instead we find all the physical cores 

[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_52/step_batch/cpuset.cpus
0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120

Could it be some issue with cgroups in the running kernel (3.10.0-514.el7.ppc64le)?

Thanks
ale
Comment 8 Dominik Bartkiewicz 2018-06-06 06:35:43 MDT
Hi

I still working on this.
Could you set ConstrainDevices to no and check if this change this behaviour?

Dominik
Comment 9 Cineca HPC Systems 2018-06-06 08:01:24 MDT
We get the same behaviour with ConstrainDevices=no.
I'm attaching the job logs.
Comment 10 Cineca HPC Systems 2018-06-06 08:01:48 MDT
Created attachment 7006 [details]
job with ConstrainDevices=no
Comment 11 Dominik Bartkiewicz 2018-06-21 07:58:18 MDT
Hi

Could you send me outputs from this commands?

cat /sys/fs/cgroup/cpuset/slurm/{cpuset.effective_cpus,cpuset.cpus}

ls /sys/fs/cgroup/cpuset/slurm/uid_28541/

cat /sys/fs/cgroup/cpuset/slurm/uid_28541/{cpuset.effective_cpus,cpuset.cpus,tasks}

I will attach the patch with some extra debugs.
It will be great if you apply it and send me back slurmd log.

Dominik
Comment 12 Dominik Bartkiewicz 2018-06-21 07:59:12 MDT
Created attachment 7133 [details]
extra debug
Comment 13 Cineca HPC Systems 2018-06-21 08:19:35 MDT
Hi Dominik,
these are the outputs 

[afederic@davide44 ~]$ srun -n 8 -w davide44 --pty bash

[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/{cpuset.effective_cpus,cpuset.cpus}
0-127
0-127

[afederic@davide44 ~]$ ls /sys/fs/cgroup/cpuset/slurm/uid_28541/
cgroup.clone_children  cpuset.cpus            cpuset.mem_hardwall        cpuset.memory_spread_slab        job_4
cgroup.event_control   cpuset.effective_cpus  cpuset.memory_migrate      cpuset.mems                      notify_on_release
cgroup.procs           cpuset.effective_mems  cpuset.memory_pressure     cpuset.sched_load_balance        tasks
cpuset.cpu_exclusive   cpuset.mem_exclusive   cpuset.memory_spread_page  cpuset.sched_relax_domain_level

[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/{cpuset.effective_cpus,cpuset.cpus,tasks}
0-7
0-7

thanks
ale
Comment 14 Dominik Bartkiewicz 2018-06-26 09:36:21 MDT
Hi

Have you grabbed this extra debug from slurmd?
 
Dominik
Comment 15 Cineca HPC Systems 2018-06-27 02:04:33 MDT
Dominik
I'm sorry. I'll do it today

thanks
Comment 16 Cineca HPC Systems 2018-06-27 02:38:05 MDT
Created attachment 7191 [details]
slurmd logs with extra debugs
Comment 17 Dominik Bartkiewicz 2018-07-03 07:14:18 MDT
Hi

It looks like something from slurm outside is modifying cgrups.
Do you use cgred or any other cgroup manager?

Dominik
Comment 18 Cineca HPC Systems 2018-07-06 07:22:14 MDT
no, we do not use any cgroup manager
it seems to me that there is some issue with the IDs of real cores versus SMT cores
for example, this job

[afederic@davide44 ~]$ srun -B 1:8:1 -w davide44 --pty bash

assigns these cores

[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_40/{cpuset.effective_cpus,cpuset.cpus,tasks}
0-31,64-95
0-31,64-95

So it assigns 4 real cores and SMTs on each socket while it should assign all the real cores of one socket and *no* SMTs 

[afederic@davide44 ~]$ ppc64_cpu --info
Core   0:    0*    1*    2*    3*    4*    5*    6*    7* 
Core   1:    8*    9*   10*   11*   12*   13*   14*   15* 
Core   2:   16*   17*   18*   19*   20*   21*   22*   23* 
Core   3:   24*   25*   26*   27*   28*   29*   30*   31* 
Core   4:   32*   33*   34*   35*   36*   37*   38*   39* 
Core   5:   40*   41*   42*   43*   44*   45*   46*   47* 
Core   6:   48*   49*   50*   51*   52*   53*   54*   55* 
Core   7:   56*   57*   58*   59*   60*   61*   62*   63* 
Core   8:   64*   65*   66*   67*   68*   69*   70*   71* 
Core   9:   72*   73*   74*   75*   76*   77*   78*   79* 
Core  10:   80*   81*   82*   83*   84*   85*   86*   87* 
Core  11:   88*   89*   90*   91*   92*   93*   94*   95* 
Core  12:   96*   97*   98*   99*  100*  101*  102*  103* 
Core  13:  104*  105*  106*  107*  108*  109*  110*  111* 
Core  14:  112*  113*  114*  115*  116*  117*  118*  119* 
Core  15:  120*  121*  122*  123*  124*  125*  126*  127* 

[afederic@davide44 ~]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 131072 MB
node 0 free: 124514 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 131072 MB
node 1 free: 128174 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

thanks

ale
Comment 19 Dominik Bartkiewicz 2018-07-06 09:25:57 MDT
Hi

OK, This means the problem is in affinity plugin.
Could you apply this patch?


diff --git a/src/plugins/task/affinity/cpuset.c b/src/plugins/task/affinity/cpuset.c
index 61eba92..5521030 100644
--- a/src/plugins/task/affinity/cpuset.c
+++ b/src/plugins/task/affinity/cpuset.c
@@ -224,6 +224,7 @@ int slurm_set_cpuset(char *base, char *path, pid_t pid, size_t size,
                return SLURM_ERROR;
        }
        rc = write(fd, mstr, strlen(mstr)+1);
+       error("BUG 5243: rc=%d, file_path=%s, mstr=%s", rc, file_path, mstr);
        close(fd);
        if (rc < 1) {
                error("write(%s): %m", file_path);


Dominik
Comment 20 Cineca HPC Systems 2018-07-09 10:01:50 MDT
Dominik
sorry for the delay. I applied the patch but I cannot find the your log

error("BUG 5243: rc=%d, file_path=%s, mstr=%s", rc, file_path, mstr);

in the slurmd.log
are you sure that code is going to be executed?

debug levels are

[root@davide44 ~]# scontrol show conf | grep -i debug
DebugFlags              = Backfill,BackfillMap,NodeFeatures,Priority,Protocol,TraceJobs
SlurmctldDebug          = debug5
SlurmctldSyslogDebug    = verbose
SlurmdDebug             = debug5
SlurmdSyslogDebug       = verbose

thanks
ale
Comment 21 Dominik Bartkiewicz 2018-07-09 12:56:04 MDT
Hi

Thanks for this info.

slurm_set_cpuset() can make non-trivial manipulation on CPU mask that is why I suspected it.
Could you try to disable task/affinity and check if this change assigned cores?

Dominik
Comment 22 Cineca HPC Systems 2018-07-10 05:39:50 MDT
Hi Dominik,
task/affinity plugin is already disabled

[root@davide44 ~]# scontrol show conf | grep -i TaskPlugin
TaskPlugin              = task/cgroup
TaskPluginParam         = (null type)

Tim asked me to disable it because it triggered a sched_setaffinty 
invalid parameter error. 
Please look at my first comment in this issue.

thanks
ale
Comment 23 Dominik Bartkiewicz 2018-07-10 07:01:17 MDT
Hi


Sorry, that means this is not task/affinity plugin.
Wich version of hwloc do you use?
To be sure, could you check if "srun --cpu-bind=verbose" doesn't return any extra info?

Dominik
Comment 24 Cineca HPC Systems 2018-07-10 10:42:52 MDT
Hi,
I cannot see extra info from the switch --cpu-bind=verbose

[afederic@davide44 ~]$ srun --cpu-bind=verbose -B 1:8:1 -w davide44 --pty bash
[afederic@davide44 ~]$ 

[root@davide44 ~]# rpm -q hwloc
hwloc-1.11.8-4.el7.ppc64le

thanks
ale
Comment 26 Dominik Bartkiewicz 2018-07-25 05:17:26 MDT
Hi

Could you check if adding "Delegate=yes" to slurmd.service change anything?

https://github.com/SchedMD/slurm/commit/cecb39ff087731d29252bbc36b00abf814a3c5ac

Dominik
Comment 27 Dominik Bartkiewicz 2018-07-31 04:51:05 MDT
Hi

Did you have any chance to test this?

Dominik
Comment 28 Cineca HPC Systems 2018-08-06 06:27:15 MDT
Hi Dominik

sorry for the delay I was out of office for holidays until today.

I tried, the behavior is the same

[afederic@davide44 ~]$ srun -B 1:8:1 -w davide44 --pty bash

[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_46/{cpuset.effective_cpus,cpuset.cpus,tasks}
0-31,64-95
0-31,64-95

thanks
ale
Comment 30 Dominik Bartkiewicz 2018-08-09 05:47:03 MDT
Created attachment 7550 [details]
open.c LD_PRELOAD library

Hi

Could you try to compile this lib and start slurmd with LD_PRELOAD=/<...>/open.so?
This lib should catch every attempt to open "cpuset.cpus" and log to /tmp/open.log. If modifying cpus comes from slurm we will see it.


Dominik
Comment 31 Cineca HPC Systems 2018-08-10 03:24:33 MDT
Created attachment 7560 [details]
backtrace log produced by open.c

Hi Dominik
I attached the log produced by this job

[afederic@davide44 ~]$ srun -B 1:8:1 -w davide44 --pty bash

thanks
ale
Comment 32 Dominik Bartkiewicz 2018-08-10 06:31:44 MDT
Hi

According to this log, modification of cpuset.cpus doesn't come from slurm.

all modifications of cpuset.cpus go through xcgroup_set_param().
xcgroup_set_param() logs each modification on debug3 and sets properly cpus values.
Because we have already disabled affinity from task/affinity and task/cgroup (TaskAffinity=no in cgroup.conf)
this also doesn't come from any slurm sched_setaffinity().

could you try to manually create cgroup in similar way to this log and check if it works fine :
[2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5'
[2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5'
[2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5'
[2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5'
[2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5/step_0'
[2018-06-27T10:30:07.937] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5/step_0'
[2018-06-27T10:30:07.937] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5/step_0'
[2018-06-27T10:30:07.937] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5/step_0'

I also have found this:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/SMT%20and%20cgroup%20cpusets

Dominik
Comment 33 Cineca HPC Systems 2018-08-10 09:06:16 MDT
Hi,
it seems to work

[root@davide44 ~]# mkdir /sys/fs/cgroup/cpuset/slurm/TEST
[root@davide44 ~]# echo 0-7 > /sys/fs/cgroup/cpuset/slurm/TEST/cpuset.cpus
[root@davide44 ~]# echo 0-1 > /sys/fs/cgroup/cpuset/slurm/TEST/cpuset.mems 
[root@davide44 ~]# echo 0 > /sys/fs/cgroup/cpuset/slurm/TEST/notify_on_release 

[root@davide44 ~]# cat /sys/fs/cgroup/cpuset/slurm/TEST/cpuset.cpus /sys/fs/cgroup/cpuset/slurm/TEST/cpuset.mems /sys/fs/cgroup/cpuset/slurm/TEST/notify_on_release
0-7
0-1
0

thanks
ale
Comment 34 Cineca HPC Systems 2018-08-10 09:14:35 MDT
Dominik, I forgot to tell you that we are on CentOS 7.5

[root@davide44 ~]# cat /etc/centos-release
CentOS Linux release 7.5.1804 (AltArch) 

thanks
ale
Comment 35 Dominik Bartkiewicz 2018-08-10 13:35:40 MDT
Hi

Thanks for this test and info.
I already know that you are using centos 7 from hwloc version :)

I know this www describes the situation on Ubuntu, but
if you are using soft that is changing some threads state, the result can be odd.

Dominik
Comment 36 Dominik Bartkiewicz 2018-08-22 07:09:47 MDT
Hi

We have made attempts to get access to a power8 machine, unfortunately without success.

Could you grain me the user-level remote account, that I can make remote tests? This is against our normal no-remote-access rule, but I haven't got any other idea now.

Dominik
Comment 37 Cineca HPC Systems 2018-08-22 08:17:08 MDT
Hi Dominik,

of course you are welcome to DAVIDE.
One of our User Support people will contact you asap.

thanks
ale
Comment 38 hpc-cs-hd 2018-08-22 08:31:28 MDT
Dear Dominik,

first of all thanks for your help. In order to obtain an HPC account on Davide, we would kindly ask you to register on our UserDB Portal at:

https://userdb.hpc.cineca.it/

Just click on "Create new user" and enter the requested information. Once created the user, follow the "HPC Access" link (you find it among the Available Services in the vertical menu on the left), and complete your registration by providing the required data in the "Institution" and "Documents for HPC" sections.

After that, please write to us so that we can associate your user to a project and send you a personal username and password to log to 

login.davide.cineca.it

Let us know in case of problems or doubts,

cheers

Isabella
Comment 39 Cineca HPC Systems 2018-08-24 04:29:41 MDT
Hi Dominik,

to help your work I set up a test environment on 

davide44: slurmctld & slurmd
davide45: slurmd

I also allowed your account to some sudo commands on both these nodes


User dbartkie may run the following commands on davide44:
    (root) NOPASSWD: /bin/cp slurm.conf /etc/slurm/, /bin/cp cgroup.conf /etc/slurm/, /bin/cp gres.conf /etc/slurm/, /bin/cp
        cgroup_allowed_devices_file.conf /etc/slurm/
    (root) NOPASSWD: /bin/systemctl restart slurmctld, /bin/systemctl restart slurmd


let me know if you need some other permissions

thanks
ale
Comment 40 Dominik Bartkiewicz 2018-10-09 08:26:22 MDT
Hi

Sorry for late response,

first of all, I can't recreate invalid value of cpuset.cpus.
On configuration with or without task/affinity plugin, none of the different combination of cgroup.conf option doesn't procure this problem.

note to -B option:
       --cores-per-socket=<cores>
              Restrict node selection to nodes with at least with specified number of cores per socket.  See additional information under -B option above when task/affinity plugin is enabled. This option applies to job allocations.

Setting this doesn't guarantee that slurmctld will select 8 cores on one socket, but only
restrict node selection to nodes with at least 8 free cores on the socket.

Dominik
Comment 41 Cineca HPC Systems 2018-10-17 06:19:38 MDT
Hi Dominik

I see that you enabled the task/affinity plugin but I cannot explain why it's now working. If you look at the my first comment in this bug report you can see that with task/affinity enabled sched_setaffinity was always crashing with "Invalid argument".

Thanks for the explanation of the -B switch, but I do not understand the cpus/threads allocated when using -n switch. Submitting a job with -n 4 results in 1 core and all its SMTs allocated

[afederic@davide44 ~]$ srun -n 4 -w davide44 --pty bash
[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 
0-7

Is this the expected behavior? I would expect 4 cores to be allocated.

I did some testing with some other switches and I only found this way to get 4 real cores allocated

[afederic@davide44 ~]$ srun -n 4 --ntasks-per-core=1 -w davide44 --pty bash
[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 
0-15,64-79

Also --ntasks-per-socket seems to work

[afederic@davide44 ~]$ srun -n 4 --ntasks-per-core=1 --ntasks-per-socket=4 -w davide44 --pty bash
[afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 
64-95

So I didn't understand what's going on when I use only the -n switch

thank you very much
ale
Comment 42 Dominik Bartkiewicz 2018-10-17 08:14:02 MDT
(In reply to Cineca HPC Systems from comment #41)
> Hi Dominik
> 
> I see that you enabled the task/affinity plugin but I cannot explain why
> it's now working. If you look at the my first comment in this bug report you
> can see that with task/affinity enabled sched_setaffinity was always
> crashing with "Invalid argument".

I know that I spent some time on trying to reproduce this and 
still, I have no idea what was wrong before.
Did you change SMT mode after starting slurmd demon?

> 
> Thanks for the explanation of the -B switch, but I do not understand the
> cpus/threads allocated when using -n switch. Submitting a job with -n 4
> results in 1 core and all its SMTs allocated
> 
> [afederic@davide44 ~]$ srun -n 4 -w davide44 --pty bash
> [afederic@davide44 ~]$ cat
> /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 
> 0-7
> 
> Is this the expected behavior? I would expect 4 cores to be allocated.
> 
> I did some testing with some other switches and I only found this way to get
> 4 real cores allocated
> 
> [afederic@davide44 ~]$ srun -n 4 --ntasks-per-core=1 -w davide44 --pty bash
> [afederic@davide44 ~]$ cat
> /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 
> 0-15,64-79
> 
Yes,

CR_Core
       Cores  are  consumable  resources.   On nodes with
       hyper-threads, each thread is counted as a CPU  to
       satisfy a job's resource requirement, but multiple
       jobs are not allocated threads on the  same  core.
       The  count  of  CPUs  allocated  to  a  job may be
       rounded up to account for every CPU  on  an  allo‐
       cated core.


Check CR_ONE_TASK_PER_CORE in slurm.conf man.
I think you would like to use it.


Dominik

> Also --ntasks-per-socket seems to work
> 
> [afederic@davide44 ~]$ srun -n 4 --ntasks-per-core=1 --ntasks-per-socket=4
> -w davide44 --pty bash
> [afederic@davide44 ~]$ cat
> /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 
> 64-95
> 
> So I didn't understand what's going on when I use only the -n switch
> 
> thank you very much
> ale
Comment 43 Dominik Bartkiewicz 2018-10-24 09:39:22 MDT
Hi

Let me know if CR_ONE_TASK_PER_CORE works as you expected.
Do you have any additional questions or my previous answer was enough?

Dominik
Comment 44 Cineca HPC Systems 2018-10-24 10:03:11 MDT
Hi Dominik
I tested CR_ONE_TASK_PER_CORE and it's working as expected.
I'm still investigating what was wrong with task/affinity when I opened the bug report.
Thank you very much 

ale
Comment 45 Dominik Bartkiewicz 2018-10-24 10:24:30 MDT
Hi

OK, can we drop severity to 3 now?

Dominik
Comment 46 Cineca HPC Systems 2018-10-24 10:33:59 MDT
Yes of course!
ale
Comment 47 Cineca HPC Systems 2018-10-31 07:17:03 MDT
ciao Dominik

just to inform you that yesterday we upgraded to Slurm 18.08.3.
All seems to be working fine so, if you like, you can close this bug.

Thanks for the help
Ale
Comment 48 Cineca HPC Systems 2018-10-31 07:20:18 MDT
ciao again Dominik ;-)

just to let you know that if you still need access to power8 architecture
I can leave the 2 nodes cluster davide4[4,5] on line for you.
Otherwise I will put the 2 nodes back in the production cluster.

Let me know

thanks
ale
Comment 49 Dominik Bartkiewicz 2018-10-31 07:30:06 MDT
Hi

I am glad to hear that.

You can put them back to the production cluster.
One more time thanks for giving me access to the machine.

Closing as resolved/infogiven, please reopen if needed.

Dominik