Created attachment 6974 [details] /etc/slurm and job logs Hi Support, we have installed SLURM 17.11.7 on a new cluster based on POWER8 with NVIDIA link. All nodes have 2 POWER8NVL processors and 4 Tesla P100. SMT is enabled [afederic@davide44 ~]$ ppc64_cpu --info Core 0: 0* 1* 2* 3* 4* 5* 6* 7* Core 1: 8* 9* 10* 11* 12* 13* 14* 15* Core 2: 16* 17* 18* 19* 20* 21* 22* 23* Core 3: 24* 25* 26* 27* 28* 29* 30* 31* Core 4: 32* 33* 34* 35* 36* 37* 38* 39* Core 5: 40* 41* 42* 43* 44* 45* 46* 47* Core 6: 48* 49* 50* 51* 52* 53* 54* 55* Core 7: 56* 57* 58* 59* 60* 61* 62* 63* Core 8: 64* 65* 66* 67* 68* 69* 70* 71* Core 9: 72* 73* 74* 75* 76* 77* 78* 79* Core 10: 80* 81* 82* 83* 84* 85* 86* 87* Core 11: 88* 89* 90* 91* 92* 93* 94* 95* Core 12: 96* 97* 98* 99* 100* 101* 102* 103* Core 13: 104* 105* 106* 107* 108* 109* 110* 111* Core 14: 112* 113* 114* 115* 116* 117* 118* 119* Core 15: 120* 121* 122* 123* 124* 125* 126* 127* SMT thread IDs are those along the lines. Submitting a job that ask for 8 tasks (8 cores) [afederic@davide44 ~]$ sbatch -n 8 -w davide44 sleep.sh Submitted batch job 38 Slurm correctly reports the job properties [afederic@davide44 ~]$ scontrol show job 38 JobId=38 JobName=sleep.sh UserId=afederic(28541) GroupId=interactive(25200) MCS_label=N/A Priority=4294901728 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:07 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2018-06-01T10:38:11 EligibleTime=2018-06-01T10:38:11 StartTime=2018-06-01T10:38:11 EndTime=2018-06-01T11:08:11 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-06-01T10:38:11 Partition=system AllocNode:Sid=davide44:67100 ReqNodeList=davide44 ExcNodeList=(null) NodeList=davide44 BatchHost=davide44 NumNodes=1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=8000M,node=1,billing=8 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/davide/home/userinternal/afederic/sleep.sh WorkDir=/davide/home/userinternal/afederic StdErr=/davide/home/userinternal/afederic/slurm-38.out StdIn=/dev/null StdOut=/davide/home/userinternal/afederic/slurm-38.out Power= but it assignes all the "real" cores in the cpuset cgroup [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch/cpuset.cpus 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120 I also submitted a job using -B flag [afederic@davide44 ~]$ sbatch -B 1:8:1 -w davide44 sleep.sh Submitted batch job 41 but in this case also the number of cpus assigne to de job is wrong [afederic@davide44 ~]$ scontrol show job 41 JobId=41 JobName=sleep.sh UserId=afederic(28541) GroupId=interactive(25200) MCS_label=N/A Priority=4294901725 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:05 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2018-06-01T10:43:09 EligibleTime=2018-06-01T10:43:09 StartTime=2018-06-01T10:43:09 EndTime=2018-06-01T11:13:09 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-06-01T10:43:09 Partition=system AllocNode:Sid=davide44:67100 ReqNodeList=davide44 ExcNodeList=(null) NodeList=davide44 BatchHost=davide44 NumNodes=1 NumCPUs=64 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:8:1 TRES=cpu=64,mem=62.50G,node=1,billing=64 Socks/Node=1 NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/davide/home/userinternal/afederic/sleep.sh WorkDir=/davide/home/userinternal/afederic StdErr=/davide/home/userinternal/afederic/slurm-41.out StdIn=/dev/null StdOut=/davide/home/userinternal/afederic/slurm-41.out Power= and the cpuset cgroup is always the same [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_41/step_batch/cpuset.cpus 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120 Looking at slurmd logs (debug5) it seems Slurm is correctly determining the cpu topology with hwloc, but then when it assigns cpuset.cpus value something very strange happens. This is the log of the first job (sbatch -n 8) [2018-06-01T10:38:11.969] [38.batch] debug: task/cgroup: job abstract cores are '0' [2018-06-01T10:38:11.969] [38.batch] debug: task/cgroup: step abstract cores are '0' [2018-06-01T10:38:11.969] [38.batch] debug: task/cgroup: job physical cores are '0-7' [2018-06-01T10:38:11.969] [38.batch] debug: task/cgroup: step physical cores are '0-7' [2018-06-01T10:38:11.969] [38.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_28541' already exists [2018-06-01T10:38:11.969] [38.batch] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541' [2018-06-01T10:38:11.969] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7,0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120' for '/sys/fs/cgroup/cpuset/slurm/uid_28541' [2018-06-01T10:38:11.969] [38.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38' already exists [2018-06-01T10:38:11.969] [38.batch] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38' [2018-06-01T10:38:11.970] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38' [2018-06-01T10:38:11.970] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38' [2018-06-01T10:38:11.970] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38' [2018-06-01T10:38:11.970] [38.batch] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch' [2018-06-01T10:38:11.972] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch' [2018-06-01T10:38:11.972] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch' [2018-06-01T10:38:11.972] [38.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_38/step_batch' from the last line we was exppecting to find 0-7 in the cpuset, instead we find all the cores. Moreover launching a with srun with -B 1:8:X with X > 1 or -n Y with Y > 7, results in the following error [afederic@davide44 ~]$ srun: error: davide44: tasks 1-7: Exited with exit code 1 srun: Terminating job step 49.0 srun: error: davide44: task 0: Killed srun: Force Terminated job step 49.0 slurmd logs reports an invalid parameter in the call to sched_setaffinity [2018-06-01T10:55:37.195] [49.0] debug3: sched_getaffinity(89196) = 0x1010101010101010101010101010101 [2018-06-01T10:55:37.195] [49.0] debug3: get_cpuset (mask_cpu[256]) 0x00000000000000000000000000000001,0x00000000000000000000000000000002,0x00000000000 000000000000000000004,0x00000000000000000000000000000008,0x00000000000000000000000000000010,0x00000000000000000000000000000020,0x0000000000000000000000 0000000040,0x00000000000000000000000000000080 [2018-06-01T10:55:37.196] [49.0] sched_setaffinity(89196,128,0x2) failed: Invalid argument [2018-06-01T10:55:37.196] [49.0] debug3: sched_getaffinity(89196) = 0x1010101010101010101010101010101 [2018-06-01T10:55:37.196] [49.0] debug: task_g_pre_launch: task/affinity: Unspecified error [2018-06-01T10:55:37.196] [49.0] error: Failed to invoke task plugins: task_p_pre_launch error I'm attaching the following files: - slurm.tgz, a tgz of /etc/slurm - job-XY.log, slurmd logs for job XY thanks Ale
Can you attach output from 'lstopo' as well? If you disable the task/affinity plugin, is the current system usable (albeit without optimal affinity) for now?
Created attachment 6988 [details] lstopo output
Hi in the production cluster I'm using TaskPlugin=task/cgroup with ConstrainCores=no in cgroup.conf. I will remove task/affinty also from the test cluster, where ConstrainCores=yes, and let you know. thanks ale
Hi Tim, ok, task/affinity plugin is disabled also in the 2 nodes cluster so the sched_setaffinty invalid parameter error doesn't occur. What about the wrong core ids written in the cpuset.cpus? I see from slurmd logs that the HW topology (hwloc_topology_load) is discovered correctly in terms of S:C:T but then, for an 8 cores job, task/cgroup tries to assign the first 8 threads (1 core) [2018-06-04T13:39:57.918] [52.batch] debug: task/cgroup: job abstract cores are '0' [2018-06-04T13:39:57.918] [52.batch] debug: task/cgroup: step abstract cores are '0' [2018-06-04T13:39:57.918] [52.batch] debug: task/cgroup: job physical cores are '0-7' [2018-06-04T13:39:57.918] [52.batch] debug: task/cgroup: step physical cores are '0-7' Then xcgroup_set_param sets these values in cpuset.cpus [2018-06-04T13:39:57.919] [52.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_52/step_batch' [2018-06-04T13:39:57.921] [52.batch] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_52/step_batch' so I would expect to find 0-7 in the cpuset.cpus parameter. Instead we find all the physical cores [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_52/step_batch/cpuset.cpus 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120 Could it be some issue with cgroups in the running kernel (3.10.0-514.el7.ppc64le)? Thanks ale
Hi I still working on this. Could you set ConstrainDevices to no and check if this change this behaviour? Dominik
We get the same behaviour with ConstrainDevices=no. I'm attaching the job logs.
Created attachment 7006 [details] job with ConstrainDevices=no
Hi Could you send me outputs from this commands? cat /sys/fs/cgroup/cpuset/slurm/{cpuset.effective_cpus,cpuset.cpus} ls /sys/fs/cgroup/cpuset/slurm/uid_28541/ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/{cpuset.effective_cpus,cpuset.cpus,tasks} I will attach the patch with some extra debugs. It will be great if you apply it and send me back slurmd log. Dominik
Created attachment 7133 [details] extra debug
Hi Dominik, these are the outputs [afederic@davide44 ~]$ srun -n 8 -w davide44 --pty bash [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/{cpuset.effective_cpus,cpuset.cpus} 0-127 0-127 [afederic@davide44 ~]$ ls /sys/fs/cgroup/cpuset/slurm/uid_28541/ cgroup.clone_children cpuset.cpus cpuset.mem_hardwall cpuset.memory_spread_slab job_4 cgroup.event_control cpuset.effective_cpus cpuset.memory_migrate cpuset.mems notify_on_release cgroup.procs cpuset.effective_mems cpuset.memory_pressure cpuset.sched_load_balance tasks cpuset.cpu_exclusive cpuset.mem_exclusive cpuset.memory_spread_page cpuset.sched_relax_domain_level [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/{cpuset.effective_cpus,cpuset.cpus,tasks} 0-7 0-7 thanks ale
Hi Have you grabbed this extra debug from slurmd? Dominik
Dominik I'm sorry. I'll do it today thanks
Created attachment 7191 [details] slurmd logs with extra debugs
Hi It looks like something from slurm outside is modifying cgrups. Do you use cgred or any other cgroup manager? Dominik
no, we do not use any cgroup manager it seems to me that there is some issue with the IDs of real cores versus SMT cores for example, this job [afederic@davide44 ~]$ srun -B 1:8:1 -w davide44 --pty bash assigns these cores [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_40/{cpuset.effective_cpus,cpuset.cpus,tasks} 0-31,64-95 0-31,64-95 So it assigns 4 real cores and SMTs on each socket while it should assign all the real cores of one socket and *no* SMTs [afederic@davide44 ~]$ ppc64_cpu --info Core 0: 0* 1* 2* 3* 4* 5* 6* 7* Core 1: 8* 9* 10* 11* 12* 13* 14* 15* Core 2: 16* 17* 18* 19* 20* 21* 22* 23* Core 3: 24* 25* 26* 27* 28* 29* 30* 31* Core 4: 32* 33* 34* 35* 36* 37* 38* 39* Core 5: 40* 41* 42* 43* 44* 45* 46* 47* Core 6: 48* 49* 50* 51* 52* 53* 54* 55* Core 7: 56* 57* 58* 59* 60* 61* 62* 63* Core 8: 64* 65* 66* 67* 68* 69* 70* 71* Core 9: 72* 73* 74* 75* 76* 77* 78* 79* Core 10: 80* 81* 82* 83* 84* 85* 86* 87* Core 11: 88* 89* 90* 91* 92* 93* 94* 95* Core 12: 96* 97* 98* 99* 100* 101* 102* 103* Core 13: 104* 105* 106* 107* 108* 109* 110* 111* Core 14: 112* 113* 114* 115* 116* 117* 118* 119* Core 15: 120* 121* 122* 123* 124* 125* 126* 127* [afederic@davide44 ~]$ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 0 size: 131072 MB node 0 free: 124514 MB node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 1 size: 131072 MB node 1 free: 128174 MB node distances: node 0 1 0: 10 40 1: 40 10 thanks ale
Hi OK, This means the problem is in affinity plugin. Could you apply this patch? diff --git a/src/plugins/task/affinity/cpuset.c b/src/plugins/task/affinity/cpuset.c index 61eba92..5521030 100644 --- a/src/plugins/task/affinity/cpuset.c +++ b/src/plugins/task/affinity/cpuset.c @@ -224,6 +224,7 @@ int slurm_set_cpuset(char *base, char *path, pid_t pid, size_t size, return SLURM_ERROR; } rc = write(fd, mstr, strlen(mstr)+1); + error("BUG 5243: rc=%d, file_path=%s, mstr=%s", rc, file_path, mstr); close(fd); if (rc < 1) { error("write(%s): %m", file_path); Dominik
Dominik sorry for the delay. I applied the patch but I cannot find the your log error("BUG 5243: rc=%d, file_path=%s, mstr=%s", rc, file_path, mstr); in the slurmd.log are you sure that code is going to be executed? debug levels are [root@davide44 ~]# scontrol show conf | grep -i debug DebugFlags = Backfill,BackfillMap,NodeFeatures,Priority,Protocol,TraceJobs SlurmctldDebug = debug5 SlurmctldSyslogDebug = verbose SlurmdDebug = debug5 SlurmdSyslogDebug = verbose thanks ale
Hi Thanks for this info. slurm_set_cpuset() can make non-trivial manipulation on CPU mask that is why I suspected it. Could you try to disable task/affinity and check if this change assigned cores? Dominik
Hi Dominik, task/affinity plugin is already disabled [root@davide44 ~]# scontrol show conf | grep -i TaskPlugin TaskPlugin = task/cgroup TaskPluginParam = (null type) Tim asked me to disable it because it triggered a sched_setaffinty invalid parameter error. Please look at my first comment in this issue. thanks ale
Hi Sorry, that means this is not task/affinity plugin. Wich version of hwloc do you use? To be sure, could you check if "srun --cpu-bind=verbose" doesn't return any extra info? Dominik
Hi, I cannot see extra info from the switch --cpu-bind=verbose [afederic@davide44 ~]$ srun --cpu-bind=verbose -B 1:8:1 -w davide44 --pty bash [afederic@davide44 ~]$ [root@davide44 ~]# rpm -q hwloc hwloc-1.11.8-4.el7.ppc64le thanks ale
Hi Could you check if adding "Delegate=yes" to slurmd.service change anything? https://github.com/SchedMD/slurm/commit/cecb39ff087731d29252bbc36b00abf814a3c5ac Dominik
Hi Did you have any chance to test this? Dominik
Hi Dominik sorry for the delay I was out of office for holidays until today. I tried, the behavior is the same [afederic@davide44 ~]$ srun -B 1:8:1 -w davide44 --pty bash [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_46/{cpuset.effective_cpus,cpuset.cpus,tasks} 0-31,64-95 0-31,64-95 thanks ale
Created attachment 7550 [details] open.c LD_PRELOAD library Hi Could you try to compile this lib and start slurmd with LD_PRELOAD=/<...>/open.so? This lib should catch every attempt to open "cpuset.cpus" and log to /tmp/open.log. If modifying cpus comes from slurm we will see it. Dominik
Created attachment 7560 [details] backtrace log produced by open.c Hi Dominik I attached the log produced by this job [afederic@davide44 ~]$ srun -B 1:8:1 -w davide44 --pty bash thanks ale
Hi According to this log, modification of cpuset.cpus doesn't come from slurm. all modifications of cpuset.cpus go through xcgroup_set_param(). xcgroup_set_param() logs each modification on debug3 and sets properly cpus values. Because we have already disabled affinity from task/affinity and task/cgroup (TaskAffinity=no in cgroup.conf) this also doesn't come from any slurm sched_setaffinity(). could you try to manually create cgroup in similar way to this log and check if it works fine : [2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5' [2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5' [2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5' [2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5' [2018-06-27T10:30:07.934] [5.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5/step_0' [2018-06-27T10:30:07.937] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5/step_0' [2018-06-27T10:30:07.937] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.mems' set to '0-1' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5/step_0' [2018-06-27T10:30:07.937] [5.0] debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0-7' for '/sys/fs/cgroup/cpuset/slurm/uid_28541/job_5/step_0' I also have found this: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/SMT%20and%20cgroup%20cpusets Dominik
Hi, it seems to work [root@davide44 ~]# mkdir /sys/fs/cgroup/cpuset/slurm/TEST [root@davide44 ~]# echo 0-7 > /sys/fs/cgroup/cpuset/slurm/TEST/cpuset.cpus [root@davide44 ~]# echo 0-1 > /sys/fs/cgroup/cpuset/slurm/TEST/cpuset.mems [root@davide44 ~]# echo 0 > /sys/fs/cgroup/cpuset/slurm/TEST/notify_on_release [root@davide44 ~]# cat /sys/fs/cgroup/cpuset/slurm/TEST/cpuset.cpus /sys/fs/cgroup/cpuset/slurm/TEST/cpuset.mems /sys/fs/cgroup/cpuset/slurm/TEST/notify_on_release 0-7 0-1 0 thanks ale
Dominik, I forgot to tell you that we are on CentOS 7.5 [root@davide44 ~]# cat /etc/centos-release CentOS Linux release 7.5.1804 (AltArch) thanks ale
Hi Thanks for this test and info. I already know that you are using centos 7 from hwloc version :) I know this www describes the situation on Ubuntu, but if you are using soft that is changing some threads state, the result can be odd. Dominik
Hi We have made attempts to get access to a power8 machine, unfortunately without success. Could you grain me the user-level remote account, that I can make remote tests? This is against our normal no-remote-access rule, but I haven't got any other idea now. Dominik
Hi Dominik, of course you are welcome to DAVIDE. One of our User Support people will contact you asap. thanks ale
Dear Dominik, first of all thanks for your help. In order to obtain an HPC account on Davide, we would kindly ask you to register on our UserDB Portal at: https://userdb.hpc.cineca.it/ Just click on "Create new user" and enter the requested information. Once created the user, follow the "HPC Access" link (you find it among the Available Services in the vertical menu on the left), and complete your registration by providing the required data in the "Institution" and "Documents for HPC" sections. After that, please write to us so that we can associate your user to a project and send you a personal username and password to log to login.davide.cineca.it Let us know in case of problems or doubts, cheers Isabella
Hi Dominik, to help your work I set up a test environment on davide44: slurmctld & slurmd davide45: slurmd I also allowed your account to some sudo commands on both these nodes User dbartkie may run the following commands on davide44: (root) NOPASSWD: /bin/cp slurm.conf /etc/slurm/, /bin/cp cgroup.conf /etc/slurm/, /bin/cp gres.conf /etc/slurm/, /bin/cp cgroup_allowed_devices_file.conf /etc/slurm/ (root) NOPASSWD: /bin/systemctl restart slurmctld, /bin/systemctl restart slurmd let me know if you need some other permissions thanks ale
Hi Sorry for late response, first of all, I can't recreate invalid value of cpuset.cpus. On configuration with or without task/affinity plugin, none of the different combination of cgroup.conf option doesn't procure this problem. note to -B option: --cores-per-socket=<cores> Restrict node selection to nodes with at least with specified number of cores per socket. See additional information under -B option above when task/affinity plugin is enabled. This option applies to job allocations. Setting this doesn't guarantee that slurmctld will select 8 cores on one socket, but only restrict node selection to nodes with at least 8 free cores on the socket. Dominik
Hi Dominik I see that you enabled the task/affinity plugin but I cannot explain why it's now working. If you look at the my first comment in this bug report you can see that with task/affinity enabled sched_setaffinity was always crashing with "Invalid argument". Thanks for the explanation of the -B switch, but I do not understand the cpus/threads allocated when using -n switch. Submitting a job with -n 4 results in 1 core and all its SMTs allocated [afederic@davide44 ~]$ srun -n 4 -w davide44 --pty bash [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 0-7 Is this the expected behavior? I would expect 4 cores to be allocated. I did some testing with some other switches and I only found this way to get 4 real cores allocated [afederic@davide44 ~]$ srun -n 4 --ntasks-per-core=1 -w davide44 --pty bash [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 0-15,64-79 Also --ntasks-per-socket seems to work [afederic@davide44 ~]$ srun -n 4 --ntasks-per-core=1 --ntasks-per-socket=4 -w davide44 --pty bash [afederic@davide44 ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus 64-95 So I didn't understand what's going on when I use only the -n switch thank you very much ale
(In reply to Cineca HPC Systems from comment #41) > Hi Dominik > > I see that you enabled the task/affinity plugin but I cannot explain why > it's now working. If you look at the my first comment in this bug report you > can see that with task/affinity enabled sched_setaffinity was always > crashing with "Invalid argument". I know that I spent some time on trying to reproduce this and still, I have no idea what was wrong before. Did you change SMT mode after starting slurmd demon? > > Thanks for the explanation of the -B switch, but I do not understand the > cpus/threads allocated when using -n switch. Submitting a job with -n 4 > results in 1 core and all its SMTs allocated > > [afederic@davide44 ~]$ srun -n 4 -w davide44 --pty bash > [afederic@davide44 ~]$ cat > /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus > 0-7 > > Is this the expected behavior? I would expect 4 cores to be allocated. > > I did some testing with some other switches and I only found this way to get > 4 real cores allocated > > [afederic@davide44 ~]$ srun -n 4 --ntasks-per-core=1 -w davide44 --pty bash > [afederic@davide44 ~]$ cat > /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus > 0-15,64-79 > Yes, CR_Core Cores are consumable resources. On nodes with hyper-threads, each thread is counted as a CPU to satisfy a job's resource requirement, but multiple jobs are not allocated threads on the same core. The count of CPUs allocated to a job may be rounded up to account for every CPU on an allo‐ cated core. Check CR_ONE_TASK_PER_CORE in slurm.conf man. I think you would like to use it. Dominik > Also --ntasks-per-socket seems to work > > [afederic@davide44 ~]$ srun -n 4 --ntasks-per-core=1 --ntasks-per-socket=4 > -w davide44 --pty bash > [afederic@davide44 ~]$ cat > /sys/fs/cgroup/cpuset/slurm/uid_28541/job_*/step_0/cpuset.cpus > 64-95 > > So I didn't understand what's going on when I use only the -n switch > > thank you very much > ale
Hi Let me know if CR_ONE_TASK_PER_CORE works as you expected. Do you have any additional questions or my previous answer was enough? Dominik
Hi Dominik I tested CR_ONE_TASK_PER_CORE and it's working as expected. I'm still investigating what was wrong with task/affinity when I opened the bug report. Thank you very much ale
Hi OK, can we drop severity to 3 now? Dominik
Yes of course! ale
ciao Dominik just to inform you that yesterday we upgraded to Slurm 18.08.3. All seems to be working fine so, if you like, you can close this bug. Thanks for the help Ale
ciao again Dominik ;-) just to let you know that if you still need access to power8 architecture I can leave the 2 nodes cluster davide4[4,5] on line for you. Otherwise I will put the 2 nodes back in the production cluster. Let me know thanks ale
Hi I am glad to hear that. You can put them back to the production cluster. One more time thanks for giving me access to the machine. Closing as resolved/infogiven, please reopen if needed. Dominik