Commit 3e59bf7e967c3c490d8ffef541d53a3a6e3f471f appears to be responsible for bad task binding logic. The failure seems consistent. With the commit: $ srun --cpu-bind=thread,verbose -n3 true cpu-bind-threads=UNK - nid00001, task 0 0 [393]: mask 0x1 set cpu-bind-threads=UNK - nid00001, task 1 1 [394]: mask 0x10 set slurmstepd-nid00001: error: task/cgroup: task[2] unable to set taskset '0x00000002' cpu-bind-threads=UNK - nid00001, task 2 2 [395]: mask 0x2 set FAILED Right before the commit: $ srun --cpu-bind=thread,verbose -n3 true cpu-bind-threads=UNK - nid00001, task 0 0 [17236]: mask 0x1 set cpu-bind-threads=UNK - nid00001, task 1 1 [17237]: mask 0x10 set cpu-bind-threads=UNK - nid00001, task 2 2 [17238]: mask 0x4 set I am seeing this failure with the following configuration cgroup.conf: ConstrainCores=yes TaskAffinity=yes slurm.conf: SelectType=select/cons_res TaskPlugin=task/cgroup NodeName=nid00001 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7000 State=UNKNOWN Gres=craynetwork:4,gpu:volta:4 I will pursue more this weekend...
The following patch logs the problem more clearly: diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c index 78cf3c8650..ed218bc13c 100644 --- a/src/plugins/task/cgroup/task_cgroup_cpuset.c +++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c @@ -676,6 +695,9 @@ static int _task_cgroup_cpuset_dist_cyclic( #else (hwloc_bitmap_first( obj->allowed_cpuset) != -1)) { +char *cpu_bit_str = NULL; +hwloc_bitmap_list_asprintf(&cpu_bit_str, obj->allowed_cpuset); +info("AT %d TASK %d ALLOWED_CPUSET:%s", __LINE__, taskid, cpu_bit_str); #endif t_ix[(s_ix*cps)+c_ixc[s_ix]]++; j++; This is what the slurmstepd log looks like with the patch and shows what is different from before the patch: [2018-09-02T16:41:22.562] [19987.0] AT 700 TASK 0 ALLOWED_CPUSET:0 [2018-09-02T16:41:22.561] [19987.0] AT 700 TASK 1 ALLOWED_CPUSET:0 [2018-09-02T16:41:22.561] [19987.0] AT 700 TASK 1 ALLOWED_CPUSET:4 [2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:0 [2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:4 [2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:1 << ALLOWED_CPUSET:2 was good #endif Removing the hwloc xml file cache creation in slurmstepd.c makes the problem go away: diff --git a/src/slurmd/slurmstepd/slurmstepd.c b/src/slurmd/slurmstepd/slurmstepd.c index 320d26195a..399647a02d 100644 --- a/src/slurmd/slurmstepd/slurmstepd.c +++ b/src/slurmd/slurmstepd/slurmstepd.c @@ -628,7 +628,8 @@ _init_from_slurmd(int sock, char **argv, jobid, stepid); /* create hwloc xml file here to avoid threading issues late */ - xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false); +//THIS FIXES PROBLEM +// xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false); /* * Swap the field to the srun client version, which will eventually I expect the problem isn't the cache per-se, but that the cache needs to be created later in the logic. This patch fixed the problem, but more study is required to determine if this is the right place to issue the call. Note that this logic executes for task launch, but not for batch jobs: diff --git a/src/slurmd/slurmstepd/mgr.c b/src/slurmd/slurmstepd/mgr.c index 2d3a53eedf..d91a5a1835 100644 --- a/src/slurmd/slurmstepd/mgr.c +++ b/src/slurmd/slurmstepd/mgr.c @@ -1796,6 +1796,7 @@ _fork_all_tasks(stepd_step_rec_t *job, bool *io_initialized) /* * Fork all of the task processes. */ +xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false); // THIS IS OK verbose("starting %u tasks", job->node_tasks); for (i = 0; i < job->node_tasks; i++) { char time_stamp[256];
I have identified exactly where we the xcpuinfo_hwloc_topo_load() needs to be in order to prevent the failure: diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c index 6e4074a391..2d8941cb1c 100644 --- a/src/plugins/task/cgroup/task_cgroup_cpuset.c +++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c @@ -1279,6 +1278,7 @@ again: /* attach the slurmstepd to the step cpuset cgroup */ pid_t pid = getpid(); +xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false); // FAILS BINDING HERE rc = xcgroup_add_pids(&step_cpuset_cg,&pid,1); if (rc != XCGROUP_SUCCESS) { error("task/cgroup: unable to add slurmstepd to cpuset cg '%s'", @@ -1286,10 +1286,9 @@ again: fstatus = SLURM_ERROR; } else fstatus = SLURM_SUCCESS; - +//xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false); // GOOD BINDING HERE /* validate the requested cpu frequency and set it */ cpu_freq_cgroup_validate(job, step_alloc_cores); - error: xcgroup_unlock(&cpuset_cg); xcgroup_destroy(&cpuset_cg);
I am guessing your system also has FastSchedule = 2 meaning you have a fake configuration for the node?
Thanks Moe, this is indeed an issue. I modified the code slightly to run this right after task_g_pre_setuid() which is where the cgroups are setup which hwloc is reading from. Where you had it the file was created by the user instead of root. I doubt that is that big as issue, but still probably not what we want. Thanks for finding this. My modified patch attributed to you is in commit d005a1447da8. Please reopen if things don't work as expected.
(In reply to Danny Auble from comment #3) > I am guessing your system also has > > FastSchedule = 2 > > meaning you have a fake configuration for the node? Correct. slurm.conf has Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 Actual hardware: Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 I've been running like this to test GPU socket binding logic.
Thanks Moe, I was only able to make this problem happen when I did exactly as you had done in comment 5. I don't think this is a wide spread issue given that.