Ticket 5663

Summary: task/cgroup: task[#] unable to set taskset
Product: Slurm Reporter: Moe Jette <jette>
Component: slurmdAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 18.08.0   
Hardware: Linux   
OS: Linux   
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 18.08.1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Moe Jette 2018-08-31 22:28:52 MDT
Commit 3e59bf7e967c3c490d8ffef541d53a3a6e3f471f appears to be responsible for bad task binding logic. The failure seems consistent. With the commit:

$ srun --cpu-bind=thread,verbose  -n3 true
cpu-bind-threads=UNK  - nid00001, task  0  0 [393]: mask 0x1 set
cpu-bind-threads=UNK  - nid00001, task  1  1 [394]: mask 0x10 set
slurmstepd-nid00001: error: task/cgroup: task[2] unable to set taskset '0x00000002'
cpu-bind-threads=UNK  - nid00001, task  2  2 [395]: mask 0x2 set FAILED

Right before the commit:
$ srun --cpu-bind=thread,verbose  -n3 true
cpu-bind-threads=UNK  - nid00001, task  0  0 [17236]: mask 0x1 set
cpu-bind-threads=UNK  - nid00001, task  1  1 [17237]: mask 0x10 set
cpu-bind-threads=UNK  - nid00001, task  2  2 [17238]: mask 0x4 set

I am seeing this failure with the following configuration
cgroup.conf:
ConstrainCores=yes
TaskAffinity=yes

slurm.conf:
SelectType=select/cons_res
TaskPlugin=task/cgroup
NodeName=nid00001 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7000 State=UNKNOWN Gres=craynetwork:4,gpu:volta:4

I will pursue more this weekend...
Comment 1 Moe Jette 2018-09-02 18:42:30 MDT
The following patch logs the problem more clearly:

diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c
index 78cf3c8650..ed218bc13c 100644
--- a/src/plugins/task/cgroup/task_cgroup_cpuset.c
+++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c

@@ -676,6 +695,9 @@ static int _task_cgroup_cpuset_dist_cyclic(
 #else
                                            (hwloc_bitmap_first(
                                             obj->allowed_cpuset) != -1)) {
+char *cpu_bit_str = NULL;
+hwloc_bitmap_list_asprintf(&cpu_bit_str, obj->allowed_cpuset);
+info("AT %d TASK %d ALLOWED_CPUSET:%s", __LINE__, taskid, cpu_bit_str);
 #endif
                                                t_ix[(s_ix*cps)+c_ixc[s_ix]]++;
                                                j++;

This is what the slurmstepd log looks like with the patch and shows what is different from before the patch:
[2018-09-02T16:41:22.562] [19987.0] AT 700 TASK 0 ALLOWED_CPUSET:0

[2018-09-02T16:41:22.561] [19987.0] AT 700 TASK 1 ALLOWED_CPUSET:0
[2018-09-02T16:41:22.561] [19987.0] AT 700 TASK 1 ALLOWED_CPUSET:4

[2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:0
[2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:4
[2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:1	<< ALLOWED_CPUSET:2 was good
#endif

Removing the hwloc xml file cache creation in slurmstepd.c makes the problem go away:
diff --git a/src/slurmd/slurmstepd/slurmstepd.c b/src/slurmd/slurmstepd/slurmstepd.c
index 320d26195a..399647a02d 100644
--- a/src/slurmd/slurmstepd/slurmstepd.c
+++ b/src/slurmd/slurmstepd/slurmstepd.c
@@ -628,7 +628,8 @@ _init_from_slurmd(int sock, char **argv,
                                                 jobid, stepid);
 
        /* create hwloc xml file here to avoid threading issues late */
-       xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false);
+//THIS FIXES PROBLEM
+//     xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false);
 
        /*
         * Swap the field to the srun client version, which will eventually

I expect the problem isn't the cache per-se, but that the cache needs to be created later in the logic. This patch fixed the problem, but more study is required to determine if this is the right place to issue the call. Note that this logic executes for task launch, but not for batch jobs:
diff --git a/src/slurmd/slurmstepd/mgr.c b/src/slurmd/slurmstepd/mgr.c
index 2d3a53eedf..d91a5a1835 100644
--- a/src/slurmd/slurmstepd/mgr.c
+++ b/src/slurmd/slurmstepd/mgr.c
@@ -1796,6 +1796,7 @@ _fork_all_tasks(stepd_step_rec_t *job, bool *io_initialized)
        /*
         * Fork all of the task processes.
         */
+xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false);        // THIS IS OK
        verbose("starting %u tasks", job->node_tasks);
        for (i = 0; i < job->node_tasks; i++) {
                char time_stamp[256];
Comment 2 Moe Jette 2018-09-03 17:36:19 MDT
I have identified exactly where we the xcpuinfo_hwloc_topo_load() needs to be in order to prevent the failure:
diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c
index 6e4074a391..2d8941cb1c 100644
--- a/src/plugins/task/cgroup/task_cgroup_cpuset.c
+++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c
@@ -1279,6 +1278,7 @@ again:
 
        /* attach the slurmstepd to the step cpuset cgroup */
        pid_t pid = getpid();
+xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false);    // FAILS BINDING HERE
        rc = xcgroup_add_pids(&step_cpuset_cg,&pid,1);
        if (rc != XCGROUP_SUCCESS) {
                error("task/cgroup: unable to add slurmstepd to cpuset cg '%s'",
@@ -1286,10 +1286,9 @@ again:
                fstatus = SLURM_ERROR;
        } else
                fstatus = SLURM_SUCCESS;
-
+//xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false);   // GOOD BINDING HERE
        /* validate the requested cpu frequency and set it */
        cpu_freq_cgroup_validate(job, step_alloc_cores);
-
 error:
        xcgroup_unlock(&cpuset_cg);
        xcgroup_destroy(&cpuset_cg);
Comment 3 Danny Auble 2018-09-04 13:22:31 MDT
I am guessing your system also has 

FastSchedule = 2

meaning you have a fake configuration for the node?
Comment 4 Danny Auble 2018-09-04 13:51:38 MDT
Thanks Moe, this is indeed an issue.

I modified the code slightly to run this right after task_g_pre_setuid() which is where the cgroups are setup which hwloc is reading from.

Where you had it the file was created by the user instead of root.  I doubt that is that big as issue, but still probably not what we want.

Thanks for finding this.

My modified patch attributed to you is in commit d005a1447da8.

Please reopen if things don't work as expected.
Comment 5 Moe Jette 2018-09-04 13:55:18 MDT
(In reply to Danny Auble from comment #3)
> I am guessing your system also has 
> 
> FastSchedule = 2
> 
> meaning you have a fake configuration for the node?

Correct. slurm.conf has
Sockets=2 CoresPerSocket=4 ThreadsPerCore=1

Actual hardware:
Sockets=1 CoresPerSocket=4 ThreadsPerCore=2

I've been running like this to test GPU socket binding logic.
Comment 6 Danny Auble 2018-09-04 13:57:00 MDT
Thanks Moe, I was only able to make this problem happen when I did exactly as you had done in comment 5.  I don't think this is a wide spread issue given that.