| Summary: | task/cgroup: task[#] unable to set taskset | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Moe Jette <jette> |
| Component: | slurmd | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 18.08.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SchedMD | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 18.08.1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Moe Jette
2018-08-31 22:28:52 MDT
The following patch logs the problem more clearly:
diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c
index 78cf3c8650..ed218bc13c 100644
--- a/src/plugins/task/cgroup/task_cgroup_cpuset.c
+++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c
@@ -676,6 +695,9 @@ static int _task_cgroup_cpuset_dist_cyclic(
#else
(hwloc_bitmap_first(
obj->allowed_cpuset) != -1)) {
+char *cpu_bit_str = NULL;
+hwloc_bitmap_list_asprintf(&cpu_bit_str, obj->allowed_cpuset);
+info("AT %d TASK %d ALLOWED_CPUSET:%s", __LINE__, taskid, cpu_bit_str);
#endif
t_ix[(s_ix*cps)+c_ixc[s_ix]]++;
j++;
This is what the slurmstepd log looks like with the patch and shows what is different from before the patch:
[2018-09-02T16:41:22.562] [19987.0] AT 700 TASK 0 ALLOWED_CPUSET:0
[2018-09-02T16:41:22.561] [19987.0] AT 700 TASK 1 ALLOWED_CPUSET:0
[2018-09-02T16:41:22.561] [19987.0] AT 700 TASK 1 ALLOWED_CPUSET:4
[2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:0
[2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:4
[2018-09-02T16:41:22.564] [19987.0] AT 700 TASK 2 ALLOWED_CPUSET:1 << ALLOWED_CPUSET:2 was good
#endif
Removing the hwloc xml file cache creation in slurmstepd.c makes the problem go away:
diff --git a/src/slurmd/slurmstepd/slurmstepd.c b/src/slurmd/slurmstepd/slurmstepd.c
index 320d26195a..399647a02d 100644
--- a/src/slurmd/slurmstepd/slurmstepd.c
+++ b/src/slurmd/slurmstepd/slurmstepd.c
@@ -628,7 +628,8 @@ _init_from_slurmd(int sock, char **argv,
jobid, stepid);
/* create hwloc xml file here to avoid threading issues late */
- xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false);
+//THIS FIXES PROBLEM
+// xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false);
/*
* Swap the field to the srun client version, which will eventually
I expect the problem isn't the cache per-se, but that the cache needs to be created later in the logic. This patch fixed the problem, but more study is required to determine if this is the right place to issue the call. Note that this logic executes for task launch, but not for batch jobs:
diff --git a/src/slurmd/slurmstepd/mgr.c b/src/slurmd/slurmstepd/mgr.c
index 2d3a53eedf..d91a5a1835 100644
--- a/src/slurmd/slurmstepd/mgr.c
+++ b/src/slurmd/slurmstepd/mgr.c
@@ -1796,6 +1796,7 @@ _fork_all_tasks(stepd_step_rec_t *job, bool *io_initialized)
/*
* Fork all of the task processes.
*/
+xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false); // THIS IS OK
verbose("starting %u tasks", job->node_tasks);
for (i = 0; i < job->node_tasks; i++) {
char time_stamp[256];
I have identified exactly where we the xcpuinfo_hwloc_topo_load() needs to be in order to prevent the failure:
diff --git a/src/plugins/task/cgroup/task_cgroup_cpuset.c b/src/plugins/task/cgroup/task_cgroup_cpuset.c
index 6e4074a391..2d8941cb1c 100644
--- a/src/plugins/task/cgroup/task_cgroup_cpuset.c
+++ b/src/plugins/task/cgroup/task_cgroup_cpuset.c
@@ -1279,6 +1278,7 @@ again:
/* attach the slurmstepd to the step cpuset cgroup */
pid_t pid = getpid();
+xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false); // FAILS BINDING HERE
rc = xcgroup_add_pids(&step_cpuset_cg,&pid,1);
if (rc != XCGROUP_SUCCESS) {
error("task/cgroup: unable to add slurmstepd to cpuset cg '%s'",
@@ -1286,10 +1286,9 @@ again:
fstatus = SLURM_ERROR;
} else
fstatus = SLURM_SUCCESS;
-
+//xcpuinfo_hwloc_topo_load(NULL, conf->hwloc_xml, false); // GOOD BINDING HERE
/* validate the requested cpu frequency and set it */
cpu_freq_cgroup_validate(job, step_alloc_cores);
-
error:
xcgroup_unlock(&cpuset_cg);
xcgroup_destroy(&cpuset_cg);
I am guessing your system also has FastSchedule = 2 meaning you have a fake configuration for the node? Thanks Moe, this is indeed an issue. I modified the code slightly to run this right after task_g_pre_setuid() which is where the cgroups are setup which hwloc is reading from. Where you had it the file was created by the user instead of root. I doubt that is that big as issue, but still probably not what we want. Thanks for finding this. My modified patch attributed to you is in commit d005a1447da8. Please reopen if things don't work as expected. (In reply to Danny Auble from comment #3) > I am guessing your system also has > > FastSchedule = 2 > > meaning you have a fake configuration for the node? Correct. slurm.conf has Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 Actual hardware: Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 I've been running like this to test GPU socket binding logic. |