Ticket 5690 - cgroups with cpusets fails
Summary: cgroups with cpusets fails
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 18.08.0
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-09-09 13:02 MDT by Anthony DelSorbo
Modified: 2018-09-10 16:35 MDT (History)
0 users

See Also:
Site: NOAA
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: NESCC
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Selene
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Anthony DelSorbo 2018-09-09 13:02:24 MDT
During a recent training session we reorganized our configuration files to arrive at a slurm configuration with the following setting in slurm.conf:

TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=task/affinity,task/cgroup

this resulted in:

scontrol: error: Bad TaskPluginParam: task/affinity
scontrol: fatal: Unable to process configuration file

Assuming the configuration settings were in error, I changed this to:

TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=Cores,Cpusets

While this cleared up the configuration error, I next ran into the following error on the compute node when running a simple job:

Sep  9 14:56:28 s0014 slurmstepd[81325]: Munge cryptographic signature plugin loaded
Sep  9 14:56:28 s0014 slurmstepd[81325]: error: slurm_build_cpuset: mkdir(/dev/cpuset/slurm2131): No such file or directory
Sep  9 14:56:28 s0014 slurmstepd[81325]: error: task_p_pre_setuid: slurm_build_cpuset() failed
Sep  9 14:56:28 s0014 slurmstepd[81325]: error: _spawn_job_container: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
Sep  9 14:56:28 s0014 slurmd[80884]: Launching batch job 2131 for UID 1209
Sep  9 14:56:28 s0014 slurmstepd[81331]: task affinity plugin loaded with CPU mask 0000000000...0ffffffffffff
Sep  9 14:56:28 s0014 slurmstepd[81331]: Munge cryptographic signature plugin loaded
Sep  9 14:56:28 s0014 slurmstepd[81331]: error: slurm_build_cpuset: mkdir(/dev/cpuset/slurm2131): No such file or directory
Sep  9 14:56:28 s0014 slurmstepd[81331]: error: task_p_pre_setuid: slurm_build_cpuset() failed
Sep  9 14:56:28 s0014 slurmstepd[81331]: error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
Sep  9 14:56:28 s0014 slurmstepd[81331]: error: job_manager exiting abnormally, rc = 4020
Sep  9 14:56:28 s0014 slurmstepd[81331]: job 2131 completed with slurm_rc = 4020, job_rc = 0
Sep  9 14:56:28 s0014 slurmstepd[81331]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
Sep  9 14:56:28 s0014 slurmd[80884]: error: task_p_slurmd_release_resources: rmdir(/dev/cpuset/slurm2131) failed No such file or directory

This, of course, results in the node being drained due to the error.

What is strange about this error is, since the compute node is a RHEL 7 system, there is no /dev/cpuset.  

Removing the "cpusets" setting from the configuration to:

TaskPluginParam=Cores

permits the job to succeed.

It's not clear to me how to proceed to resolve this issue.  Please advise.
Comment 1 Tim Wickberg 2018-09-09 13:07:21 MDT
(In reply to Anthony DelSorbo from comment #0)
> During a recent training session we reorganized our configuration files to
> arrive at a slurm configuration with the following setting in slurm.conf:
> 
> TaskPlugin=task/affinity,task/cgroup
> TaskPluginParam=task/affinity,task/cgroup

The config I sent over to you had removed TaskPluginParam as those settings are not needed with TaskPlugin=task/cgroup added in now.

I'm not sure where you got that setting from?

> It's not clear to me how to proceed to resolve this issue.  Please advise.

Please delete the TaskPluginParam config line entirely.
Comment 2 Tim Wickberg 2018-09-10 16:35:43 MDT
Updating to resolved/infogiven, please reopen if you have any further questions.

- Tim