Ticket 12252

Summary: Slurmd and cgroupv2
Product: Slurm Reporter: Torkil Svensgaard <torkil>
Component: DocumentationAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: rkv
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: DRCMR Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Torkil Svensgaard 2021-08-11 04:41:49 MDT
Hi

I'm trying to get slurmd running on our login nodes with the sole purpose of pulling configuration files configless. 

It works fine on the nodes with cgroupv1 but failes on nodes with cgroupv2:

"
# slurmd -Dvvvc
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:48 Boards:1 Sockets:1 CoresPerSocket:24 ThreadsPerCore:2
slurmd: error: Node configuration differs from hardware: CPUs=1:48(hw) Boards=1:1(hw) SocketsPerBoard=1:1(hw) CoresPerSocket=1:24(hw) ThreadsPerCore=1:2(hw)
slurmd: debug:  Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurm/d/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:48 Boards:1 Sockets:1 CoresPerSocket:24 ThreadsPerCore:2
slurmd: debug:  skipping GRES for NodeName=bigger2  AutoDetect=nvml

slurmd: debug:  skipping GRES for NodeName=bigger3  AutoDetect=nvml

slurmd: debug:  skipping GRES for NodeName=bigger9  AutoDetect=nvml

slurmd: debug:  skipping GRES for NodeName=chimera  AutoDetect=nvml

slurmd: debug:  skipping GRES for NodeName=kong  AutoDetect=nvml

slurmd: debug:  gres/gpu: init: loaded
slurmd: debug:  gpu/generic: init: init: GPU Generic plugin loaded
slurmd: topology/none: init: topology NONE plugin loaded
slurmd: route/default: init: route default plugin loaded
slurmd: debug2: Gathering cpu frequency information for 48 cpus
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug:  Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf
slurmd: debug2: _file_read_content: unable to open '/sys/fs/cgroup/cpuset//tasks' for reading : No such file or directory
slurmd: debug2: xcgroup_get_param: unable to get parameter 'tasks' for '/sys/fs/cgroup/cpuset/'
slurmd: error: unable to mount cpuset cgroup namespace: Device or resource busy
slurmd: error: unable to create cpuset namespace
slurmd: error: Couldn't load specified plugin name for task/cgroup: Plugin init() callback failed
slurmd: error: cannot create task context for task/cgroup
slurmd: error: slurmd initialization failed
"

How do I fix that?

Mvh.

Torkil
Comment 1 Jason Booth 2021-08-11 10:17:03 MDT
Hi Torkil - cgroupsv2 is not yet supported in Slurm, or any other scheduler at this time. We are currently looking at adding support for v2 in a future version. For now, I would suggest that you switch to the legacy cgroup v1 on systems that use Slurm.