12252 – Slurmd and cgroupv2

Ticket 12252 - Slurmd and cgroupv2

Summary: Slurmd and cgroupv2

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Documentation (show other tickets)
Version:	20.11.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-08-11 04:41 MDT by Torkil Svensgaard
Modified:	2021-08-11 10:17 MDT (History)
CC List:	1 user (show)

See Also:
Site:	DRCMR
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Torkil Svensgaard 2021-08-11 04:41:49 MDT

Hi

I'm trying to get slurmd running on our login nodes with the sole purpose of pulling configuration files configless. 

It works fine on the nodes with cgroupv1 but failes on nodes with cgroupv2:

"
# slurmd -Dvvvc
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:48 Boards:1 Sockets:1 CoresPerSocket:24 ThreadsPerCore:2
slurmd: error: Node configuration differs from hardware: CPUs=1:48(hw) Boards=1:1(hw) SocketsPerBoard=1:1(hw) CoresPerSocket=1:24(hw) ThreadsPerCore=1:2(hw)
slurmd: debug:  Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurm/d/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:48 Boards:1 Sockets:1 CoresPerSocket:24 ThreadsPerCore:2
slurmd: debug:  skipping GRES for NodeName=bigger2  AutoDetect=nvml

slurmd: debug:  skipping GRES for NodeName=bigger3  AutoDetect=nvml

slurmd: debug:  skipping GRES for NodeName=bigger9  AutoDetect=nvml

slurmd: debug:  skipping GRES for NodeName=chimera  AutoDetect=nvml

slurmd: debug:  skipping GRES for NodeName=kong  AutoDetect=nvml

slurmd: debug:  gres/gpu: init: loaded
slurmd: debug:  gpu/generic: init: init: GPU Generic plugin loaded
slurmd: topology/none: init: topology NONE plugin loaded
slurmd: route/default: init: route default plugin loaded
slurmd: debug2: Gathering cpu frequency information for 48 cpus
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug:  Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf
slurmd: debug2: _file_read_content: unable to open '/sys/fs/cgroup/cpuset//tasks' for reading : No such file or directory
slurmd: debug2: xcgroup_get_param: unable to get parameter 'tasks' for '/sys/fs/cgroup/cpuset/'
slurmd: error: unable to mount cpuset cgroup namespace: Device or resource busy
slurmd: error: unable to create cpuset namespace
slurmd: error: Couldn't load specified plugin name for task/cgroup: Plugin init() callback failed
slurmd: error: cannot create task context for task/cgroup
slurmd: error: slurmd initialization failed
"

How do I fix that?

Mvh.

Torkil

Comment 1 Jason Booth 2021-08-11 10:17:03 MDT

Hi Torkil - cgroupsv2 is not yet supported in Slurm, or any other scheduler at this time. We are currently looking at adding support for v2 in a future version. For now, I would suggest that you switch to the legacy cgroup v1 on systems that use Slurm.