Ticket 5497

Summary: How to disable slurmd memory cgroup use
Product: Slurm Reporter: GSK-ONYX-SLURM <slurm-support>
Component: slurmdAssignee: Marshall Garey <marshall>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 17.11.7   
Hardware: Linux   
OS: Linux   
Site: GSK Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: ? Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description GSK-ONYX-SLURM 2018-07-30 05:48:49 MDT
Hi.
We are being hit hard by what we believe is the memory cgroup issue.

[2018-07-27T15:50:07.660] task_p_slurmd_batch_request: 67020
[2018-07-27T15:50:07.660] task/affinity: job 67020 CPU input mask for node: 0x000000000002
[2018-07-27T15:50:07.660] task/affinity: job 67020 CPU final HW mask for node: 0x000001000000
[2018-07-27T15:50:07.661] _run_prolog: run job script took usec=9
[2018-07-27T15:50:07.663] _run_prolog: prolog with lock for job 67020 ran for 0 seconds
[2018-07-27T15:50:07.663] Launching batch job 67020 for UID 62356
[2018-07-27T15:50:07.845] [67020.batch] task/cgroup: /slurm/uid_62356/job_67020: alloc=5000MB mem.limit=5000MB memsw.limit=5000MB
[2018-07-27T15:50:07.845] [67020.batch] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_62356/job_67020/step_batch' : No space left on device
[2018-07-27T15:50:07.855] [67020.batch] error: task/cgroup: unable to add task[pid=13971] to memory cg '(null)'
[2018-07-27T15:50:07.858] [67020.batch] task_p_pre_launch: Using sched_affinity for tasks

When this happens we have application slurmstepd jobs taking 50 mins to complete instead of 50 seconds.  Previous attempts to resolve this have always ended up needing a reboot.  We will then have a period of running ok before it blows up again.

In an attempt to prove that this memory cgroup issue really is the cause of the application job process slowdown I thought that we might recover (without a reboot) if we stopped slurmd using memory cgroups but continued to use cgroups for cores.

So I did the following:

1. Changed slurm.conf SelectTypeParameters from CR_CPU_memory to CR_CPU
2. Propagated slurm.conf
3. restarted backup slurmctld
4. restarted primary slurmctld
5. scontrol reconfigure

That had no effect.  Jobs still took many minutes to complete and the slurmd logfile still contained

error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_62356/job_67020/step_batch' : No space left on device
error: task/cgroup: unable to add task[pid=13971] to memory cg '(null)'

messages.  I tried resatarting the slurmd daemon but that had no effect either.

Next I edited cgroup.conf on the compute node and changed
ConstrainRAMSpace=yes

to be

ConstrainRAMSpace=no

and restarted the slurmd daemon.

This seems to have had some success.  Our application jobs are running once again at <60s elapsed times.

However, the slurmd log file still shows these messages...

[2018-07-30T12:45:04.797] [76245.batch] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_62356/job_76245' : No space left on device
[2018-07-30T12:45:04.812] [76246.batch] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm/uid_62356/job_76246' : No space left on device
[2018-07-30T12:45:04.818] [76245.batch] error: task/cgroup: unable to add task[pid=30506] to memory cg '(null)'
[2018-07-30T12:45:04.820] [76245.batch] task_p_pre_launch: Using sched_affinity for tasks
[2018-07-30T12:45:04.823] [76246.batch] error: task/cgroup: unable to add task[pid=30507] to memory cg '(null)'
[2018-07-30T12:45:04.826] [76246.batch] task_p_pre_launch: Using sched_affinity for tasks

So my question is how do we properly configure our compute node(s) to use cgroups for CPUs (cores) but *NOT* for memory.

Thanks.
Mark.
Comment 1 Marshall Garey 2018-07-30 06:52:18 MDT
See bug 5082. All you have to do is set ConstrainKmemSpace=No in cgroup.conf, and you won't hit the issue anymore. But to recover the memory you have to restart the node.
Comment 2 GSK-ONYX-SLURM 2018-07-30 06:58:52 MDT
So to confirm, you are saying that I do not need to 

change SelectTypeParameters from CR_CPU_memory to CR_CPU
or
set ConstrainRAMSpace=no

I just need to set

ConstrainKmemSpace=No 

Is that correct?

Thanks.
Mark.
Comment 3 Marshall Garey 2018-07-30 08:53:22 MDT
Yes, that's correct. ConstrainKmemSpace=No is the only change - don't make any of the other changes that you made.

But you have to restart the node in order to reclaim the memory that was leaked.

Can you confirm that solves the problems you're experiencing?
Comment 4 GSK-ONYX-SLURM 2018-07-30 09:13:36 MDT
Ok, I will unwind the other changes, put the Kmem fix in place and get the server rebooted.

I'll come back to you in the next day or so once we've had chance to test this.

Thanks.
Mark.
Comment 5 Marshall Garey 2018-07-30 09:16:02 MDT
Sounds good.

Just for your information, ConstrainKmemSpace=No by default in 18.08. We made that change because of this very bug that you are experiencing (and several others have already hit).
Comment 6 Marshall Garey 2018-08-02 11:35:03 MDT
Have you had any more problems?
Comment 7 Marshall Garey 2018-08-06 09:55:41 MDT
Closing as resolved/duplicate of 5082.

*** This ticket has been marked as a duplicate of ticket 5082 ***