Ticket 447

Summary:	cgroup proctrack plugin sporadically drains nodes with ESLURMD_SETUP_ENVIRONMENT_ERROR
Product:	Slurm	Reporter:	John Morrissey <jwm>
Component:	slurmd	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	da
Version:	2.6.x
Hardware:	Linux
OS:	Linux
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	fix for proctrack/cgroup race condition add locking to prevent race conditon revised locking patch

Description John Morrissey 2013-10-09 07:05:14 MDT

When the cgroups proctrack plugin is enabled, nodes are being drained sporadically due to an ENOENT passed up to slurm_container_create(). On the master:

  Oct  5 03:28:49 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921643.0 holy2a17201 usec=62979
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921628.0 holy2a17201 usec=84454
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921631.0 holy2a17201 usec=84888
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921640.0 holy2a17201 usec=85446
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921638.0 holy2a17201 usec=50547
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921636.0 holy2a17201 usec=536095
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921627.0 holy2a17201 usec=536344
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921642.0 holy2a17201 usec=92872
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921637.0 holy2a17201 usec=57975
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921634.0 holy2a17201 usec=38049
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921629.0 holy2a17201 usec=38515
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921626.0 holy2a17201 usec=39155
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921630.0 holy2a17201 usec=39645
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921635.0 holy2a17201 usec=40084
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921633.0 holy2a17201 usec=41559
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921641.0 holy2a17201 usec=42813
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921639.0 holy2a17201 usec=484
  Oct  5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921632.0 holy2a17201 usec=390
  Oct  5 03:28:57 holy-slurm01 slurmctld[16816]: error: slurmd error running JobId=1921631 on node(s)=holy2a17201: Slurmd could not set up environment for batch job
  Oct  5 03:28:57 holy-slurm01 slurmctld[16816]: update_node: node holy2a17201 reason set to: batch job complete failure
  Oct  5 03:28:57 holy-slurm01 slurmctld[16816]: update_node: node holy2a17201 state set to DRAINING


And on the compute node:

  Oct  5 03:28:53 holy2a17201 slurmstepd[39496]: scaling CPU count by factor of 2
  Oct  5 03:28:57 holy2a17201 slurmstepd[39496]: error: slurm_container_create: No such file or directory
  Oct  5 03:28:57 holy2a17201 slurmstepd[39496]: error: job_manager exiting abnormally, rc = 4014
  Oct  5 03:28:57 holy2a17201 slurmstepd[39496]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4014
  Oct  5 03:28:57 holy2a17201 slurmd[holy2a17201][39283]: done with job
  Oct  5 03:28:57 holy2a17201 slurmstepd[39496]: done with job

I'm guessing the ENOENT is coming from one of the xcgroup_*() functions, or something deeper in their callchains. This seems more likely to happen when processing large numbers of short lived jobs (under an hour, probably on the order of minutes or less, we originally thought it was picking up on errored user jobs).

This harkens back to https://groups.google.com/forum/#!msg/slurm-devel/LzcYLVM85-c/6nZWkQ31dPkJ, which looks like it was already fixed(?). Maybe there's another cgroups race here.

Comment 1 Danny Auble 2013-10-09 07:38:57 MDT

John, how easy is this to reproduce?

Could you send me your cgroup.conf?

Comment 2 John Morrissey 2013-10-09 08:50:54 MDT

(In reply to Danny Auble from comment #1)
> John, how easy is this to reproduce?

We see a large handful of nodes affected by this in a given week. I don't have an easily reproducible workload that triggers it, though. We're mostly backtracking through symptoms and correlating with the kinds of jobs we see being run by affected nodes.

> Could you send me your cgroup.conf?

It's pretty simple:

--
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=no
ConstrainRAMSpace=no
--

Comment 3 Moe Jette 2013-10-22 04:06:26 MDT

Studying the code in src/plugins/proctrack/cgroup/proctrack_cgroup.c there appears to be a possible race condition. When a job step starts, there of three cgroups created for the user, job and step. When it ends those cgroups are deleted in the reverse order. I'm guessing that a step ending at the same time as another step is starting could result in a cgroup being deleted by one slurmstepd process while the other is going through a sequence of creates. Since this work is being performed by two different programs, the best solution is probably to add retry logic to the cgroup create sequence (e.g. if the job cgroup can not be created because the user cgroup was removed since we created it, then recreate it). It seems pretty straightforward.

Comment 4 Moe Jette 2013-10-22 07:48:51 MDT

Created attachment 475 [details]
fix for proctrack/cgroup race condition

Comment 5 Moe Jette 2013-10-22 07:50:04 MDT

Add retry logic to cgroup creation in case one job or step is starting while another is ending at the same time for the same user.

Comment 6 John Morrissey 2013-11-18 02:32:15 MST

We're still seeing this occasionally with 2.6.3rc3+attachment 475 [details], but not as often as before. The log output is similar; for example, this node spawned 30 jobs and about seven seconds later, the cgroups setup failed for one of them:

Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: error: slurm_container_create: No such file or directory
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: error: job_manager exiting abnormally, rc = 4014
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: Message thread exited
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: job 3228660 completed with slurm_rc = 4014, job_rc = 0
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4014

Comment 7 Moe Jette 2013-11-24 12:14:07 MST

The problem would be most common with a single user that has lots of simultaneous and short lived job steps on one node.

I'm afraid that cgroup operations are painfully slow and I hate to substantially modify the logic in version 2.6. I would like to suggest that you increase the retry count in the patch (MAX_CGROUP_RETRY, make it as large as you like, say 800) and I will plan to re-write the logic to use file locks for the next major release (14.03 in late March 2014).

Comment 8 Moe Jette 2013-11-29 04:47:41 MST

Created attachment 540 [details]
add locking to prevent race conditon

Comment 9 Moe Jette 2013-11-29 04:50:26 MST

This patch will be included in version 2.6.5 and replaces the retry logic in the previous patch with file locking. The second patch can be applied directly on top of version 2.6.4 or on top of first patch on earlier versions of Slurm.
https://github.com/SchedMD/slurm/commit/3f6d9e3670cd931d987cb65e53e2cfbb4c153eb5.patch

I am going to close this bug assuming this second patch fixes the problem. Please re-open the bug if problems persist.

Comment 10 Moe Jette 2013-11-29 05:28:45 MST

Created attachment 541 [details]
revised locking patch

variant of previous patch to match the variable names and logic already in the version 14.03 code base.