When the cgroups proctrack plugin is enabled, nodes are being drained sporadically due to an ENOENT passed up to slurm_container_create(). On the master: Oct 5 03:28:49 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921643.0 holy2a17201 usec=62979 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921628.0 holy2a17201 usec=84454 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921631.0 holy2a17201 usec=84888 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921640.0 holy2a17201 usec=85446 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921638.0 holy2a17201 usec=50547 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921636.0 holy2a17201 usec=536095 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921627.0 holy2a17201 usec=536344 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921642.0 holy2a17201 usec=92872 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921637.0 holy2a17201 usec=57975 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921634.0 holy2a17201 usec=38049 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921629.0 holy2a17201 usec=38515 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921626.0 holy2a17201 usec=39155 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921630.0 holy2a17201 usec=39645 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921635.0 holy2a17201 usec=40084 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921633.0 holy2a17201 usec=41559 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921641.0 holy2a17201 usec=42813 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921639.0 holy2a17201 usec=484 Oct 5 03:28:50 holy-slurm01 slurmctld[16816]: sched: _slurm_rpc_job_step_create: StepId=1921632.0 holy2a17201 usec=390 Oct 5 03:28:57 holy-slurm01 slurmctld[16816]: error: slurmd error running JobId=1921631 on node(s)=holy2a17201: Slurmd could not set up environment for batch job Oct 5 03:28:57 holy-slurm01 slurmctld[16816]: update_node: node holy2a17201 reason set to: batch job complete failure Oct 5 03:28:57 holy-slurm01 slurmctld[16816]: update_node: node holy2a17201 state set to DRAINING And on the compute node: Oct 5 03:28:53 holy2a17201 slurmstepd[39496]: scaling CPU count by factor of 2 Oct 5 03:28:57 holy2a17201 slurmstepd[39496]: error: slurm_container_create: No such file or directory Oct 5 03:28:57 holy2a17201 slurmstepd[39496]: error: job_manager exiting abnormally, rc = 4014 Oct 5 03:28:57 holy2a17201 slurmstepd[39496]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4014 Oct 5 03:28:57 holy2a17201 slurmd[holy2a17201][39283]: done with job Oct 5 03:28:57 holy2a17201 slurmstepd[39496]: done with job I'm guessing the ENOENT is coming from one of the xcgroup_*() functions, or something deeper in their callchains. This seems more likely to happen when processing large numbers of short lived jobs (under an hour, probably on the order of minutes or less, we originally thought it was picking up on errored user jobs). This harkens back to https://groups.google.com/forum/#!msg/slurm-devel/LzcYLVM85-c/6nZWkQ31dPkJ, which looks like it was already fixed(?). Maybe there's another cgroups race here.
John, how easy is this to reproduce? Could you send me your cgroup.conf?
(In reply to Danny Auble from comment #1) > John, how easy is this to reproduce? We see a large handful of nodes affected by this in a given week. I don't have an easily reproducible workload that triggers it, though. We're mostly backtracking through symptoms and correlating with the kinds of jobs we see being run by affected nodes. > Could you send me your cgroup.conf? It's pretty simple: -- CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=no ConstrainRAMSpace=no --
Studying the code in src/plugins/proctrack/cgroup/proctrack_cgroup.c there appears to be a possible race condition. When a job step starts, there of three cgroups created for the user, job and step. When it ends those cgroups are deleted in the reverse order. I'm guessing that a step ending at the same time as another step is starting could result in a cgroup being deleted by one slurmstepd process while the other is going through a sequence of creates. Since this work is being performed by two different programs, the best solution is probably to add retry logic to the cgroup create sequence (e.g. if the job cgroup can not be created because the user cgroup was removed since we created it, then recreate it). It seems pretty straightforward.
Created attachment 475 [details] fix for proctrack/cgroup race condition
Add retry logic to cgroup creation in case one job or step is starting while another is ending at the same time for the same user.
We're still seeing this occasionally with 2.6.3rc3+attachment 475 [details], but not as often as before. The log output is similar; for example, this node spawned 30 jobs and about seven seconds later, the cgroups setup failed for one of them: Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: error: slurm_container_create: No such file or directory Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: error: job_manager exiting abnormally, rc = 4014 Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: Message thread exited Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: job 3228660 completed with slurm_rc = 4014, job_rc = 0 Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4014
The problem would be most common with a single user that has lots of simultaneous and short lived job steps on one node. I'm afraid that cgroup operations are painfully slow and I hate to substantially modify the logic in version 2.6. I would like to suggest that you increase the retry count in the patch (MAX_CGROUP_RETRY, make it as large as you like, say 800) and I will plan to re-write the logic to use file locks for the next major release (14.03 in late March 2014).
Created attachment 540 [details] add locking to prevent race conditon
This patch will be included in version 2.6.5 and replaces the retry logic in the previous patch with file locking. The second patch can be applied directly on top of version 2.6.4 or on top of first patch on earlier versions of Slurm. https://github.com/SchedMD/slurm/commit/3f6d9e3670cd931d987cb65e53e2cfbb4c153eb5.patch I am going to close this bug assuming this second patch fixes the problem. Please re-open the bug if problems persist.
Created attachment 541 [details] revised locking patch variant of previous patch to match the variable names and logic already in the version 14.03 code base.