| Summary: | cgroup proctrack plugin sporadically drains nodes with ESLURMD_SETUP_ENVIRONMENT_ERROR | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | John Morrissey <jwm> |
| Component: | slurmd | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da |
| Version: | 2.6.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Harvard University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
fix for proctrack/cgroup race condition
add locking to prevent race conditon revised locking patch |
||
|
Description
John Morrissey
2013-10-09 07:05:14 MDT
John, how easy is this to reproduce? Could you send me your cgroup.conf? (In reply to Danny Auble from comment #1) > John, how easy is this to reproduce? We see a large handful of nodes affected by this in a given week. I don't have an easily reproducible workload that triggers it, though. We're mostly backtracking through symptoms and correlating with the kinds of jobs we see being run by affected nodes. > Could you send me your cgroup.conf? It's pretty simple: -- CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=no ConstrainRAMSpace=no -- Studying the code in src/plugins/proctrack/cgroup/proctrack_cgroup.c there appears to be a possible race condition. When a job step starts, there of three cgroups created for the user, job and step. When it ends those cgroups are deleted in the reverse order. I'm guessing that a step ending at the same time as another step is starting could result in a cgroup being deleted by one slurmstepd process while the other is going through a sequence of creates. Since this work is being performed by two different programs, the best solution is probably to add retry logic to the cgroup create sequence (e.g. if the job cgroup can not be created because the user cgroup was removed since we created it, then recreate it). It seems pretty straightforward. Created attachment 475 [details]
fix for proctrack/cgroup race condition
Add retry logic to cgroup creation in case one job or step is starting while another is ending at the same time for the same user. We're still seeing this occasionally with 2.6.3rc3+attachment 475 [details], but not as often as before. The log output is similar; for example, this node spawned 30 jobs and about seven seconds later, the cgroups setup failed for one of them:
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: error: slurm_container_create: No such file or directory
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: error: job_manager exiting abnormally, rc = 4014
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: Message thread exited
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: job 3228660 completed with slurm_rc = 4014, job_rc = 0
Nov 11 23:27:49 holy2a21108 slurmstepd[62787]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4014
The problem would be most common with a single user that has lots of simultaneous and short lived job steps on one node. I'm afraid that cgroup operations are painfully slow and I hate to substantially modify the logic in version 2.6. I would like to suggest that you increase the retry count in the patch (MAX_CGROUP_RETRY, make it as large as you like, say 800) and I will plan to re-write the logic to use file locks for the next major release (14.03 in late March 2014). Created attachment 540 [details]
add locking to prevent race conditon
This patch will be included in version 2.6.5 and replaces the retry logic in the previous patch with file locking. The second patch can be applied directly on top of version 2.6.4 or on top of first patch on earlier versions of Slurm. https://github.com/SchedMD/slurm/commit/3f6d9e3670cd931d987cb65e53e2cfbb4c153eb5.patch I am going to close this bug assuming this second patch fixes the problem. Please re-open the bug if problems persist. Created attachment 541 [details]
revised locking patch
variant of previous patch to match the variable names and logic already in the version 14.03 code base.
|