| Summary: | Issue with cgroup: unable to write 5 bytes to cgroup | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ARC Admins <arc-slurm-admins> |
| Component: | Configuration | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Michigan | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
cgroup_allowed_devices_file.conf
cgroup.conf slurm.conf |
||
|
Description
ARC Admins
2021-09-22 11:09:46 MDT
Created attachment 21387 [details]
cgroup.conf
Created attachment 21388 [details]
slurm.conf
(In reply to ARCTS Admins from comment #0) > We saw this message recently in our slurmd logs: > > [2021-08-25T21:33:25.999] [594135.batch] error: _file_write_content: unable > to write 5 bytes to cgroup > /sys/fs/cgroup/devices/slurm/uid_114200629/job_594135/devices.allow: Invalid > argument Well, _file_write_content() does a Linux write() on that file path. I'm guessing write() is returning an EINVAL because the file descriptor is pointing to a path that doesn't exist anymore. Maybe that cgroup hierarchy already got cleaned up by a different cgroup plugin, or something like that. The good news is that in 21.08, we overhauled how our cgroups code works to avoid these types of edge cases and fleeting errors. My guess is that this is one of those errors - it complains about something, but doesn't really have any detrimental effects. Are you able to identify any detrimental effects other than these error messages? For now, I think the solution is to live with the error messages and upgrade to 21.08 when able. If you upgrade to 21.08 and still see these errors, though, then that is something we will definitely track down. Thanks! -Michael (In reply to Michael Hinton from comment #4) > The good news is that in 21.08, we overhauled how our cgroups code works to > avoid these types of edge cases and fleeting errors. I saw in SLUG 2021. I'm excited to get to 21.08! > My guess is that this > is one of those errors - it complains about something, but doesn't really > have any detrimental effects. Are you able to identify any detrimental > effects other than these error messages? The array of 5000 is failing. Whereas smaller sized arrays are succeeding. They are going to try arrays of 1000 and 2500 and report back. David Hi David, (In reply to ARCTS Admins from comment #5) > The array of 5000 is failing. Whereas smaller sized arrays are succeeding. > They are going to try arrays of 1000 and 2500 and report back. Your jobs are failing? I wouldn't expect these errors to fail the job, since most likely they print out and then continue executing the job. So I'm wondering if the jobs are failing for some other reason. Can you give me some logs on these failed jobs? (slurmd.log and slurmctld.log snippets, as well as the job output) Thanks, -Michael David, I'll go ahead and close this out as info given, since I believe these errors are likely harmless and will be resolved in 21.08. However, feel free to reopen if you want to pursue the failures mentioned in comment 6. Thanks! -Michael David, So the real solution here is to remove the cgroup_allowed_devices_file.conf completely. It used to be required, but is no longer used as of Slurm 17.02 or something like that. The Files= option in gres.conf now forms a whitelist of device files that are allowed to be used by the job. Starting in 21.08, however, cgroup_allowed_devices_file.conf started emitting these errors with the new cgroup code changes. We are looking into disabling cgroup_allowed_devices_file.conf more thoroughly so that Slurm doesn't even read it in to begin with. -Michael (In reply to Michael Hinton from comment #8) > David, > > So the real solution here is to remove the cgroup_allowed_devices_file.conf > completely. It used to be required, but is no longer used as of Slurm 17.02 > or something like that. The Files= option in gres.conf now forms a whitelist > of device files that are allowed to be used by the job. > > Starting in 21.08, however, cgroup_allowed_devices_file.conf started > emitting these errors with the new cgroup code changes. We are looking into > disabling cgroup_allowed_devices_file.conf more thoroughly so that Slurm > doesn't even read it in to begin with. > > -Michael Michael, Thanks for this! We plan to remove this file (cgroup_allowed_devices_file.conf) for this bug and a related one (bug 7390). I'll report back how it goes once we've implemented it. Best, David |