| Summary: | cray task plugin unable to successfully _get_numa_nodes at step termination | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmstepd | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | david.gloe |
| Version: | 16.05.5 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=14870 | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 16.05.6 17.02-pre3 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
detailed slurmd log with task cgroup enabled
detailed slurmd log without task cgroup enabled slurmd log with ONLY task/cray and NO jobacctgather |
||
|
Description
Doug Jacobsen
2016-10-09 00:20:14 MDT
Created attachment 3574 [details]
detailed slurmd log with task cgroup enabled
had TaskPlugins=task/affinity,task/cgroup,task/cray
JobAcctGather/cgroup also enabled
Created attachment 3575 [details]
detailed slurmd log without task cgroup enabled
TaskPlugin=task/affinity,task/cray
JobAcctGather/cgroup still in use.
I'm seeing the issue regardless of whether or not the cgroup plugin is enabled. The jobacctgather plugin appears to be attempting to remove unrelated cgroups. I did confirm that edison running 16.05.4 does not have this issue (it does not use the affinity plugin, however). The affinity plugin does not seem to be a factor either (doubted it would be). Of course the other big difference between edison and cori is that cori is on CLE6 (kernel 3.18), but I don't think we were having this issue earlier on cori (But cannot prove it). So I'm more confused as to who/when the cpuset is getting deleted than I was earlier. I modified the cray release agent on this node to see if it was responsible for deleting the cgroup, it isn't. The release agent is getting triggered after the error message, and the directory already does not exist. nid00021:/dev/cpuset # cat /sbin/cpuset_release_agent #!/bin/sh tmstamp=$(date +"%Y-%m-%dT%H:%M:%S.%N") isthere=$(ls /dev/cpuset/$1 2>&1) /bin/rmdir /dev/cpuset/$1 state=$? echo "$tmstamp: releasing $1 cpuset: $state, $isthere" >> /tmp/release (the typical cray release agent only has the rmdir line) from /tmp/release: 2016-10-09T01:33:05.528733750: releasing /slurm/uid_56094/job_4593/step_batch cpuset: 1, ls: cannot access /dev/cpuset//slurm/uid_56094/job_4593/step_batch: No such file or directory 2016-10-09T01:33:05.610443333: releasing /slurm/uid_56094/job_4593/step_extern cpuset: 1, ls: cannot access /dev/cpuset//slurm/uid_56094/job_4593/step_extern: No such file or directory from slurmd: [2016-10-09T01:33:05.338] [4593.0] error: (task_cray.c: 716: _get_numa_nodes) Failed to open file /dev/cpuset/slurm/uid_56094/job_4593/step_0/mems: No such file or directory ... ... [2016-10-09T01:33:05.520] [4593] error: (task_cray.c: 716: _get_numa_nodes) Failed to open file /dev/cpuset/slurm/uid_56094/job_4593/step_batch/mems: No such file or directory ... ... [2016-10-09T01:33:05.608] [4593.4294967295] error: (task_cray.c: 716: _get_numa_nodes) Failed to open file /dev/cpuset/slurm/uid_56094/job_4593/step_extern/mems: No such file or directory (In reply to Doug Jacobsen from comment #3) > I'm seeing the issue regardless of whether or not the cgroup plugin is > enabled. The jobacctgather plugin appears to be attempting to remove > unrelated cgroups. I did confirm that edison running 16.05.4 does not have > this issue (it does not use the affinity plugin, however). This might be a stupid question, but I'm not very familiar with how the Cray task mechanisms are all setup - If you don't have task/cgroup, what would create the cgroup for the task/cray plugin? If the answer is "noting", I think you could explain this in two separate ways: - With task/cgroup, the new cleanup code is removing the cpuset hierarchy before the task/cray plugin cleans up (as of 16.05.5). - Without it enabled, the cpuset is never created, and task/cray can't find it to cleanup. I think you get the same symptom either way. I'll look into whether this could be cached or avoided somehow. While it'd be easy enough to #ifdef out, I'd rather keep the cgroup cleanup code enabled on all systems just to limit the discrepancies between normal and native Cray operation if possible. FWIW - 16.05.5 and later should not need the ReleaseAgent set - that was the reason for back-porting commit 66beca68217. as systemd has a habit of removing slurmd's release_agent mount option seemingly at random. There are some libjob calls that the cray task plugin make which, I believe, will create the cgroup, not 100% sure. I am sure that simply adding a task to a cgroup on cray is insufficient - they've re written a lot of the cgroup code in the kernel. I'm starting to suspect the change to the affinity plugin more, it had code to explicitly remove rmdir the cgroup (not using xcgroup_delete) I'm guessing my 2am testing it that cleared it may have been flawed. On Oct 9, 2016 8:14 AM, <bugs@schedmd.com> wrote: > Tim Wickberg <tim@schedmd.com> changed bug 3154 > <https://bugs.schedmd.com/show_bug.cgi?id=3154> > What Removed Added > Assignee support@schedmd.com tim@schedmd.com > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=3154#c6> on bug > 3154 <https://bugs.schedmd.com/show_bug.cgi?id=3154> from Tim Wickberg > <tim@schedmd.com> * > > (In reply to Doug Jacobsen from comment #3 <https://bugs.schedmd.com/show_bug.cgi?id=3154#c3>)> I'm seeing the issue regardless of whether or not the cgroup plugin is > > enabled. The jobacctgather plugin appears to be attempting to remove > > unrelated cgroups. I did confirm that edison running 16.05.4 does not have > > this issue (it does not use the affinity plugin, however). > > This might be a stupid question, but I'm not very familiar with how the Cray > task mechanisms are all setup - > > If you don't have task/cgroup, what would create the cgroup for the task/cray > plugin? > > If the answer is "noting", I think you could explain this in two separate ways: > > - With task/cgroup, the new cleanup code is removing the cpuset hierarchy > before the task/cray plugin cleans up (as of 16.05.5). > > - Without it enabled, the cpuset is never created, and task/cray can't find it > to cleanup. > > I think you get the same symptom either way. > > I'll look into whether this could be cached or avoided somehow. While it'd be > easy enough to #ifdef out, I'd rather keep the cgroup cleanup code enabled on > all systems just to limit the discrepancies between normal and native Cray > operation if possible. > > FWIW - 16.05.5 and later should not need the ReleaseAgent set - that was the > reason for back-porting commit 66beca68217. as systemd has a habit of removing > slurmd's release_agent mount option seemingly at random. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > I'm raising the priority of this a bit (also for myself). I need to get the node compaction stuff working before cori can come back into service (which is about 1.5 weeks or less away), and I have a concern until the cause is identified about updating edison to 16.05.5, which is exposing it to a number of bugs that have been resolved. I'll try to see if I can help pin down exactly what is removing the cpuset, but we may need to consider alternatives. I've tried disabling both the affinity and cgroup plugins and still see this behavior. Doug, just out of curiousity, if you enable all your task plugins and switch the jobacct_gather plugin off cgroups does everything work as you would expect? I tried that as well, but same result (I believe -- I did a lot of testing and it was all 2am testing, so is all suspect) On 10/17/16 3:15 PM, bugs@schedmd.com wrote: > > *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=3154#c9> on bug > 3154 <https://bugs.schedmd.com/show_bug.cgi?id=3154> from Danny Auble > <mailto:da@schedmd.com> * > Doug, just out of curiousity, if you enable all your task plugins and switch > the jobacct_gather plugin off cgroups does everything work as you would expect? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > I'm seeing this as well on a Cray internal system. For sbatch jobs this error appears in the user's output. That's causing some of our test suite tests to fail. Hello, I just built 16.05.5 with a couple patches (i.e., bug #3185) for our alva test system (test system for edison). Still getting this: [2016-10-18T14:13:05.039] [293.0] core_spec/cray: init [2016-10-18T14:13:05.179] [293.0] (switch_cray.c: 656: switch_p_job_init) gres_cnt: 2072 0 [2016-10-18T14:13:05.242] [293.0] task/cgroup: /slurm/uid_56094/job_293: alloc=64000MB mem.limit=64000MB memsw.limit=unlimited [2016-10-18T14:13:05.242] [293.0] task/cgroup: /slurm/uid_56094/job_293/step_0: alloc=64000MB mem.limit=64000MB memsw.limit=unlimited [2016-10-18T14:13:05.246] [293.0] in _window_manager [2016-10-18T14:13:05.311] [293.0] Created file /var/opt/cray/alps/spool/status293 [2016-10-18T14:13:24.982] [293.0] Unlinked /var/opt/cray/alps/spool/status293 [2016-10-18T14:13:24.982] [293.0] error: (task_cray.c: 716: _get_numa_nodes) Failed to open file /dev/cpuset/slurm/uid_56094/job_293/step_0/mems: No such file or directory [2016-10-18T14:13:24.982] [293.0] error: (task_cray.c: 507: task_p_post_step) get_numa_nodes failed. Return code: -1 [2016-10-18T14:13:24.996] [293.0] done with job [2016-10-18T14:13:25.753] [293.4294967295] error: (task_cray.c: 716: _get_numa_nodes) Failed to open file /dev/cpuset/slurm/uid_56094/job_293/step_extern/mems: No such file or directory [2016-10-18T14:13:25.753] [293.4294967295] error: (task_cray.c: 507: task_p_post_step) get_numa_nodes failed. Return code: -1 [2016-10-18T14:13:25.758] [293.4294967295] done with job # So we're seeing the same behavior in CLE5.2 and CLE6 (on candidate for cgroup removal was a change in the kernel, for example). Also, alva is still running the NHC whereas cori is not. So this may remove some search space. -Doug Hello, I modified, as a test, the gerty slurm.conf to set: TaskPlugin=task/cray And commented out all the jobacctgather stuff, so it was disabled. I stopped and restarted all slurmd and slurmctld (just for good measure!) And the problem still occurs. I'm attaching a debugging version of the slurmd log. I'm also pretty sure that the release_agent isn't deleting the cgroup: nid00021:~ # cat /dev/cpuset/release_agent /sbin/cpuset_release_agent nid00021:~ # cat /sbin/cpuset_release_agent #!/bin/sh timestamp=$(date +%Y-%m-%dT%H:%M:%S) echo "$timestamp: released $1 via release agent" >> /tmp/cpuset_release /bin/rmdir /dev/cpuset/$1 nid00021:~ # cat /tmp/cpuset_release cat: /tmp/cpuset_release: No such file or directory nid00021:~ # Thanks, Doug Created attachment 3606 [details]
slurmd log with ONLY task/cray and NO jobacctgather
OK, so the reason task=cray only still gives the error is for a different reason. The cpuset cgroup was never created. nid00021:~ # jstat -a JID OWNER COMMAND ------------------ ------------ -------------------------------- 0x0000000000000160 dmj sleep 0x0000000000000161 dmj /bin/bash 0x0000000000000162 dmj /usr/bin/sleep nid00021:~ # ls -l /dev/cpuset/slurm total 0 -rw-r--r-- 1 root root 0 Oct 13 17:40 cgroup.clone_children --w--w--w- 1 root root 0 Oct 13 17:40 cgroup.event_control -rw-r--r-- 1 root root 0 Oct 13 17:40 cgroup.procs -rw-r--r-- 1 root root 0 Oct 13 17:40 cpu_exclusive -rw-r--r-- 1 root root 0 Oct 13 17:40 cpus -rw-r--r-- 1 root root 0 Oct 13 17:40 expected_usage_in_bytes -rw-r--r-- 1 root root 0 Oct 13 17:40 mem_exclusive -rw-r--r-- 1 root root 0 Oct 13 17:40 mem_hardwall -rw-r--r-- 1 root root 0 Oct 13 17:40 memory_migrate -r--r--r-- 1 root root 0 Oct 13 17:40 memory_pressure -rw-r--r-- 1 root root 0 Oct 13 17:40 memory_spread_page -rw-r--r-- 1 root root 0 Oct 13 17:40 memory_spread_slab -rw-r--r-- 1 root root 0 Oct 13 17:40 mems -rw-r--r-- 1 root root 0 Oct 13 17:40 notify_on_release -rw-r--r-- 1 root root 0 Oct 13 17:40 sched_load_balance -rw-r--r-- 1 root root 0 Oct 13 17:40 sched_relax_domain_level -rw-r--r-- 1 root root 0 Oct 13 17:40 tasks nid00021:~ # My finding is that with (All with NO jobacctgather enabled): TaskPlugin=task/cray NO cpuset cgroup is created, thus we get the error from the task/cray plugin because the cpuset never existed TaskPlugin=task/cgroup,task/cray The cpuset cgroup is created, but we get the error, presumably because the cpuset cgroup was deleted _before_ the task/cray step termination logic ran TaskPlugin=task/affinity,task/cray NO cpuset cgroup is created, thus we get the error from the task/cray plugin because the cpuset never existed My finding with jobacctgather/cgroup: TaskPlugin=task/cray NO cpuset cgroup is created (but a cpu, cpuacct (possibly others) cgroups are), thus we get the error from the task/cray plugin because the cpuset never existed TaskPlugin=task/cgroup,task/cray The cpuset cgroup is created, but we get error from the task/cray plugin because the cpuset never existed TaskPlugin=task/affinity,task/cray NO cpuset cgroup is created, thus we get the error from the task/cray plugin because the cpuset never existed Results: 1) jobacctgather/cgroup seems to be uninvolved 2) task/affinity seems to be uninvolved 3) presence of task/cgroup creates the cpuset cgroup but it doesn't last long enough (we think) 4) task/cray seems to absolutely rely on task/cgroup (I've always used both, but never thought of them as this tightly integrated) 5) in all tested cases we get the _get_numa_nodes error, either because the cpuset was never created, or because it didn't exist when the task/cray termination logic ran. 6) the issue seems to be isolated to a change in the behavior of the task/cgroup plugin Thanks for the patience working through this. It's definitely commit 66beca68217. I'm trying to sort out a workaround (possibly caching the value as suggested), or rearrange the cleanup order to compensate. We're hoping to get time on kachina this week to test, or if you're willing to try a few patches on gerty I can have those over to you later today. Hi Tim, I'm happy to test some patches. However there is another, unexpected result, reordering the plugins has an impact: TaskPlugin=task/cray,task/affinity,task/cgroup Seems to behave with respect to the job termination logic, so perhaps ordering of the task plugins matters (I've never really considered it). This makes some sense reviewing the task_g_* functions in src/slurmd/common/task_plugin.c; I guess the order of the list is preserved from the config file. I suppose the next question is if there are any front-loaded interdependencies between task/cray and task/cgroup (perhaps none, since the libjob stuff was happy even without the cpuset, I may not have shared that result earlier). -Doug (In reply to Doug Jacobsen from comment #18) > Hi Tim, > > I'm happy to test some patches. However there is another, unexpected result, > reordering the plugins has an impact: > > TaskPlugin=task/cray,task/affinity,task/cgroup > > Seems to behave with respect to the job termination logic, so perhaps > ordering of the task plugins matters (I've never really considered it). ... I should have thought to ask you to change that. Yes, the ordering is preserved, and plugins are always run through in that order. (Although I think it'd be structurally cleaner if they were "unwound" in the reverse order on cleanup - treat it as a stack of environmental modifications that needs to be backed off in the same order, rather than just a list to run through.) I think I'd had some concern that moving cray up in the list may have some unexpected side-effects, but looking at the task plugin it doesn't really do much on launch, only on cleanup, and that's probably safe. > This makes some sense reviewing the task_g_* functions in > src/slurmd/common/task_plugin.c; I guess the order of the list is preserved > from the config file. > > I suppose the next question is if there are any front-loaded > interdependencies between task/cray and task/cgroup (perhaps none, since the > libjob stuff was happy even without the cpuset, I may not have shared that > result earlier). Reading through it I don't see any obvious reasons why task/cray would need to happen after task/cgroup. That may actually be sufficient - if you don't see any issues with testing it I may make a minor patch to the plugin's init to throw a few error messages if task/cray is after task/cgroup, and make it explicit that task/cray depends on task/cgroup for proper operation. OK, there is the mysterious activity of the alpsc* functions in the task startup, perhaps David could clarify if those will be ok running before cgroup. On gerty and cori, I'm considering using: TaskPlugin=task/affinity,task/cray,task/cgroup and will put that in place in the next few minutes. On edison and alva, we don't use affinity (yet), so it'll just be cray,cgroup. I think the reordering recommendation is fine. It looks like the cray documentation is a little clearer about the order, but the language could be strengthened about the relationship between cgroup and cray task plugins. The slurm.conf entry on TaskPlugins is silent on the order piece. I like the idea of forward and then reverse ordering, it does make sense for plugins that might have inter-dependencies. As you mention, some mechanism for generating warnings about out-of-order execution would be good (which may imply a more generalized interface in the multi-plugin setups (including spank) that would allow a developer to specify those dependencies). Thanks, Doug (In reply to Doug Jacobsen from comment #20) > OK, there is the mysterious activity of the alpsc* functions in the task > startup, perhaps David could clarify if those will be ok running before > cgroup. > > On gerty and cori, I'm considering using: > TaskPlugin=task/affinity,task/cray,task/cgroup > > and will put that in place in the next few minutes. > > On edison and alva, we don't use affinity (yet), so it'll just be > cray,cgroup. > > I think the reordering recommendation is fine. It looks like the cray > documentation is a little clearer about the order, but the language could be > strengthened about the relationship between cgroup and cray task plugins. > The slurm.conf entry on TaskPlugins is silent on the order piece. I assume you're referring to http://slurm.schedmd.com/cray.html as the Cray documentation, or does Cray have some other notes you're working off? Obviously that page will get updated as soon as we're confident nothing breaks by running it ahead of cgroups. > I like the idea of forward and then reverse ordering, it does make sense for > plugins that might have inter-dependencies. As you mention, some mechanism > for generating warnings about out-of-order execution would be good (which > may imply a more generalized interface in the multi-plugin setups (including > spank) that would allow a developer to specify those dependencies). The ordering warning will go in just as a few strstr() calls in the task/cray init() to verify the plugin order is correct, and that the cgroup plugin is being used, with it triggering error() on 16.05 and fatal() for 17.02. Unless interaction between various plugins becomes a significantly more common issue, it's not worth the extra hassle of abstracting out a plugin dependency system... Slurm's managed without that for 14 years so far. :) We have documentation that goes over native Slurm here: https://pubs.cray.com/#/Collaborate/00328985-DD/DD00326571 We also have a slurm.conf template in the Slurm source at contribs/cray/csm/slurm.conf.j2 If we decide the reordering is the best solution, that template should be updated with the new ordering. As far as I know the task_cray startup activities don't depend on cgroups. Commit c3266fcae1a adds an error message to the task/cray plugin about this issue on the 16.05 branch, and updates the documentation and templates with the required ordering. Commit 7fcc20fa2ff on master changes this to a fatal() error instead. |