Hi, Using jobacct_gather/cgroup appears to scale usage data in sacct by the # of tasks per node. Given the following cpu bound set of test jobs, we'd expect usercpu to be roughly ntasks * walltime. But it ends up being roughly ntasks^2 * walltime: jobid nodesxntasks-per-node usercpu walltime 6765457 1x1 1:31:51 1:34:16 6764424 1x2 3:17:53 51:34 6766943 1x8 21:43:00 21:59 6763262 1x16 2-03:54:32 13:44 This seems to be due to the fact that the the job accounting infrastructure expects accounting to be done by task (from slurmstepd/req.c): for (i = 0; i < job->node_tasks; i++) { temp_jobacct = jobacct_gather_stat_task(job->task[i]->pid); if (temp_jobacct) { jobacctinfo_aggregate(jobacct, temp_jobacct); jobacctinfo_destroy(temp_jobacct); num_tasks++; } } Which in the end calls jobacct_gather_cgroup.c:_prec_extra, which as far as I can tell returns accounting information for the whole step and not each task because I think in cgroup accounting all the task pids get lumped under a step cgroup with no differentiation between tasks for accounting purposes. This loop is what then causes the extra multiplication by # of tasks. Thanks for any help in solving this. Martins
And I should mention that I have applied attachment 4185 [details] from: https://bugs.schedmd.com/show_bug.cgi?id=3531 to get cgroup accounting working at all. Before applying that patch, we saw the same "0" values as reported in that bug report for memory.
Yeah, that patch doesn't now seem the right way to fix this. Sorry for the confusion. I'll do some more testing on a stock 17.02 and try to come up with a better bug report.
Hi I will try to improve this patch or find other solution for bug 3531. Dominik
OK, thanks! I don’t have a complete handle on it yet. But my best guess is a race condition when running all of: JobAcctGatherType = jobacct_gather/cgroup ProctrackType = proctrack/cgroup TaskPlugin = task/cgroup With stock 17.02.03, when running those plugins and multiple tasks on a node, some PIDS get put in the task cgroup and some PIDS get put in step cgroup. I believe that is the root cause. Martins On Jun 15, 2017, at 8:59 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=3895#c3> on bug 3895<https://bugs.schedmd.com/show_bug.cgi?id=3895> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi I will try to improve this patch or find other solution for bug 3531<x-msg://11/show_bug.cgi?id=3531>. Dominik ________________________________ You are receiving this mail because: * You reported the bug.
This was solved with a different patch in 3531. *** This ticket has been marked as a duplicate of ticket 3531 ***
Great thanks Danny! On Jul 19, 2017, at 5:35 PM, "bugs@schedmd.com<mailto:bugs@schedmd.com>" <bugs@schedmd.com<mailto:bugs@schedmd.com>> wrote: Danny Auble<mailto:da@schedmd.com> changed bug 3895<https://bugs.schedmd.com/show_bug.cgi?id=3895> What Removed Added Status UNCONFIRMED RESOLVED Resolution --- DUPLICATE Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=3895#c8> on bug 3895<https://bugs.schedmd.com/show_bug.cgi?id=3895> from Danny Auble<mailto:da@schedmd.com> This was solved with a different patch in 3531. *** This bug has been marked as a duplicate of bug 3531<show_bug.cgi?id=3531> *** ________________________________ You are receiving this mail because: * You reported the bug.