Created attachment 3021 [details] slurmd.conf for virtual node n80546a Hi, Noticed that jobacct_gather_set_mem_limit (in slurm_jobacct_gather.c) the jobacct_mem_limit could possibly be mis-calculated. This becomes apparent when submitting a job requesting 4194304MB of memory or more. In the slurmd.log (with debug on) it possible to see the parameters of the job being passed - in this case I've requested 4380364MB ... [2016-04-22T08:26:23.895] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm' [2016-04-22T08:26:23.895] [133] slurm cgroup /slurm successfully created for ns memory: File exists [2016-04-22T08:26:23.896] [133] parameter 'notify_on_release' set to '1' for '/cgroup/memory/slurm/uid_3748' [2016-04-22T08:26:23.896] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748' [2016-04-22T08:26:23.896] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm/uid_3748/job_133' [2016-04-22T08:26:23.896] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748/job_133' [2016-04-22T08:26:23.896] [133] parameter 'memory.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133' [2016-04-22T08:26:23.896] [133] parameter 'memory.memsw.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133' [2016-04-22T08:26:23.896] [133] task/cgroup: /slurm/uid_3748/job_133: alloc=4380364MB mem.limit=4380364MB memsw.limit=4380364MB [2016-04-22T08:26:23.897] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch' [2016-04-22T08:26:23.897] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch' [2016-04-22T08:26:23.897] [133] parameter 'memory.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch' [2016-04-22T08:26:23.897] [133] parameter 'memory.memsw.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch' [2016-04-22T08:26:23.897] [133] task/cgroup: /slurm/uid_3748/job_133/step_batch: alloc=4380364MB mem.limit=4380364MB memsw.limit=4380364MB SLURM then appears to set a limit of 186060MB, much lower than what I requested ... [2016-04-22T08:26:24.899] [133] Job 133 memory used:2523564 limit:190525440 KB as the executable consumes memory the job exceeds this newly imposed limit and is cancelled by SLURM ... [2016-04-22T08:27:43.904] [133] Job 133 memory used:200634796 limit:190525440 KB [2016-04-22T08:27:43.904] [133] Job 133 exceeded memory limit (200634796 > 190525440), being killed I appears that the below line in function jobacct_gather_set_mem_limit (in slurm_jobacct_gather.c) results in an integer overflow, possibly because mem_limit is an uint32_t and doesn't handle the values greater than 4294,967295 ... jobacct_mem_limit = mem_limit * 1024; /* MB to KB */ By type-casting mem_limit to uint64_t as part of the calculation it appears to resolve the issue ... jobacct_mem_limit = (uint64_t)mem_limit * 1024; /* MB to KB */
Created attachment 3022 [details] SLURM configuration file
Created attachment 3023 [details] cgroup Configuration File
Created attachment 3032 [details] Patch
Thanks for the patch! Committed with a NEWS entry and a shorter commit message as fe85cc35470. We do not plan on ever tagging a 14.11.12 release, but this will be included in 15.08.11 and 16.05-pre3 when released. cheers, - Tim