Summary: | jobacct_mem_limit appears to be calculated incorrectly resulting in being killed | ||
---|---|---|---|
Product: | Slurm | Reporter: | Sam Gallop <sam.gallop> |
Component: | Contributions | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 14.11.4 | ||
Hardware: | Other | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 14.11.12 15.08.11 16.05.0-pre3 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
slurmd.conf for virtual node n80546a
SLURM configuration file cgroup Configuration File Patch |
Created attachment 3022 [details]
SLURM configuration file
Created attachment 3023 [details]
cgroup Configuration File
Created attachment 3032 [details]
Patch
Thanks for the patch! Committed with a NEWS entry and a shorter commit message as fe85cc35470. We do not plan on ever tagging a 14.11.12 release, but this will be included in 15.08.11 and 16.05-pre3 when released. cheers, - Tim |
Created attachment 3021 [details] slurmd.conf for virtual node n80546a Hi, Noticed that jobacct_gather_set_mem_limit (in slurm_jobacct_gather.c) the jobacct_mem_limit could possibly be mis-calculated. This becomes apparent when submitting a job requesting 4194304MB of memory or more. In the slurmd.log (with debug on) it possible to see the parameters of the job being passed - in this case I've requested 4380364MB ... [2016-04-22T08:26:23.895] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm' [2016-04-22T08:26:23.895] [133] slurm cgroup /slurm successfully created for ns memory: File exists [2016-04-22T08:26:23.896] [133] parameter 'notify_on_release' set to '1' for '/cgroup/memory/slurm/uid_3748' [2016-04-22T08:26:23.896] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748' [2016-04-22T08:26:23.896] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm/uid_3748/job_133' [2016-04-22T08:26:23.896] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748/job_133' [2016-04-22T08:26:23.896] [133] parameter 'memory.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133' [2016-04-22T08:26:23.896] [133] parameter 'memory.memsw.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133' [2016-04-22T08:26:23.896] [133] task/cgroup: /slurm/uid_3748/job_133: alloc=4380364MB mem.limit=4380364MB memsw.limit=4380364MB [2016-04-22T08:26:23.897] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch' [2016-04-22T08:26:23.897] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch' [2016-04-22T08:26:23.897] [133] parameter 'memory.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch' [2016-04-22T08:26:23.897] [133] parameter 'memory.memsw.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch' [2016-04-22T08:26:23.897] [133] task/cgroup: /slurm/uid_3748/job_133/step_batch: alloc=4380364MB mem.limit=4380364MB memsw.limit=4380364MB SLURM then appears to set a limit of 186060MB, much lower than what I requested ... [2016-04-22T08:26:24.899] [133] Job 133 memory used:2523564 limit:190525440 KB as the executable consumes memory the job exceeds this newly imposed limit and is cancelled by SLURM ... [2016-04-22T08:27:43.904] [133] Job 133 memory used:200634796 limit:190525440 KB [2016-04-22T08:27:43.904] [133] Job 133 exceeded memory limit (200634796 > 190525440), being killed I appears that the below line in function jobacct_gather_set_mem_limit (in slurm_jobacct_gather.c) results in an integer overflow, possibly because mem_limit is an uint32_t and doesn't handle the values greater than 4294,967295 ... jobacct_mem_limit = mem_limit * 1024; /* MB to KB */ By type-casting mem_limit to uint64_t as part of the calculation it appears to resolve the issue ... jobacct_mem_limit = (uint64_t)mem_limit * 1024; /* MB to KB */