Ticket 2660

Summary: jobacct_mem_limit appears to be calculated incorrectly resulting in being killed
Product: Slurm Reporter: Sam Gallop <sam.gallop>
Component: ContributionsAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 14.11.4   
Hardware: Other   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.11.12 15.08.11 16.05.0-pre3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd.conf for virtual node n80546a
SLURM configuration file
cgroup Configuration File
Patch

Description Sam Gallop 2016-04-25 02:08:42 MDT
Created attachment 3021 [details]
slurmd.conf for virtual node n80546a

Hi,

Noticed that jobacct_gather_set_mem_limit (in slurm_jobacct_gather.c) the jobacct_mem_limit could possibly be mis-calculated.  This becomes apparent when submitting a job requesting 4194304MB of memory or more.

In the slurmd.log (with debug on) it possible to see the parameters of the job being passed - in this case I've requested 4380364MB ...
[2016-04-22T08:26:23.895] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm'
[2016-04-22T08:26:23.895] [133] slurm cgroup /slurm successfully created for ns memory: File exists
[2016-04-22T08:26:23.896] [133] parameter 'notify_on_release' set to '1' for '/cgroup/memory/slurm/uid_3748'
[2016-04-22T08:26:23.896] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748'
[2016-04-22T08:26:23.896] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm/uid_3748/job_133'
[2016-04-22T08:26:23.896] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748/job_133'
[2016-04-22T08:26:23.896] [133] parameter 'memory.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133'
[2016-04-22T08:26:23.896] [133] parameter 'memory.memsw.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133'
[2016-04-22T08:26:23.896] [133] task/cgroup: /slurm/uid_3748/job_133: alloc=4380364MB mem.limit=4380364MB memsw.limit=4380364MB
[2016-04-22T08:26:23.897] [133] parameter 'notify_on_release' set to '0' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch'
[2016-04-22T08:26:23.897] [133] parameter 'memory.use_hierarchy' set to '1' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch'
[2016-04-22T08:26:23.897] [133] parameter 'memory.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch'
[2016-04-22T08:26:23.897] [133] parameter 'memory.memsw.limit_in_bytes' set to '4593144561664' for '/cgroup/memory/slurm/uid_3748/job_133/step_batch'
[2016-04-22T08:26:23.897] [133] task/cgroup: /slurm/uid_3748/job_133/step_batch: alloc=4380364MB mem.limit=4380364MB memsw.limit=4380364MB

SLURM then appears to set a limit of 186060MB, much lower than what I requested ...
[2016-04-22T08:26:24.899] [133] Job 133 memory used:2523564 limit:190525440 KB

as the executable consumes memory the job exceeds this newly imposed limit and is cancelled by SLURM ...
[2016-04-22T08:27:43.904] [133] Job 133 memory used:200634796 limit:190525440 KB
[2016-04-22T08:27:43.904] [133] Job 133 exceeded memory limit (200634796 > 190525440), being killed

I appears that the below line in function jobacct_gather_set_mem_limit (in slurm_jobacct_gather.c) results in an integer overflow, possibly because mem_limit is an uint32_t and doesn't handle the values greater than 4294,967295 ...
jobacct_mem_limit   = mem_limit * 1024;     /* MB to KB */

By type-casting mem_limit to uint64_t as part of the calculation it appears to resolve the issue ...
jobacct_mem_limit   = (uint64_t)mem_limit * 1024;     /* MB to KB */
Comment 1 Sam Gallop 2016-04-25 02:12:28 MDT
Created attachment 3022 [details]
SLURM configuration file
Comment 2 Sam Gallop 2016-04-25 02:13:35 MDT
Created attachment 3023 [details]
cgroup Configuration File
Comment 3 Sam Gallop 2016-04-26 03:34:26 MDT
Created attachment 3032 [details]
Patch
Comment 4 Tim Wickberg 2016-04-26 05:21:30 MDT
Thanks for the patch!

Committed with a NEWS entry and a shorter commit message as fe85cc35470.

We do not plan on ever tagging a 14.11.12 release, but this will be included in 15.08.11 and 16.05-pre3 when released.

cheers,
- Tim