Summary: | jobacct_gather/linux not fully supported with pam_slurm_adopt leading to memory overlimit | ||
---|---|---|---|
Product: | Slurm | Reporter: | SafranTech <saf.cmh.safrantech-admin> |
Component: | Accounting | Assignee: | Jacob Jenson <jacob> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | CC: | csc-slurm-tickets, nate |
Version: | 17.11.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8656 | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | steps to reproduce and slurmd log |
Hi SchedMD, Our site suffers this issue as well (CSC - It center for science) and here is excellent analysis and reproducer so could you fix this issue? Best Regards, Tommi Tervo CSC Hello Further to Tommi comment, let me update. From our side, meanwhile the bug resolution, we wookround by adding this: JobAcctGatherParams=NoOverMemoryKill For more details, Principle parameters in slurm.conf: JobAcctGatherType=jobacct_gather/linux JobAcctGatherParams=NoOverMemoryKill TaskPlugin=task/cgroup TaskPluginParam=Cpusets cgroup.conf contents: ConstrainCores=yes ConstrainRAMSpace=yes MaxRAMPercent=98 ConstrainKmemSpace=no Please be aware about this parameter (extract from slurm.con man) JobAcctGatherParams Arbitrary parameters for the job account gather plugin Acceptable values at present include: NoShared Exclude shared memory from accounting. UsePss Use PSS value instead of RSS to calculate real usage of memory. The PSS value will be saved as RSS. NoOverMemoryKill Do not kill process that uses more then requested memory. This parameter should be used with caution as if jobs exceeds its memory allocation it may affect other processes and/or machine health. NOTE: It is recommended to limit memory by enabling task/cgroup in TaskPlugin and making use of Con‐ strainRAMSpace=yes cgroup.conf. If so, having JobAcctGather as an extra mechanism for memory enforcement is not recommended, so setting NoOverMemoryKill is advised. Mohamed Hendawi Marking this as a duplicate *** This ticket has been marked as a duplicate of ticket 8656 *** |
Created attachment 9177 [details] steps to reproduce and slurmd log Dear Support, In a context when pam_slurm_adopt activated, we observe that jobs are killed by jobacct_gather plugin. It appears a orphaned process is added after each termination of a tracked process. And so the memory associated to the orphaned task increments the total memory of the job step (here step_extern). This leads to the job step being killed by slurm plugin. In attachements, you fill a complete description of the issue. Regards, Philippe