Ticket 3648 - Unable to find bug # 2643
Summary: Unable to find bug # 2643
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 16.05.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-03-31 10:44 MDT by Jenny Williams
Modified: 2017-04-18 20:48 MDT (History)
1 user (show)

See Also:
Site: UNC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jenny Williams 2017-03-31 10:44:48 MDT
We are researching the error 

SLUB: Unable to allocate memory on node -1 (gfp=0x8020)


We found reference to this error here:
https://github.com/docker/docker/issues/27576

which in turn references this error:

https://bugs.schedmd.com/show_bug.cgi?id=2846

I am unable to find a bug 2846, nor is there any hit on the terms slub allocate unable ...  Could this specific error be forwarded to me ? It might be helpful to us.  

I appreciate the assistance.

Virginia ( Jenny ) Williams
UNC Chapel Hill
Comment 1 Tim Wickberg 2017-03-31 10:52:08 MDT
The bug is marked private; briefly, it has to do with a memory leak in the cgroup subsystems.

It was fixed with commit 85ab952adf26 in 16.05.7 and later.
Comment 2 Jenny Williams 2017-04-03 09:42:56 MDT
As a follow on, is this error from slurmd logs also related ?  


slurmd.log-20170320:[2017-03-16T10:24:17.707] _run_prolog: prolog with lock for job 2731931 ran for 0 seconds
slurmd.log-20170320:[2017-03-16T10:24:17.707] Launching batch job 2731931 for UID 237264
slurmd.log-20170320:[2017-03-16T10:24:17.757] [2731931] error: task/cgroup: unable to add task[pid=196105] to memory cg '(null)'
slurmd.log-20170320:[2017-03-16T10:24:18.059] [2731931] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
slurmd.log-20170320:[2017-03-16T10:24:18.060] [2731931] done with job
Comment 3 Tim Wickberg 2017-04-04 15:25:03 MDT
(In reply to Jenny Williams from comment #2)
> As a follow on, is this error from slurmd logs also related ?  
> 
> 
> slurmd.log-20170320:[2017-03-16T10:24:17.707] _run_prolog: prolog with lock
> for job 2731931 ran for 0 seconds
> slurmd.log-20170320:[2017-03-16T10:24:17.707] Launching batch job 2731931
> for UID 237264
> slurmd.log-20170320:[2017-03-16T10:24:17.757] [2731931] error: task/cgroup:
> unable to add task[pid=196105] to memory cg '(null)'
> slurmd.log-20170320:[2017-03-16T10:24:18.059] [2731931] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
> slurmd.log-20170320:[2017-03-16T10:24:18.060] [2731931] done with job

I don't believe that's directly related; but IIRC that may have been addressed by a separate fix to the cgroups subsystem. I'd expect that to go away with 16.05.10 if you get a chance to upgrade.

Is there anything else I can help answer on this?
Comment 4 Tim Wickberg 2017-04-18 20:48:50 MDT
> I don't believe that's directly related; but IIRC that may have been
> addressed by a separate fix to the cgroups subsystem. I'd expect that to go
> away with 16.05.10 if you get a chance to upgrade.
> 
> Is there anything else I can help answer on this?

Marking resolved/infogiven. If you're still seeing issues after an upgrade please reopen, or file a new bug, and we'll be happy to help.

- Tim