3749 – cg group error

Ticket 3749 - cg group error

Summary: cg group error

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	16.05.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-04-28 15:20 MDT by mengxing cheng
Modified:	2018-04-26 10:40 MDT (History)
CC List:	1 user (show)

See Also:
Site:	University of Chicago
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description mengxing cheng 2017-04-28 15:20:40 MDT

Slurm reports cg error on some nodes.

[pdeperio@midway2-login1 170219-thread_test]$ srun -w midway2-0414 --account=pi-lgrandi --partition=xenon1t hostname
slurmstepd-midway2-0414: error: task/cgroup: unable to add task[pid=34313] to memory cg '(null)'
midway2-0414.rcc.local

but no error on another node 
[pdeperio@midway2-login1 170219-thread_test]$ srun -w midway2-0421 --account=pi-lgrandi --partition=xenon1t hostname
midway2-0421.rcc.local

Do you know what is wrong with the node reporting error? Thank you!

Mengxing

Comment 1 Tim Wickberg 2017-04-28 15:29:20 MDT

This appears to be similar to an earlier bug 3364. I believe the underlying problem identified there is an issue in the Linux kernel itself, not in Slurm.

I'm guessing that the afflicted node has run more jobs than the other okay nodes? If you reboot midway2-0414 does that clear up the error?

If it does, then I think we can conclude the Linux kernel bug discussed in bug 3364 is still present and causing problems; you'd need to find a way to upgrade to a fixed Linux kernel to resolve that if so.

Comment 2 Tim Wickberg 2017-05-03 07:28:36 MDT

(In reply to Tim Wickberg from comment #1)
> This appears to be similar to an earlier bug 3364. I believe the underlying
> problem identified there is an issue in the Linux kernel itself, not in
> Slurm.
> 
> I'm guessing that the afflicted node has run more jobs than the other okay
> nodes? If you reboot midway2-0414 does that clear up the error?

Hey Mengxing - 

Have you had a chance to test this out as a work-around?

Unfortunately, as I'd described before I believe this is a problem with the Linux kernel, and not something Slurm can directly resolve. Knowing if the reboot clears things up, and if the node had run more jobs than the average would help verify that as the cause.

- Tim

Comment 3 Tim Wickberg 2017-05-23 18:43:13 MDT

I'm marking this resolved/timedout as I still haven't seen a response to comment 1 or comment 2.

Please re-open if you'd like to continue to pursue this.

- Tim

Comment 4 mengxing cheng 2017-05-24 15:09:37 MDT

Tim, thank you for support!

Mengxing

Comment 5 Marshall Garey 2018-04-26 10:40:31 MDT

*** Ticket 5082 has been marked as a duplicate of this ticket. ***