Summary: | cg group error | ||
---|---|---|---|
Product: | Slurm | Reporter: | mengxing cheng <mxcheng> |
Component: | User Commands | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED TIMEDOUT | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | ryelle |
Version: | 16.05.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | University of Chicago | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
mengxing cheng
2017-04-28 15:20:40 MDT
This appears to be similar to an earlier bug 3364. I believe the underlying problem identified there is an issue in the Linux kernel itself, not in Slurm. I'm guessing that the afflicted node has run more jobs than the other okay nodes? If you reboot midway2-0414 does that clear up the error? If it does, then I think we can conclude the Linux kernel bug discussed in bug 3364 is still present and causing problems; you'd need to find a way to upgrade to a fixed Linux kernel to resolve that if so. (In reply to Tim Wickberg from comment #1) > This appears to be similar to an earlier bug 3364. I believe the underlying > problem identified there is an issue in the Linux kernel itself, not in > Slurm. > > I'm guessing that the afflicted node has run more jobs than the other okay > nodes? If you reboot midway2-0414 does that clear up the error? Hey Mengxing - Have you had a chance to test this out as a work-around? Unfortunately, as I'd described before I believe this is a problem with the Linux kernel, and not something Slurm can directly resolve. Knowing if the reboot clears things up, and if the node had run more jobs than the average would help verify that as the cause. - Tim I'm marking this resolved/timedout as I still haven't seen a response to comment 1 or comment 2. Please re-open if you'd like to continue to pursue this. - Tim Tim, thank you for support! Mengxing *** Ticket 5082 has been marked as a duplicate of this ticket. *** |