Slurm reports cg error on some nodes. [pdeperio@midway2-login1 170219-thread_test]$ srun -w midway2-0414 --account=pi-lgrandi --partition=xenon1t hostname slurmstepd-midway2-0414: error: task/cgroup: unable to add task[pid=34313] to memory cg '(null)' midway2-0414.rcc.local but no error on another node [pdeperio@midway2-login1 170219-thread_test]$ srun -w midway2-0421 --account=pi-lgrandi --partition=xenon1t hostname midway2-0421.rcc.local Do you know what is wrong with the node reporting error? Thank you! Mengxing
This appears to be similar to an earlier bug 3364. I believe the underlying problem identified there is an issue in the Linux kernel itself, not in Slurm. I'm guessing that the afflicted node has run more jobs than the other okay nodes? If you reboot midway2-0414 does that clear up the error? If it does, then I think we can conclude the Linux kernel bug discussed in bug 3364 is still present and causing problems; you'd need to find a way to upgrade to a fixed Linux kernel to resolve that if so.
(In reply to Tim Wickberg from comment #1) > This appears to be similar to an earlier bug 3364. I believe the underlying > problem identified there is an issue in the Linux kernel itself, not in > Slurm. > > I'm guessing that the afflicted node has run more jobs than the other okay > nodes? If you reboot midway2-0414 does that clear up the error? Hey Mengxing - Have you had a chance to test this out as a work-around? Unfortunately, as I'd described before I believe this is a problem with the Linux kernel, and not something Slurm can directly resolve. Knowing if the reboot clears things up, and if the node had run more jobs than the average would help verify that as the cause. - Tim
I'm marking this resolved/timedout as I still haven't seen a response to comment 1 or comment 2. Please re-open if you'd like to continue to pursue this. - Tim
Tim, thank you for support! Mengxing
*** Ticket 5082 has been marked as a duplicate of this ticket. ***