Created attachment 6799 [details] backtrace of slurmstepd This is in reference to bug 5111. We have upgraded to 17.11.5 and additionally cherry picked commits 1675ada0a, a7c8964e, 3be9e1ee0 and e5f03971b. We continue to see the same problem: the agent queue size increases continuously (we've seen it go as high as 200000) and jobs stay in completing state. I am attaching the output from node cdr1545 of `gdb -batch -ex "thread apply all bt full" -p 99504` where is the slurmstepd process: S USER PID PPID NI RSS VSZ STIME %CPU TIME COMMAND S root 9489 1 0 7716 1506312 May03 0.0 00:00:01 /opt/software/slurm/sbin/slurmd S root 99504 1 0 4536 306160 07:28 0.0 00:00:00 slurmstepd: [7774580.extern] - Martin
The biggest problem right now is that jobs get "started" by the scheduler are hung in the prolog (`scontrol show job <jobid>` shows Reason=Prolog), but nothing ever gets sent to the nodes. After a while the JobState changes to COMPLETING and the job disappears from the system without any record for the user. The agent queue size even after a slurmctld restart rarely drops below 2000 and then quickly rises again.
Hi Good news is, slurmd bt looks fine now, this should reduced at least some rpcs. Could you send me current slurmctld.log, slurmd.log and sdiag output? Dominik
Yesterday evening we paused the starting of new jobs by setting all partitions to State=Down. It took a while (more than 30 min.) during which slurm mostly completed jobs, but then the agent queue size dropped to 0. We brought the partitions back up and slurm has been stable since. I agree that the problems that we were seeing yesterday may not be related to the thread deadlocks. It locks more like that if the agent queue size grows above a certain limit, it continues to increase and never recovers. We just had a "mini blib": the agent queue size rose to about 1500. I am attaching slurmctld.log and sdiag output during that blib. Slurm did recover though from this blib and the agent queue size dropped back to 0. I am also attaching slurm.log from one particular node (cdr761) which may have contributed to the blib because a user was running sbatch on it.
Created attachment 6815 [details] slurmctld.log
Created attachment 6816 [details] sdiag output
Created attachment 6817 [details] slurmd.log from cdr761
Created attachment 6877 [details] patch Hi This patch solves some minor race/deadlock in slurmstepd which was introduced in 17.11.6. Could you apply it and check if it helps? Dominik
Thanks Dominik. I've applied the patch on our Cluster; by the looks of this it will only really apply to new slurmstepd processes so it'll take a little while to have an effect. I'll have to have Martin update you on this as I'm away for the next 2.5 weeks and we're doing an outage for the last week of May so it wont be under any load during that time.
Additionally we patched 17.11.5 now with commit 6a74be8. The system has been stable ever since. We also plan to upgrade to 17.11.7 on May 30.
Hi Glad to hear that's all back to normal. If it's alright with you, I'm going to move this to resolved/fixed. If there's anything else I can help with, please reopen or file a new ticket. Dominik