Created attachment 25399 [details] slurmd.log from the machine Looks like we had jobs hit walltime and slurm tried to cancel but wasn't able. Any insight? [2022-05-31T15:35:33.003] [155896.batch] error: *** JOB 155896 STEPD TERMINATED ON gpu7 AT 2022-05-31T15:35:33 DUE TO JOB NOT ENDING WITH SIGNALS ***
Hi, > Looks like we had jobs hit walltime and slurm tried to cancel but wasn't > able. Any insight? > > [2022-05-31T15:35:33.003] [155896.batch] error: *** JOB 155896 STEPD > TERMINATED ON gpu7 AT 2022-05-31T15:35:33 DUE TO JOB NOT ENDING WITH SIGNALS > *** Yes, your description is correct. When Slurm cancels job it sends Linux signals to all the Linux process in all the steps. It initially sends SIGTERM: [2022-05-31T15:34:03.256] [155896.batch] Sent signal 15 to StepId=155896.batch And after some time (KillWait), if they have not ended, it sends a SIGKILL: [2022-05-31T15:34:32.968] [155896.batch] Sent SIGKILL signal to StepId=155896.batch With SIGKILL the Linux process should finish, but if the node is facing some issues, they may still run. Most tipically because they are in an "uninterruptible sleep state" (D), and that usually happens due problems in some network filesystem. Slurm keeps trying to kill those process for some time (UnkillableStepTimeout), and it finally decides to terminate the step and drain the node to avoid further problems in future jobs: [2022-05-31T15:35:33.003] [155896.batch] error: *** JOB 155896 STEPD TERMINATED ON gpu7 AT 2022-05-31T15:35:33 DUE TO JOB NOT ENDING WITH SIGNALS *** I assume that this gpu7 node was drained after this happened, right? See KillWait option to control the time between SIGTERM and SIGKILL: - https://slurm.schedmd.com/slurm.conf.html#OPT_KillWait And UnkillableStepTimeout to control the time between first SIGKILL and decide that is "unkillable": - https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout For further troubleshooting you can create and set an UnkillableStepProgram, that will be run on the node when that happens, so you can gather informations about possible problems on the node: - https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram Hope this helps, Albert
Hi, If this is ok for you I'm closing this ticket as infogiven, but please don't hesitate to reopen it if you need further support about it. Regards, Albert
Sounds good, thank you General Business From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, June 17, 2022 12:09 PM To: Vasquez, Robert (R) <rvasquez2@dow.com> Subject: [Bug 14265] Job not ending with signals CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Albert Gil<mailto:albert.gil@schedmd.com> changed bug 14265<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14265&data=05%7C01%7Crvasquez2%40dow.com%7C517fb208dfd84443524608da507bb208%7Cc3e32f53cb7f4809968d1cc4ccc785fe%7C0%7C0%7C637910789448833368%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=COjln4O8nIvAWAx1q3I0UZstAe0kEUITFFPKMe7%2F5E0%3D&reserved=0> What Removed Added Status OPEN RESOLVED Resolution --- INFOGIVEN Comment # 2<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14265%23c2&data=05%7C01%7Crvasquez2%40dow.com%7C517fb208dfd84443524608da507bb208%7Cc3e32f53cb7f4809968d1cc4ccc785fe%7C0%7C0%7C637910789448833368%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TE6J3vAWlGxBo9oax%2Fm4kOzHZF2QehC2pwpfu0lj5Bk%3D&reserved=0> on bug 14265<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D14265&data=05%7C01%7Crvasquez2%40dow.com%7C517fb208dfd84443524608da507bb208%7Cc3e32f53cb7f4809968d1cc4ccc785fe%7C0%7C0%7C637910789448833368%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=COjln4O8nIvAWAx1q3I0UZstAe0kEUITFFPKMe7%2F5E0%3D&reserved=0> from Albert Gil<mailto:albert.gil@schedmd.com> Hi, If this is ok for you I'm closing this ticket as infogiven, but please don't hesitate to reopen it if you need further support about it. Regards, Albert ________________________________ You are receiving this mail because: * You reported the bug.