Summary: | defunct slurmd process leaves a sleep in the step_extern cgroup | ||
---|---|---|---|
Product: | Slurm | Reporter: | Cineca HPC Systems <hpc-sysmgt-info> |
Component: | slurmd | Assignee: | Tim Wickberg <tim> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | chris, dmjacobsen |
Version: | 17.11.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=4622 | ||
Site: | Cineca | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 17.11.3 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | contents of /etc/slurm and slurmd logs |
I believe this is the same underlying issue as in bug 4634, and should be resolved in 17.11.3 (due to be release this afternoon) and up. If it's alright with you, I'd propose we close this as a duplicate of that bug; if after upgrading (or applying those referenced patches directly) you're still seeing issues we can re-open this, or you can file a new separate ticket. - Tim Hi Tim, thanks for the info. It's ok for us to close the bug. We can schedule an upgrade to 17.11.3 next week and we'll let you know if it solves this bug. I have just 2 questions: * Can I take a look at bug 4634, please? At the moment your site doesn't give me the access ;) * We didn't receive the email of your comment. I checked the email preferences and they seem ok. Could you check what's wrong, please? Thank you very much Ale (In reply to hpc-sysmgt-info from comment #3) > Hi Tim, > > thanks for the info. It's ok for us to close the bug. We can schedule an > upgrade to 17.11.3 next week and we'll let you know if it solves this bug. > > I have just 2 questions: > > * Can I take a look at bug 4634, please? At the moment your site doesn't > give me the access ;) Ah, sorry about that. That one is tagged private unfortunately. The relevant patches are in commit d2c838070. However, we have found a few related issues, and are working on an additional patch that closes a more likely source of these issues. That should be in the 17.11.3 release which we expect to have out early next week. > * We didn't receive the email of your comment. I checked the email > preferences and they seem ok. Could you check what's wrong, please? There does seem to have been a small hiccup getting that email out. I do see that email appears to be flowing (I'd switched into your account briefly to double-check some of your preferences, and can see that alert email made it over), and I'm verifying this response gets sent over to your mail server. - Tim Trying this again after one small tweak to your account. Comment #4 was not sent either, so I'm replying here again: (In reply to Tim Wickberg from comment #4) > (In reply to hpc-sysmgt-info from comment #3) > > Hi Tim, > > > > thanks for the info. It's ok for us to close the bug. We can schedule an > > upgrade to 17.11.3 next week and we'll let you know if it solves this bug. > > > > I have just 2 questions: > > > > * Can I take a look at bug 4634, please? At the moment your site doesn't > > give me the access ;) > > Ah, sorry about that. That one is tagged private unfortunately. > > The relevant patches are in commit d2c838070. > > However, we have found a few related issues, and are working on an > additional patch that closes a more likely source of these issues. That > should be in the 17.11.3 release which we expect to have out early next week. > > > * We didn't receive the email of your comment. I checked the email > > preferences and they seem ok. Could you check what's wrong, please? > > There does seem to have been a small hiccup getting that email out. I do see > that email appears to be flowing (I'd switched into your account briefly to > double-check some of your preferences, and can see that alert email made it > over), and I'm verifying this response gets sent over to your mail server. > > - Tim (In reply to Tim Wickberg from comment #5) > Trying this again after one small tweak to your account. Comment #4 was not > sent either, so I'm replying here again: Please take a look at Comment #4 when you get a chance. This is one more test message; this should hopefully get through to you. I've removed one checkbox in your email preferences setting that was stopping you from getting email. Having "The bug is in the UNCONFIRMED state" checked on the Reporter column is what has been skipping messages sent to you. I'm not sure if you intentionally enabled that or not? Hi Tim I confirm that comment #6 arrived by mail. I checked the box misunderstanding its meaning, thanks for fixing it. We will wait for the 17.11.3 to be released. thank you very much ale This is fixed with commit 108502e9504, and will be in 17.11.3 when released. Please re-open if you have any further questions, or still have problems after upgrading. cheers, - Tim *** Ticket 4622 has been marked as a duplicate of this ticket. *** *** Ticket 4733 has been marked as a duplicate of this ticket. *** |
Created attachment 6048 [details] contents of /etc/slurm and slurmd logs Hi support, we observe a lot of job which keep being in completing state until we kill the sleep process inside the step_extern cgroup. In these cases what we see on the involved nodes is a defunct slurmd [root@r131c17s02 ~]# ps --forest -lfe | egrep '[s]leep|[s]lurm' 1 S root 15957 1 0 80 0 - 923070 inet_c Jan23 ? 00:00:45 /usr/sbin/slurmd 1 Z root 15481 15957 0 80 0 - 0 exit 11:49 ? 00:00:00 \_ [slurmd] <defunct> 0 S root 15487 1 0 80 0 - 26973 hrtime 11:49 ? 00:00:00 sleep 1000000 [root@r131c17s02 ~]# cat /sys/fs/cgroup/cpuset/slurm/uid_29035/job_82290/step_extern/tasks 15487 we see from UNIX accounting logs that the step_extern slurmstepd died immediately [root@r131c17s02 ~]# lastcomm --command slurmstepd | grep D slurmstepd DX root __ 0.10 secs Thu Feb 1 11:49 [root@r131c17s02 ~]# dump-acct /var/account/pacct | grep 'Feb 1 11:49' | grep slurm slurmd |v3| 0.00| 0.00| 0.00| 0| 0|3558912.00| 0.00| 15481 15957|Thu Feb 1 11:49:49 2018 slurmstepd |v3| 4.00| 6.00| 12.00| 0| 0|195904.00| 0.00| 15482 1|Thu Feb 1 11:49:49 2018 So both the sleep and slurmstepd processes turn to be children of systemd (pid 1). We tried to setup an UnkillableStepProgram to kill the sleep process but the script is not invoked, we guess because the slurmd is defunct. We attach /etc/slurm dir (slurm.tgz) and slurmd logs. Thanks Ale