| Summary: | node drained after JOB NOT ENDING WITH SIGNALS | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ali Siavosh <Ali.Siavosh-haghighi> |
| Component: | Other | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NYUMC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
slurmd.log slurmd.log cn-0017 slurmd log for gn-0004 slurmd.log slurmd.log |
||
|
Description
Ali Siavosh
2021-03-24 13:06:31 MDT
What version of Slurm are you running?
> $ slurmctld -V
slurm 19.05.5 (In reply to Ali Siavosh from comment #0) > We are getting the error below right before we get the node with the job on > drained. The log file contains the followings: > > > debug: _slurm_cgroup_destroy: problem deleting step cgroup path > /sys/fs/cgroup/freezer/slurm/uid_104300/job_1053/step_0: Device or resource > busy Hi, Can you show me the output of: cat /proc/mounts ..in one of the afflicted nodes? Can you also upload your slurm.conf and slurmd log? Thank you. Created attachment 18647 [details]
slurm.conf
slurm.conf
Created attachment 18648 [details]
slurmd.log
for the same node that I got cat /proc/mounts
[root@gpu-0004 ~]# cat /proc/mounts rootfs / rootfs rw,size=395609072k,nr_inodes=98902268 0 0 proc /proc proc rw,nosuid,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,relatime,size=395609084k,nr_inodes=98902271,mode=755 0 0 tmpfs /run tmpfs rw,relatime 0 0 /dev/vg_main/root / ext4 rw,noatime,nodiratime,data=ordered 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,relatime 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct,clone_children 0 0 cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=54418 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 nfsd /proc/fs/nfsd nfsd rw,relatime 0 0 /dev/md2 /boot ext4 rw,noatime,nodiratime,data=ordered 0 0 /dev/md0 /tmp ext4 rw,noatime,nodiratime,stripe=64,data=ordered 0 0 /dev/mapper/vg_main-var /var ext4 rw,noatime,nodiratime,data=ordered 0 0 /dev/mapper/vg_main-opt /opt ext4 rw,noatime,nodiratime,data=ordered 0 0 /dev/mapper/vg_main-usr_local /usr/local ext4 rw,noatime,nodiratime,data=ordered 0 0 /dev/mapper/vg_main-var_log /var/log ext4 rw,noatime,nodiratime,data=ordered 0 0 /dev/mapper/vg_main-var_lib /var/lib ext4 rw,noatime,nodiratime,data=ordered 0 0 /dev/md1 /boot/efi vfat rw,noatime,nodiratime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro 0 0 /dev/mapper/vg_main-var_log_audit /var/log/audit ext4 rw,noatime,nodiratime,data=ordered 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 gpfs /gpfs gpfs rw,relatime 0 0 test_hsm /test_hsm gpfs rw,relatime 0 0 Hi Ali. 1. I see you have memory and freezer cgroups under the same mountpoint. This is a configuration Bright sets up, and is not supported at this moment by Slurm. You need to fix this: cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct,clone_children 0 0 Bright sets JoinControllers = blkio,cpuacct,memory,freezer into /etc/systemd/system.conf and mixes blkio,freezer,memory,cpuacct into one single mountpoint There's a discussion in bug 7536. See specifically comments 11 and 12 there about this issue and let me know if the proposed workarounds fixed your issues. 2. The slurmd log you attached is not useful. (recommendation) Please make sure to change the debug levels in slurm.conf to human-readable values, for example: SlurmdctldDebug=debug SlurmdDebug=debug Then we may catch more in the logs. 3. In your slurm.conf you have this: # Was 120 but nodes were draining because it takes too long to kill some jobs. # If this change doesn't do it, we'll file a bug with SchedMD. UnkillableStepTimeout=300 That's not fine. You need to set UnkillableStepTimeout to a value less than 127 or the node draining mechanism won't work correctly. There's a new bug we discovered here: bug 11103 4. JobAcctGather plugin, I see you commented out /cgroup plugin and set /linux one. Did this change recently? I wouldn't expect an issue like the one you're seeing with jobacct_gather/linux, but maybe you did this change and didn't restart the entire cluster and you're seeing old jobs being hit by this. Is that possible? JobAcctGatherType=jobacct_gather/linux #JobAcctGatherType=jobacct_gather/cgroup Please, fix this 4 points and give me feedback on how it is going. Thanks I forgot to mention, 19.05.5 is not supported. I encourage you to upgrade to our latest stable when possible. Is that something you have in mind currently? Thanks Yes will be coming in the next scheduled maintenance. Thanks Ali Siavosh-Haghighi, PhD Sr. HPC System Administrator, High-Performance Computing NYU Langone Health Medical Center Information Technology 227 E 30th St, #7-738 New York, NY 10016 O: 646-524-0860 C: 347-843-2357 siavoa01@nyumc.org<mailto:siavoa01@nyumc.org> nyulangone.org On Mar 25, 2021, at 12:18 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: [EXTERNAL] Comment # 8<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c8__;Iw!!MXfaZl3l!JSssvLYAn1Oq5yJ6KCbEBS9K9Vuqv_jcqQ-WYEMzYBClU68vZRS9um8nfc1oaVnOaY1avxjrcnlh$> on bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!JSssvLYAn1Oq5yJ6KCbEBS9K9Vuqv_jcqQ-WYEMzYBClU68vZRS9um8nfc1oaVnOaY1av5p4zQbb$> from Felip Moll<mailto:felip.moll@schedmd.com> I forgot to mention, 19.05.5 is not supported. I encourage you to upgrade to our latest stable when possible. Is that something you have in mind currently? Thanks ________________________________ You are receiving this mail because: * You reported the bug. ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= (In reply to Ali Siavosh from comment #9) > Yes will be coming in the next scheduled maintenance. > > Thanks > Ok, please, let me know when you apply the changes mentioned in comment 7. Regards Hi Felip, Thank you very much for you guide. We are going to have a maintenance on May 2nd when we will upgrade slurm. I implemented the #3 issue mentioned and we will monitor the system. For implementing #1: 1. As I understand commenting the corresponding line as below in system.conf should do it: #JoinControllers=blkio,cpuacct,memory,freezer And again as I understand there is no need for reboot. Though after I make this change and remount, still the "mount |grep cgroup" shows: cgroup on /sys/fs/cgroup/blkio,cpuacct,memory,freezer type cgroup (rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct,clone_children) am I missing something? (In reply to Ali Siavosh from comment #11) > Hi Felip, > Thank you very much for you guide. > We are going to have a maintenance on May 2nd when we will upgrade slurm. Cool. Just open a new bug if questions about upgrade arise. > I implemented the #3 issue mentioned and we will monitor the system. > Good. > > For implementing #1: > 1. As I understand commenting the corresponding line as below in system.conf > should do it: > #JoinControllers=blkio,cpuacct,memory,freezer > And again as I understand there is no need for reboot. Though after I make > this change and remount, still the "mount |grep cgroup" shows: > > cgroup on /sys/fs/cgroup/blkio,cpuacct,memory,freezer type cgroup > (rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct,clone_children) > > am I missing something? I checked my notes and unfortunately it seems a node reboot is needed. I don't think there's an easy way to umount/remount in the correct place. Please let me know if you find one reliable method. Just for more information, I leave here systemd documentation which explains JoinControllers. JoinControllers=cpu,cpuacct net_cls,netprio Configures controllers that shall be mounted in a single hierarchy. By default, systemd will mount all controllers which are enabled in the kernel in individual hierarchies, with the exception of those listed in this setting. Takes a space-separated list of comma-separated controller names, in order to allow multiple joined hierarchies. Defaults to 'cpu,cpuacct'. Pass an empty string to ensure that systemd mounts all controllers in separate hierarchies. Note that this option is only applied once, at very early boot. If you use an initial RAM disk (initrd) that uses systemd, it might hence be necessary to rebuild the initrd if this option is changed, and make sure the new configuration file is included in it. Otherwise, the initrd might mount the controller hierarchies in a different configuration than intended, and the main system cannot remount them anymore. PD: ------ Sorry, I realize now I have not pointed exactly to the correct comments. In my comment 7 when I said: > There's a discussion in bug 7536. See specifically comments 11 and 12 there about this issue and let me know if the proposed workarounds fixed your issues. I should have pointed to bug 7536 comments 14 and 18. Hi Ali, Just let me know how the system behaves with these changes, I am interested in seeing if issues have been ceased. (Note your bug is tagged as a sev-2 so I am monitoring it more proactively, feel free to lower the priority if it is not affecting you so heavily). Reducing sev. to 3. Hi Felip, My colleague says we still have now and then drained nodes (not as much as before though). We are going to update to 20.5.11 (the latest) on May 2nd. I may need to be in touch with you in that regard. But this is the latest status for now. Thanks (In reply to Ali Siavosh from comment #15) > Hi Felip, > My colleague says we still have now and then drained nodes (not as much as > before though). We are going to update to 20.5.11 (the latest) on May 2nd. I > may need to be in touch with you in that regard. But this is the latest > status for now. > Thanks Hi Ali, In comment 11 you said you implemented my point number 3 from comment 7. I need to know about the other points, did you get rid of the Bright setup which joined controllers? Can you upload a new slurmd log at debug or debug2 level demonstrating the issues? They can even be legitimate drains, because if a job is not unkillable then we drain the node. Let me know about this info please. Thanks! Hi Felip, Let us accumulate the new log for couple of days and I will send them to you. Thanks Ali Siavosh-Haghighi, PhD Sr. HPC System Administrator, High-Performance Computing NYU Langone Health Medical Center Information Technology 227 E 30th St, #7-738 New York, NY 10016 O: 646-524-0860 C: 347-843-2357 siavoa01@nyumc.org<mailto:siavoa01@nyumc.org> nyulangone.org On Apr 7, 2021, at 8:02 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: [EXTERNAL] Felip Moll<mailto:felip.moll@schedmd.com> changed bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2vQDE15A$> What Removed Added Summary problem deleting step cgroup path node drained after JOB NOT ENDING WITH SIGNALS Comment # 16<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c16__;Iw!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2gl6P3k-$> on bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2vQDE15A$> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Ali Siavosh from comment #15<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c15__;Iw!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2jKJswtM$>) > Hi Felip, > My colleague says we still have now and then drained nodes (not as much as > before though). We are going to update to 20.5.11 (the latest) on May 2nd. I > may need to be in touch with you in that regard. But this is the latest > status for now. > Thanks Hi Ali, In comment 11<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c11__;Iw!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2jb2mSa9$> you said you implemented my point number 3 from comment 7<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c7__;Iw!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2qMImL1w$>. I need to know about the other points, did you get rid of the Bright setup which joined controllers? Can you upload a new slurmd log at debug or debug2 level demonstrating the issues? They can even be legitimate drains, because if a job is not unkillable then we drain the node. Let me know about this info please. Thanks! ________________________________ You are receiving this mail because: * You reported the bug. ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= (In reply to Ali Siavosh from comment #17) > Hi Felip, > Let us accumulate the new log for couple of days and I will send them to you. > Ok, I will be looking forward for your response. Created attachment 18925 [details]
slurmd.log cn-0017
Created attachment 18926 [details]
slurmd log for gn-0004
slurmd log in debug mode.
Hi Felip, I just added logs of two recent drained nodes. (In reply to Ali Siavosh from comment #21) > Hi Felip, > I just added logs of two recent drained nodes. Hi, these logs only catch less than one minute run... I also see this: [2021-04-13T10:15:58.693] slurmd version 19.05.5 started [2021-04-13T10:15:58.693] killing old slurmd[85577] It seems you started one slurmd while another was already started.. The log I requested should have the time of when you see a node being drained, and it doesn't seem to be the case. Have you reproduced the issue and is this the complete log of the entire time?? Can you explain? Thanks Huh Thats a bit odd. Let me try again. Thanks Ali Siavosh-Haghighi, PhD Sr. HPC System Administrator, High-Performance Computing NYU Langone Health Medical Center Information Technology 227 E 30th St, #7-738 New York, NY 10016 O: 646-524-0860 C: 347-843-2357 siavoa01@nyumc.org<mailto:siavoa01@nyumc.org> nyulangone.org On Apr 14, 2021, at 10:26 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: [EXTERNAL] Comment # 22<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c22__;Iw!!MXfaZl3l!PeuvO3R_Faz5IwLbFnBXIQjmRFmmAt9n0zdokL-EAKa3BJS8cFCJWiKxf1jJCzZehTP75UWZrRDh$> on bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!PeuvO3R_Faz5IwLbFnBXIQjmRFmmAt9n0zdokL-EAKa3BJS8cFCJWiKxf1jJCzZehTP75d8kAGQZ$> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Ali Siavosh from comment #21<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c21__;Iw!!MXfaZl3l!PeuvO3R_Faz5IwLbFnBXIQjmRFmmAt9n0zdokL-EAKa3BJS8cFCJWiKxf1jJCzZehTP75Y4DTjJ8$>) > Hi Felip, > I just added logs of two recent drained nodes. Hi, these logs only catch less than one minute run... I also see this: [2021-04-13T10:15:58.693] slurmd version 19.05.5 started [2021-04-13T10:15:58.693] killing old slurmd[85577] It seems you started one slurmd while another was already started.. The log I requested should have the time of when you see a node being drained, and it doesn't seem to be the case. Have you reproduced the issue and is this the complete log of the entire time?? Can you explain? Thanks ________________________________ You are receiving this mail because: * You reported the bug. ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= Created attachment 19046 [details]
slurmd.log
Created attachment 19047 [details]
slurmd.log
(In reply to Ali Siavosh from comment #25) > Created attachment 19047 [details] > slurmd.log That log is more interesting. I am investigating the issue at the moment but I need you to respond to my comment 7 , 1) and 4). Do you really have jobacctgather/linux in all nodes? I guess yes since I see: [2021-04-18T09:11:17.401] [13097231.extern] debug: Job accounting gather LINUX plugin loaded I am investigating the the log in the meanwhile. Hi Felip, For #1 in comment 7 since it needs node reboot we may need to schedule it. For # It has not been a new change. Shall I go ahead and revert it still? (In reply to Ali Siavosh from comment #27) > Hi Felip, > For #1 in comment 7 since it needs node reboot we may need to schedule it. Ok, that's something it would be needed. We don't support for now this configuration and to be in the safe side you should split the mountpoints. Nevertheless I can still not ensure 100% it is the cause of your issue. > For # It has not been a new change. Shall I go ahead and revert it still? No, if jobacctgather/linux is set, then leave it as is. (In reply to Felip Moll from comment #28) > (In reply to Ali Siavosh from comment #27) > > Hi Felip, > > For #1 in comment 7 since it needs node reboot we may need to schedule it. > > Ok, that's something it would be needed. We don't support for now this > configuration and to be in the safe side you should split the mountpoints. > > Nevertheless I can still not ensure 100% it is the cause of your issue. > > > For # It has not been a new change. Shall I go ahead and revert it still? > > No, if jobacctgather/linux is set, then leave it as is. Hi Ali, Is the issue still happening, and if it is, how often do you see it? I have seen also that the issue is happening in the extern step. Is it possible that some process is attached to the extern step (like an ssh session with pam_slurm_adopt) and the process run by this ssh is unkillable (like blocked by IO) for some reason? Do you have the exact example of a job submission which triggers the error? Thanks Hi Felip. From Sunday that we upgraded to 20.11.5 we have not seen any issue. It is a bit to early but if you like we can close it and if it happens again we reopen the case. Is this OK? Thanks Ali Siavosh-Haghighi, PhD Sr. HPC System Administrator, High-Performance Computing NYU Langone Health Medical Center Information Technology 227 E 30th St, #7-738 New York, NY 10016 O: 646-524-0860 C: 347-843-2357 siavoa01@nyumc.org<mailto:siavoa01@nyumc.org> nyulangone.org On May 5, 2021, at 8:44 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: [EXTERNAL] Comment # 29<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c29__;Iw!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDbEBy1SS$> on bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDc7KG1Qq$> from Felip Moll<mailto:felip.moll@schedmd.com> (In reply to Felip Moll from comment #28<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c28__;Iw!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDXL61XQB$>) > (In reply to Ali Siavosh from comment #27<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c27__;Iw!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDQ4u85Nm$>) > > Hi Felip, > > For #1 in comment 7<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c7__;Iw!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDbuRqeT6$> since it needs node reboot we may need to schedule it. > > Ok, that's something it would be needed. We don't support for now this > configuration and to be in the safe side you should split the mountpoints. > > Nevertheless I can still not ensure 100% it is the cause of your issue. > > > For # It has not been a new change. Shall I go ahead and revert it still? > > No, if jobacctgather/linux is set, then leave it as is. Hi Ali, Is the issue still happening, and if it is, how often do you see it? I have seen also that the issue is happening in the extern step. Is it possible that some process is attached to the extern step (like an ssh session with pam_slurm_adopt) and the process run by this ssh is unkillable (like blocked by IO) for some reason? Do you have the exact example of a job submission which triggers the error? Thanks ________________________________ You are receiving this mail because: * You reported the bug. ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= (In reply to Ali Siavosh from comment #30) > Hi Felip. > From Sunday that we upgraded to 20.11.5 we have not seen any issue. It is a > bit to early but if you like we can close it and if it happens again we > reopen the case. Is this OK? > > Thanks This is OK. If you notice it again please, reopen it with the new info. Thanks Ali |