Ticket 11201

Summary:	node drained after JOB NOT ENDING WITH SIGNALS
Product:	Slurm	Reporter:	Ali Siavosh <Ali.Siavosh-haghighi>
Component:	Other	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	NYUMC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf slurmd.log slurmd.log cn-0017 slurmd log for gn-0004 slurmd.log slurmd.log

Description Ali Siavosh 2021-03-24 13:06:31 MDT

We are getting the error below right before we get the node with the job on drained. The log file contains the followings:


debug:  _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_104300/job_1053/step_0: Device or resource busy





Also

[root@bigpurple-hn1 slurm]# grep -B 2 -i drain /var/log/slurm/slurmctld
[2021-03-24T09:05:12.025] prolog_running_decr: Configuration for JobId=12698651_315(12702988) is complete
[2021-03-24T09:05:16.001] update_node: node gpu-0004 reason set to: Kill task failed
[2021-03-24T09:05:16.001] update_node: node gpu-0004 state set to DRAINING


[root@gpu-0004 ~]# grep 2021-03-24T09 /var/log/slurmd 
[2021-03-24T09:05:16.000] [12550589.extern] error: *** EXTERN STEP FOR 1255058 STEPD TERMINATED ON gpu-0004 AT 2021-03-24T09:05:15 DUE TO JOB NOT ENDING WITH SIGNALS ***

Comment 1 Jason Booth 2021-03-24 13:20:04 MDT

What version of Slurm are you running? 

> $ slurmctld -V

Comment 2 Ali Siavosh 2021-03-24 13:56:02 MDT

slurm 19.05.5

Comment 3 Felip Moll 2021-03-24 22:28:41 MDT

(In reply to Ali Siavosh from comment #0)
> We are getting the error below right before we get the node with the job on
> drained. The log file contains the followings:
> 
> 
> debug:  _slurm_cgroup_destroy: problem deleting step cgroup path
> /sys/fs/cgroup/freezer/slurm/uid_104300/job_1053/step_0: Device or resource
> busy

Hi,

Can you show me the output of:

cat /proc/mounts

..in one of the afflicted nodes?
Can you also upload your slurm.conf and slurmd log?

Thank you.

Comment 4 Ali Siavosh 2021-03-25 07:45:21 MDT

Created attachment 18647 [details]
slurm.conf

slurm.conf

Comment 5 Ali Siavosh 2021-03-25 07:46:21 MDT

Created attachment 18648 [details]
slurmd.log

for the same node that I got cat /proc/mounts

Comment 6 Ali Siavosh 2021-03-25 07:47:09 MDT

[root@gpu-0004 ~]# cat /proc/mounts
rootfs / rootfs rw,size=395609072k,nr_inodes=98902268 0 0
proc /proc proc rw,nosuid,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=395609084k,nr_inodes=98902271,mode=755 0 0
tmpfs /run tmpfs rw,relatime 0 0
/dev/vg_main/root / ext4 rw,noatime,nodiratime,data=ordered 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,relatime 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct,clone_children 0 0
cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=54418 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0
/dev/md2 /boot ext4 rw,noatime,nodiratime,data=ordered 0 0
/dev/md0 /tmp ext4 rw,noatime,nodiratime,stripe=64,data=ordered 0 0
/dev/mapper/vg_main-var /var ext4 rw,noatime,nodiratime,data=ordered 0 0
/dev/mapper/vg_main-opt /opt ext4 rw,noatime,nodiratime,data=ordered 0 0
/dev/mapper/vg_main-usr_local /usr/local ext4 rw,noatime,nodiratime,data=ordered 0 0
/dev/mapper/vg_main-var_log /var/log ext4 rw,noatime,nodiratime,data=ordered 0 0
/dev/mapper/vg_main-var_lib /var/lib ext4 rw,noatime,nodiratime,data=ordered 0 0
/dev/md1 /boot/efi vfat rw,noatime,nodiratime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro 0 0
/dev/mapper/vg_main-var_log_audit /var/log/audit ext4 rw,noatime,nodiratime,data=ordered 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
gpfs /gpfs gpfs rw,relatime 0 0
test_hsm /test_hsm gpfs rw,relatime 0 0

Comment 7 Felip Moll 2021-03-25 10:16:56 MDT

Hi Ali.

1. I see you have memory and freezer cgroups under the same mountpoint. This is a configuration Bright sets up, and is not supported at this moment by Slurm. You need to fix this:

cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct,clone_children 0 0

Bright sets JoinControllers = blkio,cpuacct,memory,freezer into /etc/systemd/system.conf and mixes blkio,freezer,memory,cpuacct into one single mountpoint

There's a discussion in bug 7536. See specifically comments 11 and 12 there about this issue and let me know if the proposed workarounds fixed your issues.

2. The slurmd log you attached is not useful.

(recommendation) Please make sure to change the debug levels in slurm.conf to human-readable values, for example:

SlurmdctldDebug=debug
SlurmdDebug=debug

Then we may catch more in the logs.

3. In your slurm.conf you have this:

# Was 120 but nodes were draining because it takes too long to kill some jobs.
# If this change doesn't do it, we'll file a bug with SchedMD.
UnkillableStepTimeout=300

That's not fine. You need to set UnkillableStepTimeout to a value less than 127 or the node draining mechanism won't work correctly. There's a new bug we discovered here: bug 11103

4. JobAcctGather plugin, I see you commented out /cgroup plugin and set /linux one. Did this change recently? I wouldn't expect an issue like the one you're seeing with jobacct_gather/linux, but maybe you did this change and didn't restart the entire cluster and you're seeing old jobs being hit by this. Is that possible?

JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherType=jobacct_gather/cgroup

Please, fix this 4 points and give me feedback on how it is going.

Thanks

Comment 8 Felip Moll 2021-03-25 10:18:09 MDT

I forgot to mention,

19.05.5 is not supported. I encourage you to upgrade to our latest stable when possible. Is that something you have in mind currently?

Thanks

Comment 9 Ali Siavosh 2021-03-25 12:50:48 MDT

Yes will be coming in the next scheduled maintenance.

Thanks

Ali Siavosh-Haghighi, PhD
Sr. HPC System Administrator, High-Performance Computing

NYU Langone Health
Medical Center Information Technology
227 E 30th St, #7-738
New York, NY 10016

O: 646-524-0860
C: 347-843-2357
siavoa01@nyumc.org<mailto:siavoa01@nyumc.org>
nyulangone.org

On Mar 25, 2021, at 12:18 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


[EXTERNAL]

Comment # 8<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c8__;Iw!!MXfaZl3l!JSssvLYAn1Oq5yJ6KCbEBS9K9Vuqv_jcqQ-WYEMzYBClU68vZRS9um8nfc1oaVnOaY1avxjrcnlh$> on bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!JSssvLYAn1Oq5yJ6KCbEBS9K9Vuqv_jcqQ-WYEMzYBClU68vZRS9um8nfc1oaVnOaY1av5p4zQbb$> from Felip Moll<mailto:felip.moll@schedmd.com>

I forgot to mention,

19.05.5 is not supported. I encourage you to upgrade to our latest stable when
possible. Is that something you have in mind currently?

Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email.
=================================

Comment 10 Felip Moll 2021-03-26 05:59:38 MDT

(In reply to Ali Siavosh from comment #9)
> Yes will be coming in the next scheduled maintenance.
> 
> Thanks
> 

Ok, please, let me know when you apply the changes mentioned in comment 7.

Regards

Comment 11 Ali Siavosh 2021-03-26 10:36:44 MDT

Hi Felip,
Thank you very much for you guide.
We are going to have a maintenance on May 2nd when we will upgrade slurm. 


I implemented the #3 issue mentioned and we will monitor the system.


For implementing #1:
1. As I understand commenting the corresponding line as below in system.conf should do it:
#JoinControllers=blkio,cpuacct,memory,freezer
And again as I understand there is no need for reboot. Though after I make this change and remount, still the "mount |grep cgroup" shows:

cgroup on /sys/fs/cgroup/blkio,cpuacct,memory,freezer type cgroup (rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct,clone_children)

am I missing something?

Comment 12 Felip Moll 2021-03-26 11:15:55 MDT

(In reply to Ali Siavosh from comment #11)
> Hi Felip,
> Thank you very much for you guide.
> We are going to have a maintenance on May 2nd when we will upgrade slurm. 

Cool. Just open a new bug if questions about upgrade arise.

> I implemented the #3 issue mentioned and we will monitor the system.
>

Good.

> 
> For implementing #1:
> 1. As I understand commenting the corresponding line as below in system.conf
> should do it:
> #JoinControllers=blkio,cpuacct,memory,freezer
> And again as I understand there is no need for reboot. Though after I make
> this change and remount, still the "mount |grep cgroup" shows:
> 
> cgroup on /sys/fs/cgroup/blkio,cpuacct,memory,freezer type cgroup
> (rw,nosuid,nodev,noexec,relatime,blkio,freezer,memory,cpuacct,clone_children)
> 
> am I missing something?

I checked my notes and unfortunately it seems a node reboot is needed. I don't think there's an easy way to umount/remount in the correct place.
Please let me know if you find one reliable method.

Just for more information, I leave here systemd documentation which explains JoinControllers.

JoinControllers=cpu,cpuacct net_cls,netprio
Configures controllers that shall be mounted in a single hierarchy. By default, systemd will mount all controllers which are enabled in the kernel in individual hierarchies, with the exception of those listed in this setting. Takes a space-separated list of comma-separated controller names, in order to allow multiple joined hierarchies. Defaults to 'cpu,cpuacct'. Pass an empty string to ensure that systemd mounts all controllers in separate hierarchies.

Note that this option is only applied once, at very early boot. If you use an initial RAM disk (initrd) that uses systemd, it might hence be necessary to rebuild the initrd if this option is changed, and make sure the new configuration file is included in it. Otherwise, the initrd might mount the controller hierarchies in a different configuration than intended, and the main system cannot remount them anymore.



PD:
------
Sorry, I realize now I have not pointed exactly to the correct comments. In my comment 7 when I said:

> There's a discussion in bug 7536. See specifically comments 11 and 12 there about this issue and let me know if the proposed workarounds fixed your issues.

I should have pointed to bug 7536 comments 14 and 18.

Comment 13 Felip Moll 2021-03-30 06:16:46 MDT

Hi Ali,

Just let me know how the system behaves with these changes, I am interested in seeing if issues have been ceased.

(Note your bug is tagged as a sev-2 so I am monitoring it more proactively, feel free to lower the priority if it is not affecting you so heavily).

Comment 14 Felip Moll 2021-04-05 05:47:50 MDT

Reducing sev. to 3.

Comment 15 Ali Siavosh 2021-04-06 08:14:38 MDT

Hi Felip,
My colleague says we still have now and then drained nodes (not as much as before though). We are going to update to 20.5.11 (the latest) on May 2nd. I may need to be in touch with you in that regard. But this is the latest status for now.
Thanks

Comment 16 Felip Moll 2021-04-07 06:02:33 MDT

(In reply to Ali Siavosh from comment #15)
> Hi Felip,
> My colleague says we still have now and then drained nodes (not as much as
> before though). We are going to update to 20.5.11 (the latest) on May 2nd. I
> may need to be in touch with you in that regard. But this is the latest
> status for now.
> Thanks

Hi Ali,

In comment 11 you said you implemented my point number 3 from comment 7.

I need to know about the other points, did you get rid of the Bright setup which joined controllers?

Can you upload a new slurmd log at debug or debug2 level demonstrating the issues?

They can even be legitimate drains, because if a job is not unkillable then we drain the node.

Let me know about this info please.

Thanks!

Comment 17 Ali Siavosh 2021-04-07 13:43:56 MDT

Hi Felip,
Let us accumulate the new log for couple of days and I will send them to you.


Thanks

Ali Siavosh-Haghighi, PhD
Sr. HPC System Administrator, High-Performance Computing

NYU Langone Health
Medical Center Information Technology
227 E 30th St, #7-738
New York, NY 10016

O: 646-524-0860
C: 347-843-2357
siavoa01@nyumc.org<mailto:siavoa01@nyumc.org>
nyulangone.org

On Apr 7, 2021, at 8:02 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


[EXTERNAL]

Felip Moll<mailto:felip.moll@schedmd.com> changed bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2vQDE15A$>
What    Removed Added
Summary problem deleting step cgroup path       node drained after JOB NOT ENDING WITH SIGNALS

Comment # 16<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c16__;Iw!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2gl6P3k-$> on bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2vQDE15A$> from Felip Moll<mailto:felip.moll@schedmd.com>

(In reply to Ali Siavosh from comment #15<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c15__;Iw!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2jKJswtM$>)
> Hi Felip,
> My colleague says we still have now and then drained nodes (not as much as
> before though). We are going to update to 20.5.11 (the latest) on May 2nd. I
> may need to be in touch with you in that regard. But this is the latest
> status for now.
> Thanks

Hi Ali,

In comment 11<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c11__;Iw!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2jb2mSa9$> you said you implemented my point number 3 from comment 7<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c7__;Iw!!MXfaZl3l!KOpVvHYFvsZYflMbQbqnqgAUDcGxvx09pypfZInPvyK1OjdTGKfbiZN-FsGGjXbwe1Bl2qMImL1w$>.

I need to know about the other points, did you get rid of the Bright setup
which joined controllers?

Can you upload a new slurmd log at debug or debug2 level demonstrating the
issues?

They can even be legitimate drains, because if a job is not unkillable then we
drain the node.

Let me know about this info please.

Thanks!

________________________________
You are receiving this mail because:

  *   You reported the bug.


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email.
=================================

Comment 18 Felip Moll 2021-04-08 02:44:08 MDT

(In reply to Ali Siavosh from comment #17)
> Hi Felip,
> Let us accumulate the new log for couple of days and I will send them to you.
> 

Ok, I will be looking forward for your response.

Comment 19 Ali Siavosh 2021-04-13 09:29:34 MDT

Created attachment 18925 [details]
slurmd.log cn-0017

Comment 20 Ali Siavosh 2021-04-13 09:30:30 MDT

Created attachment 18926 [details]
slurmd log for gn-0004

slurmd log in debug mode.

Comment 21 Ali Siavosh 2021-04-13 09:46:19 MDT

Hi Felip,
I just added logs of two recent drained nodes.

Comment 22 Felip Moll 2021-04-14 08:26:51 MDT

(In reply to Ali Siavosh from comment #21)
> Hi Felip,
> I just added logs of two recent drained nodes.

Hi, these logs only catch less than one minute run... 

I also see this:

[2021-04-13T10:15:58.693] slurmd version 19.05.5 started
[2021-04-13T10:15:58.693] killing old slurmd[85577]

It seems you started one slurmd while another was already started..

The log I requested should have the time of when you see a node being drained, and it doesn't seem to be the case. Have you reproduced the issue and is this the complete log of the entire time??

Can you explain? Thanks

Comment 23 Ali Siavosh 2021-04-14 08:54:46 MDT

Huh Thats a bit odd. Let me try again.


Thanks

Ali Siavosh-Haghighi, PhD
Sr. HPC System Administrator, High-Performance Computing

NYU Langone Health
Medical Center Information Technology
227 E 30th St, #7-738
New York, NY 10016

O: 646-524-0860
C: 347-843-2357
siavoa01@nyumc.org<mailto:siavoa01@nyumc.org>
nyulangone.org

On Apr 14, 2021, at 10:26 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


[EXTERNAL]

Comment # 22<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c22__;Iw!!MXfaZl3l!PeuvO3R_Faz5IwLbFnBXIQjmRFmmAt9n0zdokL-EAKa3BJS8cFCJWiKxf1jJCzZehTP75UWZrRDh$> on bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!PeuvO3R_Faz5IwLbFnBXIQjmRFmmAt9n0zdokL-EAKa3BJS8cFCJWiKxf1jJCzZehTP75d8kAGQZ$> from Felip Moll<mailto:felip.moll@schedmd.com>

(In reply to Ali Siavosh from comment #21<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c21__;Iw!!MXfaZl3l!PeuvO3R_Faz5IwLbFnBXIQjmRFmmAt9n0zdokL-EAKa3BJS8cFCJWiKxf1jJCzZehTP75Y4DTjJ8$>)
> Hi Felip,
> I just added logs of two recent drained nodes.

Hi, these logs only catch less than one minute run...

I also see this:

[2021-04-13T10:15:58.693] slurmd version 19.05.5 started
[2021-04-13T10:15:58.693] killing old slurmd[85577]

It seems you started one slurmd while another was already started..

The log I requested should have the time of when you see a node being drained,
and it doesn't seem to be the case. Have you reproduced the issue and is this
the complete log of the entire time??

Can you explain? Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email.
=================================

Comment 24 Ali Siavosh 2021-04-20 08:30:26 MDT

Created attachment 19046 [details]
slurmd.log

Comment 25 Ali Siavosh 2021-04-20 08:34:55 MDT

Created attachment 19047 [details]
slurmd.log

Comment 26 Felip Moll 2021-04-20 09:42:52 MDT

(In reply to Ali Siavosh from comment #25)
> Created attachment 19047 [details]
> slurmd.log

That log is more interesting. I am investigating the issue at the moment but I need you to respond to my comment 7 , 1) and 4).

Do you really have jobacctgather/linux in all nodes? I guess yes since I see:

[2021-04-18T09:11:17.401] [13097231.extern] debug:  Job accounting gather LINUX plugin loaded

I am investigating the the log in the meanwhile.

Comment 27 Ali Siavosh 2021-04-20 13:53:03 MDT

Hi Felip,
For #1 in comment 7 since it needs node reboot we may need to schedule it.
For # It has not been a new change. Shall I go ahead and revert it still?

Comment 28 Felip Moll 2021-04-20 14:40:28 MDT

(In reply to Ali Siavosh from comment #27)
> Hi Felip,
> For #1 in comment 7 since it needs node reboot we may need to schedule it.

Ok, that's something it would be needed. We don't support for now this configuration and to be in the safe side you should split the mountpoints.

Nevertheless I can still not ensure 100% it is the cause of your issue.

> For # It has not been a new change. Shall I go ahead and revert it still?

No, if jobacctgather/linux is set, then leave it as is.

Comment 29 Felip Moll 2021-05-05 06:44:03 MDT

(In reply to Felip Moll from comment #28)
> (In reply to Ali Siavosh from comment #27)
> > Hi Felip,
> > For #1 in comment 7 since it needs node reboot we may need to schedule it.
> 
> Ok, that's something it would be needed. We don't support for now this
> configuration and to be in the safe side you should split the mountpoints.
> 
> Nevertheless I can still not ensure 100% it is the cause of your issue.
> 
> > For # It has not been a new change. Shall I go ahead and revert it still?
> 
> No, if jobacctgather/linux is set, then leave it as is.

Hi Ali,

Is the issue still happening, and if it is, how often do you see it?

I have seen also that the issue is happening in the extern step. Is it possible that some process is attached to the extern step (like an ssh session with pam_slurm_adopt) and the process run by this ssh is unkillable (like blocked by IO) for some reason?

Do you have the exact example of a job submission which triggers the error?

Thanks

Comment 30 Ali Siavosh 2021-05-05 07:42:22 MDT

Hi Felip.
From Sunday that we upgraded to 20.11.5 we have not seen any issue. It is a bit to early but if you like we can close it and if it happens again we reopen the case. Is this OK?

Thanks

Ali Siavosh-Haghighi, PhD
Sr. HPC System Administrator, High-Performance Computing

NYU Langone Health
Medical Center Information Technology
227 E 30th St, #7-738
New York, NY 10016

O: 646-524-0860
C: 347-843-2357
siavoa01@nyumc.org<mailto:siavoa01@nyumc.org>
nyulangone.org

On May 5, 2021, at 8:44 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


[EXTERNAL]

Comment # 29<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c29__;Iw!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDbEBy1SS$> on bug 11201<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201__;!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDc7KG1Qq$> from Felip Moll<mailto:felip.moll@schedmd.com>

(In reply to Felip Moll from comment #28<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c28__;Iw!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDXL61XQB$>)
> (In reply to Ali Siavosh from comment #27<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c27__;Iw!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDQ4u85Nm$>)
> > Hi Felip,
> > For #1 in comment 7<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=11201*c7__;Iw!!MXfaZl3l!PFyZSTwz2lQ-DxzPHRLdfxcfL9--WL8ovRM4hqwC9hwTsSX8laOBs_MLdEwxg6R6n16jDbuRqeT6$> since it needs node reboot we may need to schedule it.
>
> Ok, that's something it would be needed. We don't support for now this
> configuration and to be in the safe side you should split the mountpoints.
>
> Nevertheless I can still not ensure 100% it is the cause of your issue.
>
> > For # It has not been a new change. Shall I go ahead and revert it still?
>
> No, if jobacctgather/linux is set, then leave it as is.

Hi Ali,

Is the issue still happening, and if it is, how often do you see it?

I have seen also that the issue is happening in the extern step. Is it possible
that some process is attached to the extern step (like an ssh session with
pam_slurm_adopt) and the process run by this ssh is unkillable (like blocked by
IO) for some reason?

Do you have the exact example of a job submission which triggers the error?

Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email.
=================================

Comment 31 Felip Moll 2021-05-05 09:38:04 MDT

(In reply to Ali Siavosh from comment #30)
> Hi Felip.
> From Sunday that we upgraded to 20.11.5 we have not seen any issue. It is a
> bit to early but if you like we can close it and if it happens again we
> reopen the case. Is this OK?
> 
> Thanks

This is OK. If you notice it again please, reopen it with the new info.

Thanks Ali