|
Description
Mark Allen
2023-03-28 16:18:56 MDT
Hi Mike, How often this issue happens? We need a bit more information from a node that failed: 1. Kernel version (uname -a) 2. cat /proc/mounts 3. Are you using Weka or Bright? 4. If you allow ssh into the nodes and you use pam_slurm_adopt: can you show me the output of.. grep -ri systemd /etc/pam.d/ - It would be ideal to set SlurmdDebug=debu2 and DebugFlags=cgroup, then reconfigure or restart slurmd daemons in the cluster. We need to catch more logging information in slurmd. > It would be ideal to set SlurmdDebug=debu2 and DebugFlags=cgroup, then
> reconfigure or restart slurmd daemons in the cluster. We need to catch more
> logging information in slurmd.
"debug2" not "debu2" , sorry for the typo :)
Looking for a signature in our slurmd logs, it appears that it's happened about 1700 out of the last 300,000 jobs. So maybe one percent or so. Looking at the gross attributes of those jobs, there's not an obvious pattern. It's happening across hosts/partitions with differing amounts of memory, different job memory requests, different users, etc. Kernel is the same for all hosts: Linux n002 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux [root@n001 ~]# cat /proc/mounts proc /proc proc rw,nosuid,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,relatime,size=65959208k,nr_inodes=16489802,mode=755 0 0 tmpfs /run tmpfs rw,relatime 0 0 /dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=23,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=21590 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 /dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota 0 0 /dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 fs1 /gpfs gpfs rw,relatime 0 0 gvfsd-fuse /gpfs/home/msconce/.gvfs fuse.gvfsd-fuse rw,nosuid,nodev,relatime,user_id=1221,group_id=1159 0 0 gvfsd-fuse /gpfs/home/ayb/.gvfs fuse.gvfsd-fuse rw,nosuid,nodev,relatime,user_id=2129,group_id=1159 0 0 We use Bright, but not the SLURM from there. We compile our own from source. I suppose it's (barely) possible that we're picking up bits from the Bright version, though that seems unlikely. [root@n001 ~]# grep -ri systemd /etc/pam.d/ /etc/pam.d/fingerprint-auth-ac:-session optional pam_systemd.so /etc/pam.d/password-auth-ac:#-session optional pam_systemd.so /etc/pam.d/password-auth~:-session optional pam_systemd.so /etc/pam.d/runuser-l:-session optional pam_systemd.so /etc/pam.d/smartcard-auth-ac:-session optional pam_systemd.so /etc/pam.d/system-auth-ac:-session optional pam_systemd.so /etc/pam.d/system-auth~:-session optional pam_systemd.so /etc/pam.d/systemd-user:# This file is part of systemd. /etc/pam.d/systemd-user:# Used by systemd --user instances. /etc/pam.d/lightdm-greeter:session optional pam_systemd.so [root@n001 ~]# grep -ri adopt /etc/pam.d/ /etc/pam.d/sshd:# Pam Slurm adopt: /etc/pam.d/sshd:account sufficient pam_slurm_adopt.so I've set the logging and will watch for an instance. (Incidentally, it's "SlurmdDebug", and it doesn't appear that a "scontrol reconfig" is sufficient to activate it. Or maybe I did it wrong.) Thanks. Here are some log lines from an occurrence. (They're not marked 'debug2', but I think it's enabled nonetheless.) The earlier lines mentioning this job look unremarkable. The 'slurmd' daemon was restarted while the job was running. Perhaps that's a factor. ... [2023-03-30T15:43:46.185] debug: credential for job 24810326 revoked [2023-03-30T15:43:46.185] [24810326.batch] debug: Handling REQUEST_SIGNAL_CONTAINER [2023-03-30T15:43:46.185] [24810326.batch] debug: _handle_signal_container for StepId=24810326.batch uid=450 signal=18 [2023-03-30T15:43:46.186] [24810326.batch] Sent signal 18 to StepId=24810326.batch [2023-03-30T15:43:46.186] [24810326.extern] debug: Handling REQUEST_SIGNAL_CONTAINER [2023-03-30T15:43:46.186] [24810326.extern] debug: _handle_signal_container for StepId=24810326.extern uid=450 signal=18 [2023-03-30T15:43:46.187] [24810326.extern] Sent signal 18 to StepId=24810326.extern [2023-03-30T15:43:46.187] [24810326.batch] debug: Handling REQUEST_SIGNAL_CONTAINER [2023-03-30T15:43:46.187] [24810326.batch] debug: _handle_signal_container for StepId=24810326.batch uid=450 signal=15 [2023-03-30T15:43:46.187] [24810326.batch] error: *** JOB 24810326 ON n182 CANCELLED AT 2023-03-30T15:43:46 *** [2023-03-30T15:43:46.187] [24810326.batch] Sent signal 15 to StepId=24810326.batch [2023-03-30T15:43:46.188] [24810326.extern] debug: Handling REQUEST_SIGNAL_CONTAINER [2023-03-30T15:43:46.188] [24810326.extern] debug: _handle_signal_container for StepId=24810326.extern uid=450 signal=15 [2023-03-30T15:43:46.188] [24810326.extern] Sent signal 15 to StepId=24810326.extern [2023-03-30T15:43:46.188] [24810326.extern] debug: signaling condition [2023-03-30T15:43:46.188] [24810326.extern] debug: task/affinity: task_p_post_term: affinity StepId=24810326.extern, task 0 [2023-03-30T15:43:46.188] [24810326.extern] error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1711/job_24810326/step_extern' : No such file or directory [2023-03-30T15:43:46.188] [24810326.extern] error: common_cgroup_lock error: No such file or directory [2023-03-30T15:43:46.188] [24810326.extern] debug: task_g_post_term: task/cgroup: Unspecified error [2023-03-30T15:43:46.188] [24810326.extern] error: problem with oom_pipe[0] [2023-03-30T15:43:46.188] [24810326.extern] fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument [2023-03-30T15:43:46.188] [24810326.extern] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded [2023-03-30T15:43:46.189] [24810326.batch] debug: Handling REQUEST_STATE [2023-03-30T15:43:46.209] [24810326.batch] debug: Handling REQUEST_STATE [2023-03-30T15:43:46.242] [24810326.batch] task 0 (151774) exited. Killed by signal 15. [2023-03-30T15:43:46.243] [24810326.batch] debug: task/affinity: task_p_post_term: affinity StepId=24810326.batch, task 0 [2023-03-30T15:43:46.243] [24810326.batch] error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1711/job_24810326/step_batch' : No such file or directory [2023-03-30T15:43:46.243] [24810326.batch] error: common_cgroup_lock error: No such file or directory [2023-03-30T15:43:46.243] [24810326.batch] debug: task_g_post_term: task/cgroup: Unspecified error [2023-03-30T15:43:46.243] [24810326.batch] error: problem with oom_pipe[0] [2023-03-30T15:43:46.243] [24810326.batch] fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument [2023-03-30T15:43:46.260] debug: _step_connect: connect() failed for /cm/local/apps/slurm/var/spool/n182_24810326.4294967291: Connection refused [2023-03-30T15:43:46.260] debug: Cleaned up stray socket /cm/local/apps/slurm/var/spool/n182_24810326.4294967291 [2023-03-30T15:43:46.260] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job24810326/slurm_script [2023-03-30T15:43:46.260] debug: _step_connect: connect() failed for /cm/local/apps/slurm/var/spool/n182_24810326.4294967292: Connection refused [2023-03-30T15:43:46.260] debug: Cleaned up stray socket /cm/local/apps/slurm/var/spool/n182_24810326.4294967292 [2023-03-30T15:43:46.260] debug: Waiting for job 24810326's prolog to complete [2023-03-30T15:43:46.260] debug: Finished wait for job 24810326's prolog to complete [2023-03-30T15:43:48.524] debug: Note large processing time from prep_g_epilog: usec=2264129 began=15:43:46.260 [2023-03-30T15:43:48.524] debug: completed epilog for jobid 24810326 [2023-03-30T15:43:48.525] debug: JobId=24810326: sent epilog complete msg: rc = 0 > [2023-03-30T15:43:48.524] debug: Note large processing time from
> prep_g_epilog: usec=2264129 began=15:43:46.260
> [2023-03-30T15:43:48.524] debug: completed epilog for jobid 24810326
> [2023-03-30T15:43:48.525] debug: JobId=24810326: sent epilog complete msg:
> rc = 0
I am not sure the logs are properly configured, since I don't see any "CGROUP" log message nor a debug2.
Can you do 'scontrol show config' and paste here the output? This is what is needed in slurm.conf:
SlurmdDebug=debug2
DebugFlags=cgroup
I checked it and a 'scontrol reconfig' is enough for slurmd to take the new log level. I understand you start slurmd through 'systemctl start', right?
Please, can you double-check?
Not sure what happened. Here's the current state: root@hpc-hn1 ~ 669# scontrol show config | fgrep -ie debug DebugFlags = Cgroup SlurmctldDebug = debug2 SlurmctldSyslogDebug = (null) SlurmdDebug = debug2 SlurmdSyslogDebug = (null) All daemons were restarted (systemctl restart) at least once, but perhaps I missed something. In any case, I'll take another run at this. After a restart of all 'slurmd', there was an instance of the bug, and the log lines look rather similar (missing what you want to see). I'm remembering now discovering that each job is actually managed by one or more slurmstepd, and these do _not_ restart when slurmd is restarted. They also fail to pick up at least some parameter changes from slurm.conf, even if slurmctld and slurmd are restarted. (In that particular case, I changed some port numbers, which made quite a mess.) So, a theory here is that already existing jobs won't pick up these flag changes. I will watch for an instance of the bug for a job started _after_ our debug2 flag changes. (In reply to Mark Allen from comment #7) > After a restart of all 'slurmd', there was an instance of the bug, and the > log lines look rather similar (missing what you want to see). > > I'm remembering now discovering that each job is actually managed by one or > more slurmstepd, and these do _not_ restart when slurmd is restarted. They > also fail to pick up at least some parameter changes from slurm.conf, even > if slurmctld and slurmd are restarted. (In that particular case, I changed > some port numbers, which made quite a mess.) Ah, if the situation happens on old steps it is expected that debug2 or CGROUP log messages don't show up. As you say slurmstepds won't get the new settings. The settings will affect only new steps. I expect to see debug2 messages from slurmd itself at least, but not what we're after. Sorry I assumed you were running new jobs. > So, a theory here is that already existing jobs won't pick up these flag > changes. I will watch for an instance of the bug for a job started _after_ > our debug2 flag changes. Yes, that's what we should wait for. We've seen a problem that may or may not be exactly the same issue, but it seems likely. I'm not sure what to upload, so here are a few things to start with. A few initial observations: There are a lot of zombie slurmd processes. That's surely a bug. There are processes from several rather old jobs. Checking one, it seems to be in the "completing" state. That's surely an issue as well. I'm not seeing "debug2" anywhere. I'm not sure if that's an indication that we're not seeing log messages at that level or not. The macros are a bit of a mess to read. Is it specifically supposed to be present in the log message? Attachment uploads to follow. # showjob 24106982 JobId= 24106982 Partition= bgmp Account= bgmp UserId=isisd (Isis Monarrez) JobName=bash NumNodes= 2 NumCPUs=2 NumTasks=2 Tasks/Node=0 CPUs/Task=1 TRES= cpu=2 mem=64G node=2 MinMemoryNode=32G BatchHost= n278 NodeList=n[278-279] SubmitTime= 2023-02-13T14:52:58 StartTime= 2023-02-13T14:52:58 EndTime= 2023-02-13T20:05:18 RunTime= 05:12:20 TimeLimit= 10:00:00 WorkDir= /gpfs/projects/bgmp/isisd/bioinfo/OHSU Command= bash JobState= COMPLETING Reason=None ExitCode=0:0 Created attachment 29681 [details]
ps -eflyH output
Created attachment 29682 [details]
slurmd log
Created attachment 29683 [details]
slurm cgroup tree
- Can I ask for a "dmesg -T" in the node? - Can I see your prolog script? - Can I see your most recent slurm.conf? Things I noticed: - The compute node is runnig since 2022 without any reboot. - There are VNC and Mate (graphical) sessions, with for example Rstudio or Firefox running since 2022. Users clafranc and luizah. Is this expected? - The slurmd is started with -vv, which overrides the debug level set in slurm.conf: /cm/shared/apps/slurm/current/sbin/slurmd --conf-server hpc-hn1,hpc-hn2 -vv How do you start slurmd? Can you remove this -vv? The configuration is taken from these two servers, I guess you changed the debug2 level in these servers and not only in the slurmd node, right? - A couple of Zombie processes are prolog threads: Z root 27781 37651 0 80 0 0 0 do_exi 10:29 ? 00:00:00 [prolog] <defunct> Can I see your prolog scripts? - In the log.. at least we can see the "CGROUP:" messages. - The zombie processes are from 10:29, same time when a bunch of jobs/steps were started in the node, at the same time. Part of a job array? - I see the following. It is like the stdout file doesn't exist, and the batch step fails. I guess this job just ended with FAILED state?: [2023-04-04T10:29:55.744] [25584694.batch] error: Could not open stdout file /home/jhass2/jamming/JacobData/logs/ScatteringSweepLonger/25584454-239.out: No such file or directory [2023-04-04T10:29:55.744] [25584694.batch] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO [2023-04-04T10:29:55.744] [25584694.batch] debug: signaling condition [2023-04-04T10:29:57.527] [25584722.batch] error: Could not open stdout file /home/jhass2/jamming/JacobData/logs/ScatteringSweepLonger/25584454-267.out: No such file or directory [2023-04-04T10:29:57.527] [25584722.batch] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO [2023-04-04T10:29:57.527] [25584722.batch] debug: signaling condition [2023-04-04T10:29:57.528] [25584726.batch] error: Could not open stdout file /home/jhass2/jamming/JacobData/logs/ScatteringSweepLonger/25584454-271.out: No such file or directory [2023-04-04T10:29:57.528] [25584726.batch] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO [2023-04-04T10:29:57.528] [25584726.batch] debug: signaling condition ... etc ... - Have you experiencied any GPFS issues? (if possible check the logs) - What are your system limits? (ulimit) Can this be related to a limits issue? (sysctl -a / https://slurm.schedmd.com/high_throughput.html ) Hi, Marke Do you have any feedback for me? Thanks Hi Felip, Apologies for letting this go stale. It is definitely still important to us. I will try to get responses tomorrow. As a meta comment, we're in the process of upgrading to RHEL 8 in about six months, which will mean cgroups v2. Given that window, even just mitigation of the problems would be great, even if the underlying problems aren't solved. What do I mean by that? For example, if restarting all slurmd's every 15 minutes would make this go away, we'd do it. (Yes, that's ridiculous.) If you have any ideas on that front, we're all ears. Again, though, I'll try to get you answers to your questions soon. Regards, Mike Created attachment 29941 [details]
dmesg -T
Created attachment 29942 [details]
sysctl out
Created attachment 29943 [details]
systemd unit file
Created attachment 29944 [details]
slurm.conf
Created attachment 29945 [details]
cgroup.conf
Hi, We're now seeing 'debug2' messages in the logs, so I'm going to look for a fresh instance. In case it might help, I've uploaded a few things you mentioned from the original instance, but you're welcome to await the new instance. (Note that the two .conf files were grabbed today, so might be a bit different. Probably not in a way that matters here.) Regarding your questions, yes, this node has been up for four or five months. We try to avoid taking them down except during maintenance events, twice per year. The VNC/X sessions are expected and were started within Open OnDemand jobs. Those jobs ended long ago, so it's quite unexpected that the processes are still present. That said, some processes in those session do immediately re-parent to PID 1. My thought is that they still cannot escape the job's cgroup and that slurmd will thus kill them at the job's end no matter what. I'm fuzzy on the details of cgroups--I certainly hope that's true. In a test desktop job just now, I do see those re-parented processes in the output of 'systemctl status slurmd', which makes me think that they have not escaped. I saw the zombie processes as well. These seem like processes that slurmd should somehow have waited for. The prolog files are largely or completely coming from Bright. They don't seem to do much besides 'scontrol show' commands. I can still upload if you like. No GPFS issues that we're aware of. And in any case, I'd expect slurmd to throw a huge fit if there were any such disk I/O errors. I'll work on capturing a new instance. Thanks for your help. Mike (In reply to Mark Allen from comment #21) > Hi, > > We're now seeing 'debug2' messages in the logs, so I'm going to look for a > fresh instance. In case it might help, I've uploaded a few things you > mentioned from the original instance, but you're welcome to await the new > instance. The only interesting thing I can see are the numerous kills because OOM: ]$ grep Killed n289-stuck-dmesg-20230420 but indeed I don't see anything related to NFS or filesystem. I think is worth to wait at a new instance if you're seeing debug2 messages. > Regarding your questions, yes, this node has been up for four or five > months. We try to avoid taking them down except during maintenance events, > twice per year. > > The VNC/X sessions are expected and were started within Open OnDemand jobs. > Those jobs ended long ago, so it's quite unexpected that the processes are > still present. Time ago we had a problem with X sessions and a fork(), where we were calling non-async signal safe functions after the fork, but we fixed that back in 2020. commit 8fb07498b3fdc1e1ca7bb0aa2c95785ab2956f42 Author: Felip Moll <felip.moll@schedmd.com> AuthorDate: Wed Jul 8 17:30:11 2020 +0200 Commit: Tim Wickberg <tim@schedmd.com> CommitDate: Fri Nov 6 10:10:26 2020 -0700 Make x11 forwarding setup async-signal-safe. This were causing random issues in X apps that suddenly crashed. When you fork a multi-thread process, it is possible that one thread have locks set, and the forked process inherits these locks. That's why it is important to get the slurmd log, I want to see the process and what was slurmd exactly doing around the time the issue happened. slurmstepd: error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1699/job_25139838/step_batch' : No such file or directory slurmstepd: error: common_cgroup_lock error: No such file or directory slurmstepd: error: problem with oom_pipe[0] slurmstepd: fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument There are some minor fixes about this (like e8bedd62f6bfdd338a90db3a7eb66c7958446625), and we know some places that are still uncorrected. It would be good to know if all the jobs you're having this issues are X11 related or not, or always python.. > That said, some processes in those session do immediately re-parent to PID > 1. My thought is that they still cannot escape the job's cgroup and that > slurmd will thus kill them at the job's end no matter what. I'm fuzzy on > the details of cgroups--I certainly hope that's true. slurmstepd is the one which is in charge of user processes, not slurmd directly. One other idea is that these graphical or python apps are running processes through a launcher (like systemd-run, or calls to dbus and systemd, which is possible since they're on a graphical session) which run outside slurm control because are really started by systemd and not slurm. If you see processes outside our cgroups that may be the cause, but that shouldn't make the jobs go stuck. > In a test desktop job just now, I do see those re-parented processes in the > output of 'systemctl status slurmd', which makes me think that they have not > escaped. > > I saw the zombie processes as well. These seem like processes that slurmd > should somehow have waited for. If they are re-parented to pid 1 this means its parent is dead, which means slurmstepd is dead. That reinforces my theory that slurmstepd terminates abruptly or doesn't wait for the forked process for some reason. > The prolog files are largely or completely coming from Bright. They don't > seem to do much besides 'scontrol show' commands. I can still upload if you > like. Ok, we'll table that for now. > No GPFS issues that we're aware of. And in any case, I'd expect slurmd to > throw a huge fit if there were any such disk I/O errors. > > I'll work on capturing a new instance. Thanks for your help. > > Mike Thanks Mike. Thanks for your reply. I think we've made some real progress on our end, which I will detail in the next comment. Replying to your current comments first, yes, the OOM killer seems to be part of all of this. It's somewhat plausible that the triggering jobs involve X11, likely a server, or I suppose Python. Nothing magical about these, as far as I know, but maybe their typical behavior is a trigger. As far as I know, we don't specifically enable any SLURM-specific X11 stuff. On the other hand, the 'srun --x11' flag seems to work. Not sure if this is involved. But also, I don't think Open OnDemand desktop sessions use this feature. That is, I think they set up their own forwarding sockets. I'm seeing the "Invalid argument" messages some. If this is a true EINVAL, this makes me wonder if these calls aren't buggy somehow. (See next comment also.) If user processes can start significant programs via the system 'systemd' or 'dbus-daemon', that's news to me, but I'm not very familiar with these. Re "If they are re-parented to pid 1 this means its parent is dead, which means slurmstepd is dead": I don't think the second part follows. That is, the parent is dead, but wasn't necessarily slurmstepd. Based on your comments, though, re-parenting (or forking) does seem like a place where something could be going amiss. Next comment coming up. Starting an OOD desktop session on an unused node did seem to reproduce at least part of this. (If nothing else, it definitely demonstrates a Slurm bug.) In short, this job is exceeding its Slurm memory request by a lot. On a fully used node, this would lead shortly to a lot of trouble. Not really sure which processes get whacked first, but it's all bad. Here are some bits that seem relevant. I will try to gather more data, assuming it really is reproducible. My first thought is to capture high-quality strace output for the entire job, starting from slurmd. What else would be useful? Mike Linux n140 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux ### mainly note 32G memory # showjob 25852232 JobId= 25852232 Partition= racs Account= hpcrcf UserId=mcolema5 (Michael Coleman) JobName=sys/dashboard/sys/bc_talapas_desktop NumNodes= 1 NumCPUs=4 NumTasks=1 Tasks/Node=0 CPUs/Task=4 TRES= cpu=4 mem=32G node=1 MinCPUsNode=4 MinMemoryNode=32G BatchHost= n140 NodeList=n140 SubmitTime= 2023-04-20T13:21:39 StartTime= 2023-04-20T13:21:43 EndTime= 2023-04-30T13:21:43 RunTime= 1-04:21:47 TimeLimit= 10-00:00:00 WorkDir= /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a Command= (null) StdOut= /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/output.log StdErr= /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/output.log JobState= RUNNING Reason=None ExitCode=0:0 ### same from raw scontrol # scontrol show job 25852232 JobId=25852232 JobName=sys/dashboard/sys/bc_talapas_desktop UserId=mcolema5(1345) GroupId=uoregon(131) MCS_label=N/A Priority=1200733 Nice=0 Account=hpcrcf QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=1-04:34:55 TimeLimit=10-00:00:00 TimeMin=N/A SubmitTime=2023-04-20T13:21:39 EligibleTime=2023-04-20T13:21:39 AccrueTime=2023-04-20T13:21:39 StartTime=2023-04-20T13:21:43 EndTime=2023-04-30T13:21:43 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-04-20T13:21:43 Scheduler=Main Partition=racs AllocNode:Sid=talapas-ln2:45953 ReqNodeList=(null) ExcNodeList=(null) NodeList=n140 BatchHost=n140 NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:* TRES=cpu=4,mem=32G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=4 MinMemoryNode=32G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a Comment= StdErr=/home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/output.log StdIn=/dev/null StdOut=/home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/output.log Power= ### Started desktop, then gnome terminal within, then python3 in terminal. Within that, generated huge string (see below). I stopped it and started another python3 to use more memory. Obviously, this is blowing way past any Slurm job limits. The job is still running after several days, probably partly because there have been no other jobs run on this node. top - 17:55:02 up 43 days, 4:21, 2 users, load average: 0.00, 0.01, 0.05 Tasks: 418 total, 1 running, 414 sleeping, 3 stopped, 0 zombie %Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 19799787+total, 73960536 free, 11990907+used, 4128264 buff/cache KiB Swap: 0 total, 0 free, 0 used. 75955136 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7374 mcolema5 20 0 93.3g 93.1g 2968 T 0.0 49.3 0:34.33 python3 7593 mcolema5 20 0 9963112 9.3g 2968 S 0.0 4.9 0:07.86 python3 9775 root 0 -20 19.1g 4.9g 268304 S 0.3 2.6 189:09.65 mmfsd 30994 root 20 0 2689992 275428 55436 S 0.0 0.1 27:35.41 cmd 3480 root 20 0 186808 107552 107096 S 0.3 0.1 6:45.82 systemd-journal 4881 mcolema5 20 0 434112 80968 13552 S 0.7 0.0 2:00.78 Xvnc ... ### partial ps output, omitting lots of kernel/root procs that seem irrelevant S root 3085 1 0 80 0 7868 64496 inet_c Apr20 ? 00:00:02 /cm/shared/apps/slurm/current/sbin/slurmd --conf-server hpc-hn1,hpc-hn2 S root 4801 1 0 80 0 4436 108222 do_wai Apr20 ? 00:00:06 slurmstepd: [25852232.extern] S root 4806 4801 0 80 0 348 27014 hrtime Apr20 ? 00:00:00 sleep 100000000 S root 4808 1 1 80 0 4640 107706 do_wai Apr20 ? 00:17:47 slurmstepd: [25852232.batch] S mcolema5 4813 4808 0 80 0 1780 4005 do_wai Apr20 ? 00:00:00 /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script S mcolema5 4908 4813 0 80 0 1584 2430 do_wai Apr20 ? 00:00:00 bash /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/script.sh S mcolema5 4982 4908 0 80 0 28412 132928 poll_s Apr20 ? 00:00:02 mate-session S mcolema5 5047 4982 0 80 0 15336 364107 poll_s Apr20 ? 00:00:13 /usr/libexec/mate-settings-daemon S mcolema5 5051 4982 0 80 0 16560 102908 poll_s Apr20 ? 00:00:03 marco S mcolema5 5055 4982 0 80 0 16520 141292 poll_s Apr20 ? 00:00:02 mate-panel S mcolema5 8419 5055 0 80 0 25936 147685 poll_s Apr20 ? 00:00:01 mate-terminal S mcolema5 8428 8419 0 80 0 860 3698 unix_s Apr20 ? 00:00:00 gnome-pty-helper S mcolema5 8429 8419 0 80 0 5292 4835 do_wai Apr20 pts/0 00:00:00 -bash T mcolema5 7374 8429 4 80 0 97662500 24463434 do_sig 17:40 pts/0 00:00:34 python3 S mcolema5 7593 8429 1 80 0 9771892 2490778 poll_s 17:41 pts/0 00:00:07 python3 S mcolema5 5072 4982 0 80 0 31248 147109 poll_s Apr20 ? 00:00:55 caja S mcolema5 5075 4982 0 80 0 16836 71489 poll_s Apr20 ? 00:00:00 /usr/bin/python /usr/share/system-config-printer/applet.py S mcolema5 5076 4982 0 80 0 12480 96527 poll_s Apr20 ? 00:00:00 abrt-applet S mcolema5 5081 4982 0 80 0 9440 99829 poll_s Apr20 ? 00:00:00 mate-screensaver S mcolema5 5083 4982 0 80 0 19656 127696 poll_s Apr20 ? 00:00:00 nm-applet S mcolema5 5260 4813 0 80 0 1116 4005 pipe_w Apr20 ? 00:00:00 /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script S mcolema5 5263 5260 0 80 0 740 4005 do_wai Apr20 ? 00:00:00 /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script S mcolema5 5264 5263 0 80 0 616 1103 hrtime Apr20 ? 00:00:03 tail -f --pid=4908 vnc.log S mcolema5 4881 1 0 80 0 80968 108528 ep_pol Apr20 ? 00:02:00 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: n140:1 (mcolema5) -auth /home/mcolema5/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/mcolema5/.vnc/x509_cert.pem -x509key /home/mcolema5/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0 S mcolema5 4939 1 0 80 0 1000 15802 poll_s Apr20 ? 00:00:00 dbus-launch --autolaunch=5abba43653164143a1d82e3b311c9406 --binary-syntax --close-stderr S mcolema5 4940 1 0 80 0 1408 15543 ep_pol Apr20 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session S mcolema5 4944 1 0 80 0 2984 43813 poll_s Apr20 ? 00:00:00 /usr/libexec/dconf-service S mcolema5 4985 1 0 80 0 1000 15803 poll_s Apr20 ? 00:00:00 dbus-launch --exit-with-session mate-session S mcolema5 4986 1 0 80 0 1984 15673 ep_pol Apr20 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session S mcolema5 5004 1 0 80 0 4108 82793 poll_s Apr20 ? 00:00:00 /usr/libexec/at-spi-bus-launcher S mcolema5 5017 5004 0 80 0 2284 15543 ep_pol Apr20 ? 00:00:00 /usr/bin/dbus-daemon --config-file=/usr/share/defaults/at-spi2/accessibility.conf --nofork --print-address 3 S mcolema5 5008 1 0 80 0 3940 66656 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfsd S mcolema5 5139 5008 0 80 0 4824 84671 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfsd-trash --spawner :1.2 /org/gtk/gvfs/exec_spaw/0 S mcolema5 6331 5008 0 80 0 4648 102675 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfsd-network --spawner :1.2 /org/gtk/gvfs/exec_spaw/1 S mcolema5 6352 5008 0 80 0 4444 91345 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfsd-dnssd --spawner :1.2 /org/gtk/gvfs/exec_spaw/3 S mcolema5 5013 1 0 80 0 3388 97320 futex_ Apr20 ? 00:00:00 /usr/libexec/gvfsd-fuse /home/mcolema5/.gvfs -f -o big_writes S mcolema5 5026 1 0 80 0 4012 55229 poll_s Apr20 ? 00:00:00 /usr/libexec/at-spi2-registryd --use-gnome-session S mcolema5 5040 1 0 80 0 2988 43813 poll_s Apr20 ? 00:00:00 /usr/libexec/dconf-service S mcolema5 5064 1 0 80 0 4048 141615 poll_s Apr20 ? 00:00:00 /usr/bin/pulseaudio --start --log-target=syslog S rtkit 5065 1 0 81 1 1812 49696 poll_s Apr20 ? 00:00:01 /usr/libexec/rtkit-daemon S mcolema5 5087 1 0 80 0 5280 74468 poll_s Apr20 ? 00:00:06 /usr/libexec/gvfs-udisks2-volume-monitor S root 5095 1 0 80 0 7184 114475 poll_s Apr20 ? 00:00:33 /usr/libexec/udisks2/udisksd S mcolema5 5192 1 0 80 0 3184 67364 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfs-mtp-volume-monitor S mcolema5 5198 1 0 80 0 3416 69637 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfs-gphoto2-volume-monitor S mcolema5 5204 1 0 80 0 4404 87572 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfs-afc-volume-monitor S mcolema5 5210 1 0 80 0 14244 110089 poll_s Apr20 ? 00:00:00 /usr/libexec/mate-panel/wnck-applet S mcolema5 5212 1 0 80 0 15976 123510 poll_s Apr20 ? 00:00:01 /usr/libexec/mate-panel/clock-applet S mcolema5 5214 1 0 80 0 9100 88456 poll_s Apr20 ? 00:00:00 /usr/libexec/mate-panel/notification-area-applet S mcolema5 5262 1 0 80 0 26560 70835 poll_s Apr20 ? 00:00:07 /packages/python/3.6.5/bin/python3.6 /packages/python/3.6.5/bin/websockify -D 40045 localhost:5901 S mcolema5 6357 5262 0 80 0 27800 72269 poll_s 17:34 ? 00:00:01 /packages/python/3.6.5/bin/python3.6 /packages/python/3.6.5/bin/websockify -D 40045 localhost:5901 S mcolema5 5414 1 0 80 0 5068 103195 poll_s Apr20 ? 00:00:00 /usr/libexec/xdg-desktop-portal S mcolema5 5419 1 0 80 0 3012 112041 poll_s Apr20 ? 00:00:00 /usr/libexec/xdg-document-portal S mcolema5 5424 1 0 80 0 2524 60648 poll_s Apr20 ? 00:00:00 /usr/libexec/xdg-permission-store S mcolema5 5436 1 0 80 0 8940 94966 poll_s Apr20 ? 00:00:00 /usr/libexec/xdg-desktop-portal-gtk ### Here's the log from job startup, captured like so: fgrep -e 25852232 -C 100 /var/log/slurmd-20230421 ### The start and end are clipped some to include the relevant parts. [2023-04-20T13:21:31.048] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:21:31.048] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:21:43.270] debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG [2023-04-20T13:21:43.270] debug2: Processing RPC: REQUEST_LAUNCH_PROLOG [2023-04-20T13:21:43.271] debug2: Start processing RPC: REQUEST_BATCH_JOB_LAUNCH [2023-04-20T13:21:43.271] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH [2023-04-20T13:21:43.271] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 25852232 [2023-04-20T13:21:43.271] task/affinity: batch_bind: job 25852232 CPU input mask for node: 0x000000F [2023-04-20T13:21:43.271] task/affinity: batch_bind: job 25852232 CPU final HW mask for node: 0x0000055 [2023-04-20T13:21:43.271] debug: Waiting for job 25852232's prolog to complete [2023-04-20T13:21:43.296] debug2: prep/script: _run_subpath_command: prolog success rc:0 output: [2023-04-20T13:21:43.339] debug2: debug level read from slurmd is 'debug2'. [2023-04-20T13:21:43.339] debug2: read_slurmd_conf_lite: slurmd sent 9 TRES. [2023-04-20T13:21:43.339] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded [2023-04-20T13:21:43.340] debug: acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded [2023-04-20T13:21:43.340] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded [2023-04-20T13:21:43.340] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded [2023-04-20T13:21:43.340] debug2: Received CPU frequency information for 28 CPUs [2023-04-20T13:21:43.341] debug: switch Cray/Aries plugin loaded. [2023-04-20T13:21:43.342] debug: switch/none: init: switch NONE plugin loaded [2023-04-20T13:21:43.342] debug: gres/gpu: init: loaded [2023-04-20T13:21:43.342] [25852232.extern] debug2: setup for a launch_task [2023-04-20T13:21:43.342] [25852232.extern] CGROUP: autodetect_cgroup_version: using cgroup version 1 [2023-04-20T13:21:43.343] [25852232.extern] debug: cgroup/v1: init: Cgroup v1 plugin loaded [2023-04-20T13:21:43.343] [25852232.extern] debug2: hwloc_topology_init [2023-04-20T13:21:43.343] [25852232.extern] debug2: xcpuinfo_hwloc_topo_load: xml file (/cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml) found [2023-04-20T13:21:43.345] [25852232.extern] debug: CPUs:28 Boards:1 Sockets:2 CoresPerSocket:14 ThreadsPerCore:1 [2023-04-20T13:21:43.366] [25852232.extern] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2023-04-20T13:21:43.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:21:43.366] [25852232.extern] debug: Message thread started pid = 4801 [2023-04-20T13:21:43.366] [25852232.extern] debug: Setting slurmstepd(4801) oom_score_adj to -1000 [2023-04-20T13:21:43.378] [25852232.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffff [2023-04-20T13:21:43.389] [25852232.extern] debug: task/cgroup: init: core enforcement enabled [2023-04-20T13:21:43.389] [25852232.extern] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:193357M allowed:100%(enforced), swap:0%(enforced), max:99%(191423M) max+swap:0%(191423M) min:200M kmem:100%(193357M permissive) min:200M [2023-04-20T13:21:43.389] [25852232.extern] debug: task/cgroup: init: memory enforcement enabled [2023-04-20T13:21:43.400] [25852232.extern] debug: task/cgroup: init: device enforcement enabled [2023-04-20T13:21:43.400] [25852232.extern] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2023-04-20T13:21:43.400] [25852232.extern] cred/munge: init: Munge credential signature plugin loaded [2023-04-20T13:21:43.400] [25852232.extern] debug: job_container/none: init: job_container none plugin loaded [2023-04-20T13:21:43.401] [25852232.extern] debug2: _spawn_job_container: Before call to spank_init() [2023-04-20T13:21:43.401] [25852232.extern] debug: spank: opening plugin stack /cm/local/apps/slurm/var/spool/conf-cache/plugstack.conf [2023-04-20T13:21:43.401] [25852232.extern] debug2: _spawn_job_container: After call to spank_init() [2023-04-20T13:21:43.401] [25852232.extern] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-3' [2023-04-20T13:21:43.401] [25852232.extern] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-3' [2023-04-20T13:21:43.401] [25852232.extern] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0,2,4,6' [2023-04-20T13:21:43.401] [25852232.extern] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0,2,4,6' [2023-04-20T13:21:43.426] [25852232.extern] task/cgroup: _memcg_initialize: job: alloc=32768MB mem.limit=32768MB memsw.limit=32768MB job_swappiness=18446744073709551614 [2023-04-20T13:21:43.426] [25852232.extern] task/cgroup: _memcg_initialize: step: alloc=32768MB mem.limit=32768MB memsw.limit=32768MB job_swappiness=18446744073709551614 [2023-04-20T13:21:43.426] [25852232.extern] debug: cgroup/v1: _oom_event_monitor: started. [2023-04-20T13:21:43.427] [25852232.extern] debug2: adding task 0 pid 4806 on node 0 to jobacct [2023-04-20T13:21:43.428] [25852232.extern] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.extern taskid 0 max_task_id 0 [2023-04-20T13:21:43.429] [25852232.extern] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4806' for '/sys/fs/cgroup/cpuacct/slurm/uid_1345/job_25852232/step_extern/task_0' [2023-04-20T13:21:43.429] [25852232.extern] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.extern taskid 0 max_task_id 0 [2023-04-20T13:21:43.441] [25852232.extern] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4806' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0' [2023-04-20T13:21:43.441] debug: Finished wait for job 25852232's prolog to complete [2023-04-20T13:21:43.441] Launching batch job 25852232 for UID 1345 [2023-04-20T13:21:43.444] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG [2023-04-20T13:21:43.480] debug2: debug level read from slurmd is 'debug2'. [2023-04-20T13:21:43.480] debug2: read_slurmd_conf_lite: slurmd sent 9 TRES. [2023-04-20T13:21:43.481] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded [2023-04-20T13:21:43.481] debug: acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded [2023-04-20T13:21:43.482] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded [2023-04-20T13:21:43.482] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded [2023-04-20T13:21:43.482] debug2: Received CPU frequency information for 28 CPUs [2023-04-20T13:21:43.482] debug: gres/gpu: init: loaded [2023-04-20T13:21:43.482] [25852232.batch] debug2: setup for a batch_job [2023-04-20T13:21:43.483] [25852232.batch] CGROUP: autodetect_cgroup_version: using cgroup version 1 [2023-04-20T13:21:43.483] [25852232.batch] debug: cgroup/v1: init: Cgroup v1 plugin loaded [2023-04-20T13:21:43.483] [25852232.batch] debug2: hwloc_topology_init [2023-04-20T13:21:43.483] [25852232.batch] debug2: xcpuinfo_hwloc_topo_load: xml file (/cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml) found [2023-04-20T13:21:43.486] [25852232.batch] debug: CPUs:28 Boards:1 Sockets:2 CoresPerSocket:14 ThreadsPerCore:1 [2023-04-20T13:21:43.506] [25852232.batch] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded [2023-04-20T13:21:43.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:21:43.507] [25852232.batch] debug: laying out the 1 tasks on 1 hosts n140 dist 2 [2023-04-20T13:21:43.507] [25852232.batch] debug: Message thread started pid = 4808 [2023-04-20T13:21:43.507] debug2: Finish processing RPC: REQUEST_BATCH_JOB_LAUNCH [2023-04-20T13:21:43.507] [25852232.batch] debug: Setting slurmstepd(4808) oom_score_adj to -1000 [2023-04-20T13:21:43.508] [25852232.batch] debug: switch/none: init: switch NONE plugin loaded [2023-04-20T13:21:43.519] [25852232.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffff [2023-04-20T13:21:43.530] [25852232.batch] debug: task/cgroup: init: core enforcement enabled [2023-04-20T13:21:43.530] [25852232.batch] debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:193357M allowed:100%(enforced), swap:0%(enforced), max:99%(191423M) max+swap:0%(191423M) min:200M kmem:100%(193357M permissive) min:200M [2023-04-20T13:21:43.530] [25852232.batch] debug: task/cgroup: init: memory enforcement enabled [2023-04-20T13:21:43.541] [25852232.batch] debug: task/cgroup: init: device enforcement enabled [2023-04-20T13:21:43.541] [25852232.batch] debug: task/cgroup: init: Tasks containment cgroup plugin loaded [2023-04-20T13:21:43.541] [25852232.batch] cred/munge: init: Munge credential signature plugin loaded [2023-04-20T13:21:43.542] [25852232.batch] debug: job_container/none: init: job_container none plugin loaded [2023-04-20T13:21:43.542] [25852232.batch] debug2: Before call to spank_init() [2023-04-20T13:21:43.542] [25852232.batch] debug: spank: opening plugin stack /cm/local/apps/slurm/var/spool/conf-cache/plugstack.conf [2023-04-20T13:21:43.542] [25852232.batch] debug2: After call to spank_init() [2023-04-20T13:21:43.542] [25852232.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-3' [2023-04-20T13:21:43.542] [25852232.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-3' [2023-04-20T13:21:43.542] [25852232.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0,2,4,6' [2023-04-20T13:21:43.542] [25852232.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0,2,4,6' [2023-04-20T13:21:43.549] [25852232.batch] task/cgroup: _memcg_initialize: job: alloc=32768MB mem.limit=32768MB memsw.limit=32768MB job_swappiness=18446744073709551614 [2023-04-20T13:21:43.549] [25852232.batch] task/cgroup: _memcg_initialize: step: alloc=32768MB mem.limit=32768MB memsw.limit=32768MB job_swappiness=18446744073709551614 [2023-04-20T13:21:43.549] [25852232.batch] debug: cgroup/v1: _oom_event_monitor: started. [2023-04-20T13:21:43.549] [25852232.batch] debug2: hwloc_topology_load [2023-04-20T13:21:43.569] [25852232.batch] debug2: hwloc_topology_export_xml [2023-04-20T13:21:43.570] [25852232.batch] debug2: Entering _setup_normal_io [2023-04-20T13:21:43.591] [25852232.batch] debug2: Leaving _setup_normal_io [2023-04-20T13:21:43.591] [25852232.batch] debug levels are stderr='error', logfile='debug2', syslog='fatal' [2023-04-20T13:21:43.591] [25852232.batch] starting 1 tasks [2023-04-20T13:21:43.592] [25852232.batch] task 0 (4813) started 2023-04-20T13:21:43 [2023-04-20T13:21:43.593] [25852232.batch] debug2: adding task 0 pid 4813 on node 0 to jobacct [2023-04-20T13:21:43.593] [25852232.batch] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.batch taskid 0 max_task_id 0 [2023-04-20T13:21:43.594] [25852232.batch] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4813' for '/sys/fs/cgroup/cpuacct/slurm/uid_1345/job_25852232/step_batch/task_0' [2023-04-20T13:21:43.594] [25852232.batch] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.batch taskid 0 max_task_id 0 [2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4813' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0' [2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.batch taskid 0 max_task_id 0 [2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4813' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0' [2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.batch taskid 0 max_task_id 0 [2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4813' for '/sys/fs/cgroup/devices/slurm/uid_1345/job_25852232/step_batch/task_0' [2023-04-20T13:21:43.602] [25852232.batch] debug: task/affinity: task_p_pre_launch: affinity StepId=25852232.batch, task:0 bind:mask_cpu [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_CPU in environment [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_FSIZE in environment [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_DATA in environment [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_STACK in environment [2023-04-20T13:21:43.603] [25852232.batch] debug2: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0 [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_RSS in environment [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_NPROC in environment [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_NOFILE in environment [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_MEMLOCK in environment [2023-04-20T13:21:43.603] [25852232.batch] debug: Couldn't find SLURM_RLIMIT_AS in environment [2023-04-20T13:21:43.603] [25852232.batch] debug2: Set task rss(32768 MB) [2023-04-20T13:22:01.134] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:22:01.134] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:22:01.134] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:22:13.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:22:13.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:22:31.166] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:22:31.166] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:22:31.167] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:22:43.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:22:43.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:23:01.216] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:23:01.216] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:23:01.217] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:23:13.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:23:13.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:23:31.256] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:23:31.256] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:23:31.257] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:23:43.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:23:43.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:24:01.283] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:24:01.283] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:24:01.284] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:24:13.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:24:13.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:24:31.319] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:24:31.319] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:24:31.320] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:24:43.366] [25852232.extern] debug2: profile signaling type Task ### Like that for a while and then boom and we're off [2023-04-20T13:28:43.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:28:43.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:29:01.890] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:29:01.890] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:29:01.890] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:29:13.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:29:13.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:29:27.635] [25852232.batch] debug: cgroup/v1: _oom_event_monitor: oom-kill event count: 1 [2023-04-20T13:29:27.649] [25852232.extern] debug: cgroup/v1: _oom_event_monitor: oom-kill event count: 1 [2023-04-20T13:29:31.993] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:29:31.993] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:29:31.994] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:29:43.366] [25852232.extern] debug2: profile signaling type Task [2023-04-20T13:29:43.367] [25852232.extern] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0/memory.stat' for reading : No such file or directory [2023-04-20T13:29:43.367] [25852232.extern] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0' [2023-04-20T13:29:43.367] [25852232.extern] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4806 ppid 4801 [2023-04-20T13:29:43.506] [25852232.batch] debug2: profile signaling type Task [2023-04-20T13:29:43.811] [25852232.batch] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0/memory.stat' for reading : No such file or directory [2023-04-20T13:29:43.811] [25852232.batch] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0' [2023-04-20T13:29:43.811] [25852232.batch] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4813 ppid 4808 [2023-04-20T13:30:01.053] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:30:01.053] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:30:01.054] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-20T13:30:13.366] [25852232.extern] debug2: profile signaling type Task ### This is a sample log after the job has started. It's still generating output like this. [2023-04-21T17:51:43.817] [25852232.batch] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0' [2023-04-21T17:51:43.817] [25852232.batch] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4813 ppid 4808 [2023-04-21T17:52:11.845] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-21T17:52:11.845] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-21T17:52:11.845] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-21T17:52:13.366] [25852232.extern] debug2: profile signaling type Task [2023-04-21T17:52:13.367] [25852232.extern] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0/memory.stat' for reading : No such file or directory [2023-04-21T17:52:13.367] [25852232.extern] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0' [2023-04-21T17:52:13.367] [25852232.extern] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4806 ppid 4801 [2023-04-21T17:52:13.506] [25852232.batch] debug2: profile signaling type Task [2023-04-21T17:52:13.818] [25852232.batch] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0/memory.stat' for reading : No such file or directory [2023-04-21T17:52:13.818] [25852232.batch] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0' [2023-04-21T17:52:13.818] [25852232.batch] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4813 ppid 4808 [2023-04-21T17:52:41.873] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-21T17:52:41.873] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE [2023-04-21T17:52:41.874] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE [ ### This is probably the second python3 I started, but the technique was the same with both mcolema5@n140:~ 4$ python3 Python 3.6.8 (default, Aug 13 2020, 07:46:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> x = 'z' * 200000000000 Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError >>> x = 'z' * 100000000000 Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError >>> x = 'z' * 10000000000 >>> [2]+ Stopped python3 ### It's nice that 'ulimit -m' matches the job request, but of course this doesn't have much effect. mcolema5@n140:~ 5$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 773354 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 33554432 open files (-n) 51200 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 773354 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Last dmesg output was Apr 7, oom-killer message. Could be relevant, but probably not. For completeness: Apr 20 13:27:29 n140 mmfs: [N] Connecting to 172.20.8.90 n090 <c0n96>:[0] Apr 20 13:27:29 n140 mmfs: [N] Connecting to 172.20.8.90 n090 <c0n96>:[1] Apr 20 13:27:37 n140 dbus[5897]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' Apr 20 13:27:37 n140 systemd: Starting Hostname Service... Apr 20 13:27:37 n140 dbus[5897]: [system] Successfully activated service 'org.freedesktop.hostname1' Apr 20 13:27:37 n140 systemd: Started Hostname Service. Apr 20 13:29:27 n140 dbus[5897]: [system] Activating via systemd: service name='org.freedesktop.UPower' unit='upower.service' Apr 20 13:29:28 n140 systemd: Starting Daemon for power management... Apr 20 13:29:28 n140 dbus[5897]: [system] Successfully activated service 'org.freedesktop.UPower' Apr 20 13:29:28 n140 systemd: Started Daemon for power management. Apr 20 13:30:01 n140 systemd: Created slice User Slice of root. Apr 20 13:30:01 n140 systemd: Started Session 127794 of user root. Apr 20 13:30:01 n140 systemd: Started Session 127795 of user root. Apr 20 13:30:01 n140 systemd: Started Session 127796 of user root. Apr 20 13:30:01 n140 systemd: Started Session 127797 of user root. One additional detail. The '/sys/fs/cgroup/cpu,cpuacct' tree has a lot of entries for this job, but the /memory tree is completely gone. (at least as of right now) I tried replicating this on another node, restarting slurmd and then attaching it to slurmd beforehand. No luck. That node does have a different amount of memory. The bug could just be intermittent, though. Or, if this is ultimately a race, strace can sometimes change or serialize things enough to remove the problem. With the thought of trying to strace on the original node, I canceled that job. The processes below were left after that. It's perhaps notable that three are running as a different user (two root, one rtkit). The binaries themselves are not setuid, and I haven't tried to figure out how they get started. Perhaps this matters. I'll continue with the attempt to replicate on the original node under strace (and if that doesn't work, without strace). I suppose that if strace blocks the bug, we could run the slurm daemons that way and limp along. Hopefully we'll find something better. Is there an easy way to limit, eliminate, or just make deterministic the whole pthread mechanism? S root 3085 1 0 80 0 8204 65542 inet_c Apr20 ? 00:00:08 /cm/shared/apps/slurm/current/sbin/slurmd --conf-server hpc-hn1,hpc-hn2 S mcolema5 4881 1 0 80 0 80968 108528 ep_pol Apr20 ? 00:02:40 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: n140:1 (mcolema5) -auth /home/mcolema5/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/mcolema5/.vnc/x509_cert.pem -x509key /home/mcolema5/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0 S mcolema5 4908 1 0 80 0 1584 2430 do_wai Apr20 ? 00:00:00 bash /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/script.sh S mcolema5 4982 4908 0 80 0 28400 132928 poll_s Apr20 ? 00:00:05 mate-session S mcolema5 5047 4982 0 80 0 15320 364107 poll_s Apr20 ? 00:00:45 /usr/libexec/mate-settings-daemon S mcolema5 5051 4982 0 80 0 16552 102908 poll_s Apr20 ? 00:00:05 marco S mcolema5 5055 4982 0 80 0 16532 141292 poll_s Apr20 ? 00:00:08 mate-panel S mcolema5 8419 5055 0 80 0 25936 147685 poll_s Apr20 ? 00:00:02 mate-terminal S mcolema5 8428 8419 0 80 0 860 3698 unix_s Apr20 ? 00:00:00 gnome-pty-helper S mcolema5 8429 8419 0 80 0 5296 4835 n_tty_ Apr20 pts/0 00:00:00 -bash T mcolema5 7374 8429 0 80 0 97662500 24463434 do_sig Apr21 pts/0 00:00:34 python3 T mcolema5 7593 8429 0 80 0 9771892 2490778 do_sig Apr21 pts/0 00:00:07 python3 S mcolema5 5072 4982 0 80 0 31252 147109 poll_s Apr20 ? 00:04:15 caja S mcolema5 5075 4982 0 80 0 16836 71489 poll_s Apr20 ? 00:00:00 /usr/bin/python /usr/share/system-config-printer/applet.py S mcolema5 5076 4982 0 80 0 12480 96527 poll_s Apr20 ? 00:00:00 abrt-applet S mcolema5 5081 4982 0 80 0 9520 99829 poll_s Apr20 ? 00:00:03 mate-screensaver S mcolema5 5083 4982 0 80 0 19652 127696 poll_s Apr20 ? 00:00:00 nm-applet S mcolema5 4939 1 0 80 0 1000 15802 poll_s Apr20 ? 00:00:00 dbus-launch --autolaunch=5abba43653164143a1d82e3b311c9406 --binary-syntax --close-stderr S mcolema5 4940 1 0 80 0 1408 15543 ep_pol Apr20 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session S mcolema5 4944 1 0 80 0 2984 43813 poll_s Apr20 ? 00:00:00 /usr/libexec/dconf-service S mcolema5 4985 1 0 80 0 1000 15803 poll_s Apr20 ? 00:00:00 dbus-launch --exit-with-session mate-session S mcolema5 4986 1 0 80 0 1984 15673 ep_pol Apr20 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session S mcolema5 5004 1 0 80 0 4104 82793 poll_s Apr20 ? 00:00:00 /usr/libexec/at-spi-bus-launcher S mcolema5 5017 5004 0 80 0 2284 15543 ep_pol Apr20 ? 00:00:00 /usr/bin/dbus-daemon --config-file=/usr/share/defaults/at-spi2/accessibility.conf --nofork --print-address 3 S mcolema5 5008 1 0 80 0 3940 66656 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfsd S mcolema5 5139 5008 0 80 0 4824 84671 poll_s Apr20 ? 00:00:02 /usr/libexec/gvfsd-trash --spawner :1.2 /org/gtk/gvfs/exec_spaw/0 S mcolema5 6331 5008 0 80 0 4644 102675 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfsd-network --spawner :1.2 /org/gtk/gvfs/exec_spaw/1 S mcolema5 6352 5008 0 80 0 4648 91345 poll_s Apr20 ? 00:00:02 /usr/libexec/gvfsd-dnssd --spawner :1.2 /org/gtk/gvfs/exec_spaw/3 S mcolema5 5013 1 0 80 0 3388 97320 futex_ Apr20 ? 00:00:00 /usr/libexec/gvfsd-fuse /home/mcolema5/.gvfs -f -o big_writes S mcolema5 5026 1 0 80 0 4012 55229 poll_s Apr20 ? 00:00:00 /usr/libexec/at-spi2-registryd --use-gnome-session S mcolema5 5040 1 0 80 0 2988 43813 poll_s Apr20 ? 00:00:00 /usr/libexec/dconf-service S mcolema5 5064 1 0 80 0 4048 141615 poll_s Apr20 ? 00:00:01 /usr/bin/pulseaudio --start --log-target=syslog S rtkit 5065 1 0 81 1 1812 49696 poll_s Apr20 ? 00:00:05 /usr/libexec/rtkit-daemon S mcolema5 5087 1 0 80 0 5280 74468 poll_s Apr20 ? 00:00:23 /usr/libexec/gvfs-udisks2-volume-monitor S root 5095 1 0 80 0 7196 114475 poll_s Apr20 ? 00:01:55 /usr/libexec/udisks2/udisksd S mcolema5 5192 1 0 80 0 3184 67364 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfs-mtp-volume-monitor S mcolema5 5198 1 0 80 0 3416 69637 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfs-gphoto2-volume-monitor S mcolema5 5204 1 0 80 0 4404 87572 poll_s Apr20 ? 00:00:00 /usr/libexec/gvfs-afc-volume-monitor S mcolema5 5210 1 0 80 0 14240 110089 poll_s Apr20 ? 00:00:00 /usr/libexec/mate-panel/wnck-applet S mcolema5 5212 1 0 80 0 15964 123510 poll_s Apr20 ? 00:00:05 /usr/libexec/mate-panel/clock-applet S mcolema5 5214 1 0 80 0 9100 88456 poll_s Apr20 ? 00:00:00 /usr/libexec/mate-panel/notification-area-applet S mcolema5 5260 1 0 80 0 1116 4005 pipe_w Apr20 ? 00:00:00 /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script S mcolema5 5263 5260 0 80 0 740 4005 do_wai Apr20 ? 00:00:00 /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script S mcolema5 5264 5263 0 80 0 616 1103 hrtime Apr20 ? 00:00:11 tail -f --pid=4908 vnc.log S mcolema5 5262 1 0 80 0 26560 70835 poll_s Apr20 ? 00:00:25 /packages/python/3.6.5/bin/python3.6 /packages/python/3.6.5/bin/websockify -D 40045 localhost:5901 S mcolema5 5414 1 0 80 0 5064 103195 poll_s Apr20 ? 00:00:00 /usr/libexec/xdg-desktop-portal S mcolema5 5419 1 0 80 0 3012 112041 poll_s Apr20 ? 00:00:00 /usr/libexec/xdg-document-portal S mcolema5 5424 1 0 80 0 2524 60648 poll_s Apr20 ? 00:00:00 /usr/libexec/xdg-permission-store S mcolema5 5436 1 0 80 0 8940 94966 poll_s Apr20 ? 00:00:00 /usr/libexec/xdg-desktop-portal-gtk S root 6969 1 0 80 0 5340 107595 poll_s Apr20 ? 00:00:03 /usr/libexec/upowerd No luck with what seems like a fairly identical attempt to replicate that. Looking at the slurmd logs, the bad and good are identical up to the point of the oom-kill message. I did reboot the node right before. So, perhaps some bitrot, generally or in Slurm, was cleanup up by that. Or, maybe this is just intermittent. I guess we could try running the slurm processes under valgrind, as suggested here: https://stackoverflow.com/questions/981011/c-programming-debugging-with-pthreads#comment52831476_982031 I've used it to good effect in the past, though never for pthread issues. Do you test the slurm daemons this way? Given the realities of our situation, I'm also consdering an out-of-band script to just whack processes that Slurm should have killed. For example, user processes running on a node that the user has no job running on. Or, memory usage exceeding total memory request for the node. Etc. Any ideas on that? Has anyone already written it? Thanks! --Mike (In reply to Mark Allen from comment #27) > No luck with what seems like a fairly identical attempt to replicate that. > Looking at the slurmd logs, the bad and good are identical up to the point > of the oom-kill message. > > I did reboot the node right before. So, perhaps some bitrot, generally or > in Slurm, was cleanup up by that. > > Or, maybe this is just intermittent. > > > I guess we could try running the slurm processes under valgrind, as > suggested here: > https://stackoverflow.com/questions/981011/c-programming-debugging-with- > pthreads#comment52831476_982031 > > I've used it to good effect in the past, though never for pthread issues. > Do you test the slurm daemons this way? > > > Given the realities of our situation, I'm also consdering an out-of-band > script to just whack processes that Slurm should have killed. For example, > user processes running on a node that the user has no job running on. Or, > memory usage exceeding total memory request for the node. Etc. > > Any ideas on that? Has anyone already written it? > > Thanks! --Mike Well, you've done an interesting analysis so far. I think running slurmd under valgrind or strace is a really bad idea, basically the performance hit will be enormous and you can have other nasty effects. We use to test the daemons under valgrind for example: valgrind --leak-check=full --show-leak-kinds=definite slurmd -Dv Some comments: - You're right, that a process is being reparented to 1 doesn't mean necessarily that slurmstepd is gone. If it's been started by some launcher it may be out of slurm control. - Question 1: I highly suspect you're suffering from this. Starting an entire X11 session uses dbus to communicate to systemd and start a new session and processes, it is probable that even the processes escape the slurm cgroup. Could you run one of this tests and check the cgroup of *all the spawned processes* (i.e. in your example, 'cat /proc/<pid>/cgroup' of all the processes of 'ps' output)? - dbus is a daemon which allows communication in a client-server fashion between systemd and user applications. When you start a graphical session, most of the applications, even the desktop widgets, use to communicate using dbus. They can ask for starting new 'systemd sessions' (XDG_* variables), and systemd can decide to move processes from one cgroup to another. I am concerned this is affecting you. These processes looks good, all of them are under the slurmstepd. S root 4801 1 0 80 0 4436 108222 do_wai Apr20 ? 00:00:06 slurmstepd: [25852232.extern] S root 4806 4801 0 80 0 348 27014 hrtime Apr20 ? 00:00:00 sleep 100000000 S root 4808 1 1 80 0 4640 107706 do_wai Apr20 ? 00:17:47 slurmstepd: [25852232.batch] S mcolema5 4813 4808 0 80 0 1780 4005 do_wai Apr20 ? 00:00:00 /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script S mcolema5 4908 4813 0 80 0 1584 2430 do_wai Apr20 ? 00:00:00 bash /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/script.sh S mcolema5 4982 4908 0 80 0 28412 132928 poll_s Apr20 ? 00:00:02 mate-session S mcolema5 5047 4982 0 80 0 15336 364107 poll_s Apr20 ? 00:00:13 /usr/libexec/mate-settings-daemon S mcolema5 5051 4982 0 80 0 16560 102908 poll_s Apr20 ? 00:00:03 marco S mcolema5 5055 4982 0 80 0 16520 141292 poll_s Apr20 ? 00:00:02 mate-panel S mcolema5 8419 5055 0 80 0 25936 147685 poll_s Apr20 ? 00:00:01 mate-terminal S mcolema5 8428 8419 0 80 0 860 3698 unix_s Apr20 ? 00:00:00 gnome-pty-helper S mcolema5 8429 8419 0 80 0 5292 4835 do_wai Apr20 pts/0 00:00:00 -bash T mcolema5 7374 8429 4 80 0 97662500 24463434 do_sig 17:40 pts/0 00:00:34 python3 S mcolema5 7593 8429 1 80 0 9771892 2490778 poll_s 17:41 pts/0 00:00:07 python3 S mcolema5 5072 4982 0 80 0 31248 147109 poll_s Apr20 ? 00:00:55 caja S mcolema5 5075 4982 0 80 0 16836 71489 poll_s Apr20 ? 00:00:00 /usr/bin/python /usr/share/system-config-printer/applet.py S mcolema5 5076 4982 0 80 0 12480 96527 poll_s Apr20 ? 00:00:00 abrt-applet S mcolema5 5081 4982 0 80 0 9440 99829 poll_s Apr20 ? 00:00:00 mate-screensaver S mcolema5 5083 4982 0 80 0 19656 127696 poll_s Apr20 ? 00:00:00 nm-applet S mcolema5 5260 4813 0 80 0 1116 4005 pipe_w Apr20 ? 00:00:00 /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script S mcolema5 5263 5260 0 80 0 740 4005 do_wai Apr20 ? 00:00:00 /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script S mcolema5 5264 5263 0 80 0 616 1103 hrtime Apr20 ? 00:00:03 tail -f --pid=4908 vnc.log But this one is under ppid 1, which means it has been started outside slurm?: S mcolema5 4881 1 0 80 0 80968 108528 ep_pol Apr20 ? 00:02:00 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: n140:1 (mcolema5) -auth /home/mcolema5/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/mcolema5/.vnc/x509_cert.pem -x509key /home/mcolema5/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0 -- I am really uncertain on why these cgroups disappear or cannot be opened.. for example pid 4806 is slep 10000000, and ppid 4801 is the slurmstepd extern.. but it is still alive.. Question2: when you see these messages, can you check where slurmstepd or sleep process (or any other which is still alive and throwing this error) resides? i.e. in this case you would've been done: 'cat /proc/4806/cgroup' and 'cat /proc/4801/cgroup'. [2023-04-21T17:52:13.367] [25852232.extern] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0/memory.stat' for reading : No such file or directory [2023-04-21T17:52:13.367] [25852232.extern] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0' [2023-04-21T17:52:13.367] [25852232.extern] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4806 ppid 4801 [2023-04-21T17:52:13.506] [25852232.batch] debug2: profile signaling type Task [2023-04-21T17:52:13.818] [25852232.batch] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0/memory.stat' for reading : No such file or directory I think we want to know now about Question1 and Question2. Replying to your comments.
Yes, 'valgrind' is very slow and would only be for testing. Note, though, that valgrind has a number of distinct "tools". The command you mention is implicitly defaulting to the 'memcheck' tool. It's good that you're doing this.
But note also that the other tools--'helgrind' and 'drd'--mentioned in my prior link attempt to spot thread races. Given the intermittent nature of this and the fact that pthreads are involved here, that makes a race seem possible. (Also, still wondering about that possible pthread EINVAL status.)
The 'strace' alternative is a bit more general, in that its slowdown would likely be much smaller. (It only slows system calls.) For correctness, though, don't sell it short. There is some room for side-effects, but not much. IIRC, the default fake ptrace reparenting is somewhat visible to parents watching the child with options like WSTOPPED. That's pretty rare, though. The new --daemonize* flags might do even better, though I haven't looked closely.
(As a detail, altering the timing of broken code might expose the breakage, even if it normally "works".)
In any case, in addition to debugging the issue, I'm reaching for any easy way to just mitigate the issue, for us. Obviously, a true fix would be better.
Regarding the reparenting, I see what you're saying about "launchers". But in addition, it's possible to have orphans without that. This generates one (and is in the output I'm about to upload).
{ nohup /bin/sleep 12345 & } &
from https://unix.stackexchange.com/questions/267001/how-to-create-an-orphan-process-without-terminating-the-parent-shell
No idea whether that's involved here, but it's a separate case to consider.
I'm about to upload some cgroup info for the session I have, which is one that did not exhibit failure. If I see another failure, I'll try to gather the same for that.
The generating commands were
ps -eflyH > ps-eflyH.txt
egrep -e . /proc/*/cgroup > cgroup.txt
ps -eH -O cgroup:1000 > ps-cgroup-1000.txt
Kind of a mess, but hopefully useful info is in there.
Created attachment 30009 [details]
cgroup info for working OOD desktop
Created attachment 30010 [details]
cgroup info for working OOD desktop (cgroup contents)
Created attachment 30011 [details]
cgroup info for working OOD desktop (cgroups with ps)
(In reply to Mark Allen from comment #29) > Replying to your comments. > > Yes, 'valgrind' is very slow and would only be for testing. Note, though, > that valgrind has a number of distinct "tools". The command you mention is > implicitly defaulting to the 'memcheck' tool. It's good that you're doing > this. Yep, indeed I showed how we check for memory. You can try with the others to find thread races, but just not run it normally in production. > In any case, in addition to debugging the issue, I'm reaching for any easy > way to just mitigate the issue, for us. Obviously, a true fix would be > better. It's hard to find a mitigation if we don't know the cause :) > Regarding the reparenting, I see what you're saying about "launchers". But > in addition, it's possible to have orphans without that. This generates one > (and is in the output I'm about to upload). > > { nohup /bin/sleep 12345 & } & Yes, you're right. I am not concerned though, you can see from your outputs that this sleep process is in the correct cgroup, so when the job is done it will be killed. So we want to check the cgroup. The commands you used are useful, specially the first two: > The generating commands were > > ps -eflyH > ps-eflyH.txt > > egrep -e . /proc/*/cgroup > cgroup.txt Let's capture the slurmd logs, dmesg, and this ps when the next occurrence happens. I added a bug in "see also" with similar problems, they were using Bright and /proc/mounts shown a joined controller in cgroups: cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0 It seems it is not your case but is something that bright does. Thanks (In reply to Felip Moll from comment #33) > (In reply to Mark Allen from comment #29) > > Replying to your comments. > > > > Yes, 'valgrind' is very slow and would only be for testing. Note, though, > > that valgrind has a number of distinct "tools". The command you mention is > > implicitly defaulting to the 'memcheck' tool. It's good that you're doing > > this. > > Yep, indeed I showed how we check for memory. > You can try with the others to find thread races, but just not run it > normally in production. I briefly tried all three, mostly out of curiosity. I'm reminded that memcheck will often give some false positives (e.g., due to games libc plays). The other two did both flag some errors that seemed possibly real. I'm not knowledgeable enough about slurm nor pthreads to go further, but it might be worth a look. > It's hard to find a mitigation if we don't know the cause :) For sure. Just mentioning it as our context. Obviously better for the larger userbase to find a true fix, especially if this isn't peculiar to v1. > Let's capture the slurmd logs, dmesg, and this ps when the next occurrence > happens. Will do. > It seems it is not your case but is something that bright does. As detail, we're actually using vanilla Slurm, which we compiled ourselves, rather than the Bright version. It's certainly possible that there's Bright-specific stuff affecting this somehow. Thanks! > > Let's capture the slurmd logs, dmesg, and this ps when the next occurrence
> > happens.
>
> Will do.
Hi Mark,
Have you been able to capture more data of a new real situation?
Thanks!
Hi, We appreciate your forbearance on this. We are so time-short, and relatively close to our RHEL 8 cutover, that we're going to pause on this, at least for now. With any luck, either cgroups v2 or other changes will somehow make this go away. Since you're not hearing about similar issues from others, it seems likely that there's something unusual about our config or environment that's exposing the bug, if there is one. (It could be a kernel bug, or something else.) Feel free to pause or close this report. If we're still experiencing it on RHEL 8, we'll reopen and reference this bug. Here are a few brain-dump comments, in case they might help, or help restart the ticket later. Didn't try very hard, but the 'valgrind' race checks seemed to show some tantalizing output. It might all be false positives, but still, worth checking out, as that sort of thing could produce rather nasty intermittent bugs. I will upload some sample traces. (They're rather flawed, but maybe useful.) Also, while generating those, I saw some startup failures that seemed to be caused by timeouts. Not sure that's right, but given the short timeframe, it suggested that some parts of SLURM might have rather short (default) timeouts. Like under six seconds. If that's true, it seems like a recipe for instability. For myself, I'd choose one or two minutes as an absolute minimum timeout, unless there's a crucial reason. (We'll look at this in our config in the future.) Regarding valgrind and strace being slow, it's true, but in our environment, the load of slurmd and slurmstepd seems to be extremely low. Perhaps for loads where jobs(steps) are started at very high rates it would be different. Anyway, we appreciate your help, and will update you later. Thanks! Mike Created attachment 30160 [details]
valgrind drd/helgrind traces (poor, but maybe interesting)
Thanks Mark for your detailed analysis. Next week I am on PTO but will keep this bug open to check the outputs you sent. On the other hand, I will wait for some feedback if you have any during next week and the other, be it because you upgrade or because you catch a case. I understand you're short on time so don't worry. If after two weeks we have no news I will close the bug. Thanks :) Mark, I checked your files but I don't find anything concerning or related to your issue. Most of the messages refers to forked threads but they don't mean a real race exists. Have you already upgraded or caught a new situation and gathered some info (comment 33)? Thanks for looking. At present, we're working hard on getting to RHEL 8, and thus cgroups v2. This might take a while. Once we're there, we'll be watching for further occurrences. (With any luck, there won't be any.) Thanks again! Let's do one thing then, let's mark this bug as timed out and when you have more feedback or new catches, just reopen it and we will continue!. Thanks :) |