Ticket 16390

Summary: jobs get stuck COMPLETING with cgroup v1, oom_pipe diagnostics
Product: Slurm Reporter: Mark Allen <racsadmin>
Component: slurmstepdAssignee: Felip Moll <felip.moll>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: lyeager
Version: 22.05.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=14191
Site: University of Oregon Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: ps -eflyH output
slurmd log
slurm cgroup tree
dmesg -T
sysctl out
systemd unit file
slurm.conf
cgroup.conf
cgroup info for working OOD desktop
cgroup info for working OOD desktop (cgroup contents)
cgroup info for working OOD desktop (cgroups with ps)
valgrind drd/helgrind traces (poor, but maybe interesting)

Description Mark Allen 2023-03-28 16:18:56 MDT
We've been seeing a new, intermittent error pattern over the last few months.  The obvious symptom is that jobs will get stuck in the COMPLETING state.  Users see error messages like this:

slurmstepd: error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1699/job_25139838/step_batch' : No such file or directory
slurmstepd: error: common_cgroup_lock error: No such file or directory
slurmstepd: error: problem with oom_pipe[0]
slurmstepd: fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument

We're on RHEL 7, which means we're on cgroups v1.

We didn't used to see these.  What changed?  The things that come to mind are that we updated the version of SLURM, and we somewhat reduced the memory available for jobs on each node (to give more to the GPFS page pool).

It happens enough to be a problem, but not sure that we have a way to reproduce it.

Here are some lines grep'ed from the slurmctld log (truncated).  I'm not sure whether or at what point the user might have explicitly 'scancel'ed the job, but I'm pretty sure it was hung prior to any such act.


[2023-03-26T18:10:27.734] _slurm_rpc_submit_batch_job: JobId=25139838 InitPrio=175891 usec=1971
[2023-03-27T05:49:24.880] debug:  email msg to xxx@uoregon.edu: Slurm Job_id=25139838 Name=feats_cifar10_nn_fast-walking-runoff-split_nmod0600_rng24-32_itr091 Began, Queued time 11:38:57
[2023-03-27T05:49:24.880] sched: Allocate JobId=25139838 NodeList=n101 #CPUs=3 Partition=gpu
[2023-03-27T05:49:25.114] prolog_running_decr: Configuration for JobId=25139838 is complete
[2023-03-27T05:49:25.114] Extending JobId=25139838 time limit by 1 secs for configuration
[2023-03-27T21:44:35.000] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T21:44:35.005] debug:  email msg to xxx@uoregon.edu: Slurm Job_id=25139838 Name=feats_cifar10_nn_fast-walking-runoff-split_nmod0600_rng24-32_itr091 Ended, Run time 15:55:10, CANCELLED, ExitCode 0
[2023-03-27T21:44:35.007] job_signal: 9 of running JobId=25139838 successful 0x8004
[2023-03-27T21:45:35.967] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T21:46:36.610] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T21:47:40.786] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T21:48:41.341] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T21:49:42.340] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T21:50:44.537] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T21:55:50.967] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T21:55:50.967] job_str_signal(3): invalid JobId=25139838
[2023-03-27T21:55:50.967] _slurm_rpc_kill_job: job_str_signal() uid=1699 JobId=25139838 sig=9 returned: Invalid job id specified
[2023-03-27T22:01:00.205] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T22:01:00.205] job_str_signal(3): invalid JobId=25139838
[2023-03-27T22:01:00.205] _slurm_rpc_kill_job: job_str_signal() uid=1699 JobId=25139838 sig=9 returned: Invalid job id specified
[2023-03-27T22:06:08.852] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T22:06:08.852] job_str_signal(3): invalid JobId=25139838
[2023-03-27T22:06:08.852] _slurm_rpc_kill_job: job_str_signal() uid=1699 JobId=25139838 sig=9 returned: Invalid job id specified
[2023-03-27T22:11:18.281] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
[2023-03-27T22:11:18.281] job_str_signal(3): invalid JobId=25139838
[2023-03-27T22:11:18.281] _slurm_rpc_kill_job: job_str_signal() uid=1699 JobId=25139838 sig=9 returned: Invalid job id specified
[2023-03-27T22:16:25.058] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=25139838 uid 1699
...


As a guess, the OOM killer is firing, and SLURM isn't dealing with that correctly in this case.


Here's a grep of the corresponding slurmd log lines:

/var/log/slurmd-20230326:[2023-03-27T05:49:25.118] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 25139838
/var/log/slurmd-20230326:[2023-03-27T05:49:25.118] task/affinity: batch_bind: job 25139838 CPU input mask for node: 0x0000031
/var/log/slurmd-20230326:[2023-03-27T05:49:25.118] task/affinity: batch_bind: job 25139838 CPU final HW mask for node: 0x0000501
/var/log/slurmd-20230326:[2023-03-27T05:49:25.118] debug:  Waiting for job 25139838's prolog to complete
/var/log/slurmd-20230326:[2023-03-27T05:49:25.451] debug:  Finished wait for job 25139838's prolog to complete
/var/log/slurmd-20230326:[2023-03-27T05:49:25.452] Launching batch job 25139838 for UID 1699
/var/log/slurmd-20230326:[2023-03-27T12:56:36.245] debug:  _fill_registration_msg: found apparently running StepId=25139838.extern
/var/log/slurmd-20230326:[2023-03-27T12:56:36.245] debug:  _fill_registration_msg: found apparently running job 25139838
/var/log/slurmd-20230326:[2023-03-27T21:44:35.007] debug:  _rpc_terminate_job: uid = 450 JobId=25139838
/var/log/slurmd-20230326:[2023-03-27T21:44:35.008] debug:  credential for job 25139838 revoked
/var/log/slurmd-20230326:[2023-03-27T21:44:35.008] debug:  _step_connect: connect() failed for /cm/local/apps/slurm/var/spool/n101_25139838.4294967291: Connection refused
/var/log/slurmd-20230326:[2023-03-27T21:44:35.008] debug:  Cleaned up stray socket /cm/local/apps/slurm/var/spool/n101_25139838.4294967291
/var/log/slurmd-20230326:[2023-03-27T21:44:35.008] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job25139838/slurm_script
/var/log/slurmd-20230326:[2023-03-27T21:44:35.008] debug:  signal for nonexistent StepId=25139838.batch stepd_connect failed: Connection refused
/var/log/slurmd-20230326:[2023-03-27T21:44:35.030] debug:  _step_connect: connect() failed for /cm/local/apps/slurm/var/spool/n101_25139838.4294967292: Connection refused
/var/log/slurmd-20230326:[2023-03-27T21:44:35.030] debug:  Cleaned up stray socket /cm/local/apps/slurm/var/spool/n101_25139838.4294967292
/var/log/slurmd-20230326:[2023-03-27T21:44:35.030] debug:  Waiting for job 25139838's prolog to complete
/var/log/slurmd-20230326:[2023-03-27T21:44:35.030] debug:  Finished wait for job 25139838's prolog to complete
/var/log/slurmd-20230326:[2023-03-27T21:44:41.675] debug:  completed epilog for jobid 25139838
/var/log/slurmd-20230326:[2023-03-27T21:44:41.676] debug:  JobId=25139838: sent epilog complete msg: rc = 0
/var/log/slurmd-20230328:[2023-03-27T05:49:25.307] [25139838.extern] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.308] [25139838.extern] debug:  CPUs:28 Boards:1 Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
/var/log/slurmd-20230328:[2023-03-27T05:49:25.315] [25139838.extern] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.315] [25139838.extern] debug:  Message thread started pid = 25115
/var/log/slurmd-20230328:[2023-03-27T05:49:25.315] [25139838.extern] debug:  Setting slurmstepd(25115) oom_score_adj to -1000
/var/log/slurmd-20230328:[2023-03-27T05:49:25.367] [25139838.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffff
/var/log/slurmd-20230328:[2023-03-27T05:49:25.411] [25139838.extern] debug:  task/cgroup: init: core enforcement enabled
/var/log/slurmd-20230328:[2023-03-27T05:49:25.411] [25139838.extern] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257869M allowed:100%(enforced), swap:0%(enforced), max:99%(255290M) max+swap:0%(255290M) min:200M kmem:100%(257869M permissive) min:200M
/var/log/slurmd-20230328:[2023-03-27T05:49:25.411] [25139838.extern] debug:  task/cgroup: init: memory enforcement enabled
/var/log/slurmd-20230328:[2023-03-27T05:49:25.414] [25139838.extern] debug:  task/cgroup: init: device enforcement enabled
/var/log/slurmd-20230328:[2023-03-27T05:49:25.414] [25139838.extern] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.415] [25139838.extern] cred/munge: init: Munge credential signature plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.433] [25139838.extern] debug:  job_container/none: init: job_container none plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.433] [25139838.extern] debug:  spank: opening plugin stack /cm/local/apps/slurm/var/spool/conf-cache/plugstack.conf
/var/log/slurmd-20230328:[2023-03-27T05:49:25.433] [25139838.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0,4-5'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.433] [25139838.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0,4-5'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.433] [25139838.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0,8,10'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.433] [25139838.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0,8,10'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.445] [25139838.extern] task/cgroup: _memcg_initialize: job: alloc=49152MB mem.limit=49152MB memsw.limit=49152MB job_swappiness=18446744073709551614
/var/log/slurmd-20230328:[2023-03-27T05:49:25.445] [25139838.extern] task/cgroup: _memcg_initialize: step: alloc=49152MB mem.limit=49152MB memsw.limit=49152MB job_swappiness=18446744073709551614
/var/log/slurmd-20230328:[2023-03-27T05:49:25.445] [25139838.extern] debug:  cgroup/v1: _oom_event_monitor: started.
/var/log/slurmd-20230328:[2023-03-27T05:49:25.469] [25139838.batch] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.470] [25139838.batch] debug:  CPUs:28 Boards:1 Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
/var/log/slurmd-20230328:[2023-03-27T05:49:25.477] [25139838.batch] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.477] [25139838.batch] debug:  laying out the 1 tasks on 1 hosts n101 dist 2
/var/log/slurmd-20230328:[2023-03-27T05:49:25.478] [25139838.batch] debug:  Message thread started pid = 25122
/var/log/slurmd-20230328:[2023-03-27T05:49:25.478] [25139838.batch] debug:  Setting slurmstepd(25122) oom_score_adj to -1000
/var/log/slurmd-20230328:[2023-03-27T05:49:25.478] [25139838.batch] debug:  switch/none: init: switch NONE plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.482] [25139838.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffff
/var/log/slurmd-20230328:[2023-03-27T05:49:25.485] [25139838.batch] debug:  task/cgroup: init: core enforcement enabled
/var/log/slurmd-20230328:[2023-03-27T05:49:25.485] [25139838.batch] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:257869M allowed:100%(enforced), swap:0%(enforced), max:99%(255290M) max+swap:0%(255290M) min:200M kmem:100%(257869M permissive) min:200M
/var/log/slurmd-20230328:[2023-03-27T05:49:25.485] [25139838.batch] debug:  task/cgroup: init: memory enforcement enabled
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] debug:  task/cgroup: init: device enforcement enabled
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] cred/munge: init: Munge credential signature plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] debug:  job_container/none: init: job_container none plugin loaded
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] debug:  spank: opening plugin stack /cm/local/apps/slurm/var/spool/conf-cache/plugstack.conf
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0,4-5'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0,4-5'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0,8,10'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.489] [25139838.batch] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0,8,10'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.497] [25139838.batch] task/cgroup: _memcg_initialize: job: alloc=49152MB mem.limit=49152MB memsw.limit=49152MB job_swappiness=18446744073709551614
/var/log/slurmd-20230328:[2023-03-27T05:49:25.497] [25139838.batch] task/cgroup: _memcg_initialize: step: alloc=49152MB mem.limit=49152MB memsw.limit=49152MB job_swappiness=18446744073709551614
/var/log/slurmd-20230328:[2023-03-27T05:49:25.497] [25139838.batch] debug:  cgroup/v1: _oom_event_monitor: started.
/var/log/slurmd-20230328:[2023-03-27T05:49:25.509] [25139838.batch] debug levels are stderr='error', logfile='debug', syslog='fatal'
/var/log/slurmd-20230328:[2023-03-27T05:49:25.512] [25139838.batch] starting 1 tasks
/var/log/slurmd-20230328:[2023-03-27T05:49:25.512] [25139838.batch] task 0 (25156) started 2023-03-27T05:49:25
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  task/affinity: task_p_pre_launch: affinity StepId=25139838.batch, task:0 bind:mask_cpu
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_CPU in environment
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_FSIZE in environment
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_DATA in environment
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_STACK in environment
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_RSS in environment
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_NPROC in environment
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_NOFILE in environment
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
/var/log/slurmd-20230328:[2023-03-27T05:49:25.523] [25139838.batch] debug:  Couldn't find SLURM_RLIMIT_AS in environment
/var/log/slurmd-20230328:[2023-03-27T12:56:36.245] [25139838.extern] debug:  Handling REQUEST_STATE
/var/log/slurmd-20230328:[2023-03-27T12:56:36.245] [25139838.batch] debug:  Handling REQUEST_STATE
/var/log/slurmd-20230328:[2023-03-27T16:56:09.790] [25139838.batch] debug:  cgroup/v1: _oom_event_monitor: oom-kill event count: 1
/var/log/slurmd-20230328:[2023-03-27T16:56:09.803] [25139838.extern] debug:  cgroup/v1: _oom_event_monitor: oom-kill event count: 1
/var/log/slurmd-20230328:[2023-03-27T20:53:50.914] [25139838.batch] task 0 (25156) exited with exit code 0.
/var/log/slurmd-20230328:[2023-03-27T20:53:50.915] [25139838.batch] debug:  task/affinity: task_p_post_term: affinity StepId=25139838.batch, task 0
/var/log/slurmd-20230328:[2023-03-27T20:53:50.915] [25139838.batch] error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1699/job_25139838/step_batch' : No such file or directory
/var/log/slurmd-20230328:[2023-03-27T20:53:50.915] [25139838.batch] error: common_cgroup_lock error: No such file or directory
/var/log/slurmd-20230328:[2023-03-27T20:53:50.915] [25139838.batch] debug:  task_g_post_term: task/cgroup: Unspecified error
/var/log/slurmd-20230328:[2023-03-27T20:53:50.916] [25139838.batch] error: problem with oom_pipe[0]
/var/log/slurmd-20230328:[2023-03-27T20:53:50.916] [25139838.batch] fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument
/var/log/slurmd-20230328:[2023-03-27T20:53:50.916] [25139838.batch] debug:  signaling condition
/var/log/slurmd-20230328:[2023-03-27T21:44:35.008] [25139838.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
/var/log/slurmd-20230328:[2023-03-27T21:44:35.008] [25139838.extern] debug:  _handle_signal_container for StepId=25139838.extern uid=450 signal=18
/var/log/slurmd-20230328:[2023-03-27T21:44:35.008] [25139838.extern] Sent signal 18 to StepId=25139838.extern
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] debug:  _handle_signal_container for StepId=25139838.extern uid=450 signal=15
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] Sent signal 15 to StepId=25139838.extern
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] debug:  signaling condition
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] debug:  task/affinity: task_p_post_term: affinity StepId=25139838.extern, task 0
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1699/job_25139838/step_extern' : No such file or directory
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] error: common_cgroup_lock error: No such file or directory
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] debug:  task_g_post_term: task/cgroup: Unspecified error
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] error: problem with oom_pipe[0]
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] debug:  Handling REQUEST_STATE
/var/log/slurmd-20230328:[2023-03-27T21:44:35.009] [25139838.extern] debug:  jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded


Info from the job:

JobId=25139838 JobName=feats_cifar10_nn_fast-walking-runoff-split_nmod0600_rng24-32_itr091
   UserId=zhammoud(1699) GroupId=talapas(1159) MCS_label=N/A
   Priority=176064 Nice=0 Account=uoml QOS=normal
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=15:55:10 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2023-03-26T18:10:27 EligibleTime=2023-03-26T18:10:27
   AccrueTime=2023-03-26T18:10:27
   StartTime=2023-03-27T05:49:24 EndTime=2023-03-27T21:44:34 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-27T05:49:24 Scheduler=Main
   Partition=gpu AllocNode:Sid=talapas-ln1:49260
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n101
   BatchHost=n101
   NumNodes=1 NumCPUs=3 NumTasks=1 CPUs/Task=3 ReqB:S:C:T=0:0:*:*
   TRES=cpu=3,mem=48G,node=1,billing=14,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   JOB_GRES=gpu:tesla_k80:1
     Nodes=n101 CPU_IDs=0,4-5 Mem=49152 GRES=
   MinCPUsNode=3 MinMemoryCPU=16G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/dev/stdin
   WorkDir=/gpfs/home/zhammoud/repos/debug/robust-feats-runoff-split/feature-robustness
   StdErr=/home/zhammoud/projects/.out/feats_cifar10_nn_fast-walking-runoff-split_nmod0600_rng24-32_itr091_25139838.out
   StdIn=/dev/null
   StdOut=/home/zhammoud/projects/.out/feats_cifar10_nn_fast-walking-runoff-split_nmod0600_rng24-32_itr091_25139838.out
   Power=
   TresPerNode=gres:gpu:1
   MailUser=xxx@uoregon.edu MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT

Info for the node:

NodeName=n101 Arch=x86_64 CoresPerSocket=14
   CPUAlloc=14 CPUEfctv=28 CPUTot=28 CPULoad=4.02
   AvailableFeatures=broadwell,e5-2690,k80
   ActiveFeatures=broadwell,e5-2690,k80
   Gres=gpu:tesla_k80:4(S:0-1)
   NodeAddr=n101 NodeHostName=n101 Version=22.05.6
   OS=Linux 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021
   RealMemory=246385 AllocMem=168624 FreeMem=231818 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=gpu,longgpu,preempt
   BootTime=2023-03-25T05:53:53 SlurmdStartTime=2023-03-25T05:57:02
   LastBusyTime=2023-03-26T12:57:43
   CfgTRES=cpu=28,mem=246385M,billing=56,gres/gpu=4
   AllocTRES=cpu=14,mem=168624M,gres/gpu=4
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Also:

[root@n101 ~]# uname -a
Linux n101 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@n101 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:         257869       22442      231801         158        3625      233770
Swap:             0           0           0


Excerpts from slurm.conf:

NodeName=n[097,099-113,115-118,120] Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 RealMemory=246385 Feature=broadwell,e5-2690,k80 Gres=gpu:tesla_k80:4 Weight=100

PartitionName=longgpu State=up Default=NO MinNodes=1 MaxNodes=12 DefaultTime=14-00:00:00 MaxTime=14-00:00:00 AllowGroups=all AllowAccounts=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowQos=ALL QOS=clublimit LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=3 OverSubscribe=NO DefMemPerCPU=8790 TRESBillingWeights="CPU=2.0,Mem=0.219G,GRES/gpu=14.0" Nodes=n[101-115,120]

cgroup.conf:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
#CgroupReleaseAgentDir="/etc/slurm/cgroup"
#AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
ConstrainCores=yes
#TaskAffinity=yes
ConstrainRAMSpace=yes
ConstrainKmemSpace=no
ConstrainSwapSpace=yes
ConstrainDevices=yes
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=99
MaxSwapPercent=0
MinRAMSpace=200
MinKmemSpace=200


At a bare minimum, it appears that slurmstepd has somehow lost track of the cgroup situation and the error messages are just fallout after the fact.  Beyond that, it's hard to guess.

Any ideas?

Thanks!
Mike
Comment 1 Felip Moll 2023-03-29 05:04:53 MDT
Hi Mike,

How often this issue happens? We need a bit more information from a node that failed:
1. Kernel version (uname -a)
2. cat /proc/mounts
3. Are you using Weka or Bright?
4. If you allow ssh into the nodes and you use pam_slurm_adopt: can you show me the output of..
grep -ri systemd /etc/pam.d/ 

-

It would be ideal to set SlurmdDebug=debu2 and DebugFlags=cgroup, then reconfigure or restart slurmd daemons in the cluster. We need to catch more logging information in slurmd.
Comment 2 Felip Moll 2023-03-29 05:05:24 MDT
> It would be ideal to set SlurmdDebug=debu2 and DebugFlags=cgroup, then
> reconfigure or restart slurmd daemons in the cluster. We need to catch more
> logging information in slurmd.

"debug2" not "debu2" , sorry for the typo :)
Comment 3 Mark Allen 2023-03-30 13:38:12 MDT
Looking for a signature in our slurmd logs, it appears that it's happened about 1700 out of the last 300,000 jobs.  So maybe one percent or so.

Looking at the gross attributes of those jobs, there's not an obvious pattern.  It's happening across hosts/partitions with differing amounts of memory, different job memory requests, different users, etc.

Kernel is the same for all hosts:

Linux n002 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux


[root@n001 ~]# cat /proc/mounts
proc /proc proc rw,nosuid,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=65959208k,nr_inodes=16489802,mode=755 0 0
tmpfs /run tmpfs rw,relatime 0 0
/dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0                                                                                                                                  tmpfs /dev/shm tmpfs rw 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0                                                                                                                                              tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=23,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=21590 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota 0 0
/dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
fs1 /gpfs gpfs rw,relatime 0 0
gvfsd-fuse /gpfs/home/msconce/.gvfs fuse.gvfsd-fuse rw,nosuid,nodev,relatime,user_id=1221,group_id=1159 0 0
gvfsd-fuse /gpfs/home/ayb/.gvfs fuse.gvfsd-fuse rw,nosuid,nodev,relatime,user_id=2129,group_id=1159 0 0


We use Bright, but not the SLURM from there.  We compile our own from source.  I suppose it's (barely) possible that we're picking up bits from the Bright version, though that seems unlikely.


[root@n001 ~]# grep -ri systemd /etc/pam.d/
/etc/pam.d/fingerprint-auth-ac:-session     optional      pam_systemd.so
/etc/pam.d/password-auth-ac:#-session     optional      pam_systemd.so
/etc/pam.d/password-auth~:-session     optional      pam_systemd.so
/etc/pam.d/runuser-l:-session   optional        pam_systemd.so
/etc/pam.d/smartcard-auth-ac:-session     optional      pam_systemd.so
/etc/pam.d/system-auth-ac:-session     optional      pam_systemd.so                                                                                                                                             /etc/pam.d/system-auth~:-session     optional      pam_systemd.so
/etc/pam.d/systemd-user:# This file is part of systemd.
/etc/pam.d/systemd-user:# Used by systemd --user instances.
/etc/pam.d/lightdm-greeter:session   optional pam_systemd.so

[root@n001 ~]# grep -ri adopt /etc/pam.d/
/etc/pam.d/sshd:# Pam Slurm adopt:
/etc/pam.d/sshd:account    sufficient   pam_slurm_adopt.so


I've set the logging and will watch for an instance.  (Incidentally, it's "SlurmdDebug", and it doesn't appear that a "scontrol reconfig" is sufficient to activate it.  Or maybe I did it wrong.)

Thanks.
Comment 4 Mark Allen 2023-03-30 17:01:34 MDT
Here are some log lines from an occurrence.  (They're not marked 'debug2', but I think it's enabled nonetheless.)

The earlier lines mentioning this job look unremarkable.  The 'slurmd' daemon was restarted while the job was running.  Perhaps that's a factor.

...
[2023-03-30T15:43:46.185] debug:  credential for job 24810326 revoked
[2023-03-30T15:43:46.185] [24810326.batch] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2023-03-30T15:43:46.185] [24810326.batch] debug:  _handle_signal_container for StepId=24810326.batch uid=450 signal=18
[2023-03-30T15:43:46.186] [24810326.batch] Sent signal 18 to StepId=24810326.batch
[2023-03-30T15:43:46.186] [24810326.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2023-03-30T15:43:46.186] [24810326.extern] debug:  _handle_signal_container for StepId=24810326.extern uid=450 signal=18
[2023-03-30T15:43:46.187] [24810326.extern] Sent signal 18 to StepId=24810326.extern
[2023-03-30T15:43:46.187] [24810326.batch] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2023-03-30T15:43:46.187] [24810326.batch] debug:  _handle_signal_container for StepId=24810326.batch uid=450 signal=15
[2023-03-30T15:43:46.187] [24810326.batch] error: *** JOB 24810326 ON n182 CANCELLED AT 2023-03-30T15:43:46 ***
[2023-03-30T15:43:46.187] [24810326.batch] Sent signal 15 to StepId=24810326.batch
[2023-03-30T15:43:46.188] [24810326.extern] debug:  Handling REQUEST_SIGNAL_CONTAINER
[2023-03-30T15:43:46.188] [24810326.extern] debug:  _handle_signal_container for StepId=24810326.extern uid=450 signal=15
[2023-03-30T15:43:46.188] [24810326.extern] Sent signal 15 to StepId=24810326.extern
[2023-03-30T15:43:46.188] [24810326.extern] debug:  signaling condition
[2023-03-30T15:43:46.188] [24810326.extern] debug:  task/affinity: task_p_post_term: affinity StepId=24810326.extern, task 0
[2023-03-30T15:43:46.188] [24810326.extern] error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1711/job_24810326/step_extern' : No such file or directory
[2023-03-30T15:43:46.188] [24810326.extern] error: common_cgroup_lock error: No such file or directory
[2023-03-30T15:43:46.188] [24810326.extern] debug:  task_g_post_term: task/cgroup: Unspecified error
[2023-03-30T15:43:46.188] [24810326.extern] error: problem with oom_pipe[0]
[2023-03-30T15:43:46.188] [24810326.extern] fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument
[2023-03-30T15:43:46.188] [24810326.extern] debug:  jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded
[2023-03-30T15:43:46.189] [24810326.batch] debug:  Handling REQUEST_STATE
[2023-03-30T15:43:46.209] [24810326.batch] debug:  Handling REQUEST_STATE
[2023-03-30T15:43:46.242] [24810326.batch] task 0 (151774) exited. Killed by signal 15.
[2023-03-30T15:43:46.243] [24810326.batch] debug:  task/affinity: task_p_post_term: affinity StepId=24810326.batch, task 0
[2023-03-30T15:43:46.243] [24810326.batch] error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1711/job_24810326/step_batch' : No such file or directory
[2023-03-30T15:43:46.243] [24810326.batch] error: common_cgroup_lock error: No such file or directory
[2023-03-30T15:43:46.243] [24810326.batch] debug:  task_g_post_term: task/cgroup: Unspecified error
[2023-03-30T15:43:46.243] [24810326.batch] error: problem with oom_pipe[0]
[2023-03-30T15:43:46.243] [24810326.batch] fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument
[2023-03-30T15:43:46.260] debug:  _step_connect: connect() failed for /cm/local/apps/slurm/var/spool/n182_24810326.4294967291: Connection refused
[2023-03-30T15:43:46.260] debug:  Cleaned up stray socket /cm/local/apps/slurm/var/spool/n182_24810326.4294967291
[2023-03-30T15:43:46.260] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job24810326/slurm_script
[2023-03-30T15:43:46.260] debug:  _step_connect: connect() failed for /cm/local/apps/slurm/var/spool/n182_24810326.4294967292: Connection refused
[2023-03-30T15:43:46.260] debug:  Cleaned up stray socket /cm/local/apps/slurm/var/spool/n182_24810326.4294967292
[2023-03-30T15:43:46.260] debug:  Waiting for job 24810326's prolog to complete
[2023-03-30T15:43:46.260] debug:  Finished wait for job 24810326's prolog to complete
[2023-03-30T15:43:48.524] debug:  Note large processing time from prep_g_epilog: usec=2264129 began=15:43:46.260
[2023-03-30T15:43:48.524] debug:  completed epilog for jobid 24810326
[2023-03-30T15:43:48.525] debug:  JobId=24810326: sent epilog complete msg: rc = 0
Comment 5 Felip Moll 2023-03-31 05:17:43 MDT
> [2023-03-30T15:43:48.524] debug:  Note large processing time from
> prep_g_epilog: usec=2264129 began=15:43:46.260
> [2023-03-30T15:43:48.524] debug:  completed epilog for jobid 24810326
> [2023-03-30T15:43:48.525] debug:  JobId=24810326: sent epilog complete msg:
> rc = 0

I am not sure the logs are properly configured, since I don't see any "CGROUP" log message nor a debug2.

Can you do 'scontrol show config' and paste here the output? This is what is needed in slurm.conf:

SlurmdDebug=debug2
DebugFlags=cgroup

I checked it and a 'scontrol reconfig' is enough for slurmd to take the new log level. I understand you start slurmd through 'systemctl start', right?

Please, can you double-check?
Comment 6 Mark Allen 2023-03-31 10:56:50 MDT
Not sure what happened.  Here's the current state:

root@hpc-hn1 ~ 669# scontrol show config | fgrep -ie debug
DebugFlags              = Cgroup
SlurmctldDebug          = debug2
SlurmctldSyslogDebug    = (null)
SlurmdDebug             = debug2
SlurmdSyslogDebug       = (null)

All daemons were restarted (systemctl restart) at least once, but perhaps I missed something.

In any case, I'll take another run at this.
Comment 7 Mark Allen 2023-03-31 11:19:45 MDT
After a restart of all 'slurmd', there was an instance of the bug, and the log lines look rather similar (missing what you want to see).

I'm remembering now discovering that each job is actually managed by one or more slurmstepd, and these do _not_ restart when slurmd is restarted.  They also fail to pick up at least some parameter changes from slurm.conf, even if slurmctld and slurmd are restarted.  (In that particular case, I changed some port numbers, which made quite a mess.)

So, a theory here is that already existing jobs won't pick up these flag changes.  I will watch for an instance of the bug for a job started _after_ our debug2 flag changes.
Comment 8 Felip Moll 2023-04-03 06:09:35 MDT
(In reply to Mark Allen from comment #7)
> After a restart of all 'slurmd', there was an instance of the bug, and the
> log lines look rather similar (missing what you want to see).
> 
> I'm remembering now discovering that each job is actually managed by one or
> more slurmstepd, and these do _not_ restart when slurmd is restarted.  They
> also fail to pick up at least some parameter changes from slurm.conf, even
> if slurmctld and slurmd are restarted.  (In that particular case, I changed
> some port numbers, which made quite a mess.)


Ah, if the situation happens on old steps it is expected that debug2 or CGROUP log messages don't show up. As you say slurmstepds won't get the new settings. The settings will affect only new steps.

I expect to see debug2 messages from slurmd itself at least, but not what we're after.

Sorry I assumed you were running new jobs.

> So, a theory here is that already existing jobs won't pick up these flag
> changes.  I will watch for an instance of the bug for a job started _after_
> our debug2 flag changes.

Yes, that's what we should wait for.
Comment 9 Mark Allen 2023-04-04 15:45:27 MDT
We've seen a problem that may or may not be exactly the same issue, but it seems likely.

I'm not sure what to upload, so here are a few things to start with.

A few initial observations: There are a lot of zombie slurmd processes.  That's surely a bug.  

There are processes from several rather old jobs.  Checking one, it seems to be in the "completing" state.  That's surely an issue as well.

I'm not seeing "debug2" anywhere.  I'm not sure if that's an indication that we're not seeing log messages at that level or not.  The macros are a bit of a mess to read.  Is it specifically supposed to be present in the log message?

Attachment uploads to follow.


# showjob 24106982
JobId= 24106982
Partition= bgmp
  Account= bgmp UserId=isisd (Isis Monarrez) JobName=bash

NumNodes= 2 NumCPUs=2 NumTasks=2 Tasks/Node=0 CPUs/Task=1
TRES= cpu=2  mem=64G  node=2
 MinMemoryNode=32G

BatchHost= n278 NodeList=n[278-279]

SubmitTime= 2023-02-13T14:52:58
 StartTime= 2023-02-13T14:52:58
   EndTime= 2023-02-13T20:05:18
   RunTime=            05:12:20
 TimeLimit=            10:00:00

WorkDir= /gpfs/projects/bgmp/isisd/bioinfo/OHSU
Command= bash



JobState= COMPLETING Reason=None ExitCode=0:0
Comment 10 Mark Allen 2023-04-04 15:46:51 MDT
Created attachment 29681 [details]
ps -eflyH output
Comment 11 Mark Allen 2023-04-04 15:47:28 MDT
Created attachment 29682 [details]
slurmd log
Comment 12 Mark Allen 2023-04-04 15:48:02 MDT
Created attachment 29683 [details]
slurm cgroup tree
Comment 13 Felip Moll 2023-04-10 09:37:42 MDT
- Can I ask for a "dmesg -T" in the node?
- Can I see your prolog script?
- Can I see your most recent slurm.conf?

Things I noticed:

- The compute node is runnig since 2022 without any reboot.
- There are VNC and Mate (graphical) sessions, with for example Rstudio or Firefox running since 2022. Users clafranc and luizah. Is this expected?
- The slurmd is started with -vv, which overrides the debug level set in slurm.conf:

 /cm/shared/apps/slurm/current/sbin/slurmd --conf-server hpc-hn1,hpc-hn2 -vv
 
How do you start slurmd? Can you remove this -vv? The configuration is taken from these two servers, I guess you changed the debug2 level in these servers and not only in the slurmd node, right?

- A couple of Zombie processes are prolog threads:

Z root     27781 37651  0  80   0     0     0 do_exi 10:29 ?        00:00:00     [prolog] <defunct>

Can I see your prolog scripts?

- In the log.. at least we can see the "CGROUP:" messages.
- The zombie processes are from 10:29, same time when a bunch of jobs/steps were started in the node, at the same time. Part of a job array?
- I see the following. It is like the stdout file doesn't exist, and the batch step fails. I guess this job just ended with FAILED state?:

[2023-04-04T10:29:55.744] [25584694.batch] error: Could not open stdout file /home/jhass2/jamming/JacobData/logs/ScatteringSweepLonger/25584454-239.out: No such file or directory
[2023-04-04T10:29:55.744] [25584694.batch] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO
[2023-04-04T10:29:55.744] [25584694.batch] debug:  signaling condition 
 
[2023-04-04T10:29:57.527] [25584722.batch] error: Could not open stdout file /home/jhass2/jamming/JacobData/logs/ScatteringSweepLonger/25584454-267.out: No such file or directory
[2023-04-04T10:29:57.527] [25584722.batch] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO
[2023-04-04T10:29:57.527] [25584722.batch] debug:  signaling condition

[2023-04-04T10:29:57.528] [25584726.batch] error: Could not open stdout file /home/jhass2/jamming/JacobData/logs/ScatteringSweepLonger/25584454-271.out: No such file or directory
[2023-04-04T10:29:57.528] [25584726.batch] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO
[2023-04-04T10:29:57.528] [25584726.batch] debug:  signaling condition
... etc ...

- Have you experiencied any GPFS issues? (if possible check the logs)
- What are your system limits? (ulimit) Can this be related to a limits issue? (sysctl -a / https://slurm.schedmd.com/high_throughput.html )
Comment 14 Felip Moll 2023-04-18 10:28:35 MDT
Hi, Marke

Do you have any feedback for me?

Thanks
Comment 15 Mark Allen 2023-04-19 20:35:27 MDT
Hi Felip,

Apologies for letting this go stale.  It is definitely still important to us.  I will try to get responses tomorrow.

As a meta comment, we're in the process of upgrading to RHEL 8 in about six months, which will mean cgroups v2.  Given that window, even just mitigation of the problems would be great, even if the underlying problems aren't solved.

What do I mean by that?  For example, if restarting all slurmd's every 15 minutes would make this go away, we'd do it.  (Yes, that's ridiculous.)

If you have any ideas on that front, we're all ears.

Again, though, I'll try to get you answers to your questions soon.

Regards,
Mike
Comment 16 Mark Allen 2023-04-20 14:27:38 MDT
Created attachment 29941 [details]
dmesg -T
Comment 17 Mark Allen 2023-04-20 14:28:18 MDT
Created attachment 29942 [details]
sysctl out
Comment 18 Mark Allen 2023-04-20 14:29:02 MDT
Created attachment 29943 [details]
systemd unit file
Comment 19 Mark Allen 2023-04-20 14:37:40 MDT
Created attachment 29944 [details]
slurm.conf
Comment 20 Mark Allen 2023-04-20 14:38:15 MDT
Created attachment 29945 [details]
cgroup.conf
Comment 21 Mark Allen 2023-04-20 14:52:06 MDT
Hi,

We're now seeing 'debug2' messages in the logs, so I'm going to look for a fresh instance.  In case it might help, I've uploaded a few things you mentioned from the original instance, but you're welcome to await the new instance.

(Note that the two .conf files were grabbed today, so might be a bit different.  Probably not in a way that matters here.)

Regarding your questions, yes, this node has been up for four or five months.  We try to avoid taking them down except during maintenance events, twice per year.

The VNC/X sessions are expected and were started within Open OnDemand jobs.  Those jobs ended long ago, so it's quite unexpected that the processes are still present.

That said, some processes in those session do immediately re-parent to PID 1.  My thought is that they still cannot escape the job's cgroup and that slurmd will thus kill them at the job's end no matter what.  I'm fuzzy on the details of cgroups--I certainly hope that's true.

In a test desktop job just now, I do see those re-parented processes in the output of 'systemctl status slurmd', which makes me think that they have not escaped.

I saw the zombie processes as well.  These seem like processes that slurmd should somehow have waited for.

The prolog files are largely or completely coming from Bright.  They don't seem to do much besides 'scontrol show' commands.  I can still upload if you like.

No GPFS issues that we're aware of.  And in any case, I'd expect slurmd to throw a huge fit if there were any such disk I/O errors.

I'll work on capturing a new instance.  Thanks for your help.

Mike
Comment 22 Felip Moll 2023-04-24 03:25:20 MDT
(In reply to Mark Allen from comment #21)
> Hi,
> 
> We're now seeing 'debug2' messages in the logs, so I'm going to look for a
> fresh instance.  In case it might help, I've uploaded a few things you
> mentioned from the original instance, but you're welcome to await the new
> instance.

The only interesting thing I can see are the numerous kills because OOM:

]$ grep Killed n289-stuck-dmesg-20230420

but indeed I don't see anything related to NFS or filesystem. I think is worth to wait at a new instance if you're seeing debug2 messages.

 
> Regarding your questions, yes, this node has been up for four or five
> months.  We try to avoid taking them down except during maintenance events,
> twice per year.
> 
> The VNC/X sessions are expected and were started within Open OnDemand jobs. 
> Those jobs ended long ago, so it's quite unexpected that the processes are
> still present.

Time ago we had a problem with X sessions and a fork(), where we were calling non-async signal safe functions after the fork, but we fixed that back in 2020.

commit 8fb07498b3fdc1e1ca7bb0aa2c95785ab2956f42
Author:     Felip Moll <felip.moll@schedmd.com>
AuthorDate: Wed Jul 8 17:30:11 2020 +0200
Commit:     Tim Wickberg <tim@schedmd.com>
CommitDate: Fri Nov 6 10:10:26 2020 -0700

    Make x11 forwarding setup async-signal-safe.

This were causing random issues in X apps that suddenly crashed. When you fork a multi-thread process, it is possible that one thread have locks set, and the forked process inherits these locks. That's why it is important to get the slurmd log, I want to see the process and what was slurmd exactly doing around the time the issue happened.

slurmstepd: error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1699/job_25139838/step_batch' : No such file or directory
slurmstepd: error: common_cgroup_lock error: No such file or directory
slurmstepd: error: problem with oom_pipe[0]
slurmstepd: fatal: cgroup_v1.c:1203 _oom_event_monitor: pthread_mutex_lock(): Invalid argument

There are some minor fixes about this (like e8bedd62f6bfdd338a90db3a7eb66c7958446625), and we know some places that are still uncorrected. It would be good to know if all the jobs you're having this issues are X11 related or not, or always python..

> That said, some processes in those session do immediately re-parent to PID
> 1.  My thought is that they still cannot escape the job's cgroup and that
> slurmd will thus kill them at the job's end no matter what.  I'm fuzzy on
> the details of cgroups--I certainly hope that's true.

slurmstepd is the one which is in charge of user processes, not slurmd directly. One other idea is that these graphical or python apps are running processes through a launcher (like systemd-run, or calls to dbus and systemd, which is possible since they're on a graphical session) which run outside slurm control because are really started by systemd and not slurm. If you see processes outside our cgroups that may be the cause, but that shouldn't make the jobs go stuck.

> In a test desktop job just now, I do see those re-parented processes in the
> output of 'systemctl status slurmd', which makes me think that they have not
> escaped.
> 
> I saw the zombie processes as well.  These seem like processes that slurmd
> should somehow have waited for.

If they are re-parented to pid 1 this means its parent is dead, which means slurmstepd is dead. That reinforces my theory that slurmstepd terminates abruptly or doesn't wait for the forked process for some reason.


> The prolog files are largely or completely coming from Bright.  They don't
> seem to do much besides 'scontrol show' commands.  I can still upload if you
> like.

Ok, we'll table that for now.


> No GPFS issues that we're aware of.  And in any case, I'd expect slurmd to
> throw a huge fit if there were any such disk I/O errors.
> 
> I'll work on capturing a new instance.  Thanks for your help.
> 
> Mike

Thanks Mike.
Comment 23 Mark Allen 2023-04-24 12:43:50 MDT
Thanks for your reply.  I think we've made some real progress on our end, which I will detail in the next comment.

Replying to your current comments first, yes, the OOM killer seems to be part of all of this.

It's somewhat plausible that the triggering jobs involve X11, likely a server, or I suppose Python.  Nothing magical about these, as far as I know, but maybe their typical behavior is a trigger.

As far as I know, we don't specifically enable any SLURM-specific X11 stuff.  On the other hand, the 'srun --x11' flag seems to work.  Not sure if this is involved.  But also, I don't think Open OnDemand desktop sessions use this feature.  That is, I think they set up their own forwarding sockets.

I'm seeing the "Invalid argument" messages some.  If this is a true EINVAL, this makes me wonder if these calls aren't buggy somehow.  (See next comment also.)

If user processes can start significant programs via the system 'systemd' or 'dbus-daemon', that's news to me, but I'm not very familiar with these.

Re "If they are re-parented to pid 1 this means its parent is dead, which means slurmstepd is dead": I don't think the second part follows.  That is, the parent is dead, but wasn't necessarily slurmstepd.  Based on your comments, though, re-parenting (or forking) does seem like a place where something could be going amiss.

Next comment coming up.
Comment 24 Mark Allen 2023-04-24 15:14:34 MDT
Starting an OOD desktop session on an unused node did seem to reproduce at least part of this.  (If nothing else, it definitely demonstrates a Slurm bug.)

In short, this job is exceeding its Slurm memory request by a lot.  On a fully used node, this would lead shortly to a lot of trouble.  Not really sure which processes get whacked first, but it's all bad.

Here are some bits that seem relevant.  I will try to gather more data, assuming it really is reproducible.  My first thought is to capture high-quality strace output for the entire job, starting from slurmd.  What else would be useful?

Mike

Linux n140 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

### mainly note 32G memory

# showjob 25852232
JobId= 25852232
Partition= racs
  Account= hpcrcf UserId=mcolema5 (Michael Coleman) JobName=sys/dashboard/sys/bc_talapas_desktop

NumNodes= 1 NumCPUs=4 NumTasks=1 Tasks/Node=0 CPUs/Task=4
TRES= cpu=4  mem=32G  node=1
MinCPUsNode=4 MinMemoryNode=32G

BatchHost= n140 NodeList=n140

SubmitTime= 2023-04-20T13:21:39
StartTime= 2023-04-20T13:21:43
   EndTime= 2023-04-30T13:21:43
   RunTime=          1-04:21:47
TimeLimit=         10-00:00:00

WorkDir= /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a
Command= (null)
StdOut= /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/output.log
StdErr= /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/output.log

JobState= RUNNING Reason=None ExitCode=0:0

### same from raw scontrol

# scontrol show job 25852232
JobId=25852232 JobName=sys/dashboard/sys/bc_talapas_desktop
   UserId=mcolema5(1345) GroupId=uoregon(131) MCS_label=N/A
   Priority=1200733 Nice=0 Account=hpcrcf QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-04:34:55 TimeLimit=10-00:00:00 TimeMin=N/A
   SubmitTime=2023-04-20T13:21:39 EligibleTime=2023-04-20T13:21:39
   AccrueTime=2023-04-20T13:21:39
   StartTime=2023-04-20T13:21:43 EndTime=2023-04-30T13:21:43 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-04-20T13:21:43 Scheduler=Main
   Partition=racs AllocNode:Sid=talapas-ln2:45953
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n140
   BatchHost=n140
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=32G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a
   Comment=
   StdErr=/home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/output.log
   StdIn=/dev/null
   StdOut=/home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/output.log
   Power=



### Started desktop, then gnome terminal within, then python3 in terminal.  Within that, generated huge string (see below).  I stopped it and started another python3 to use more memory.  Obviously, this is blowing way past any Slurm job limits.  The job is still running after several days, probably partly because there have been no other jobs run on this node.


top - 17:55:02 up 43 days,  4:21,  2 users,  load average: 0.00, 0.01, 0.05
Tasks: 418 total,   1 running, 414 sleeping,   3 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.2 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 19799787+total, 73960536 free, 11990907+used,  4128264 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 75955136 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                               7374 mcolema5  20   0   93.3g  93.1g   2968 T   0.0 49.3   0:34.33 python3                                                                                                                                                                                               7593 mcolema5  20   0 9963112   9.3g   2968 S   0.0  4.9   0:07.86 python3                                                                                                                                                                                               9775 root       0 -20   19.1g   4.9g 268304 S   0.3  2.6 189:09.65 mmfsd                                                                                                                                                                                                30994 root      20   0 2689992 275428  55436 S   0.0  0.1  27:35.41 cmd                                                                                                                                                                                                   3480 root      20   0  186808 107552 107096 S   0.3  0.1   6:45.82 systemd-journal                                                                                                                                                                                       4881 mcolema5  20   0  434112  80968  13552 S   0.7  0.0   2:00.78 Xvnc      
...


### partial ps output, omitting lots of kernel/root procs that seem irrelevant

S root      3085     1  0  80   0  7868 64496 inet_c Apr20 ?        00:00:02   /cm/shared/apps/slurm/current/sbin/slurmd --conf-server hpc-hn1,hpc-hn2
S root      4801     1  0  80   0  4436 108222 do_wai Apr20 ?       00:00:06   slurmstepd: [25852232.extern]
S root      4806  4801  0  80   0   348 27014 hrtime Apr20 ?        00:00:00     sleep 100000000
S root      4808     1  1  80   0  4640 107706 do_wai Apr20 ?       00:17:47   slurmstepd: [25852232.batch]
S mcolema5  4813  4808  0  80   0  1780  4005 do_wai Apr20 ?        00:00:00     /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script
S mcolema5  4908  4813  0  80   0  1584  2430 do_wai Apr20 ?        00:00:00       bash /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/script.sh
S mcolema5  4982  4908  0  80   0 28412 132928 poll_s Apr20 ?       00:00:02         mate-session
S mcolema5  5047  4982  0  80   0 15336 364107 poll_s Apr20 ?       00:00:13           /usr/libexec/mate-settings-daemon
S mcolema5  5051  4982  0  80   0 16560 102908 poll_s Apr20 ?       00:00:03           marco
S mcolema5  5055  4982  0  80   0 16520 141292 poll_s Apr20 ?       00:00:02           mate-panel
S mcolema5  8419  5055  0  80   0 25936 147685 poll_s Apr20 ?       00:00:01             mate-terminal
S mcolema5  8428  8419  0  80   0   860  3698 unix_s Apr20 ?        00:00:00               gnome-pty-helper
S mcolema5  8429  8419  0  80   0  5292  4835 do_wai Apr20 pts/0    00:00:00               -bash
T mcolema5  7374  8429  4  80   0 97662500 24463434 do_sig 17:40 pts/0 00:00:34                 python3
S mcolema5  7593  8429  1  80   0 9771892 2490778 poll_s 17:41 pts/0 00:00:07                 python3
S mcolema5  5072  4982  0  80   0 31248 147109 poll_s Apr20 ?       00:00:55           caja
S mcolema5  5075  4982  0  80   0 16836 71489 poll_s Apr20 ?        00:00:00           /usr/bin/python /usr/share/system-config-printer/applet.py
S mcolema5  5076  4982  0  80   0 12480 96527 poll_s Apr20 ?        00:00:00           abrt-applet
S mcolema5  5081  4982  0  80   0  9440 99829 poll_s Apr20 ?        00:00:00           mate-screensaver
S mcolema5  5083  4982  0  80   0 19656 127696 poll_s Apr20 ?       00:00:00           nm-applet
S mcolema5  5260  4813  0  80   0  1116  4005 pipe_w Apr20 ?        00:00:00       /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script
S mcolema5  5263  5260  0  80   0   740  4005 do_wai Apr20 ?        00:00:00         /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script
S mcolema5  5264  5263  0  80   0   616  1103 hrtime Apr20 ?        00:00:03           tail -f --pid=4908 vnc.log
S mcolema5  4881     1  0  80   0 80968 108528 ep_pol Apr20 ?       00:02:00   /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: n140:1 (mcolema5) -auth /home/mcolema5/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/mcolema5/.vnc/x509_cert.pem -x509key /home/mcolema5/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0
S mcolema5  4939     1  0  80   0  1000 15802 poll_s Apr20 ?        00:00:00   dbus-launch --autolaunch=5abba43653164143a1d82e3b311c9406 --binary-syntax --close-stderr
S mcolema5  4940     1  0  80   0  1408 15543 ep_pol Apr20 ?        00:00:00   /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
S mcolema5  4944     1  0  80   0  2984 43813 poll_s Apr20 ?        00:00:00   /usr/libexec/dconf-service
S mcolema5  4985     1  0  80   0  1000 15803 poll_s Apr20 ?        00:00:00   dbus-launch --exit-with-session mate-session
S mcolema5  4986     1  0  80   0  1984 15673 ep_pol Apr20 ?        00:00:00   /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
S mcolema5  5004     1  0  80   0  4108 82793 poll_s Apr20 ?        00:00:00   /usr/libexec/at-spi-bus-launcher
S mcolema5  5017  5004  0  80   0  2284 15543 ep_pol Apr20 ?        00:00:00     /usr/bin/dbus-daemon --config-file=/usr/share/defaults/at-spi2/accessibility.conf --nofork --print-address 3
S mcolema5  5008     1  0  80   0  3940 66656 poll_s Apr20 ?        00:00:00   /usr/libexec/gvfsd
S mcolema5  5139  5008  0  80   0  4824 84671 poll_s Apr20 ?        00:00:00     /usr/libexec/gvfsd-trash --spawner :1.2 /org/gtk/gvfs/exec_spaw/0
S mcolema5  6331  5008  0  80   0  4648 102675 poll_s Apr20 ?       00:00:00     /usr/libexec/gvfsd-network --spawner :1.2 /org/gtk/gvfs/exec_spaw/1
S mcolema5  6352  5008  0  80   0  4444 91345 poll_s Apr20 ?        00:00:00     /usr/libexec/gvfsd-dnssd --spawner :1.2 /org/gtk/gvfs/exec_spaw/3
S mcolema5  5013     1  0  80   0  3388 97320 futex_ Apr20 ?        00:00:00   /usr/libexec/gvfsd-fuse /home/mcolema5/.gvfs -f -o big_writes
S mcolema5  5026     1  0  80   0  4012 55229 poll_s Apr20 ?        00:00:00   /usr/libexec/at-spi2-registryd --use-gnome-session
S mcolema5  5040     1  0  80   0  2988 43813 poll_s Apr20 ?        00:00:00   /usr/libexec/dconf-service
S mcolema5  5064     1  0  80   0  4048 141615 poll_s Apr20 ?       00:00:00   /usr/bin/pulseaudio --start --log-target=syslog
S rtkit     5065     1  0  81   1  1812 49696 poll_s Apr20 ?        00:00:01   /usr/libexec/rtkit-daemon
S mcolema5  5087     1  0  80   0  5280 74468 poll_s Apr20 ?        00:00:06   /usr/libexec/gvfs-udisks2-volume-monitor
S root      5095     1  0  80   0  7184 114475 poll_s Apr20 ?       00:00:33   /usr/libexec/udisks2/udisksd
S mcolema5  5192     1  0  80   0  3184 67364 poll_s Apr20 ?        00:00:00   /usr/libexec/gvfs-mtp-volume-monitor
S mcolema5  5198     1  0  80   0  3416 69637 poll_s Apr20 ?        00:00:00   /usr/libexec/gvfs-gphoto2-volume-monitor
S mcolema5  5204     1  0  80   0  4404 87572 poll_s Apr20 ?        00:00:00   /usr/libexec/gvfs-afc-volume-monitor
S mcolema5  5210     1  0  80   0 14244 110089 poll_s Apr20 ?       00:00:00   /usr/libexec/mate-panel/wnck-applet
S mcolema5  5212     1  0  80   0 15976 123510 poll_s Apr20 ?       00:00:01   /usr/libexec/mate-panel/clock-applet
S mcolema5  5214     1  0  80   0  9100 88456 poll_s Apr20 ?        00:00:00   /usr/libexec/mate-panel/notification-area-applet
S mcolema5  5262     1  0  80   0 26560 70835 poll_s Apr20 ?        00:00:07   /packages/python/3.6.5/bin/python3.6 /packages/python/3.6.5/bin/websockify -D 40045 localhost:5901
S mcolema5  6357  5262  0  80   0 27800 72269 poll_s 17:34 ?        00:00:01     /packages/python/3.6.5/bin/python3.6 /packages/python/3.6.5/bin/websockify -D 40045 localhost:5901
S mcolema5  5414     1  0  80   0  5068 103195 poll_s Apr20 ?       00:00:00   /usr/libexec/xdg-desktop-portal
S mcolema5  5419     1  0  80   0  3012 112041 poll_s Apr20 ?       00:00:00   /usr/libexec/xdg-document-portal
S mcolema5  5424     1  0  80   0  2524 60648 poll_s Apr20 ?        00:00:00   /usr/libexec/xdg-permission-store
S mcolema5  5436     1  0  80   0  8940 94966 poll_s Apr20 ?        00:00:00   /usr/libexec/xdg-desktop-portal-gtk


### Here's the log from job startup, captured like so: fgrep -e 25852232 -C 100 /var/log/slurmd-20230421

### The start and end are clipped some to include the relevant parts.

[2023-04-20T13:21:31.048] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:21:31.048] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:21:43.270] debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG
[2023-04-20T13:21:43.270] debug2: Processing RPC: REQUEST_LAUNCH_PROLOG
[2023-04-20T13:21:43.271] debug2: Start processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2023-04-20T13:21:43.271] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2023-04-20T13:21:43.271] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 25852232
[2023-04-20T13:21:43.271] task/affinity: batch_bind: job 25852232 CPU input mask for node: 0x000000F
[2023-04-20T13:21:43.271] task/affinity: batch_bind: job 25852232 CPU final HW mask for node: 0x0000055
[2023-04-20T13:21:43.271] debug:  Waiting for job 25852232's prolog to complete
[2023-04-20T13:21:43.296] debug2: prep/script: _run_subpath_command: prolog success rc:0 output:
[2023-04-20T13:21:43.339] debug2: debug level read from slurmd is 'debug2'.
[2023-04-20T13:21:43.339] debug2: read_slurmd_conf_lite: slurmd sent 9 TRES.
[2023-04-20T13:21:43.339] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-04-20T13:21:43.340] debug:  acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-04-20T13:21:43.340] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-04-20T13:21:43.340] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-04-20T13:21:43.340] debug2: Received CPU frequency information for 28 CPUs
[2023-04-20T13:21:43.341] debug:  switch Cray/Aries plugin loaded.
[2023-04-20T13:21:43.342] debug:  switch/none: init: switch NONE plugin loaded
[2023-04-20T13:21:43.342] debug:  gres/gpu: init: loaded
[2023-04-20T13:21:43.342] [25852232.extern] debug2: setup for a launch_task
[2023-04-20T13:21:43.342] [25852232.extern] CGROUP: autodetect_cgroup_version: using cgroup version 1
[2023-04-20T13:21:43.343] [25852232.extern] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-04-20T13:21:43.343] [25852232.extern] debug2: hwloc_topology_init
[2023-04-20T13:21:43.343] [25852232.extern] debug2: xcpuinfo_hwloc_topo_load: xml file (/cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml) found
[2023-04-20T13:21:43.345] [25852232.extern] debug:  CPUs:28 Boards:1 Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
[2023-04-20T13:21:43.366] [25852232.extern] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-04-20T13:21:43.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:21:43.366] [25852232.extern] debug:  Message thread started pid = 4801
[2023-04-20T13:21:43.366] [25852232.extern] debug:  Setting slurmstepd(4801) oom_score_adj to -1000
[2023-04-20T13:21:43.378] [25852232.extern] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffff
[2023-04-20T13:21:43.389] [25852232.extern] debug:  task/cgroup: init: core enforcement enabled
[2023-04-20T13:21:43.389] [25852232.extern] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:193357M allowed:100%(enforced), swap:0%(enforced), max:99%(191423M) max+swap:0%(191423M) min:200M kmem:100%(193357M permissive) min:200M
[2023-04-20T13:21:43.389] [25852232.extern] debug:  task/cgroup: init: memory enforcement enabled
[2023-04-20T13:21:43.400] [25852232.extern] debug:  task/cgroup: init: device enforcement enabled
[2023-04-20T13:21:43.400] [25852232.extern] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-04-20T13:21:43.400] [25852232.extern] cred/munge: init: Munge credential signature plugin loaded
[2023-04-20T13:21:43.400] [25852232.extern] debug:  job_container/none: init: job_container none plugin loaded
[2023-04-20T13:21:43.401] [25852232.extern] debug2: _spawn_job_container: Before call to spank_init()
[2023-04-20T13:21:43.401] [25852232.extern] debug:  spank: opening plugin stack /cm/local/apps/slurm/var/spool/conf-cache/plugstack.conf
[2023-04-20T13:21:43.401] [25852232.extern] debug2: _spawn_job_container: After call to spank_init()
[2023-04-20T13:21:43.401] [25852232.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-3'
[2023-04-20T13:21:43.401] [25852232.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-3'
[2023-04-20T13:21:43.401] [25852232.extern] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0,2,4,6'
[2023-04-20T13:21:43.401] [25852232.extern] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0,2,4,6'
[2023-04-20T13:21:43.426] [25852232.extern] task/cgroup: _memcg_initialize: job: alloc=32768MB mem.limit=32768MB memsw.limit=32768MB job_swappiness=18446744073709551614
[2023-04-20T13:21:43.426] [25852232.extern] task/cgroup: _memcg_initialize: step: alloc=32768MB mem.limit=32768MB memsw.limit=32768MB job_swappiness=18446744073709551614
[2023-04-20T13:21:43.426] [25852232.extern] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-04-20T13:21:43.427] [25852232.extern] debug2: adding task 0 pid 4806 on node 0 to jobacct
[2023-04-20T13:21:43.428] [25852232.extern] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.extern taskid 0 max_task_id 0
[2023-04-20T13:21:43.429] [25852232.extern] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4806' for '/sys/fs/cgroup/cpuacct/slurm/uid_1345/job_25852232/step_extern/task_0'
[2023-04-20T13:21:43.429] [25852232.extern] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.extern taskid 0 max_task_id 0
[2023-04-20T13:21:43.441] [25852232.extern] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4806' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0'
[2023-04-20T13:21:43.441] debug:  Finished wait for job 25852232's prolog to complete
[2023-04-20T13:21:43.441] Launching batch job 25852232 for UID 1345
[2023-04-20T13:21:43.444] debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
[2023-04-20T13:21:43.480] debug2: debug level read from slurmd is 'debug2'.
[2023-04-20T13:21:43.480] debug2: read_slurmd_conf_lite: slurmd sent 9 TRES.
[2023-04-20T13:21:43.481] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-04-20T13:21:43.481] debug:  acct_gather_profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-04-20T13:21:43.482] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-04-20T13:21:43.482] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-04-20T13:21:43.482] debug2: Received CPU frequency information for 28 CPUs
[2023-04-20T13:21:43.482] debug:  gres/gpu: init: loaded
[2023-04-20T13:21:43.482] [25852232.batch] debug2: setup for a batch_job
[2023-04-20T13:21:43.483] [25852232.batch] CGROUP: autodetect_cgroup_version: using cgroup version 1
[2023-04-20T13:21:43.483] [25852232.batch] debug:  cgroup/v1: init: Cgroup v1 plugin loaded
[2023-04-20T13:21:43.483] [25852232.batch] debug2: hwloc_topology_init
[2023-04-20T13:21:43.483] [25852232.batch] debug2: xcpuinfo_hwloc_topo_load: xml file (/cm/local/apps/slurm/var/spool/hwloc_topo_whole.xml) found
[2023-04-20T13:21:43.486] [25852232.batch] debug:  CPUs:28 Boards:1 Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
[2023-04-20T13:21:43.506] [25852232.batch] debug:  jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded
[2023-04-20T13:21:43.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:21:43.507] [25852232.batch] debug:  laying out the 1 tasks on 1 hosts n140 dist 2
[2023-04-20T13:21:43.507] [25852232.batch] debug:  Message thread started pid = 4808
[2023-04-20T13:21:43.507] debug2: Finish processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2023-04-20T13:21:43.507] [25852232.batch] debug:  Setting slurmstepd(4808) oom_score_adj to -1000
[2023-04-20T13:21:43.508] [25852232.batch] debug:  switch/none: init: switch NONE plugin loaded
[2023-04-20T13:21:43.519] [25852232.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xfffffff
[2023-04-20T13:21:43.530] [25852232.batch] debug:  task/cgroup: init: core enforcement enabled
[2023-04-20T13:21:43.530] [25852232.batch] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:193357M allowed:100%(enforced), swap:0%(enforced), max:99%(191423M) max+swap:0%(191423M) min:200M kmem:100%(193357M permissive) min:200M
[2023-04-20T13:21:43.530] [25852232.batch] debug:  task/cgroup: init: memory enforcement enabled
[2023-04-20T13:21:43.541] [25852232.batch] debug:  task/cgroup: init: device enforcement enabled
[2023-04-20T13:21:43.541] [25852232.batch] debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
[2023-04-20T13:21:43.541] [25852232.batch] cred/munge: init: Munge credential signature plugin loaded
[2023-04-20T13:21:43.542] [25852232.batch] debug:  job_container/none: init: job_container none plugin loaded
[2023-04-20T13:21:43.542] [25852232.batch] debug2: Before call to spank_init()
[2023-04-20T13:21:43.542] [25852232.batch] debug:  spank: opening plugin stack /cm/local/apps/slurm/var/spool/conf-cache/plugstack.conf
[2023-04-20T13:21:43.542] [25852232.batch] debug2: After call to spank_init()
[2023-04-20T13:21:43.542] [25852232.batch] debug:  task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-3'
[2023-04-20T13:21:43.542] [25852232.batch] debug:  task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-3'
[2023-04-20T13:21:43.542] [25852232.batch] debug:  task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0,2,4,6'
[2023-04-20T13:21:43.542] [25852232.batch] debug:  task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0,2,4,6'
[2023-04-20T13:21:43.549] [25852232.batch] task/cgroup: _memcg_initialize: job: alloc=32768MB mem.limit=32768MB memsw.limit=32768MB job_swappiness=18446744073709551614
[2023-04-20T13:21:43.549] [25852232.batch] task/cgroup: _memcg_initialize: step: alloc=32768MB mem.limit=32768MB memsw.limit=32768MB job_swappiness=18446744073709551614
[2023-04-20T13:21:43.549] [25852232.batch] debug:  cgroup/v1: _oom_event_monitor: started.
[2023-04-20T13:21:43.549] [25852232.batch] debug2: hwloc_topology_load
[2023-04-20T13:21:43.569] [25852232.batch] debug2: hwloc_topology_export_xml
[2023-04-20T13:21:43.570] [25852232.batch] debug2: Entering _setup_normal_io
[2023-04-20T13:21:43.591] [25852232.batch] debug2: Leaving  _setup_normal_io
[2023-04-20T13:21:43.591] [25852232.batch] debug levels are stderr='error', logfile='debug2', syslog='fatal'
[2023-04-20T13:21:43.591] [25852232.batch] starting 1 tasks
[2023-04-20T13:21:43.592] [25852232.batch] task 0 (4813) started 2023-04-20T13:21:43
[2023-04-20T13:21:43.593] [25852232.batch] debug2: adding task 0 pid 4813 on node 0 to jobacct
[2023-04-20T13:21:43.593] [25852232.batch] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.batch taskid 0 max_task_id 0
[2023-04-20T13:21:43.594] [25852232.batch] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4813' for '/sys/fs/cgroup/cpuacct/slurm/uid_1345/job_25852232/step_batch/task_0'
[2023-04-20T13:21:43.594] [25852232.batch] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.batch taskid 0 max_task_id 0
[2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4813' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0'
[2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.batch taskid 0 max_task_id 0
[2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4813' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0'
[2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: cgroup_p_task_addto: CGROUP: StepId=25852232.batch taskid 0 max_task_id 0
[2023-04-20T13:21:43.599] [25852232.batch] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '4813' for '/sys/fs/cgroup/devices/slurm/uid_1345/job_25852232/step_batch/task_0'
[2023-04-20T13:21:43.602] [25852232.batch] debug:  task/affinity: task_p_pre_launch: affinity StepId=25852232.batch, task:0 bind:mask_cpu
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_CPU in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_FSIZE in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_DATA in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_STACK in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug2: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_RSS in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_NPROC in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_NOFILE in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug:  Couldn't find SLURM_RLIMIT_AS in environment
[2023-04-20T13:21:43.603] [25852232.batch] debug2: Set task rss(32768 MB)
[2023-04-20T13:22:01.134] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:22:01.134] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:22:01.134] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:22:13.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:22:13.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:22:31.166] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:22:31.166] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:22:31.167] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:22:43.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:22:43.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:23:01.216] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:23:01.216] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:23:01.217] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:23:13.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:23:13.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:23:31.256] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:23:31.256] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:23:31.257] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:23:43.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:23:43.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:24:01.283] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:24:01.283] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:24:01.284] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:24:13.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:24:13.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:24:31.319] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:24:31.319] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:24:31.320] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:24:43.366] [25852232.extern] debug2: profile signaling type Task

### Like that for a while and then boom and we're off

[2023-04-20T13:28:43.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:28:43.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:29:01.890] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:29:01.890] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:29:01.890] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:29:13.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:29:13.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:29:27.635] [25852232.batch] debug:  cgroup/v1: _oom_event_monitor: oom-kill event count: 1
[2023-04-20T13:29:27.649] [25852232.extern] debug:  cgroup/v1: _oom_event_monitor: oom-kill event count: 1
[2023-04-20T13:29:31.993] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:29:31.993] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:29:31.994] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:29:43.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-20T13:29:43.367] [25852232.extern] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0/memory.stat' for reading : No such file or directory
[2023-04-20T13:29:43.367] [25852232.extern] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0'
[2023-04-20T13:29:43.367] [25852232.extern] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4806 ppid 4801
[2023-04-20T13:29:43.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-20T13:29:43.811] [25852232.batch] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0/memory.stat' for reading : No such file or directory
[2023-04-20T13:29:43.811] [25852232.batch] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0'
[2023-04-20T13:29:43.811] [25852232.batch] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4813 ppid 4808
[2023-04-20T13:30:01.053] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:30:01.053] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:30:01.054] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-20T13:30:13.366] [25852232.extern] debug2: profile signaling type Task


### This is a sample log after the job has started.  It's still generating output like this.

[2023-04-21T17:51:43.817] [25852232.batch] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0'
[2023-04-21T17:51:43.817] [25852232.batch] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4813 ppid 4808
[2023-04-21T17:52:11.845] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-21T17:52:11.845] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-21T17:52:11.845] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-21T17:52:13.366] [25852232.extern] debug2: profile signaling type Task
[2023-04-21T17:52:13.367] [25852232.extern] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0/memory.stat' for reading : No such file or directory
[2023-04-21T17:52:13.367] [25852232.extern] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0'
[2023-04-21T17:52:13.367] [25852232.extern] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4806 ppid 4801
[2023-04-21T17:52:13.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-21T17:52:13.818] [25852232.batch] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0/memory.stat' for reading : No such file or directory
[2023-04-21T17:52:13.818] [25852232.batch] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0'
[2023-04-21T17:52:13.818] [25852232.batch] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4813 ppid 4808
[2023-04-21T17:52:41.873] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-21T17:52:41.873] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2023-04-21T17:52:41.874] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
[


### This is probably the second python3 I started, but the technique was the same with both


mcolema5@n140:~ 4$ python3
Python 3.6.8 (default, Aug 13 2020, 07:46:32) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 'z' * 200000000000
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError
>>> x = 'z' * 100000000000
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError
>>> x = 'z' * 10000000000 
>>> 
[2]+  Stopped                 python3

### It's nice that 'ulimit -m' matches the job request, but of course this doesn't have much effect.

mcolema5@n140:~ 5$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 773354
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 33554432
open files                      (-n) 51200
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 773354
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Last dmesg output was Apr 7, oom-killer message.  Could be relevant, but probably not.

For completeness:

Apr 20 13:27:29 n140 mmfs: [N] Connecting to 172.20.8.90 n090 <c0n96>:[0]
Apr 20 13:27:29 n140 mmfs: [N] Connecting to 172.20.8.90 n090 <c0n96>:[1]
Apr 20 13:27:37 n140 dbus[5897]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
Apr 20 13:27:37 n140 systemd: Starting Hostname Service...
Apr 20 13:27:37 n140 dbus[5897]: [system] Successfully activated service 'org.freedesktop.hostname1'
Apr 20 13:27:37 n140 systemd: Started Hostname Service.
Apr 20 13:29:27 n140 dbus[5897]: [system] Activating via systemd: service name='org.freedesktop.UPower' unit='upower.service'
Apr 20 13:29:28 n140 systemd: Starting Daemon for power management...
Apr 20 13:29:28 n140 dbus[5897]: [system] Successfully activated service 'org.freedesktop.UPower'
Apr 20 13:29:28 n140 systemd: Started Daemon for power management.
Apr 20 13:30:01 n140 systemd: Created slice User Slice of root.
Apr 20 13:30:01 n140 systemd: Started Session 127794 of user root.
Apr 20 13:30:01 n140 systemd: Started Session 127795 of user root.
Apr 20 13:30:01 n140 systemd: Started Session 127796 of user root.
Apr 20 13:30:01 n140 systemd: Started Session 127797 of user root.
Comment 25 Mark Allen 2023-04-24 15:49:46 MDT
One additional detail.  The '/sys/fs/cgroup/cpu,cpuacct' tree has a lot of entries for this job, but the /memory tree is completely gone.  (at least as of right now)
Comment 26 Mark Allen 2023-04-24 17:35:55 MDT
I tried replicating this on another node, restarting slurmd and then attaching it to slurmd beforehand.  No luck.  That node does have a different amount of memory.  The bug could just be intermittent, though.  

Or, if this is ultimately a race, strace can sometimes change or serialize things enough to remove the problem.

With the thought of trying to strace on the original node, I canceled that job.  The processes below were left after that.  It's perhaps notable that three are running as a different user (two root, one rtkit).  The binaries themselves are not setuid, and I haven't tried to figure out how they get started.  Perhaps this matters.

I'll continue with the attempt to replicate on the original node under strace (and if that doesn't work, without strace).

I suppose that if strace blocks the bug, we could run the slurm daemons that way and limp along.  Hopefully we'll find something better.

Is there an easy way to limit, eliminate, or just make deterministic the whole pthread mechanism?


S root      3085     1  0  80   0  8204 65542 inet_c Apr20 ?        00:00:08   /cm/shared/apps/slurm/current/sbin/slurmd --conf-server hpc-hn1,hpc-hn2
S mcolema5  4881     1  0  80   0 80968 108528 ep_pol Apr20 ?       00:02:40   /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: n140:1 (mcolema5) -auth /home/mcolema5/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/mcolema5/.vnc/x509_cert.pem -x509key /home/mcolema5/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0
S mcolema5  4908     1  0  80   0  1584  2430 do_wai Apr20 ?        00:00:00   bash /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/script.sh
S mcolema5  4982  4908  0  80   0 28400 132928 poll_s Apr20 ?       00:00:05     mate-session
S mcolema5  5047  4982  0  80   0 15320 364107 poll_s Apr20 ?       00:00:45       /usr/libexec/mate-settings-daemon
S mcolema5  5051  4982  0  80   0 16552 102908 poll_s Apr20 ?       00:00:05       marco
S mcolema5  5055  4982  0  80   0 16532 141292 poll_s Apr20 ?       00:00:08       mate-panel
S mcolema5  8419  5055  0  80   0 25936 147685 poll_s Apr20 ?       00:00:02         mate-terminal
S mcolema5  8428  8419  0  80   0   860  3698 unix_s Apr20 ?        00:00:00           gnome-pty-helper
S mcolema5  8429  8419  0  80   0  5296  4835 n_tty_ Apr20 pts/0    00:00:00           -bash
T mcolema5  7374  8429  0  80   0 97662500 24463434 do_sig Apr21 pts/0 00:00:34             python3
T mcolema5  7593  8429  0  80   0 9771892 2490778 do_sig Apr21 pts/0 00:00:07             python3
S mcolema5  5072  4982  0  80   0 31252 147109 poll_s Apr20 ?       00:04:15       caja
S mcolema5  5075  4982  0  80   0 16836 71489 poll_s Apr20 ?        00:00:00       /usr/bin/python /usr/share/system-config-printer/applet.py
S mcolema5  5076  4982  0  80   0 12480 96527 poll_s Apr20 ?        00:00:00       abrt-applet
S mcolema5  5081  4982  0  80   0  9520 99829 poll_s Apr20 ?        00:00:03       mate-screensaver
S mcolema5  5083  4982  0  80   0 19652 127696 poll_s Apr20 ?       00:00:00       nm-applet
S mcolema5  4939     1  0  80   0  1000 15802 poll_s Apr20 ?        00:00:00   dbus-launch --autolaunch=5abba43653164143a1d82e3b311c9406 --binary-syntax --close-stderr
S mcolema5  4940     1  0  80   0  1408 15543 ep_pol Apr20 ?        00:00:00   /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
S mcolema5  4944     1  0  80   0  2984 43813 poll_s Apr20 ?        00:00:00   /usr/libexec/dconf-service
S mcolema5  4985     1  0  80   0  1000 15803 poll_s Apr20 ?        00:00:00   dbus-launch --exit-with-session mate-session
S mcolema5  4986     1  0  80   0  1984 15673 ep_pol Apr20 ?        00:00:00   /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
S mcolema5  5004     1  0  80   0  4104 82793 poll_s Apr20 ?        00:00:00   /usr/libexec/at-spi-bus-launcher
S mcolema5  5017  5004  0  80   0  2284 15543 ep_pol Apr20 ?        00:00:00     /usr/bin/dbus-daemon --config-file=/usr/share/defaults/at-spi2/accessibility.conf --nofork --print-address 3
S mcolema5  5008     1  0  80   0  3940 66656 poll_s Apr20 ?        00:00:00   /usr/libexec/gvfsd
S mcolema5  5139  5008  0  80   0  4824 84671 poll_s Apr20 ?        00:00:02     /usr/libexec/gvfsd-trash --spawner :1.2 /org/gtk/gvfs/exec_spaw/0
S mcolema5  6331  5008  0  80   0  4644 102675 poll_s Apr20 ?       00:00:00     /usr/libexec/gvfsd-network --spawner :1.2 /org/gtk/gvfs/exec_spaw/1
S mcolema5  6352  5008  0  80   0  4648 91345 poll_s Apr20 ?        00:00:02     /usr/libexec/gvfsd-dnssd --spawner :1.2 /org/gtk/gvfs/exec_spaw/3
S mcolema5  5013     1  0  80   0  3388 97320 futex_ Apr20 ?        00:00:00   /usr/libexec/gvfsd-fuse /home/mcolema5/.gvfs -f -o big_writes
S mcolema5  5026     1  0  80   0  4012 55229 poll_s Apr20 ?        00:00:00   /usr/libexec/at-spi2-registryd --use-gnome-session
S mcolema5  5040     1  0  80   0  2988 43813 poll_s Apr20 ?        00:00:00   /usr/libexec/dconf-service
S mcolema5  5064     1  0  80   0  4048 141615 poll_s Apr20 ?       00:00:01   /usr/bin/pulseaudio --start --log-target=syslog
S rtkit     5065     1  0  81   1  1812 49696 poll_s Apr20 ?        00:00:05   /usr/libexec/rtkit-daemon
S mcolema5  5087     1  0  80   0  5280 74468 poll_s Apr20 ?        00:00:23   /usr/libexec/gvfs-udisks2-volume-monitor
S root      5095     1  0  80   0  7196 114475 poll_s Apr20 ?       00:01:55   /usr/libexec/udisks2/udisksd
S mcolema5  5192     1  0  80   0  3184 67364 poll_s Apr20 ?        00:00:00   /usr/libexec/gvfs-mtp-volume-monitor
S mcolema5  5198     1  0  80   0  3416 69637 poll_s Apr20 ?        00:00:00   /usr/libexec/gvfs-gphoto2-volume-monitor
S mcolema5  5204     1  0  80   0  4404 87572 poll_s Apr20 ?        00:00:00   /usr/libexec/gvfs-afc-volume-monitor
S mcolema5  5210     1  0  80   0 14240 110089 poll_s Apr20 ?       00:00:00   /usr/libexec/mate-panel/wnck-applet
S mcolema5  5212     1  0  80   0 15964 123510 poll_s Apr20 ?       00:00:05   /usr/libexec/mate-panel/clock-applet
S mcolema5  5214     1  0  80   0  9100 88456 poll_s Apr20 ?        00:00:00   /usr/libexec/mate-panel/notification-area-applet
S mcolema5  5260     1  0  80   0  1116  4005 pipe_w Apr20 ?        00:00:00   /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script
S mcolema5  5263  5260  0  80   0   740  4005 do_wai Apr20 ?        00:00:00     /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script
S mcolema5  5264  5263  0  80   0   616  1103 hrtime Apr20 ?        00:00:11       tail -f --pid=4908 vnc.log
S mcolema5  5262     1  0  80   0 26560 70835 poll_s Apr20 ?        00:00:25   /packages/python/3.6.5/bin/python3.6 /packages/python/3.6.5/bin/websockify -D 40045 localhost:5901
S mcolema5  5414     1  0  80   0  5064 103195 poll_s Apr20 ?       00:00:00   /usr/libexec/xdg-desktop-portal
S mcolema5  5419     1  0  80   0  3012 112041 poll_s Apr20 ?       00:00:00   /usr/libexec/xdg-document-portal
S mcolema5  5424     1  0  80   0  2524 60648 poll_s Apr20 ?        00:00:00   /usr/libexec/xdg-permission-store
S mcolema5  5436     1  0  80   0  8940 94966 poll_s Apr20 ?        00:00:00   /usr/libexec/xdg-desktop-portal-gtk
S root      6969     1  0  80   0  5340 107595 poll_s Apr20 ?       00:00:03   /usr/libexec/upowerd
Comment 27 Mark Allen 2023-04-24 18:41:32 MDT
No luck with what seems like a fairly identical attempt to replicate that.  Looking at the slurmd logs, the bad and good are identical up to the point of the oom-kill message.

I did reboot the node right before.  So, perhaps some bitrot, generally or in Slurm, was cleanup up by that.

Or, maybe this is just intermittent.


I guess we could try running the slurm processes under valgrind, as suggested here: https://stackoverflow.com/questions/981011/c-programming-debugging-with-pthreads#comment52831476_982031

I've used it to good effect in the past, though never for pthread issues.  Do you test the slurm daemons this way?


Given the realities of our situation, I'm also consdering an out-of-band script to just whack processes that Slurm should have killed.  For example, user processes running on a node that the user has no job running on.  Or, memory usage exceeding total memory request for the node.  Etc.

Any ideas on that?  Has anyone already written it?

Thanks!  --Mike
Comment 28 Felip Moll 2023-04-25 06:39:04 MDT
(In reply to Mark Allen from comment #27)
> No luck with what seems like a fairly identical attempt to replicate that. 
> Looking at the slurmd logs, the bad and good are identical up to the point
> of the oom-kill message.
> 
> I did reboot the node right before.  So, perhaps some bitrot, generally or
> in Slurm, was cleanup up by that.
> 
> Or, maybe this is just intermittent.
> 
> 
> I guess we could try running the slurm processes under valgrind, as
> suggested here:
> https://stackoverflow.com/questions/981011/c-programming-debugging-with-
> pthreads#comment52831476_982031
> 
> I've used it to good effect in the past, though never for pthread issues. 
> Do you test the slurm daemons this way?
> 
> 
> Given the realities of our situation, I'm also consdering an out-of-band
> script to just whack processes that Slurm should have killed.  For example,
> user processes running on a node that the user has no job running on.  Or,
> memory usage exceeding total memory request for the node.  Etc.
> 
> Any ideas on that?  Has anyone already written it?
> 
> Thanks!  --Mike

Well, you've done an interesting analysis so far.

I think running slurmd under valgrind or strace is a really bad idea, basically the performance hit will be enormous and you can have other nasty effects. We use to test the daemons under valgrind for example:

valgrind  --leak-check=full --show-leak-kinds=definite slurmd -Dv

Some comments:

- You're right, that a process is being reparented to 1 doesn't mean necessarily that slurmstepd is gone. If it's been started by some launcher it may be out of slurm control.

- Question 1: I highly suspect you're suffering from this. Starting an entire X11 session uses dbus to communicate to systemd and start a new session and processes, it is probable that even the processes escape the slurm cgroup. Could you run one of this tests and check the cgroup of *all the spawned processes* (i.e. in your example, 'cat /proc/<pid>/cgroup' of all the processes of 'ps' output)?

- dbus is a daemon which allows communication in a client-server fashion between systemd and user applications. When you start a graphical session, most of the applications, even the desktop widgets, use to communicate using dbus. They can ask for starting new 'systemd sessions' (XDG_* variables), and systemd can decide to move processes from one cgroup to another. I am concerned this is affecting you.

These processes looks good, all of them are under the slurmstepd.

S root      4801     1  0  80   0  4436 108222 do_wai Apr20 ?       00:00:06   slurmstepd: [25852232.extern]
S root      4806  4801  0  80   0   348 27014 hrtime Apr20 ?        00:00:00     sleep 100000000
S root      4808     1  1  80   0  4640 107706 do_wai Apr20 ?       00:17:47   slurmstepd: [25852232.batch]
S mcolema5  4813  4808  0  80   0  1780  4005 do_wai Apr20 ?        00:00:00     /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script
S mcolema5  4908  4813  0  80   0  1584  2430 do_wai Apr20 ?        00:00:00       bash /home/mcolema5/ondemand/data/sys/dashboard/batch_connect/sys/bc_talapas_desktop/output/0ba6932e-de31-44ce-b4d7-3b2e75cd005a/script.sh
S mcolema5  4982  4908  0  80   0 28412 132928 poll_s Apr20 ?       00:00:02         mate-session
S mcolema5  5047  4982  0  80   0 15336 364107 poll_s Apr20 ?       00:00:13           /usr/libexec/mate-settings-daemon
S mcolema5  5051  4982  0  80   0 16560 102908 poll_s Apr20 ?       00:00:03           marco
S mcolema5  5055  4982  0  80   0 16520 141292 poll_s Apr20 ?       00:00:02           mate-panel
S mcolema5  8419  5055  0  80   0 25936 147685 poll_s Apr20 ?       00:00:01             mate-terminal
S mcolema5  8428  8419  0  80   0   860  3698 unix_s Apr20 ?        00:00:00               gnome-pty-helper
S mcolema5  8429  8419  0  80   0  5292  4835 do_wai Apr20 pts/0    00:00:00               -bash
T mcolema5  7374  8429  4  80   0 97662500 24463434 do_sig 17:40 pts/0 00:00:34                 python3
S mcolema5  7593  8429  1  80   0 9771892 2490778 poll_s 17:41 pts/0 00:00:07                 python3
S mcolema5  5072  4982  0  80   0 31248 147109 poll_s Apr20 ?       00:00:55           caja
S mcolema5  5075  4982  0  80   0 16836 71489 poll_s Apr20 ?        00:00:00           /usr/bin/python /usr/share/system-config-printer/applet.py
S mcolema5  5076  4982  0  80   0 12480 96527 poll_s Apr20 ?        00:00:00           abrt-applet
S mcolema5  5081  4982  0  80   0  9440 99829 poll_s Apr20 ?        00:00:00           mate-screensaver
S mcolema5  5083  4982  0  80   0 19656 127696 poll_s Apr20 ?       00:00:00           nm-applet
S mcolema5  5260  4813  0  80   0  1116  4005 pipe_w Apr20 ?        00:00:00       /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script
S mcolema5  5263  5260  0  80   0   740  4005 do_wai Apr20 ?        00:00:00         /bin/bash /cm/local/apps/slurm/var/spool/job25852232/slurm_script
S mcolema5  5264  5263  0  80   0   616  1103 hrtime Apr20 ?        00:00:03           tail -f --pid=4908 vnc.log

But this one is under ppid 1, which means it has been started outside slurm?:

S mcolema5  4881     1  0  80   0 80968 108528 ep_pol Apr20 ?       00:02:00   /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: n140:1 (mcolema5) -auth /home/mcolema5/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/mcolema5/.vnc/x509_cert.pem -x509key /home/mcolema5/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0

--

I am really uncertain on why these cgroups disappear or cannot be opened.. for example pid 4806 is slep 10000000, and ppid 4801 is the slurmstepd extern.. but it is still alive..

Question2: when you see these messages, can you check where slurmstepd or sleep process (or any other which is still alive and throwing this error) resides? i.e. in this case you would've been done: 'cat /proc/4806/cgroup' and 'cat /proc/4801/cgroup'.

[2023-04-21T17:52:13.367] [25852232.extern] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0/memory.stat' for reading : No such file or directory
[2023-04-21T17:52:13.367] [25852232.extern] cgroup/v1: common_cgroup_get_param: CGROUP: unable to get parameter 'memory.stat' for '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_extern/task_0'
[2023-04-21T17:52:13.367] [25852232.extern] debug2: jobacct_gather/cgroup: _prec_extra: failed to collect cgroup memory stats pid 4806 ppid 4801
[2023-04-21T17:52:13.506] [25852232.batch] debug2: profile signaling type Task
[2023-04-21T17:52:13.818] [25852232.batch] cgroup/v1: common_file_read_content: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_1345/job_25852232/step_batch/task_0/memory.stat' for reading : No such file or directory

I think we want to know now about Question1 and Question2.
Comment 29 Mark Allen 2023-04-25 17:24:23 MDT
Replying to your comments.

Yes, 'valgrind' is very slow and would only be for testing.  Note, though, that valgrind has a number of distinct "tools".  The command you mention is implicitly defaulting to the 'memcheck' tool.  It's good that you're doing this.  

But note also that the other tools--'helgrind' and 'drd'--mentioned in my prior link attempt to spot thread races.  Given the intermittent nature of this and the fact that pthreads are involved here, that makes a race seem possible.  (Also, still wondering about that possible pthread EINVAL status.)

The 'strace' alternative is a bit more general, in that its slowdown would likely be much smaller.  (It only slows system calls.)  For correctness, though, don't sell it short.  There is some room for side-effects, but not much.  IIRC, the default fake ptrace reparenting is somewhat visible to parents watching the child with options like WSTOPPED.  That's pretty rare, though.  The new --daemonize* flags might do even better, though I haven't looked closely.

(As a detail, altering the timing of broken code might expose the breakage, even if it normally "works".)

In any case, in addition to debugging the issue, I'm reaching for any easy way to just mitigate the issue, for us.  Obviously, a true fix would be better.

Regarding the reparenting, I see what you're saying about "launchers".  But in addition, it's possible to have orphans without that.  This generates one (and is in the output I'm about to upload).

    { nohup /bin/sleep 12345 & } &

    from https://unix.stackexchange.com/questions/267001/how-to-create-an-orphan-process-without-terminating-the-parent-shell

No idea whether that's involved here, but it's a separate case to consider.

I'm about to upload some cgroup info for the session I have, which is one that did not exhibit failure.  If I see another failure, I'll try to gather the same for that.

The generating commands were

    ps -eflyH > ps-eflyH.txt

    egrep -e . /proc/*/cgroup > cgroup.txt

    ps -eH -O cgroup:1000 > ps-cgroup-1000.txt

Kind of a mess, but hopefully useful info is in there.
Comment 30 Mark Allen 2023-04-25 17:27:26 MDT
Created attachment 30009 [details]
cgroup info for working OOD desktop
Comment 31 Mark Allen 2023-04-25 17:28:19 MDT
Created attachment 30010 [details]
cgroup info for working OOD desktop (cgroup contents)
Comment 32 Mark Allen 2023-04-25 17:29:00 MDT
Created attachment 30011 [details]
cgroup info for working OOD desktop (cgroups with ps)
Comment 33 Felip Moll 2023-04-26 04:13:55 MDT
(In reply to Mark Allen from comment #29)
> Replying to your comments.
> 
> Yes, 'valgrind' is very slow and would only be for testing.  Note, though,
> that valgrind has a number of distinct "tools".  The command you mention is
> implicitly defaulting to the 'memcheck' tool.  It's good that you're doing
> this.  

Yep, indeed I showed how we check for memory.
You can try with the others to find thread races, but just not run it normally in production.

> In any case, in addition to debugging the issue, I'm reaching for any easy
> way to just mitigate the issue, for us.  Obviously, a true fix would be
> better.

It's hard to find a mitigation if we don't know the cause :)

> Regarding the reparenting, I see what you're saying about "launchers".  But
> in addition, it's possible to have orphans without that.  This generates one
> (and is in the output I'm about to upload).
> 
>     { nohup /bin/sleep 12345 & } &

Yes, you're right. I am not concerned though, you can see from your outputs that this sleep process is in the correct cgroup, so when the job is done it will be killed.
So we want to check the cgroup. The commands you used are useful, specially the first two:

> The generating commands were
> 
>     ps -eflyH > ps-eflyH.txt
> 
>     egrep -e . /proc/*/cgroup > cgroup.txt

Let's capture the slurmd logs, dmesg, and this ps when the next occurrence happens.

I added a bug in "see also" with similar problems, they were using Bright and /proc/mounts shown a joined controller in cgroups:
cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0

It seems it is not your case but is something that bright does.

Thanks
Comment 34 Mark Allen 2023-04-27 11:50:44 MDT
(In reply to Felip Moll from comment #33)
> (In reply to Mark Allen from comment #29)
> > Replying to your comments.
> > 
> > Yes, 'valgrind' is very slow and would only be for testing.  Note, though,
> > that valgrind has a number of distinct "tools".  The command you mention is
> > implicitly defaulting to the 'memcheck' tool.  It's good that you're doing
> > this.  
> 
> Yep, indeed I showed how we check for memory.
> You can try with the others to find thread races, but just not run it
> normally in production.

I briefly tried all three, mostly out of curiosity.  I'm reminded that memcheck will often give some false positives (e.g., due to games libc plays).  The other two did both flag some errors that seemed possibly real.  I'm not knowledgeable enough about slurm nor pthreads to go further, but it might be worth a look.

> It's hard to find a mitigation if we don't know the cause :)

For sure.  Just mentioning it as our context.  Obviously better for the larger userbase to find a true fix, especially if this isn't peculiar to v1.

> Let's capture the slurmd logs, dmesg, and this ps when the next occurrence
> happens.

Will do.

> It seems it is not your case but is something that bright does.

As detail, we're actually using vanilla Slurm, which we compiled ourselves, rather than the Bright version.  It's certainly possible that there's Bright-specific stuff affecting this somehow.

Thanks!
Comment 35 Felip Moll 2023-05-05 09:03:47 MDT
> > Let's capture the slurmd logs, dmesg, and this ps when the next occurrence
> > happens.
> 
> Will do.

Hi Mark,

Have you been able to capture more data of a new real situation?

Thanks!
Comment 36 Mark Allen 2023-05-05 18:21:05 MDT
Hi,

We appreciate your forbearance on this.  We are so time-short, and relatively close to our RHEL 8 cutover, that we're going to pause on this, at least for now.  With any luck, either cgroups v2 or other changes will somehow make this go away.  Since you're not hearing about similar issues from others, it seems likely that there's something unusual about our config or environment that's exposing the bug, if there is one.  (It could be a kernel bug, or something else.)

Feel free to pause or close this report.  If we're still experiencing it on RHEL 8, we'll reopen and reference this bug.

Here are a few brain-dump comments, in case they might help, or help restart the ticket later.

Didn't try very hard, but the 'valgrind' race checks seemed to show some tantalizing output.  It might all be false positives, but still, worth checking out, as that sort of thing could produce rather nasty intermittent bugs.  I will upload some sample traces.  (They're rather flawed, but maybe useful.)

Also, while generating those, I saw some startup failures that seemed to be caused by timeouts.  Not sure that's right, but given the short timeframe, it suggested that some parts of SLURM might have rather short (default) timeouts.  Like under six seconds.  If that's true, it seems like a recipe for instability.  For myself, I'd choose one or two minutes as an absolute minimum timeout, unless there's a crucial reason.  (We'll look at this in our config in the future.)

Regarding valgrind and strace being slow, it's true, but in our environment, the load of slurmd and slurmstepd seems to be extremely low.  Perhaps for loads where jobs(steps) are started at very high rates it would be different.

Anyway, we appreciate your help, and will update you later.

Thanks!

Mike
Comment 37 Mark Allen 2023-05-05 18:22:07 MDT
Created attachment 30160 [details]
valgrind drd/helgrind traces (poor, but maybe interesting)
Comment 38 Felip Moll 2023-05-06 01:34:21 MDT
Thanks Mark for your detailed analysis.

Next week I am on PTO but will keep this bug open to check the outputs you sent. On the other hand, I will wait for some feedback if you have any during next week and the other, be it because you upgrade or because you catch a case.

I understand you're short on time so don't worry. If after two weeks we have no news I will close the bug.

Thanks :)
Comment 39 Felip Moll 2023-05-26 06:45:56 MDT
Mark, I checked your files but I don't find anything concerning or related to your issue.

Most of the messages refers to forked threads but they don't mean a real race exists.

Have you already upgraded or caught a new situation and gathered some info (comment 33)?
Comment 40 Mark Allen 2023-05-26 13:09:32 MDT
Thanks for looking.

At present, we're working hard on getting to RHEL 8, and thus cgroups v2.  This might take a while.  Once we're there, we'll be watching for further occurrences.  (With any luck, there won't be any.)

Thanks again!
Comment 41 Felip Moll 2023-06-05 09:40:00 MDT
Let's do one thing then, let's mark this bug as timed out and when you have more feedback or new catches, just reopen it and we will continue!.

Thanks :)