Ticket 14179 - slurmstepd is using too much cpu resources after upgrade to patched 21.08.7
Summary: slurmstepd is using too much cpu resources after upgrade to patched 21.08.7
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 21.08.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Oriol Vilarrubi
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-05-26 08:46 MDT by Bruno Mundim
Modified: 2024-03-01 10:18 MST (History)
2 users (show)

See Also:
Site: SciNet
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: CentOS
Machine Name: Niagara
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf and cgroup.conf (2.77 KB, application/x-compressed-tar)
2022-05-26 14:22 MDT, Bruno Mundim
Details
gdb backtrace and strace of slurmstepd (24.96 KB, application/x-compressed-tar)
2022-05-27 10:01 MDT, Bruno Mundim
Details
slurmd log, strace, gdb backtrace and other relevant info (7.99 MB, application/x-compressed-tar)
2022-06-01 15:02 MDT, Bruno Mundim
Details
Comparisons between sleep with and without host, and with a python script (1.67 MB, application/x-compressed-tar)
2022-06-28 15:38 MDT, Bruno Mundim
Details
perf data collected and attached (172.89 MB, application/x-compressed-tar)
2022-08-23 09:46 MDT, Bruno Mundim
Details
perf data for slurmstepd for a gnu-parallel job (194.02 MB, application/x-compressed-tar)
2022-08-23 15:04 MDT, Bruno Mundim
Details
pstree, lsof etc for slurmd on both nodes (121.92 MB, application/x-compressed-tar)
2022-08-24 08:54 MDT, Bruno Mundim
Details
Data collected for a running case (221.41 MB, application/x-compressed-tar)
2022-08-24 14:12 MDT, Bruno Mundim
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Bruno Mundim 2022-05-26 08:46:37 MDT
Hi,

we have recently upgraded slurm from 20.02.7 to 21.08.7 with the CVE-2022-29500_29501_29502-21.08-v2.patch patch provided by Tim. Since then we have noticed that sometimes slurmstepd uses too much cpu resources. For example:

[root@nia0015 ~]# ps -auxwwf | grep -E "slurmstepd|RSS"
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      53520  0.0  0.0 112812   984 pts/0    S+   10:22   0:00          \_ grep --color=auto -E slurmstepd|RSS
root     234727 75.0  0.0 1896692 18432 ?       Sl   08:15  95:13 slurmstepd: [7454418.extern]
root     234893  0.0  0.0 272732  3756 ?        Sl   08:15   0:01 slurmstepd: [7454418.batch]

slurmd.log shows that some connection to slurmstepd is timing out:

[2022-05-26T08:15:13.843] [7454418.extern] task/cgroup: _memcg_initialize: job: alloc=175000MB mem.limit=175000MB memsw.limit=unlimited
[2022-05-26T08:15:13.844] [7454418.extern] task/cgroup: _memcg_initialize: step: alloc=175000MB mem.limit=175000MB memsw.limit=unlimited
[2022-05-26T08:15:15.562] Launching batch job 7454418 for UID 3054483
[2022-05-26T08:15:15.591] [7454418.batch] task/cgroup: _memcg_initialize: job: alloc=175000MB mem.limit=175000MB memsw.limit=unlimited
[2022-05-26T08:15:15.591] [7454418.batch] task/cgroup: _memcg_initialize: step: alloc=175000MB mem.limit=175000MB memsw.limit=unlimited
[2022-05-26T08:45:34.822] error: slurm_msg_sendto: address:port=10.20.1.15:33358 msg_type=5020: No error
[2022-05-26T08:56:47.589] error: Timeout waiting for slurmstepd
[2022-05-26T08:56:47.589] error: gathering job accounting: 0
[2022-05-26T09:06:53.607] error: Timeout waiting for slurmstepd
[2022-05-26T09:06:53.607] error: gathering job accounting: 0
[2022-05-26T09:16:15.661] error: Timeout waiting for slurmstepd
[2022-05-26T09:16:15.662] error: gathering job accounting: 0
[2022-05-26T09:27:07.624] error: Timeout waiting for slurmstepd
[2022-05-26T09:27:07.624] error: gathering job accounting: 0
[2022-05-26T09:36:11.682] error: Timeout waiting for slurmstepd
[2022-05-26T09:36:11.682] error: gathering job accounting: 0
[2022-05-26T09:38:41.675] error: /var/spool/slurm/scripts/nodecheck2: exited with status 0x0100
[2022-05-26T09:43:40.110] error: /var/spool/slurm/scripts/nodecheck2: exited with status 0x0100

and our nodecheck script picks up on slurmstepd using too much cpu resources and drains the node for investigation. Nothing else is pointing to a problem with this job from the slurmctld point of view:

[2022-05-26T08:15:12.424] _slurm_rpc_submit_batch_job: JobId=7454418 InitPrio=2630799 usec=11246
[2022-05-26T08:15:13.413] sched/backfill: _start_job: Started JobId=7454418 in compute on nia0015

This particular job asks for 1 node and 40 tasks per node and uses gnu-parallel to handle the several serial jobs launched.

Do you have any idea on why slurmstepd is taking so much cpu resources?

Thanks,
Bruno.
Comment 1 Jason Booth 2022-05-26 14:07:30 MDT
Bruno - I will have Carlos take a look at this and get back to you. In the interim please attach your current slurm.conf, and cgroup.conf.
Comment 2 Bruno Mundim 2022-05-26 14:22:38 MDT
Created attachment 25248 [details]
slurm.conf and cgroup.conf

I have attached a tarball with slurm.conf and cgroup.conf as requested. The case I reported this morning solved by itself once the job was done. That seems an exception, since the previous cases we had to intervene. I have another case now and I will follow it closely.
Comment 4 Carlos Tripiana Montes 2022-05-27 06:01:57 MDT
Bruno,

Is it possible to execute:

  gdb -batch -p [PID] -ex "thread apply all bt full"

for the crazy slurmstepd?

Additionally, could you please send us the output for:

  strace -p [PID]

of few seconds to see if we spot any pattern there?

Let's see if we can narrow the code area where the stepd is doing it bad.

I see "error: Timeout waiting for slurmstepd" and that's mostly the slurmd end complaining extern stepd is not answering some message. And possibly, the stepd is locked in whatever section forever.

Thanks,
Carlos.
Comment 5 Carlos Tripiana Montes 2022-05-27 07:47:40 MDT
Looking into the info we already have:

[2022-05-26T08:45:34.822] error: slurm_msg_sendto: address:port=10.20.1.15:33358 msg_type=5020: No error
[2022-05-26T08:56:47.589] error: Timeout waiting for slurmstepd
[2022-05-26T08:56:47.589] error: gathering job accounting: 0

msg_type=5020 is RESPONSE_JOB_STEP_STAT. RPC only created in _rpc_stat_jobacct.

_rpc_stat_jobacct calls stepd_stat_jobacct. This function is the one which logs error("gathering job accounting: %d", rc);

stepd_stat_jobacct calls wait_fd_readable. This function is the one which logs error("Timeout waiting for slurmstepd");

This is the calling tree for the error messages, arises when slurmd receives REQUEST_JOB_STEP_STAT RPC request:

wait_fd_readable(int, int) : int - /slurm-22.05_14152/src/common/fd.c
	stepd_stat_jobacct(int, uint16_t, slurm_step_id_t *, job_step_stat_t *) : int - /slurm-22.05_14152/src/common/stepd_api.c
		_rpc_stat_jobacct(slurm_msg_t *) : void - /slurm-22.05_14152/src/slurmd/slurmd/req.c
			slurmd_req(slurm_msg_t *) : void - /slurm-22.05_14152/src/slurmd/slurmd/req.c

The REQUEST_JOB_STEP_STAT RPC is issued only by 2 guys:

slurm_job_step_stat(slurm_step_id_t *, char *, uint16_t, job_step_stat_response_msg_t * *) : int - /slurm-22.05_14152/src/api/job_step_info.c
	_do_stat(slurm_step_id_t *, char *, uint32_t, uint32_t, uint32_t, uint16_t) : int - /slurm-22.05_14152/src/sstat/sstat.c
		main(int, char * *) : int - /slurm-22.05_14152/src/sstat/sstat.c
	XS(int) : int - /slurm-22.05_14152/contribs/perlapi/libslurm/perl/Slurm.c

Please, report us back whether you use the Perl API or not. I'm mostly sure the issue could be reproduced issuing sstat command but I need to check whether it were you or the user who did that, and how it was issued (the exact sstat command).
Comment 6 Carlos Tripiana Montes 2022-05-27 08:20:30 MDT
Continuation.

What stepd_stat_jobacct is triggering in the slurmstepd is request REQUEST_STEP_STAT.

On the stepd side, this is handled in _handle_stat_jobacct function.

As the rogue stepd belongs to extern step, we can point potential issue in this block:

	if (job->step_id.step_id == SLURM_EXTERN_CONT) {
		pid_t *pids = NULL;
		int npids = 0;

		/*
		 * We only have one task in the extern step on each node,
		 * despite many pids may have been adopted.
		 */
		num_tasks = 1;
		proctrack_g_get_pids(job->cont_id, &pids, &npids);

		for (i = 0; i < npids; i++) {
			temp_jobacct = jobacct_gather_stat_task(pids[i]);
			if (temp_jobacct) {
				jobacctinfo_aggregate(jobacct, temp_jobacct);
				jobacctinfo_destroy(temp_jobacct);
			}
			log_flag(JAG, "%s: step_extern cont_id=%"PRIu64" includes pid=%"PRIu64,
				 __func__, job->cont_id, (uint64_t) pids[i]);
		}

		xfree(pids);
	} else {

I can't see any way to make the "for" statement unbound, so it's something deeper. proctrack_g_get_pids calls this tree because you have ProctrackType=proctrack/cgroup in your slurm.conf:

cgroup_g_step_get_pids(pid_t * *, int *) : int - /slurm-22.05_14152/src/common/cgroup.c
	proctrack_p_get_pids(uint64_t, pid_t * *, int *) : int - /slurm-22.05_14152/src/plugins/proctrack/cgroup/proctrack_cgroup.c

Which calls cgroup_p_step_get_pids either from cgroup_v1/2.c. So I need to know if your system is running Slurm under Cgroup v1 or v2 to properly continue my investigation.

I think, I need you to send us *if possible* a reproduction of this error having debug3 and DebugFlags=Cgroup,JobAccountGather set. If you can't please let us know.

Regards,
Carlos.
Comment 8 Bruno Mundim 2022-05-27 10:01:03 MDT
Created attachment 25255 [details]
gdb backtrace and strace of slurmstepd

I have attached most of the information you asked so far. The output of strace was just a wait call, doing nothing. I waited a few seconds before cancelling. Let me know if I should wait longer.
Comment 9 Bruno Mundim 2022-05-27 10:06:40 MDT
(In reply to Carlos Tripiana Montes from comment #6)
> Continuation.
> 
> What stepd_stat_jobacct is triggering in the slurmstepd is request
> REQUEST_STEP_STAT.
> 
> On the stepd side, this is handled in _handle_stat_jobacct function.
> 
> As the rogue stepd belongs to extern step, we can point potential issue in
> this block:
> 
> 	if (job->step_id.step_id == SLURM_EXTERN_CONT) {
> 		pid_t *pids = NULL;
> 		int npids = 0;
> 
> 		/*
> 		 * We only have one task in the extern step on each node,
> 		 * despite many pids may have been adopted.
> 		 */
> 		num_tasks = 1;
> 		proctrack_g_get_pids(job->cont_id, &pids, &npids);
> 
> 		for (i = 0; i < npids; i++) {
> 			temp_jobacct = jobacct_gather_stat_task(pids[i]);
> 			if (temp_jobacct) {
> 				jobacctinfo_aggregate(jobacct, temp_jobacct);
> 				jobacctinfo_destroy(temp_jobacct);
> 			}
> 			log_flag(JAG, "%s: step_extern cont_id=%"PRIu64" includes pid=%"PRIu64,
> 				 __func__, job->cont_id, (uint64_t) pids[i]);
> 		}
> 
> 		xfree(pids);
> 	} else {
> 
> I can't see any way to make the "for" statement unbound, so it's something
> deeper. proctrack_g_get_pids calls this tree because you have
> ProctrackType=proctrack/cgroup in your slurm.conf:
> 
> cgroup_g_step_get_pids(pid_t * *, int *) : int -
> /slurm-22.05_14152/src/common/cgroup.c
> 	proctrack_p_get_pids(uint64_t, pid_t * *, int *) : int -
> /slurm-22.05_14152/src/plugins/proctrack/cgroup/proctrack_cgroup.c
> 

Just to remind you that I am using slurm 21.08.7 patched to address the security CVE. I see you seem to be inspecting source code from slurm-22.05_14152.

> Which calls cgroup_p_step_get_pids either from cgroup_v1/2.c. So I need to
> know if your system is running Slurm under Cgroup v1 or v2 to properly
> continue my investigation.
> 

I believe is cgroup v1, the default on CentosOS 7.9. Please take a look at the README file with the info I sent you a few minutes ago.

> I think, I need you to send us *if possible* a reproduction of this error
> having debug3 and DebugFlags=Cgroup,JobAccountGather set. If you can't
> please let us know.
> 

It depends. Can I use these flags localized on one node only? Otherwise I will
try to reproduce the problem on our testing system and will get back to you later.

Thanks,
Bruno.


> Regards,
> Carlos.
Comment 10 Carlos Tripiana Montes 2022-05-27 10:23:37 MDT
> Just to remind you that I am using slurm 21.08.7 patched to address the
> security CVE. I see you seem to be inspecting source code from
> slurm-22.05_14152.

I'm targeting 22.05 to see if, when we'll able to spot the issue, it's already fixed. This is fairly common situation. So I'm following your issue down to 22.05 and see if git blame show lines with commits that their description point to a fix of a similar issue.

> It depends. Can I use these flags localized on one node only? Otherwise I
> will
> try to reproduce the problem on our testing system and will get back to you
> later.

If you know how to hit the issue, setting these debug flags in just 1 compute node is enough.

I'm looking into the GDB trace and seems a deadlock with the pthread mutex "task_list_lock". But I'm not 100% sure yet, nor seen any commit fixing this issue, nor a possible deadlock with this mutex.

So, to summarise, debug logs would be much appreciated to speed the investigation.

Thanks,
Carlos.
Comment 11 Carlos Tripiana Montes 2022-05-27 10:46:02 MDT
The usage of mutex task_list_lock is fairly simple and I can't see new commits going around the lock.

But from your GDB strace indeed seems like the threads are stuck on "lock" for that mutex, so other thread left it locked.

The question now is that I can't see how this can happened. I guess it might be not a bug about not properly unlocking the mutex but a thread having dead w/o releasing that mutex.

All the threads in the GDB trace that are around the functions that uses this lock are locked.  I haven't been able to spot a function that uses it an are not locked. This suggests me of a dead thread that, of course, is not present in the GDB trace.

All this reinforces the need of a reproducer with debug logs, so I can see how the story evolves in the logs and see if there's an error that points to a dead thread as well.

I'm still looking into this so it might happen that I discover something new but up to now this is what I now about.

Cheers,
Carlos.
Comment 12 Bruno Mundim 2022-05-27 12:40:46 MDT
(In reply to Carlos Tripiana Montes from comment #10)
> > Just to remind you that I am using slurm 21.08.7 patched to address the
> > security CVE. I see you seem to be inspecting source code from
> > slurm-22.05_14152.
> 
> I'm targeting 22.05 to see if, when we'll able to spot the issue, it's
> already fixed. This is fairly common situation. So I'm following your issue
> down to 22.05 and see if git blame show lines with commits that their
> description point to a fix of a similar issue.
> 

Ok, That sounds good!

> > It depends. Can I use these flags localized on one node only? Otherwise I
> > will
> > try to reproduce the problem on our testing system and will get back to you
> > later.
> 
> If you know how to hit the issue, setting these debug flags in just 1
> compute node is enough.
> 

I tried on our test system with a submission script similar to the user,
but I couldn't reproduce the bug yet. The script I used is very simple though:

###
#!/bin/bash
# SLURM submission script for multiple serial jobs on multiple Niagara nodes
#
#SBATCH --nodes=1
#SBATCH --account=scinet
#SBATCH --ntasks-per-node=40
#SBATCH --time=01:00:00

# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1

module load NiaEnv/2019b gnu-parallel

HOSTS=$(scontrol show hostnames $SLURM_NODELIST | tr '\n' ,)
NCORES=40
parallel --env OMP_NUM_THREADS,PATH,LD_LIBRARY_PATH --joblog slurm-$SLURM_JOBID.log -j $NCORES -S $HOSTS --wd $PWD "echo {}; sleep 600" ::: {001..120}

###

So far slurmstepd is fine, using 1.0% of cpu only. I guess it won't help if I turn on the debug flags after the user job triggers this bug, right? If so then I would have to turn on slurmd debug flags on whole cluster and monitor until the next case pops up, right? Will that slow down slurmd considerably and affect job launching performance or will it just blow up the slurmd.log size?


> I'm looking into the GDB trace and seems a deadlock with the pthread mutex
> "task_list_lock". But I'm not 100% sure yet, nor seen any commit fixing this
> issue, nor a possible deadlock with this mutex.
> 
> So, to summarise, debug logs would be much appreciated to speed the
> investigation.
> 

ok, I will check the queue now if there is any similar job pending there and 
will turn on the debug flags.

> Thanks,
> Carlos.
Comment 13 Bruno Mundim 2022-05-27 15:04:57 MDT
Unfortunately I can't reproduce the bug on demand. I contacted one of our users that had a job triggering this bug and asked him to resubmit it to a reservation early next week. Hopefully this way, and with all debug flags turned on, we can reproduce it.

Thanks!
Bruno.
Comment 14 Bruno Mundim 2022-06-01 15:02:40 MDT
Created attachment 25319 [details]
slurmd log, strace, gdb backtrace and other relevant info

I created a reservation and asked the user to submit the exact same job that triggered this bug. The first few minutes slurmstepd seems fine in terms of cpu consumption. Then it starts to slowly use more and more cpu resources. I collected the slurmd logs with the flags you asked, dumped the gdb backtrace 
several times, plus other info that might be relevant. They are all attached on the file posted here. I cancelled the job after collecting this data, but slurmstepd was not killed by the scheduler and the node was drained with 
"Kill task failed" reason. After killing -9 slurmstepd by hand, and restarting slurmd the node came back to normal.

Anyway, I hope this data is helpful. Please let me know how it goes and if I need to rerun it with other debug flags.

Cheers,
Bruno.
Comment 15 Carlos Tripiana Montes 2022-06-02 01:14:23 MDT
Thanks for the feedback. I'm going to have a look at.

The only thing it remains unknown to me is the exact jobscript for this test. I guess it's similar to the snippet in Comment 12, but just to be sure.

I'll let you know of any discovery asap.

Regards,
Carlos.
Comment 16 Carlos Tripiana Montes 2022-06-02 02:19:43 MDT
I'm seeing evidences of this job spawning a massive number of processes. I'm fairly sure the spawning isn't happening as expected, I'd recommend to get rid of parallel and use srun for the spawning of the tasks. It's not mandatory the job run with srun is MPI or whatever parallel thing, srun can act as a mere spawner.
----

I wonder whether is possible or not to confirm or test again the following:

At the moment you had to manually kill -9 the extern step, there were processes belonging to the user execution still going around?

How many of those, if any, were still roge (count it number if you test this again).

I'm asking this because I see processes like PID:259236 with parent PPID:20240, that are having their accounting gathered *after* the job has been cancelled. So they must have been still alive at that point in time.
----

Finally, I've been able to spot the running thread both in your 1st GDB backtrace and in the new ones. All the JAG threads are stuck waiting, and there's 1 in _get_offspring_data. So to clarify my previous investigation: I was wrong, there's no deadlock.

My worries are that I see a huge amount of independent processes that needs a thread to catch it's resources' usage. I'm fairly sure disabling job accounting gather will make the cpu usage to just drop to zero, but that's not the solution. The question here is more that seems like the stepd cpu usage is fine, if you take into account that is getting overloaded by a massive number of processes -that needs 1 thread each to get the accounting.
----

To summarize, I'll take the user job, and migrate from parallel to srun. Then see if the issue still happens. I'm not 100% sure, but I feel that would solve the issue.

Thanks,
Carlos.
Comment 17 Bruno Mundim 2022-06-03 15:43:23 MDT
(In reply to Carlos Tripiana Montes from comment #15)
> Thanks for the feedback. I'm going to have a look at.
> 
> The only thing it remains unknown to me is the exact jobscript for this
> test. I guess it's similar to the snippet in Comment 12, but just to be sure.
> 

I don't know the gory details of what this user is doing, but I have seen the same problem coming from other users using gnu-parallel. This user's script is roughly the same I mentioned earlier:

#!/bin/bash
# SLURM submission script for multiple serial jobs on multiple Niagara nodes
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=40
#SBATCH --time=06:00:00
#SBATCH --job-name Y1_OV_64

# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is the directory from which the job was submitted
cd $SLURM_SUBMIT_DIR

# Turn off implicit threading in Python, R
export OMP_NUM_THREADS=1


module load NiaEnv/2018a
module load python/3.6.5
source ~/.virtualenvs/python_3_6_5/bin/activate
module load gnu-parallel
module load gcc/7.3.0
module load bedtools/2.27.1
module load r/3.5.1

HOSTS=$(scontrol show hostnames $SLURM_NODELIST | tr '\n' ,)

parallel --env OMP_NUM_THREADS,PATH,LD_LIBRARY_PATH --joblog slurm-$SLURM_JOBID.log -j ${SLURM_TASKS_PER_NODE:0:2} -S $HOSTS --wd $PWD "/path/to/user/executable/command_{}.sh" ::: {1..40}

Where the several command_1.sh, etc... are just python scripts running on the user data:

python /path/to/user/python_script.py USER_SCRIPT_ARGUMENT /path/to/user/data.bed /path/to/user/another/data.bed

> I'll let you know of any discovery asap.
> 
Thanks!

> Regards,
> Carlos.
Comment 18 Bruno Mundim 2022-06-03 16:13:09 MDT
(In reply to Carlos Tripiana Montes from comment #16)
> I'm seeing evidences of this job spawning a massive number of processes. I'm
> fairly sure the spawning isn't happening as expected, I'd recommend to get
> rid of parallel and use srun for the spawning of the tasks. It's not
> mandatory the job run with srun is MPI or whatever parallel thing, srun can
> act as a mere spawner.

I understand that srun can act as a spawner, however it doesn't have the capabilities that parallel command from gnu-parallel package (https://www.gnu.org/software/parallel/) has to offer. For example, if you 
have lots of tasks with varying time for completion, gnu-parallel will take 
care of the load balance and keep as many cores busy as you have specified. 
In the case of Niagara 40 tasks should be running on the 40 physical cores
of the compute nodes all the time. The nice thing about this tool is that you
can ask to run 4000 tasks 40 at the time, and if you reach the end of the 
allocation, when you submit your job again, it will pick up the tasks that
were not completed and restart from them. We have written a wiki page with 
a few examples, https://docs.scinet.utoronto.ca/index.php/Running_Serial_Jobs_on_Niagara.
Maybe srun could adopt most of these features in the future?

> ----
> 
> I wonder whether is possible or not to confirm or test again the following:
> 
> At the moment you had to manually kill -9 the extern step, there were
> processes belonging to the user execution still going around?
> 

Good question! I don't know the answer. I didn't record the output of ps
for all processes.

> How many of those, if any, were still roge (count it number if you test this
> again).
> 
> I'm asking this because I see processes like PID:259236 with parent
> PPID:20240, that are having their accounting gathered *after* the job has
> been cancelled. So they must have been still alive at that point in time.
> ----
> 

So maybe they could be stuck at IO? I will have to ask the user to submit 
again. Please let me know if you have more ideas on what kind of extra info to 
collect.

> Finally, I've been able to spot the running thread both in your 1st GDB
> backtrace and in the new ones. All the JAG threads are stuck waiting, and
> there's 1 in _get_offspring_data. So to clarify my previous investigation: I
> was wrong, there's no deadlock.
> 

Ok. What are these JAG threads you are referring to? jobacct_gather threads?

> My worries are that I see a huge amount of independent processes that needs
> a thread to catch it's resources' usage. I'm fairly sure disabling job
> accounting gather will make the cpu usage to just drop to zero, but that's
> not the solution. The question here is more that seems like the stepd cpu
> usage is fine, if you take into account that is getting overloaded by a
> massive number of processes -that needs 1 thread each to get the accounting.
> ----

What is the evidence for a massive number of processes? As I understood the user has only 40 tasks launched at the time. If I understood you correctly that would request 40 extra JAG threads to collect resource consumption, right? Why would that overload slurmstepd?

> 
> To summarize, I'll take the user job, and migrate from parallel to srun.
> Then see if the issue still happens. I'm not 100% sure, but I feel that
> would solve the issue.
> 

We can't really do that because of missing capabilities on srun. However
it is true that this is happening only with gnu-parallel commands so far,
and only after the slurm upgrade.

Thanks,
Bruno.


> Thanks,
> Carlos.
Comment 19 Carlos Tripiana Montes 2022-06-07 02:13:34 MDT
> For example, if you have lots of tasks with varying time for completion, gnu-parallel will take care of the load balance and keep as many cores busy as you have specified.
> In the case of Niagara 40 tasks should be running on the 40 physical cores of the compute nodes all the time. The nice thing about this tool is that you can ask to run 4000 tasks 40 at the time, and if you reach the end of the allocation, when you submit your job again, it will pick up the tasks that were not completed and restart from them.

OK, that makes sense. You are doing greedy stuff. Now I understand better the logs. Looking at the 2nd bunch of GDB traces, you'll see more than, but close to, 40 threads keeping track of the parallel tasks for the job account gather plugin (yes, JAG means exactly that :)). But if you look at the 1st GDB trace you'll see quite more than ~40. My guess is that, at that time, things gone quite more overloaded.

> > At the moment you had to manually kill -9 the extern step, there were processes belonging to the user execution still going around?
> 
> Good question! I don't know the answer. I didn't record the output of ps for all processes.

This is something I think it really needs to be checked. See below.

> So maybe they could be stuck at IO? I will have to ask the user to submit again. Please let me know if you have more ideas on what kind of extra info to collect.

Yes, this could be for sure a possibility. We need to check why there's still JAG threads looking at the stats (those from /proc) of parallel tasks.

> What is the evidence for a massive number of processes? As I understood the user has only 40 tasks launched at the time. If I understood you correctly that would request 40 extra JAG threads to collect resource consumption, right? Why would that overload slurmstepd?

In fact, you are right for the 2nd bunch of GDB traces. In the logs, you'll see a massive amount of threads for the same jobs doing JAG work. But after looking at the GDB traces, you see the slurmd is consistently around 40 JAG threads. Then, a second look at logs exposes the PIDs tracked by JAG in "blocks" of around 40 threads.

I was more afraid if the 1st GDB trace where you'll clearly see there were way more than 40 JAG threads. Either the user did wrong with parallel, or something else bad happened.

All in all, the point that is causing problems is related to things like:

2982:[2022-06-01T14:15:06.639] [7503353.extern] debug:  Handling REQUEST_ADD_EXTERN_PID
2983:[2022-06-01T14:15:06.639] [7503353.extern] debug:  _handle_add_extern_pid_internal: for StepId=7503353.extern, pid 250376

Every time you see this, a new JAG thread for this PID is created. What I don't really know yet is why this is happening. I wonder whether the parallel job is creating tasks that subsequently get detached from the parallel process. My bet is yes, because the "-S HOSTLIST" flag is making parallel to spawn every single task using SSH.

Have a look at the comparison between this 2 commands:

parallel -j 40 "sleep 60" ::: {1..40}
parallel -j 40 -S localhost "sleep 60" ::: {1..40}

The second one exposes the issue I'm pointing out.

IF the user is only intending to use 1 node, just wipe out the "-S" flag so Slurm can keep the whole work as part of the batch step. If the SSH is in the middle causing the parallel tasks to get detached from the batch step, slurm tries to catch them all in the extern step to avoid users from running beyond the allowed, but this causes the extra overhead you see. In any case, this is the explanation and the workaround to get rid of 40 JAG threads. I can't really know why at some point the node got so overloaded that slurmd got ~150 JAG threads.

Regards,
Carlos.
Comment 20 Carlos Tripiana Montes 2022-06-15 04:20:23 MDT
Hi Bruno,

please, update us whenever possible. It's been more than a week since our last communication.

Regards,
Carlos.
Comment 21 Bruno Mundim 2022-06-15 09:00:48 MDT
(In reply to Carlos Tripiana Montes from comment #20)
> Hi Bruno,
> 
> please, update us whenever possible. It's been more than a week since our
> last communication.
> 
> Regards,
> Carlos.

Sorry Carlos,

we have a maintenance going on at the datacenter this week and I will need an extra day or two to get back to you on this.

Thanks,
Bruno.
Comment 22 Bruno Mundim 2022-06-28 15:33:59 MDT
(In reply to Carlos Tripiana Montes from comment #19)
> > For example, if you have lots of tasks with varying time for completion, gnu-parallel will take care of the load balance and keep as many cores busy as you have specified.
> > In the case of Niagara 40 tasks should be running on the 40 physical cores of the compute nodes all the time. The nice thing about this tool is that you can ask to run 4000 tasks 40 at the time, and if you reach the end of the allocation, when you submit your job again, it will pick up the tasks that were not completed and restart from them.
> 
> OK, that makes sense. You are doing greedy stuff. Now I understand better
> the logs. Looking at the 2nd bunch of GDB traces, you'll see more than, but
> close to, 40 threads keeping track of the parallel tasks for the job account
> gather plugin (yes, JAG means exactly that :)). But if you look at the 1st
> GDB trace you'll see quite more than ~40. My guess is that, at that time,
> things gone quite more overloaded.
> 

But then why it didn't decrease with time?

> > > At the moment you had to manually kill -9 the extern step, there were processes belonging to the user execution still going around?
> > 
> > Good question! I don't know the answer. I didn't record the output of ps for all processes.
> 
> This is something I think it really needs to be checked. See below.
> 

I rerun your suggestions below and did one with a python script of mine and this problem was absent in all 3 cases. Not sure how to reproduce it yet.

> > So maybe they could be stuck at IO? I will have to ask the user to submit again. Please let me know if you have more ideas on what kind of extra info to collect.
> 
> Yes, this could be for sure a possibility. We need to check why there's
> still JAG threads looking at the stats (those from /proc) of parallel tasks.
> 

Do you see evidence of that on the logs for the new tests? (that I am about to upload)

> > What is the evidence for a massive number of processes? As I understood the user has only 40 tasks launched at the time. If I understood you correctly that would request 40 extra JAG threads to collect resource consumption, right? Why would that overload slurmstepd?
> 
> In fact, you are right for the 2nd bunch of GDB traces. In the logs, you'll
> see a massive amount of threads for the same jobs doing JAG work. But after
> looking at the GDB traces, you see the slurmd is consistently around 40 JAG
> threads. Then, a second look at logs exposes the PIDs tracked by JAG in
> "blocks" of around 40 threads.
> 
> I was more afraid if the 1st GDB trace where you'll clearly see there were
> way more than 40 JAG threads. Either the user did wrong with parallel, or
> something else bad happened.
> 

Could it be the extra ssh processes to launch the tasks on the node?


> All in all, the point that is causing problems is related to things like:
> 
> 2982:[2022-06-01T14:15:06.639] [7503353.extern] debug:  Handling
> REQUEST_ADD_EXTERN_PID
> 2983:[2022-06-01T14:15:06.639] [7503353.extern] debug: 
> _handle_add_extern_pid_internal: for StepId=7503353.extern, pid 250376
> 
> Every time you see this, a new JAG thread for this PID is created. What I
> don't really know yet is why this is happening. I wonder whether the
> parallel job is creating tasks that subsequently get detached from the
> parallel process. My bet is yes, because the "-S HOSTLIST" flag is making
> parallel to spawn every single task using SSH.
> 

So you think the parallel process is forking itself 40 times?

> Have a look at the comparison between this 2 commands:
> 
> parallel -j 40 "sleep 60" ::: {1..40}
> parallel -j 40 -S localhost "sleep 60" ::: {1..40}
> 
> The second one exposes the issue I'm pointing out.
> 

Yes, I do see all the ssh connections collected by sshd daemon. 
However I ran both of commands above, but sleeping for 10 minutes, and
I didn't see slurmstepd increasing in cpu consumption.

> IF the user is only intending to use 1 node, just wipe out the "-S" flag so
> Slurm can keep the whole work as part of the batch step. If the SSH is in
> the middle causing the parallel tasks to get detached from the batch step,
> slurm tries to catch them all in the extern step to avoid users from running
> beyond the allowed, but this causes the extra overhead you see. 

But why that load persists after the connection is established? Definitely 
we can advise users to use -S flag only when using more than one node, however
I didn't notice any difference on slurmstepd cpu consumption.


In any case,
> this is the explanation and the workaround to get rid of 40 JAG threads. I
> can't really know why at some point the node got so overloaded that slurmd
> got ~150 JAG threads.
> 

Ok, I have attached the results of 3 runs. The first two uses sleep 600 command
but differ on using -S flag or not. I couldn't see any difference between 
them, beyond the extra ssh processes. The last run I use a simple python code
that just load 3 2d arrays with random numbers and keep averaging their values 
in an infinite loop. I tried to simulate a reasonably large memory foot print per core and a constant use of cpu. I do see a spike in slurmstepd cpu usage 
at the beginning, but it reduces back to almost zero as times goes on.

The bottom line is that even with a python script that actually does something,
I couldn't still reproduce the problem. Next I will try to add some heavy IO to see if it helps.

Thanks,
Bruno. 


> Regards,
> Carlos.
Comment 23 Bruno Mundim 2022-06-28 15:38:09 MDT
Created attachment 25691 [details]
Comparisons between sleep with and without host, and with a python script

Several output files and logs comparing first the influence of -S flag on parallel command and slurmstepd cpu consumption. No increase was found. The third set of output is for a simple python script keeping a reasonable memory footprint and the cpu busy. Again no increase of slurmstepd observed. I hope you can spot the differences between the runs that worked fine here and the previous one from the user, where the slurmstepd end up slowly using more and more cpu resources 
with time.

Thanks,
Bruno.
Comment 24 Carlos Tripiana Montes 2022-07-11 02:37:57 MDT
Hi Bruno,

Sorry for the delay.

Overall, I'm still confident on the high number of JAG threads opened is due to the extra ssh processes to launch the tasks.

This scenario is keen to cause cpu contention, as there is more threads than cpus. If, for some reason, this collides with an slowdown in grabbing the proc stats (like due to IO overload) this could lead to your situation. On the 1st gdb trace with way more than 40 threads the logs exposed JAG threads able to get stats from PIDs of already cancelled job, so these processes must have been alive at that time. And this also points to some IO overload or similar.

The high cpu load usage from slurmstepd isn't only related to the ssh spawning but increases by the time, and I think it's the JAG thread the one consuming cycles computing the energy or any related work related to this. All in all I can't confirm this extent unless we can put this process under a perf record and see the cycles report, so I am just betting the most probable cause. And, most important, you see this isn't something easy to reproduce, so I think this is not the common situation. It must be something that happens only in an overload scenario.

I wonder whether you are keen/able to use perf record when the problem arises in a production job, or if you have a reproducer job. But this would report the truth. I didn't wanted to get into that option until now as it requires some time from you and I don't know if you have it.

Please tell us your options.

Thanks,
Carlos.
Comment 25 Carlos Tripiana Montes 2022-07-26 04:42:35 MDT
Hi,

There's 2 weeks since the last reply I sent you. Please, tell us what do you think of.

Thanks,
Carlos.
Comment 26 Bruno Mundim 2022-07-26 09:18:38 MDT
Hi Carlos,

yes I would like to try your suggestion. I am a bit short on time this week and the next but I could start working on it. Could you please elaborate your suggestion on using perf?

Thanks,
Bruno.
Comment 27 Carlos Tripiana Montes 2022-08-02 04:23:40 MDT
If you spot one slurmstepd consuming a lot of CPU in this precise moment, collect the perf data by doing:

perf record  -s --call-graph dwarf -p [PID OF slurmstepd] sleep 600
perf archive perf.data

With the appropriated PID. Then send both perf.data.tar.bz2 and perf.data.

Thanks,
Carlos.
Comment 28 Carlos Tripiana Montes 2022-08-11 01:25:24 MDT
Bruno,

We need to close the bug as timed out. Just reopen it and provide the information we need if possible.

Thanks.
Comment 30 Bruno Mundim 2022-08-23 09:46:42 MDT
Created attachment 26430 [details]
perf data collected and attached

I have attached the perf data collected from one instance of slurmstepd running very high on cpu usage. This example collects data for a user using R (https://www.r-project.org/) instead of the previously used gnu-parallel program.
Please let me know if you find it helpful.

Thanks,
Bruno.
Comment 31 Jason Booth 2022-08-23 09:55:09 MDT
Carlos is out of the office on vacation, so I will have Oriol look over this data and let you know what he finds.
Comment 32 Bruno Mundim 2022-08-23 15:04:08 MDT
Created attachment 26441 [details]
perf data for slurmstepd for a gnu-parallel job

Thank you! I have also attached here data for a recent gnu-parallel job which slurm seems to fail to kill slurmstepd and it is currently at 97% cpu usage.
Please let me know if this data for both cases are enough for your analyses so I can bring those nodes to production again.

Thanks,
Bruno.
Comment 33 Jason Booth 2022-08-23 15:21:13 MDT
From the compute node, please also gather the following.

> sudo lsof > /tmp/lsof_14179.out
> sudo pstree -p > /tmp/pstree_14179.out
> sudo pldd <PROCESS_ID> >> /tmp/pldd_14179.out
> sudo cat /proc/<PROCESS_ID>/maps | awk '{print $6}' | grep '\.so' | sort | uniq >> /tmp/proc_14179.out

Please use the last two commands with the process id of the slurmstepd and slurmd. 

Gather the output files and attach those files to this issue. You then can return these nodes back to the cluster.
Comment 34 Bruno Mundim 2022-08-24 08:54:51 MDT
Created attachment 26451 [details]
pstree, lsof etc for slurmd on both nodes

I have attached the output of the commands you asked for 2 nodes with slurmstepd problem. Unfortunately when I collected this data, the slurmstepd process had already been killed. I took note of it and will submit new data when I observe this problem in the cluster again. 

Thanks,
Bruno.
Comment 35 Bruno Mundim 2022-08-24 14:12:54 MDT
Created attachment 26457 [details]
Data collected for a running case

I was able to catch a case while still running today. All the data you asked for is attached. I have contacted the user to try understanding what he is doing. Most of his R jobs trigger a high cpu usage from slurmstepd which takes a long time to clear out even after the user job finishes.

Thanks,
Bruno.
Comment 36 Oriol Vilarrubi 2022-08-25 13:03:02 MDT
Hi Bruno,

> Thank you! I have also attached here data for a recent gnu-parallel job which slurm seems to fail to kill slurmstepd and it is currently at 97% cpu usage.
> Please let me know if this data for both cases are enough for your analyses so I can bring those nodes to production again.

Yes, with all the data you've provided I have plenty of information to look into, you can bring those nodes to production again. I've read all the bug history and I'm going trough the logs and perf data you sent. For the moment I can confirm that the cpu usage is highly on the JAG threads, so the idea of Carlos confirms, the issue is there, I will continue to study the perf data as well as all the other information you provided.

In order to discard points of failure, would you be open to change to jobacct_gather from linux to cgroup and see if the same R job that increased the cpu usage does it again?

Regards.
Comment 37 Bruno Mundim 2022-08-25 14:13:11 MDT
(In reply to Oriol Vilarrubi from comment #36)

> In order to discard points of failure, would you be open to change to
> jobacct_gather from linux to cgroup and see if the same R job that increased
> the cpu usage does it again?

Do you think it is possible to do this change on the fly? Ideally I would identify the node with the slurmstepd problem and change the configuration just there and reload slurmd. Would that be possible instead of a change of configuration for the whole cluster?

I failed to come up with a test to reproduce this issue so far. A colleague just suggested to create one that spawns new processes and destroy them very fast as a possible way to reproduce the problem. If you think it is useful and have an idea of a test case for it, please let me know.

Thanks,
Bruno.
Comment 38 Oriol Vilarrubi 2022-08-29 04:19:25 MDT
(In reply to Bruno Mundim from comment #37)
> (In reply to Oriol Vilarrubi from comment #36)
> 
> > In order to discard points of failure, would you be open to change to
> > jobacct_gather from linux to cgroup and see if the same R job that increased
> > the cpu usage does it again?
> 
> Do you think it is possible to do this change on the fly? Ideally I would
> identify the node with the slurmstepd problem and change the configuration
> just there and reload slurmd. Would that be possible instead of a change of
> configuration for the whole cluster?

Yes, it can be done, but it requires the slurmd node to be empty of jobs, as this changes the communication between slurmd and the steps, so if there is any running the communication would fail. Also if you change the configuration in one node add the option NO_CONF_HASH to the parameter DebugFlags in your slurm.conf (https://slurm.schedmd.com/slurm.conf.html#OPT_NO_CONF_HASH), this way you will not get error messages in the log about having different config files.

> I failed to come up with a test to reproduce this issue so far. A colleague
> just suggested to create one that spawns new processes and destroy them very
> fast as a possible way to reproduce the problem. If you think it is useful
> and have an idea of a test case for it, please let me know.

Let me took a deeper look into all the information and logs provided by you as well as look into the code before answering that, as I want to be sure.

Regards.
Comment 39 Oriol Vilarrubi 2023-03-24 10:08:24 MDT
Hello Bruno,

First let me apologize for the late response in this bug. I've been investigating it further and I think I came up with a lead that we can take to further isolate down the issue, but I want to investigate the code further before asking you to do any tests. The issue you are seeing is caused by the syscall poll taking more than 300 seconds, as Carlos already mentioned in past comments this could be caused to having lots of processes having too much IO on the system, but it can also be caused to non responsive slurmstepd as the poll timeout is in the UNIX filesystem socket that is used to communicate slurmd and slurmstepd. As the next step I would like you to try to change to the cgroup procktrack plugin in order to be sure this is a general jobaccountgather issue not a linux specific one.

In order to do so, you will need to create a reservation for one node, and when there are no jobs in it, you need to change the local slurm.conf on that node to have the cgroup job account gather plugin and restart the slurmd daemon, and then retest with the user job that tends to reproduce the issue. If we still see the errors "Timeout waiting for slurmstepd" we know that the error is not in the plugin but in the generic JAG platform. Meanwhile I am working on a debug patch, so that you can apply to this same "reserved" node and we can obtain further information on what is happening.

Thanks, and sorry again for the late response.
Comment 40 Bruno Mundim 2023-04-24 09:38:19 MDT
Hi Oriol,

I am waiting for a user's code to trigger the problem again. Sometimes it happens 
almost daily, and sometimes we have a long hiatus like now.

Thanks,
Bruno.


(In reply to Oriol Vilarrubi from comment #39)
> Hello Bruno,
> 
> First let me apologize for the late response in this bug. I've been
> investigating it further and I think I came up with a lead that we can take
> to further isolate down the issue, but I want to investigate the code
> further before asking you to do any tests. The issue you are seeing is
> caused by the syscall poll taking more than 300 seconds, as Carlos already
> mentioned in past comments this could be caused to having lots of processes
> having too much IO on the system, but it can also be caused to non
> responsive slurmstepd as the poll timeout is in the UNIX filesystem socket
> that is used to communicate slurmd and slurmstepd. As the next step I would
> like you to try to change to the cgroup procktrack plugin in order to be
> sure this is a general jobaccountgather issue not a linux specific one.
> 
> In order to do so, you will need to create a reservation for one node, and
> when there are no jobs in it, you need to change the local slurm.conf on
> that node to have the cgroup job account gather plugin and restart the
> slurmd daemon, and then retest with the user job that tends to reproduce the
> issue. If we still see the errors "Timeout waiting for slurmstepd" we know
> that the error is not in the plugin but in the generic JAG platform.
> Meanwhile I am working on a debug patch, so that you can apply to this same
> "reserved" node and we can obtain further information on what is happening.
> 
> Thanks, and sorry again for the late response.
Comment 41 Oriol Vilarrubi 2023-09-12 09:37:45 MDT
Hi Bruno,

Did the user's code triggered the problem again?

Regards.
Comment 42 Bruno Mundim 2023-09-18 13:08:48 MDT
(In reply to Oriol Vilarrubi from comment #41)
> Hi Bruno,
> 
> Did the user's code triggered the problem again?
> 
> Regards.

It is happening less often, but when it happens I am usually late for the party. Sorry about that.

Thanks,
Bruno.
Comment 43 Oriol Vilarrubi 2024-02-16 06:50:45 MST
Hi Bruno,

Did the user's code triggered the problem another time? And have you by any chance tried that same thing in a newer slurm version?

Regards.
Comment 44 Oriol Vilarrubi 2024-03-01 10:18:44 MST
Hi Bruno,

I am closing this bug as at the moment this has not been appearing that much often, also in newer slurm some performance improvements had been made in the jobacctgather part of the code for jobs that have a lot of processes that are not present at the end of the job. Please do not hesitate to reopen this bug in case that you observe this behaviour again.

Regards.