Ticket 10605

Summary: slurmctld running slow after latest upgrade
Product: Slurm Reporter: Andrew Bruno <aebruno2>
Component: slurmctldAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart, brian.mccaul, jbednasz, jennyw, kilian
Version: 20.11.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=11257
Site: University of Buffalo (SUNY) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.6 21.08pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Ticket Depends on:    
Ticket Blocks: 10767    
Attachments: log files
slurmctld backtrace
perf data
per data archive
slurmctld backtrace when unresponsive
v1
slurmctld backtrace 3
sdiag 4m

Description Andrew Bruno 2021-01-11 12:19:38 MST
We upgraded to slurm v20.11.2 (from 20.02.6) last week. Since the upgrade we've noticed slow response times from the slurmctl and jobs failing to complete. The slurmctld process is consuming higher than usual CPU load. Apologies in advance for the lack of specific information but we're having trouble tracking this down. Please let us know if you need any more info.

We're getting quite of a few timeout issues on the slurmctl along with pending RPCs at cycle end warnings:

[2021-01-11T14:02:58.593] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2021-01-11T14:03:00.084] error: slurm_receive_msg [10.111.10.4:49986]: Socket timed out on send/recv operation
[2021-01-11T14:03:04.788] error: slurm_receive_msg [10.111.16.114:40806]: Socket timed out on send/recv operation
[2021-01-11T14:03:14.712] error: slurm_receive_msg [10.111.13.14:44926]: Socket timed out on send/recv operation
[2021-01-11T14:03:14.713] error: slurm_receive_msg [10.111.8.112:57796]: Socket timed out on send/recv operation

We're also seeing timeout issues on the slurmd side, specifically node health check failing:

[2021-01-09T01:01:42.706] error: Unable to register: Socket timed out on send/recv operation
[2021-01-09T01:01:50.511] health_check: timeout after 60s: killing pgid 9793

All sinfo, squeue, etc. utils are extremely slow (sometime taking minutes to respond).

On the user side of things we're seeing the following behavior:

The srun command fails to complete.  It hangs giving the following errors:
srun: Job 4874039 step creation temporarily disabled, retrying (Socket timed out on send/recv operation)
srun: Job 4874039 step creation still disabled, retrying (Requested nodes are busy)
The job stays in that state until the requested time has expired.
Any srun command can encounter this problem.  Using the following line in Slurm scripts:

NPROCS=`srun --nodes=${SLURM_NNODES} bash -c ‘hostname’ |wc -l`

Did not make a different whether or not the job had the --exclusive flag.

Here's the output from sdiag from last week when we first started noticing the issues:

*******************************************************
sdiag output at Fri Jan 08 18:37:06 2021 (1610149026)
Data since      Thu Jan 07 19:00:06 2021 (1610064006)
*******************************************************
Server thread count:  256
Agent queue size:     0
Agent count:          1
Agent thread count:   12
DBD Agent queue size: 0

Jobs submitted: 5476
Jobs started:   4967
Jobs completed: 3493
Jobs canceled:  1019
Jobs failed:    8

Job states ts:  Fri Jan 08 18:36:49 2021 (1610149009)
Jobs pending:   2409
Jobs running:   1882

Main schedule statistics (microseconds):
        Last cycle:   16456
        Max cycle:    523323
        Total cycles: 7406
        Mean cycle:   13526
        Mean depth cycle:  1201
        Cycles per minute: 5
        Last queue length: 2355

Backfilling stats
        Total backfilled jobs (since last slurm start): 10099
        Total backfilled jobs (since last stats cycle start): 4926
        Total backfilled heterogeneous job components: 0
        Total cycles: 609
        Last cycle when: Fri Jan 08 18:35:15 2021 (1610148915)
        Last cycle: 10661875
        Max cycle:  137957839
        Mean cycle: 9605637
        Last depth cycle: 2400
        Last depth cycle (try sched): 310
        Depth Mean: 1776
        Depth Mean (try depth): 232
        Last queue length: 2356
        Queue length mean: 1749
        Last table size: 147
        Mean table size: 114

Latency for 1000 calls to gettimeofday(): 48 microseconds


Also, here's the current output of sdiag:

 *******************************************************
sdiag output at Mon Jan 11 14:13:20 2021 (1610392400)
Data since      Mon Jan 11 09:17:08 2021 (1610374628)
*******************************************************
Server thread count:  256
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 1412
Jobs started:   806
Jobs completed: 625
Jobs canceled:  53
Jobs failed:    0

Job states ts:  Mon Jan 11 14:12:53 2021 (1610392373)
Jobs pending:   3461
Jobs running:   1580

Main schedule statistics (microseconds):
        Last cycle:   17652
        Max cycle:    129673
        Total cycles: 1845
        Mean cycle:   14047
        Mean depth cycle:  1201
        Cycles per minute: 6
        Last queue length: 2321

Backfilling stats
        Total backfilled jobs (since last slurm start): 774
        Total backfilled jobs (since last stats cycle start): 774
        Total backfilled heterogeneous job components: 0
        Total cycles: 123
        Last cycle when: Mon Jan 11 14:12:08 2021 (1610392328)
        Last cycle: 20325399
        Max cycle:  99182802
        Mean cycle: 19552938
        Last depth cycle: 2397
        Last depth cycle (try sched): 465
        Depth Mean: 1909
        Depth Mean (try depth): 434
        Last queue length: 2321
        Queue length mean: 1853
        Last table size: 202
        Mean table size: 194

Latency for 1000 calls to gettimeofday(): 23 microseconds


Thanks for any help!

--Andrew
Comment 1 Jason Booth 2021-01-11 13:25:25 MST
Do you also have the "Remote Procedure Call statistics by message type" from those two sdiag outputs? If so, please also include those.
Comment 2 Jason Booth 2021-01-11 13:27:28 MST
Andrew - please also include your slurmctld.log and *.conf from your cluster. Have you also noticed a changed job type or quantity since upgrading? What other slurm.conf parameters have changes between versions?
Comment 3 Andrew Bruno 2021-01-11 13:51:27 MST
Created attachment 17416 [details]
log files
Comment 4 Andrew Bruno 2021-01-11 13:52:04 MST
(In reply to Jason Booth from comment #2)
> Andrew - please also include your slurmctld.log and *.conf from your
> cluster. Have you also noticed a changed job type or quantity since
> upgrading? What other slurm.conf parameters have changes between versions?

Requested log files have been attached. No changes have been made to slurm.conf params between versions.

Thanks!
Comment 6 Marcin Stolarek 2021-01-12 06:56:13 MST
 Andrew, 

Could you please take a coredump of slurmctld process while it's running:
>#gcore SLURMCTLD_PID
>[..]
>Saved corefile core.XXX

Get it's backtrace:
>#gdb PATH_TO_SLURMCTLD ./core.XXX
>(gdb)set pagination off
>(gdb)t a a bt f

and share the result with us?

cheers,
Marcin
Comment 7 Andrew Bruno 2021-01-12 07:06:14 MST
Created attachment 17426 [details]
slurmctld backtrace
Comment 8 Andrew Bruno 2021-01-12 07:06:31 MST
(In reply to Marcin Stolarek from comment #6)
>  Andrew, 
> 
> Could you please take a coredump of slurmctld process while it's running:
> >#gcore SLURMCTLD_PID
> >[..]
> >Saved corefile core.XXX
> 
> Get it's backtrace:
> >#gdb PATH_TO_SLURMCTLD ./core.XXX
> >(gdb)set pagination off
> >(gdb)t a a bt f
> 
> and share the result with us?
> 
> cheers,
> Marcin


Attached is the backtrace
Comment 9 Marcin Stolarek 2021-01-12 10:20:48 MST
Andrew,

Thanks for the backtrace, I'm looking into it.
Could you please additionally collect perf results:
>perf record -F 99 -p SLURMCTLD_PID --call-graph dwarf sleep 300
>perf archive perf.data

and share _both_ perf.data and perf.data.tar.bz2 (.tar.gz2 is not a compressed version of perf.data it's a file containing call annotations from your environment).

cheers,
Marcin
Comment 10 Andrew Bruno 2021-01-12 10:52:48 MST
Created attachment 17431 [details]
perf data
Comment 11 Andrew Bruno 2021-01-12 10:53:05 MST
Created attachment 17432 [details]
per data archive
Comment 12 Andrew Bruno 2021-01-12 10:53:49 MST
(In reply to Marcin Stolarek from comment #9)
> Andrew,
> 
> Thanks for the backtrace, I'm looking into it.
> Could you please additionally collect perf results:
> >perf record -F 99 -p SLURMCTLD_PID --call-graph dwarf sleep 300
> >perf archive perf.data
> 
> and share _both_ perf.data and perf.data.tar.bz2 (.tar.gz2 is not a
> compressed version of perf.data it's a file containing call annotations from
> your environment).
> 
> cheers,
> Marcin

Hi Marcin.. attached is the perf data you requested. Thanks!
Comment 13 Marcin Stolarek 2021-01-12 12:56:34 MST
Andrew,

I'm trying to understand if/how code changes between versions resulted in lower performance.

From the perf and backtrace, it looks like slurmctld is under high pressure by step creation requests, however, a fair amount of time is also spent in "dump_all_job_state" which may starve step creation RPC. Could you please lower MinJobAge to 30 and check if this will at least mitigate the issue?
Is it possible that you have a new workload on the cluster?

cheers,
Marcin
Comment 14 Andrew Bruno 2021-01-12 13:26:25 MST
(In reply to Marcin Stolarek from comment #13)
> Andrew,
> 
> I'm trying to understand if/how code changes between versions resulted in
> lower performance.
> 
> From the perf and backtrace, it looks like slurmctld is under high pressure
> by step creation requests, however, a fair amount of time is also spent in
> "dump_all_job_state" which may starve step creation RPC. Could you please
> lower MinJobAge to 30 and check if this will at least mitigate the issue?
> Is it possible that you have a new workload on the cluster?
> 

We're not aware of any new workloads on our cluster. The job throughput actually went down after the upgrade. Total job steps went down as well:

report-2021-01-01.txt:Total job steps run: 92598
report-2021-01-02.txt:Total job steps run: 87249
report-2021-01-03.txt:Total job steps run: 82618
report-2021-01-04.txt:Total job steps run: 126335
report-2021-01-05.txt:Total job steps run: 26240
report-2021-01-06.txt:Total job steps run: 69723
report-2021-01-07.txt:Total job steps run: 61565
report-2021-01-08.txt:Total job steps run: 44268
report-2021-01-09.txt:Total job steps run: 45902
report-2021-01-10.txt:Total job steps run: 53213
report-2021-01-11.txt:Total job steps run: 24709

However, this could be due to the slow performance. It seems like most of the jobs that are stuck with srun are hammering the slurmctld periodically trying to get that slurm step created. The only way these job with a stuck srun complete is running in to the walltime. One thought we had was to try deleting those jobs. However that does not explain how we got in this state to begin with?
Comment 15 Andrew Bruno 2021-01-12 13:40:49 MST
(In reply to Marcin Stolarek from comment #13)
> 
>  Could you please lower MinJobAge to 30 and check if this will at least mitigate the issue?

This did not go well. Set MinJobAge=30 and slurmctld was completely unresponsive. Reverting back to 300 allowed the slurmctld to at least come back up.
Comment 16 Andrew Bruno 2021-01-12 13:52:14 MST
Now we're seeing lots of these messages in the logs:

[2021-01-12T15:50:08.315] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=4860088, this should never happen

[2021-01-12T15:49:21.402] error: slurm_receive_msg [10.121.23.7:34568]: Zero Bytes were transmitted or received

[2021-01-12T15:48:47.789] error: gres/gpu: job 4875002 dealloc node cpn-u25-06 type tesla_v100-pcie-16gb gres count underflow (0 1)

Also, can't seem to run an scontrol reconf:


# scontrol reconf
slurm_reconfigure error: Socket timed out on send/recv operation

So we're now seeing lots of:

[2021-01-12T15:51:30.685] error: Node cpn-k06-05 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
Comment 17 Andrew Bruno 2021-01-12 15:01:13 MST
Created attachment 17444 [details]
slurmctld backtrace when unresponsive
Comment 18 Andrew Bruno 2021-01-12 15:02:47 MST
Attached another backtrace from our running slurmctld. We're now seeing the slurmctld run for while after a restart and then become completely unresponsive. Let us know if we can provide any other details.
Comment 19 Andrew Bruno 2021-01-12 19:13:02 MST
We're seeing lots of these errors in the logs as well:

[2021-01-12T21:05:57.483] step_partial_comp: JobId=4877785 StepID=4294967292 invalid
[2021-01-12T21:05:57.483] step_partial_comp: JobId=4877777 StepID=4294967292 invalid
[2021-01-12T21:05:57.484] step_partial_comp: JobId=4877628 StepID=4294967292 invalid
[2021-01-12T21:06:12.423] step_partial_comp: JobId=4877604 StepID=4294967292 invalid
[2021-01-12T21:06:12.468] step_partial_comp: JobId=4877694 StepID=4294967292 invalid
[2021-01-12T21:06:12.961] step_partial_comp: JobId=4868314 StepID=4294967292 invalid
[2021-01-12T21:06:13.462] step_partial_comp: JobId=4877823 StepID=4294967292 invalid
[2021-01-12T21:06:32.761] step_partial_comp: JobId=4877508 StepID=4294967292 invalid
[2021-01-12T21:06:32.792] step_partial_comp: JobId=4877823 StepID=4294967292 invalid
[2021-01-12T21:06:33.304] step_partial_comp: JobId=4877707 StepID=4294967292 invalid
[2021-01-12T21:06:33.306] step_partial_comp: JobId=4875387 StepID=4294967292 invalid
[2021-01-12T21:06:34.345] step_partial_comp: JobId=4877601 StepID=4294967292 invalid

# grep -r 4294967292 /var/log/slurm/slurmctld.log | wc -l
559

This just keeps growing. 4294967292 seems a bit suspect as it's 2^32 - 4? perhaps hitting a max unsigned int?
Comment 20 Marcin Stolarek 2021-01-13 04:21:34 MST
Created attachment 17449 [details]
v1

 Andrew,

Could you please apply the attached patch and let us know it works?

cheers,
Marcin
Comment 21 Andrew Bruno 2021-01-13 06:44:20 MST
(In reply to Marcin Stolarek from comment #20)
> 
> Could you please apply the attached patch and let us know it works?
> 


Hi Marcin,

Thanks for the patch. We applied this and things seem to be a bit better. slurm commands like sinfo, squeue are more responsive.

However, we're still seeing these errors:

[2021-01-13T08:30:05.368] debug:  Note large processing time from _slurmctld_background: usec=1057632 began=08:30:04.310
[2021-01-13T08:30:07.628] debug:  Note large processing time from _slurm_rpc_dump_nodes: usec=1209215 began=08:30:06.419
[2021-01-13T08:30:07.633] debug:  Note large processing time from _slurm_rpc_dump_nodes: usec=1210543 began=08:30:06.423
[2021-01-13T08:30:08.402] debug:  Note large processing time from _slurm_rpc_dump_nodes: usec=2381759 began=08:30:06.020
[2021-01-13T08:30:09.394] debug:  Note large processing time from _slurm_rpc_dump_nodes: usec=1751710 began=08:30:07.642
[2021-01-13T08:30:09.401] debug:  Note large processing time from _slurm_rpc_dump_nodes: usec=1001186 began=08:30:08.400

Did this patch address the large processing times from _slurm_rpc_dump_nodes? 

It's also worth noting, last night we put in a reservation to prevent any new jobs from starting. At the time of the reservation start time (19:10 EST) we had 1723 jobs running and we now are down to about ~800 jobs (as of 08:30 EST
this morning). The reservation is still currently in place.

What do you recommend for next steps?

Thank you!
Comment 24 Marcin Stolarek 2021-01-13 07:17:14 MST
>Thanks for the patch. We applied this and things seem to be a bit better. slurm commands like sinfo, squeue are more responsive.

In one sentence the patch allows parallel processing of dump jobs rpc (backend of squeue/scontrol show job) with the potential cost of longer processing of it. It's a revert when compared to 20.02 code. This should not impact `_slurm_rpc_dump_nodes` neither positive nor negative, which may suggest that it's just a different issue.

I'm happy to continue checking the details of those log messages, but can you check that the throughput after the patch is closer to what you expect (saw before the upgrade)? 
>The reservation is still currently in place.
Let jobs run and verify if the patch makes things better, if the performance is bad again please:
-) collect backtrace/profile
-) try reducing MinJobAge again (it may be that you effectively didn't change MinJobAge, 2nd backtrace shows slurmctld stuck in reconfigure, this shouldn't happen with the patch applied).

cheers,
Marcin
Comment 26 Andrew Bruno 2021-01-13 09:18:12 MST
(In reply to Marcin Stolarek from comment #24)
> >Thanks for the patch. We applied this and things seem to be a bit better. slurm commands like sinfo, squeue are more responsive.
> 
> In one sentence the patch allows parallel processing of dump jobs rpc
> (backend of squeue/scontrol show job) with the potential cost of longer
> processing of it. It's a revert when compared to 20.02 code. This should not
> impact `_slurm_rpc_dump_nodes` neither positive nor negative, which may
> suggest that it's just a different issue.
> 
> I'm happy to continue checking the details of those log messages, but can
> you check that the throughput after the patch is closer to what you expect
> (saw before the upgrade)? 
> >The reservation is still currently in place.
> Let jobs run and verify if the patch makes things better, if the performance
> is bad again please:
> -) collect backtrace/profile
> -) try reducing MinJobAge again (it may be that you effectively didn't
> change MinJobAge, 2nd backtrace shows slurmctld stuck in reconfigure, this
> shouldn't happen with the patch applied).
> 
> cheers,
> Marcin

Sounds good, thanks for the info Marcin. There seems to be the following issues we've identified:

1. Large processing time from _slurm_rpc_dump_nodes

2. Long response times from slurm userland tools: squeue, sinfo, etc.

3. slurmctld was getting into an infinite loop and stuck on scontrol reconf

4. debug messages showing job step ids being corrupt/invalid

With the patch applied, we've released the reservation. We've also set MinJobAge=30. Issue #3 above is resolved. slurmctld no longer hangs and scontrol reconf completes without errors. Regarding issue #2, responsiveness of the slurm user tools appear to be better. However we're still seeing #1 and #4, i.e. higher than usual load on the slurmctld and also seeing quite a bit of these errors with jobid/stepid invalid/corrupt (which may be safe to ignore?):

[2021-01-13T10:48:17.214] debug:  freed ports 12913-12921 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:17.378] debug:  freed ports 12682-12714 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:17.413] debug:  freed ports 12847-12879 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:17.415] debug:  sched: Running job scheduler
[2021-01-13T10:48:17.426] debug:  sched: already tested 1202 jobs, breaking out
[2021-01-13T10:48:17.460] debug:  freed ports 12748-12780 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:17.474] debug:  freed ports 12922-12930 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:17.814] debug:  freed ports 12781-12813 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:17.958] debug:  freed ports 12814-12846 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:17.969] debug:  freed ports 12880-12912 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:18.018] debug:  freed ports 12649-12681 for JobId=Invalid StepId=CORRUPT
[2021-01-13T10:48:18.129] debug:  freed ports 12715-12747 for JobId=Invalid StepId=CORRUPT

I'm attaching a backtrace as well.
Comment 27 Andrew Bruno 2021-01-13 09:18:43 MST
Created attachment 17455 [details]
slurmctld backtrace 3
Comment 28 Marcin Stolarek 2021-01-13 09:40:00 MST
>We've also set MinJobAge=30.
Did you notice issues with user commands (#2) without that? If not, then let's set it back to your previous value. 

We're working on some compatibility issues between slurmd/slurmctld/slurmstepd code on 20.11 that appeared as a result of the change of numeric representation of batch StepId, but I'm not sure if you're affected as well at the moment.


>1. Large processing time from _slurm_rpc_dump_nodes
Could you please share sdiag from last 4 minutes (bf_interval*2):
>#sdiag -r && sleep 4m && sdiag
From last backtrace I can say that we have a number of REQUEST_NODE_INFO RPCs waiting for lock while the lock is acquired by the thread handling the RPC. Those normally come from sinfo/scontrol show node/squeue execution. Is it possible that you have one of those running with high frequency? (sdiag should help to identify the user)

cheers,
Marcin
Comment 29 Andrew Bruno 2021-01-13 10:20:41 MST
(In reply to Marcin Stolarek from comment #28)
> >We've also set MinJobAge=30.
> Did you notice issues with user commands (#2) without that? If not, then
> let's set it back to your previous value. 

We've set MinJobAge back to 300. So far things look good.


> 
> >1. Large processing time from _slurm_rpc_dump_nodes
> Could you please share sdiag from last 4 minutes (bf_interval*2):
> >#sdiag -r && sleep 4m && sdiag

Attached is sdiag from 4m.

> From last backtrace I can say that we have a number of REQUEST_NODE_INFO
> RPCs waiting for lock while the lock is acquired by the thread handling the
> RPC. Those normally come from sinfo/scontrol show node/squeue execution. Is
> it possible that you have one of those running with high frequency? (sdiag
> should help to identify the user)

We'll monitor sdiag to see if we can identify any high frequency RPCs for users.

In regards to the patch we're running. Is your plan to include this patch in upstream v20.11.3? 

Thanks again for all the help.

--Andrew
Comment 30 Andrew Bruno 2021-01-13 10:20:53 MST
Created attachment 17459 [details]
sdiag 4m
Comment 31 Marcin Stolarek 2021-01-13 11:04:23 MST
>In regards to the patch we're running. Is your plan to include this patch in upstream v20.11.3? 

20.11.3 will be released soon. I don't expect to have the issue fixed there. It's nothing I can commit to at the moment, but we'll rather have a different/more sophisticated long-term fix.

From the sdiag you shared it looks like REQUEST_NODE_INFO comes at high frequency 36136 in 4 minutes ~ 150 RPC/s from root user. Looking at the code this RPC is never generated from daemons, so it has to be coming from one or more of:
1)scontrol (either show node or show job)
2)sinfo
3)sview
4)squeue
5)slurmrestd
6)tool using API (maybe perlAPI or other translator)

You can identify the host where the requests are comming from enabling Protocol debugflag:
>scontrol setdebugflags +Protocol
Slurmctld debug level of at least verbose is required, but if you're on debug you should see messages like:
>slurmctld: slurmctld_req: received opcode REQUEST_NODE_INFO from 192.168.121.168:54486 uid 0

Once you identify the source, disable debug flag by:
>scontrol setdebugflags -Protocol

cheers,
Marcin
Comment 32 Marcin Stolarek 2021-01-13 11:14:04 MST
PS. Is it possible that one of the tools I mentioned is used by:
>HealthCheckProgram=/usr/sbin/nhc.ccr ?

Do you see new occurrences of #1 or #4 (from comment 26)?

cheers,
Marcin
Comment 33 Andrew Bruno 2021-01-13 11:35:26 MST
(In reply to Marcin Stolarek from comment #31)
> >In regards to the patch we're running. Is your plan to include this patch in upstream v20.11.3? 
> 
> 20.11.3 will be released soon. I don't expect to have the issue fixed there.
> It's nothing I can commit to at the moment, but we'll rather have a
> different/more sophisticated long-term fix.

Sounds good. Once 20.11.3 is released, if you could let us know if we should apply the same patch or different patch that would be great.

> 
> From the sdiag you shared it looks like REQUEST_NODE_INFO comes at high
> frequency 36136 in 4 minutes ~ 150 RPC/s from root user. Looking at the code
> this RPC is never generated from daemons, so it has to be coming from one or
> more of:
> 1)scontrol (either show node or show job)
> 2)sinfo
> 3)sview
> 4)squeue
> 5)slurmrestd
> 6)tool using API (maybe perlAPI or other translator)
> 
> You can identify the host where the requests are comming from enabling
> Protocol debugflag:
> >scontrol setdebugflags +Protocol
> Slurmctld debug level of at least verbose is required, but if you're on
> debug you should see messages like:
> >slurmctld: slurmctld_req: received opcode REQUEST_NODE_INFO from 192.168.121.168:54486 uid 0

I turned on protocol debugging for about a minute. It appears we're seeing an even distribution from all our nodes on the cluster. For example, here's the IPs and occurrences of REQUEST_NODE_INFO in the logs:

     11 10.64.24.8
     11 10.64.24.7
     11 10.64.24.51
     11 10.64.24.5
     11 10.64.24.49
     11 10.64.24.47
     11 10.64.24.46
     11 10.64.23.46
     11 10.64.23.45
     11 10.121.25.10
     11 10.121.24.5
     11 10.121.24.38
     11 10.121.24.36
     11 10.121.24.34
     11 10.111.14.135
     11 10.111.14.128
     11 10.111.14.127
     11 10.111.14.126
     11 10.111.14.12
     11 10.111.14.119
     11 10.111.14.117
     11 10.111.14.116
     11 10.111.14.114
     11 10.111.14.112
     11 10.111.14.107
     11 10.111.14.106
     11 10.111.14.105
     10 10.111.14.36
     10 10.111.14.35
     10 10.111.14.30
     10 10.111.14.3
     10 10.111.14.27
     10 10.111.14.26
     10 10.111.14.2
     10 10.111.14.18
      9 10.104.16.20
      9 10.104.16.2
      9 10.104.16.13
      9 10.104.15.2
      9 10.104.15.14
      9 10.104.15.10
      9 10.104.14.35
      9 10.104.14.20
      9 10.104.14.15


So it appears it's not just a single host.
Comment 34 Andrew Bruno 2021-01-13 11:40:36 MST
(In reply to Marcin Stolarek from comment #32)
> PS. Is it possible that one of the tools I mentioned is used by:
> >HealthCheckProgram=/usr/sbin/nhc.ccr ?
> 
> Do you see new occurrences of #1 or #4 (from comment 26)?
> 

Here's what our /usr/sbin/nhc.ccr looks like:


# cat /usr/sbin/nhc.ccr
#!/bin/bash
export DF_CMD="df --local"
/usr/sbin/nhc -t 60 HOSTNAME=`hostname -s`


It does not appear any of our nhc checks call any of these tools:

> 1)scontrol (either show node or show job)
> 2)sinfo
> 3)sview
> 4)squeue
> 5)slurmrestd
> 6)tool using API (maybe perlAPI or other translator)

The only thing we have is if one of the checks fails, nhc will call scontrol update node to set it offline.
Comment 35 Andrew Bruno 2021-01-13 11:48:23 MST
(In reply to Andrew Bruno from comment #33)
> (In reply to Marcin Stolarek from comment #31)

> 6)tool using API (maybe perlAPI or other translator)

Think I found what might be causing the RPCs. We run PCP on all our nodes, specifically a slurm pmda:

https://github.com/performancecopilot/pcp/blob/1fe1622641fb8a7fe2e030b96e45acde92ee96ee/src/pmdas/slurm/pmdaslurm.pl

This might explain the high frequency of REQUEST_NODE_INFO ~150 RPC/s from root user? If so, is this causing us any issues that you can tell? We've been running this pmda for quite some time without noticing any issues.
Comment 37 Andrew Bruno 2021-01-13 17:01:44 MST
Hi Marcin,

Just wanted to give a quick update. We disabled the slurm pcp pmda on all our nodes and noticed a significant improvement. The load on slurmctld is much better and the 4 minute sdiag sample now shows: 

Remote Procedure Call statistics by user
        root            (       0) count:4060   ave_time:8208   total_time:33325109

Also, we're no longer seeing the large processing time from _slurm_rpc_dump_nodes messages. 

Thank you so much for all your help debugging this. 

Cheers,

--Andrew
Comment 38 Marcin Stolarek 2021-01-13 23:09:26 MST
Andrew,

I'm glad that we're making a progress, do you agree that we can lower the case severity?

The line being an issue is No.78[1]: $slurm->load_node(), which actually gets all nodes' information. Slurmctld provides this info in a consistent way, which requires block of other threads activity (oversimplifying; you can't get node info allowing job allocation or handling job competition in parallel).

As far as I understand pcp (performance co-pilot) is specifically designed to gather node-specific information, so the call to $slurm->load_node() is not the best fit for it?

>Sounds good. Once 20.11.3 is released, if you could let us know if we should apply the same patch or different patch that would be great.
The bug report will remain open till the final patch is merged with our public github repository. I'll keep you posted on the progress here.

cheers,
Marcin

[1]https://github.com/performancecopilot/pcp/blob/1fe1622641fb8a7fe2e030b96e45acde92ee96ee/src/pmdas/slurm/pmdaslurm.pl#L78
Comment 39 Andrew Bruno 2021-01-14 06:32:13 MST
(In reply to Marcin Stolarek from comment #38)
> Andrew,
> 
> I'm glad that we're making a progress, do you agree that we can lower the
> case severity?

Yes, agreed. We're running much better now.

> The bug report will remain open till the final patch is merged with our
> public github repository. I'll keep you posted on the progress here.

All sounds good. We run 2 other clusters, however they have not been exhibiting the same behavior. Do you recommend holding off until 20.11.3 or should we apply this patch to our other clusters as well.

Thanks again for all the help!

--Andrew
Comment 40 Marcin Stolarek 2021-01-14 08:12:06 MST
>All sounds good. We run 2 other clusters, however they have not been exhibiting the same behavior. Do you recommend holding off until 20.11.3 or should we apply this patch to our other clusters as well.

It's a little bit of a question on your operations procedures. I'd say - don't apply the workaround patch if you don't have an issue there. We're aware of more than a few customers that already migrated to 20.11 and you're the first to report this issue.

Just to be 100% clear. This issue will not be fixed on 20.11.3.

cheers,
Marcin
Comment 41 Andrew Bruno 2021-01-14 08:17:32 MST
(In reply to Marcin Stolarek from comment #40)
> >All sounds good. We run 2 other clusters, however they have not been exhibiting the same behavior. Do you recommend holding off until 20.11.3 or should we apply this patch to our other clusters as well.
> 
> It's a little bit of a question on your operations procedures. I'd say -
> don't apply the workaround patch if you don't have an issue there. We're
> aware of more than a few customers that already migrated to 20.11 and you're
> the first to report this issue.
> 
> Just to be 100% clear. This issue will not be fixed on 20.11.3.
> 
> cheers,
> Marcin


Sounds good. Thanks again!
Comment 50 Jenny Williams 2021-03-10 07:57:29 MST
FWIW - we also are seeing this behavior. ( UNC Chapel Hill )
Comment 51 Jenny Williams 2021-03-10 07:58:18 MST
... on v. slurm 20.11.3
Comment 62 Marcin Stolarek 2021-03-23 12:58:20 MDT
The fix for a deadlock was merged into slurm-20.11 branch 6db0aca5a2d69[1] and will be released in Slurm 20.11.6.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/6db0aca5a2d690dfeb77eb16e727afa6b5c97ea8
Comment 63 Kilian Cavalotti 2021-03-23 13:27:37 MDT
(In reply to Marcin Stolarek from comment #62)
> The fix for a deadlock was merged into slurm-20.11 branch 6db0aca5a2d69[1]
> and will be released in Slurm 20.11.6.

Thanks for the fix!

Cheers,
--
Kilian
Comment 64 Jason Booth 2021-08-06 08:33:09 MDT
*** Ticket 12218 has been marked as a duplicate of this ticket. ***