| Summary: | slurmctld running slow after latest upgrade | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Andrew Bruno <aebruno2> |
| Component: | slurmctld | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart, brian.mccaul, jbednasz, jennyw, kilian |
| Version: | 20.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=11257 | ||
| Site: | University of Buffalo (SUNY) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.11.6 21.08pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Ticket Depends on: | |||
| Ticket Blocks: | 10767 | ||
| Attachments: |
log files
slurmctld backtrace perf data per data archive slurmctld backtrace when unresponsive v1 slurmctld backtrace 3 sdiag 4m |
||
Do you also have the "Remote Procedure Call statistics by message type" from those two sdiag outputs? If so, please also include those. Andrew - please also include your slurmctld.log and *.conf from your cluster. Have you also noticed a changed job type or quantity since upgrading? What other slurm.conf parameters have changes between versions? Created attachment 17416 [details]
log files
(In reply to Jason Booth from comment #2) > Andrew - please also include your slurmctld.log and *.conf from your > cluster. Have you also noticed a changed job type or quantity since > upgrading? What other slurm.conf parameters have changes between versions? Requested log files have been attached. No changes have been made to slurm.conf params between versions. Thanks! Andrew, Could you please take a coredump of slurmctld process while it's running: >#gcore SLURMCTLD_PID >[..] >Saved corefile core.XXX Get it's backtrace: >#gdb PATH_TO_SLURMCTLD ./core.XXX >(gdb)set pagination off >(gdb)t a a bt f and share the result with us? cheers, Marcin Created attachment 17426 [details]
slurmctld backtrace
(In reply to Marcin Stolarek from comment #6) > Andrew, > > Could you please take a coredump of slurmctld process while it's running: > >#gcore SLURMCTLD_PID > >[..] > >Saved corefile core.XXX > > Get it's backtrace: > >#gdb PATH_TO_SLURMCTLD ./core.XXX > >(gdb)set pagination off > >(gdb)t a a bt f > > and share the result with us? > > cheers, > Marcin Attached is the backtrace Andrew,
Thanks for the backtrace, I'm looking into it.
Could you please additionally collect perf results:
>perf record -F 99 -p SLURMCTLD_PID --call-graph dwarf sleep 300
>perf archive perf.data
and share _both_ perf.data and perf.data.tar.bz2 (.tar.gz2 is not a compressed version of perf.data it's a file containing call annotations from your environment).
cheers,
Marcin
Created attachment 17431 [details]
perf data
Created attachment 17432 [details]
per data archive
(In reply to Marcin Stolarek from comment #9) > Andrew, > > Thanks for the backtrace, I'm looking into it. > Could you please additionally collect perf results: > >perf record -F 99 -p SLURMCTLD_PID --call-graph dwarf sleep 300 > >perf archive perf.data > > and share _both_ perf.data and perf.data.tar.bz2 (.tar.gz2 is not a > compressed version of perf.data it's a file containing call annotations from > your environment). > > cheers, > Marcin Hi Marcin.. attached is the perf data you requested. Thanks! Andrew, I'm trying to understand if/how code changes between versions resulted in lower performance. From the perf and backtrace, it looks like slurmctld is under high pressure by step creation requests, however, a fair amount of time is also spent in "dump_all_job_state" which may starve step creation RPC. Could you please lower MinJobAge to 30 and check if this will at least mitigate the issue? Is it possible that you have a new workload on the cluster? cheers, Marcin (In reply to Marcin Stolarek from comment #13) > Andrew, > > I'm trying to understand if/how code changes between versions resulted in > lower performance. > > From the perf and backtrace, it looks like slurmctld is under high pressure > by step creation requests, however, a fair amount of time is also spent in > "dump_all_job_state" which may starve step creation RPC. Could you please > lower MinJobAge to 30 and check if this will at least mitigate the issue? > Is it possible that you have a new workload on the cluster? > We're not aware of any new workloads on our cluster. The job throughput actually went down after the upgrade. Total job steps went down as well: report-2021-01-01.txt:Total job steps run: 92598 report-2021-01-02.txt:Total job steps run: 87249 report-2021-01-03.txt:Total job steps run: 82618 report-2021-01-04.txt:Total job steps run: 126335 report-2021-01-05.txt:Total job steps run: 26240 report-2021-01-06.txt:Total job steps run: 69723 report-2021-01-07.txt:Total job steps run: 61565 report-2021-01-08.txt:Total job steps run: 44268 report-2021-01-09.txt:Total job steps run: 45902 report-2021-01-10.txt:Total job steps run: 53213 report-2021-01-11.txt:Total job steps run: 24709 However, this could be due to the slow performance. It seems like most of the jobs that are stuck with srun are hammering the slurmctld periodically trying to get that slurm step created. The only way these job with a stuck srun complete is running in to the walltime. One thought we had was to try deleting those jobs. However that does not explain how we got in this state to begin with? (In reply to Marcin Stolarek from comment #13) > > Could you please lower MinJobAge to 30 and check if this will at least mitigate the issue? This did not go well. Set MinJobAge=30 and slurmctld was completely unresponsive. Reverting back to 300 allowed the slurmctld to at least come back up. Now we're seeing lots of these messages in the logs: [2021-01-12T15:50:08.315] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=4860088, this should never happen [2021-01-12T15:49:21.402] error: slurm_receive_msg [10.121.23.7:34568]: Zero Bytes were transmitted or received [2021-01-12T15:48:47.789] error: gres/gpu: job 4875002 dealloc node cpn-u25-06 type tesla_v100-pcie-16gb gres count underflow (0 1) Also, can't seem to run an scontrol reconf: # scontrol reconf slurm_reconfigure error: Socket timed out on send/recv operation So we're now seeing lots of: [2021-01-12T15:51:30.685] error: Node cpn-k06-05 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. Created attachment 17444 [details]
slurmctld backtrace when unresponsive
Attached another backtrace from our running slurmctld. We're now seeing the slurmctld run for while after a restart and then become completely unresponsive. Let us know if we can provide any other details. We're seeing lots of these errors in the logs as well: [2021-01-12T21:05:57.483] step_partial_comp: JobId=4877785 StepID=4294967292 invalid [2021-01-12T21:05:57.483] step_partial_comp: JobId=4877777 StepID=4294967292 invalid [2021-01-12T21:05:57.484] step_partial_comp: JobId=4877628 StepID=4294967292 invalid [2021-01-12T21:06:12.423] step_partial_comp: JobId=4877604 StepID=4294967292 invalid [2021-01-12T21:06:12.468] step_partial_comp: JobId=4877694 StepID=4294967292 invalid [2021-01-12T21:06:12.961] step_partial_comp: JobId=4868314 StepID=4294967292 invalid [2021-01-12T21:06:13.462] step_partial_comp: JobId=4877823 StepID=4294967292 invalid [2021-01-12T21:06:32.761] step_partial_comp: JobId=4877508 StepID=4294967292 invalid [2021-01-12T21:06:32.792] step_partial_comp: JobId=4877823 StepID=4294967292 invalid [2021-01-12T21:06:33.304] step_partial_comp: JobId=4877707 StepID=4294967292 invalid [2021-01-12T21:06:33.306] step_partial_comp: JobId=4875387 StepID=4294967292 invalid [2021-01-12T21:06:34.345] step_partial_comp: JobId=4877601 StepID=4294967292 invalid # grep -r 4294967292 /var/log/slurm/slurmctld.log | wc -l 559 This just keeps growing. 4294967292 seems a bit suspect as it's 2^32 - 4? perhaps hitting a max unsigned int? Created attachment 17449 [details]
v1
Andrew,
Could you please apply the attached patch and let us know it works?
cheers,
Marcin
(In reply to Marcin Stolarek from comment #20) > > Could you please apply the attached patch and let us know it works? > Hi Marcin, Thanks for the patch. We applied this and things seem to be a bit better. slurm commands like sinfo, squeue are more responsive. However, we're still seeing these errors: [2021-01-13T08:30:05.368] debug: Note large processing time from _slurmctld_background: usec=1057632 began=08:30:04.310 [2021-01-13T08:30:07.628] debug: Note large processing time from _slurm_rpc_dump_nodes: usec=1209215 began=08:30:06.419 [2021-01-13T08:30:07.633] debug: Note large processing time from _slurm_rpc_dump_nodes: usec=1210543 began=08:30:06.423 [2021-01-13T08:30:08.402] debug: Note large processing time from _slurm_rpc_dump_nodes: usec=2381759 began=08:30:06.020 [2021-01-13T08:30:09.394] debug: Note large processing time from _slurm_rpc_dump_nodes: usec=1751710 began=08:30:07.642 [2021-01-13T08:30:09.401] debug: Note large processing time from _slurm_rpc_dump_nodes: usec=1001186 began=08:30:08.400 Did this patch address the large processing times from _slurm_rpc_dump_nodes? It's also worth noting, last night we put in a reservation to prevent any new jobs from starting. At the time of the reservation start time (19:10 EST) we had 1723 jobs running and we now are down to about ~800 jobs (as of 08:30 EST this morning). The reservation is still currently in place. What do you recommend for next steps? Thank you! >Thanks for the patch. We applied this and things seem to be a bit better. slurm commands like sinfo, squeue are more responsive. In one sentence the patch allows parallel processing of dump jobs rpc (backend of squeue/scontrol show job) with the potential cost of longer processing of it. It's a revert when compared to 20.02 code. This should not impact `_slurm_rpc_dump_nodes` neither positive nor negative, which may suggest that it's just a different issue. I'm happy to continue checking the details of those log messages, but can you check that the throughput after the patch is closer to what you expect (saw before the upgrade)? >The reservation is still currently in place. Let jobs run and verify if the patch makes things better, if the performance is bad again please: -) collect backtrace/profile -) try reducing MinJobAge again (it may be that you effectively didn't change MinJobAge, 2nd backtrace shows slurmctld stuck in reconfigure, this shouldn't happen with the patch applied). cheers, Marcin (In reply to Marcin Stolarek from comment #24) > >Thanks for the patch. We applied this and things seem to be a bit better. slurm commands like sinfo, squeue are more responsive. > > In one sentence the patch allows parallel processing of dump jobs rpc > (backend of squeue/scontrol show job) with the potential cost of longer > processing of it. It's a revert when compared to 20.02 code. This should not > impact `_slurm_rpc_dump_nodes` neither positive nor negative, which may > suggest that it's just a different issue. > > I'm happy to continue checking the details of those log messages, but can > you check that the throughput after the patch is closer to what you expect > (saw before the upgrade)? > >The reservation is still currently in place. > Let jobs run and verify if the patch makes things better, if the performance > is bad again please: > -) collect backtrace/profile > -) try reducing MinJobAge again (it may be that you effectively didn't > change MinJobAge, 2nd backtrace shows slurmctld stuck in reconfigure, this > shouldn't happen with the patch applied). > > cheers, > Marcin Sounds good, thanks for the info Marcin. There seems to be the following issues we've identified: 1. Large processing time from _slurm_rpc_dump_nodes 2. Long response times from slurm userland tools: squeue, sinfo, etc. 3. slurmctld was getting into an infinite loop and stuck on scontrol reconf 4. debug messages showing job step ids being corrupt/invalid With the patch applied, we've released the reservation. We've also set MinJobAge=30. Issue #3 above is resolved. slurmctld no longer hangs and scontrol reconf completes without errors. Regarding issue #2, responsiveness of the slurm user tools appear to be better. However we're still seeing #1 and #4, i.e. higher than usual load on the slurmctld and also seeing quite a bit of these errors with jobid/stepid invalid/corrupt (which may be safe to ignore?): [2021-01-13T10:48:17.214] debug: freed ports 12913-12921 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:17.378] debug: freed ports 12682-12714 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:17.413] debug: freed ports 12847-12879 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:17.415] debug: sched: Running job scheduler [2021-01-13T10:48:17.426] debug: sched: already tested 1202 jobs, breaking out [2021-01-13T10:48:17.460] debug: freed ports 12748-12780 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:17.474] debug: freed ports 12922-12930 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:17.814] debug: freed ports 12781-12813 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:17.958] debug: freed ports 12814-12846 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:17.969] debug: freed ports 12880-12912 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:18.018] debug: freed ports 12649-12681 for JobId=Invalid StepId=CORRUPT [2021-01-13T10:48:18.129] debug: freed ports 12715-12747 for JobId=Invalid StepId=CORRUPT I'm attaching a backtrace as well. Created attachment 17455 [details]
slurmctld backtrace 3
>We've also set MinJobAge=30. Did you notice issues with user commands (#2) without that? If not, then let's set it back to your previous value. We're working on some compatibility issues between slurmd/slurmctld/slurmstepd code on 20.11 that appeared as a result of the change of numeric representation of batch StepId, but I'm not sure if you're affected as well at the moment. >1. Large processing time from _slurm_rpc_dump_nodes Could you please share sdiag from last 4 minutes (bf_interval*2): >#sdiag -r && sleep 4m && sdiag From last backtrace I can say that we have a number of REQUEST_NODE_INFO RPCs waiting for lock while the lock is acquired by the thread handling the RPC. Those normally come from sinfo/scontrol show node/squeue execution. Is it possible that you have one of those running with high frequency? (sdiag should help to identify the user) cheers, Marcin (In reply to Marcin Stolarek from comment #28) > >We've also set MinJobAge=30. > Did you notice issues with user commands (#2) without that? If not, then > let's set it back to your previous value. We've set MinJobAge back to 300. So far things look good. > > >1. Large processing time from _slurm_rpc_dump_nodes > Could you please share sdiag from last 4 minutes (bf_interval*2): > >#sdiag -r && sleep 4m && sdiag Attached is sdiag from 4m. > From last backtrace I can say that we have a number of REQUEST_NODE_INFO > RPCs waiting for lock while the lock is acquired by the thread handling the > RPC. Those normally come from sinfo/scontrol show node/squeue execution. Is > it possible that you have one of those running with high frequency? (sdiag > should help to identify the user) We'll monitor sdiag to see if we can identify any high frequency RPCs for users. In regards to the patch we're running. Is your plan to include this patch in upstream v20.11.3? Thanks again for all the help. --Andrew Created attachment 17459 [details]
sdiag 4m
>In regards to the patch we're running. Is your plan to include this patch in upstream v20.11.3? 20.11.3 will be released soon. I don't expect to have the issue fixed there. It's nothing I can commit to at the moment, but we'll rather have a different/more sophisticated long-term fix. From the sdiag you shared it looks like REQUEST_NODE_INFO comes at high frequency 36136 in 4 minutes ~ 150 RPC/s from root user. Looking at the code this RPC is never generated from daemons, so it has to be coming from one or more of: 1)scontrol (either show node or show job) 2)sinfo 3)sview 4)squeue 5)slurmrestd 6)tool using API (maybe perlAPI or other translator) You can identify the host where the requests are comming from enabling Protocol debugflag: >scontrol setdebugflags +Protocol Slurmctld debug level of at least verbose is required, but if you're on debug you should see messages like: >slurmctld: slurmctld_req: received opcode REQUEST_NODE_INFO from 192.168.121.168:54486 uid 0 Once you identify the source, disable debug flag by: >scontrol setdebugflags -Protocol cheers, Marcin PS. Is it possible that one of the tools I mentioned is used by: >HealthCheckProgram=/usr/sbin/nhc.ccr ? Do you see new occurrences of #1 or #4 (from comment 26)? cheers, Marcin (In reply to Marcin Stolarek from comment #31) > >In regards to the patch we're running. Is your plan to include this patch in upstream v20.11.3? > > 20.11.3 will be released soon. I don't expect to have the issue fixed there. > It's nothing I can commit to at the moment, but we'll rather have a > different/more sophisticated long-term fix. Sounds good. Once 20.11.3 is released, if you could let us know if we should apply the same patch or different patch that would be great. > > From the sdiag you shared it looks like REQUEST_NODE_INFO comes at high > frequency 36136 in 4 minutes ~ 150 RPC/s from root user. Looking at the code > this RPC is never generated from daemons, so it has to be coming from one or > more of: > 1)scontrol (either show node or show job) > 2)sinfo > 3)sview > 4)squeue > 5)slurmrestd > 6)tool using API (maybe perlAPI or other translator) > > You can identify the host where the requests are comming from enabling > Protocol debugflag: > >scontrol setdebugflags +Protocol > Slurmctld debug level of at least verbose is required, but if you're on > debug you should see messages like: > >slurmctld: slurmctld_req: received opcode REQUEST_NODE_INFO from 192.168.121.168:54486 uid 0 I turned on protocol debugging for about a minute. It appears we're seeing an even distribution from all our nodes on the cluster. For example, here's the IPs and occurrences of REQUEST_NODE_INFO in the logs: 11 10.64.24.8 11 10.64.24.7 11 10.64.24.51 11 10.64.24.5 11 10.64.24.49 11 10.64.24.47 11 10.64.24.46 11 10.64.23.46 11 10.64.23.45 11 10.121.25.10 11 10.121.24.5 11 10.121.24.38 11 10.121.24.36 11 10.121.24.34 11 10.111.14.135 11 10.111.14.128 11 10.111.14.127 11 10.111.14.126 11 10.111.14.12 11 10.111.14.119 11 10.111.14.117 11 10.111.14.116 11 10.111.14.114 11 10.111.14.112 11 10.111.14.107 11 10.111.14.106 11 10.111.14.105 10 10.111.14.36 10 10.111.14.35 10 10.111.14.30 10 10.111.14.3 10 10.111.14.27 10 10.111.14.26 10 10.111.14.2 10 10.111.14.18 9 10.104.16.20 9 10.104.16.2 9 10.104.16.13 9 10.104.15.2 9 10.104.15.14 9 10.104.15.10 9 10.104.14.35 9 10.104.14.20 9 10.104.14.15 So it appears it's not just a single host. (In reply to Marcin Stolarek from comment #32) > PS. Is it possible that one of the tools I mentioned is used by: > >HealthCheckProgram=/usr/sbin/nhc.ccr ? > > Do you see new occurrences of #1 or #4 (from comment 26)? > Here's what our /usr/sbin/nhc.ccr looks like: # cat /usr/sbin/nhc.ccr #!/bin/bash export DF_CMD="df --local" /usr/sbin/nhc -t 60 HOSTNAME=`hostname -s` It does not appear any of our nhc checks call any of these tools: > 1)scontrol (either show node or show job) > 2)sinfo > 3)sview > 4)squeue > 5)slurmrestd > 6)tool using API (maybe perlAPI or other translator) The only thing we have is if one of the checks fails, nhc will call scontrol update node to set it offline. (In reply to Andrew Bruno from comment #33) > (In reply to Marcin Stolarek from comment #31) > 6)tool using API (maybe perlAPI or other translator) Think I found what might be causing the RPCs. We run PCP on all our nodes, specifically a slurm pmda: https://github.com/performancecopilot/pcp/blob/1fe1622641fb8a7fe2e030b96e45acde92ee96ee/src/pmdas/slurm/pmdaslurm.pl This might explain the high frequency of REQUEST_NODE_INFO ~150 RPC/s from root user? If so, is this causing us any issues that you can tell? We've been running this pmda for quite some time without noticing any issues. Hi Marcin,
Just wanted to give a quick update. We disabled the slurm pcp pmda on all our nodes and noticed a significant improvement. The load on slurmctld is much better and the 4 minute sdiag sample now shows:
Remote Procedure Call statistics by user
root ( 0) count:4060 ave_time:8208 total_time:33325109
Also, we're no longer seeing the large processing time from _slurm_rpc_dump_nodes messages.
Thank you so much for all your help debugging this.
Cheers,
--Andrew
Andrew, I'm glad that we're making a progress, do you agree that we can lower the case severity? The line being an issue is No.78[1]: $slurm->load_node(), which actually gets all nodes' information. Slurmctld provides this info in a consistent way, which requires block of other threads activity (oversimplifying; you can't get node info allowing job allocation or handling job competition in parallel). As far as I understand pcp (performance co-pilot) is specifically designed to gather node-specific information, so the call to $slurm->load_node() is not the best fit for it? >Sounds good. Once 20.11.3 is released, if you could let us know if we should apply the same patch or different patch that would be great. The bug report will remain open till the final patch is merged with our public github repository. I'll keep you posted on the progress here. cheers, Marcin [1]https://github.com/performancecopilot/pcp/blob/1fe1622641fb8a7fe2e030b96e45acde92ee96ee/src/pmdas/slurm/pmdaslurm.pl#L78 (In reply to Marcin Stolarek from comment #38) > Andrew, > > I'm glad that we're making a progress, do you agree that we can lower the > case severity? Yes, agreed. We're running much better now. > The bug report will remain open till the final patch is merged with our > public github repository. I'll keep you posted on the progress here. All sounds good. We run 2 other clusters, however they have not been exhibiting the same behavior. Do you recommend holding off until 20.11.3 or should we apply this patch to our other clusters as well. Thanks again for all the help! --Andrew >All sounds good. We run 2 other clusters, however they have not been exhibiting the same behavior. Do you recommend holding off until 20.11.3 or should we apply this patch to our other clusters as well.
It's a little bit of a question on your operations procedures. I'd say - don't apply the workaround patch if you don't have an issue there. We're aware of more than a few customers that already migrated to 20.11 and you're the first to report this issue.
Just to be 100% clear. This issue will not be fixed on 20.11.3.
cheers,
Marcin
(In reply to Marcin Stolarek from comment #40) > >All sounds good. We run 2 other clusters, however they have not been exhibiting the same behavior. Do you recommend holding off until 20.11.3 or should we apply this patch to our other clusters as well. > > It's a little bit of a question on your operations procedures. I'd say - > don't apply the workaround patch if you don't have an issue there. We're > aware of more than a few customers that already migrated to 20.11 and you're > the first to report this issue. > > Just to be 100% clear. This issue will not be fixed on 20.11.3. > > cheers, > Marcin Sounds good. Thanks again! FWIW - we also are seeing this behavior. ( UNC Chapel Hill ) ... on v. slurm 20.11.3 The fix for a deadlock was merged into slurm-20.11 branch 6db0aca5a2d69[1] and will be released in Slurm 20.11.6. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/6db0aca5a2d690dfeb77eb16e727afa6b5c97ea8 (In reply to Marcin Stolarek from comment #62) > The fix for a deadlock was merged into slurm-20.11 branch 6db0aca5a2d69[1] > and will be released in Slurm 20.11.6. Thanks for the fix! Cheers, -- Kilian *** Ticket 12218 has been marked as a duplicate of this ticket. *** |
We upgraded to slurm v20.11.2 (from 20.02.6) last week. Since the upgrade we've noticed slow response times from the slurmctl and jobs failing to complete. The slurmctld process is consuming higher than usual CPU load. Apologies in advance for the lack of specific information but we're having trouble tracking this down. Please let us know if you need any more info. We're getting quite of a few timeout issues on the slurmctl along with pending RPCs at cycle end warnings: [2021-01-11T14:02:58.593] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt [2021-01-11T14:03:00.084] error: slurm_receive_msg [10.111.10.4:49986]: Socket timed out on send/recv operation [2021-01-11T14:03:04.788] error: slurm_receive_msg [10.111.16.114:40806]: Socket timed out on send/recv operation [2021-01-11T14:03:14.712] error: slurm_receive_msg [10.111.13.14:44926]: Socket timed out on send/recv operation [2021-01-11T14:03:14.713] error: slurm_receive_msg [10.111.8.112:57796]: Socket timed out on send/recv operation We're also seeing timeout issues on the slurmd side, specifically node health check failing: [2021-01-09T01:01:42.706] error: Unable to register: Socket timed out on send/recv operation [2021-01-09T01:01:50.511] health_check: timeout after 60s: killing pgid 9793 All sinfo, squeue, etc. utils are extremely slow (sometime taking minutes to respond). On the user side of things we're seeing the following behavior: The srun command fails to complete. It hangs giving the following errors: srun: Job 4874039 step creation temporarily disabled, retrying (Socket timed out on send/recv operation) srun: Job 4874039 step creation still disabled, retrying (Requested nodes are busy) The job stays in that state until the requested time has expired. Any srun command can encounter this problem. Using the following line in Slurm scripts: NPROCS=`srun --nodes=${SLURM_NNODES} bash -c ‘hostname’ |wc -l` Did not make a different whether or not the job had the --exclusive flag. Here's the output from sdiag from last week when we first started noticing the issues: ******************************************************* sdiag output at Fri Jan 08 18:37:06 2021 (1610149026) Data since Thu Jan 07 19:00:06 2021 (1610064006) ******************************************************* Server thread count: 256 Agent queue size: 0 Agent count: 1 Agent thread count: 12 DBD Agent queue size: 0 Jobs submitted: 5476 Jobs started: 4967 Jobs completed: 3493 Jobs canceled: 1019 Jobs failed: 8 Job states ts: Fri Jan 08 18:36:49 2021 (1610149009) Jobs pending: 2409 Jobs running: 1882 Main schedule statistics (microseconds): Last cycle: 16456 Max cycle: 523323 Total cycles: 7406 Mean cycle: 13526 Mean depth cycle: 1201 Cycles per minute: 5 Last queue length: 2355 Backfilling stats Total backfilled jobs (since last slurm start): 10099 Total backfilled jobs (since last stats cycle start): 4926 Total backfilled heterogeneous job components: 0 Total cycles: 609 Last cycle when: Fri Jan 08 18:35:15 2021 (1610148915) Last cycle: 10661875 Max cycle: 137957839 Mean cycle: 9605637 Last depth cycle: 2400 Last depth cycle (try sched): 310 Depth Mean: 1776 Depth Mean (try depth): 232 Last queue length: 2356 Queue length mean: 1749 Last table size: 147 Mean table size: 114 Latency for 1000 calls to gettimeofday(): 48 microseconds Also, here's the current output of sdiag: ******************************************************* sdiag output at Mon Jan 11 14:13:20 2021 (1610392400) Data since Mon Jan 11 09:17:08 2021 (1610374628) ******************************************************* Server thread count: 256 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 1412 Jobs started: 806 Jobs completed: 625 Jobs canceled: 53 Jobs failed: 0 Job states ts: Mon Jan 11 14:12:53 2021 (1610392373) Jobs pending: 3461 Jobs running: 1580 Main schedule statistics (microseconds): Last cycle: 17652 Max cycle: 129673 Total cycles: 1845 Mean cycle: 14047 Mean depth cycle: 1201 Cycles per minute: 6 Last queue length: 2321 Backfilling stats Total backfilled jobs (since last slurm start): 774 Total backfilled jobs (since last stats cycle start): 774 Total backfilled heterogeneous job components: 0 Total cycles: 123 Last cycle when: Mon Jan 11 14:12:08 2021 (1610392328) Last cycle: 20325399 Max cycle: 99182802 Mean cycle: 19552938 Last depth cycle: 2397 Last depth cycle (try sched): 465 Depth Mean: 1909 Depth Mean (try depth): 434 Last queue length: 2321 Queue length mean: 1853 Last table size: 202 Mean table size: 194 Latency for 1000 calls to gettimeofday(): 23 microseconds Thanks for any help! --Andrew