9796 – srun stucks with around 1000 nodes

Ticket 9796 - srun stucks with around 1000 nodes

Summary: srun stucks with around 1000 nodes

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	20.02.3
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-09-10 01:05 MDT by issp2020support
Modified:	2020-10-07 13:33 MDT (History)
CC List:	1 user (show)

See Also:
Site:	U of Tokyo
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf of the system (2.72 KB, text/plain) 2020-09-10 17:25 MDT, issp2020support	Details
topology.conf of the system (981 bytes, text/plain) 2020-09-10 17:26 MDT, issp2020support	Details
slurmctlo.log (18.12 KB, application/x-gzip) 2020-09-10 17:32 MDT, issp2020support	Details
srun debug logs (347.05 KB, application/x-gzip) 2020-09-24 19:22 MDT, issp2020support	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description issp2020support 2020-09-10 01:05:14 MDT

The system has 1,680 compute nodes.

When I ran srun with 1,200 nodes, srun stucked.
 * I got hundreds of responses, and the others failed.
# srun -n 1200 -N 1200 hostname

When I ran many tasks with small number of node, it succeeded.
# srun -n 18432 -N 144 hostname


I think I need to tune some parameters, but I don't know what to do.
Do you have any suggestions?

Comment 1 issp2020support 2020-09-10 01:07:25 MDT

The OS of compute node is RHEL 8.2.

# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

# uname -r
4.18.0-193.el8.x86_64

Comment 3 Colby Ashley 2020-09-10 12:07:17 MDT

Can you upload your slurm.conf, topology.conf and the slurmctld.log. Also do you have a timestamp of when you ran the srun command with 1200 nodes.

~Colby

Comment 4 issp2020support 2020-09-10 17:25:55 MDT

Created attachment 15852 [details]
slurm.conf of the system

Comment 5 issp2020support 2020-09-10 17:26:25 MDT

Created attachment 15853 [details]
topology.conf of the system

Comment 6 issp2020support 2020-09-10 17:32:17 MDT

Created attachment 15854 [details]
slurmctlo.log

Hi, I attached the files.
Could you check them?


JobID 12: 1200node, failed
JobID 13: 1100node, failed
JobID 14: 1000node, succeeded

       JobID    JobName      NCPUS   NNodes      State               Start                 End
------------ ---------- ---------- -------- ---------- ------------------- -------------------
12             hostname     153600     1200 CANCELLED+ 2020-09-10T15:48:24 2020-09-10T15:50:59
12.0           hostname       1200     1200     FAILED 2020-09-10T15:48:24 2020-09-10T15:48:49
13             hostname     140800     1100 CANCELLED+ 2020-09-10T15:51:44 2020-09-10T15:52:58
13.0           hostname       1100     1100     FAILED 2020-09-10T15:51:44 2020-09-10T15:52:09
14             hostname     128000     1000  COMPLETED 2020-09-10T15:53:39 2020-09-10T15:53:42
14.0           hostname       1000     1000  COMPLETED 2020-09-10T15:53:39 2020-09-10T15:53:42

Comment 7 issp2020support 2020-09-17 06:50:53 MDT

Hi,

Is there any update?
We'd like to run HPL on all node.

Comment 8 Colby Ashley 2020-09-17 10:36:17 MDT

Sorry about the delay.

What happens when you run

srun -n 18432 -N 400 hostname

and what happens when you run

srun -n 18432 -N 450 hostname

~Colby

Comment 9 issp2020support 2020-09-17 14:56:11 MDT

Thank you for update.

I can run the following commands successfully.
>srun -n 18432 -N 400 hostname
>srun -n 18432 -N 450 hostname

$ sacct -o jobid,nnodes,ncpus,ntasks,state,exitcode
       JobID   NNodes      NCPUS   NTasks      State ExitCode
------------ -------- ---------- -------- ---------- --------
8592              400      51200           COMPLETED      0:0
8592.0            400      18432    18432  COMPLETED      0:0
8593              450      57600           COMPLETED      0:0
8593.0            450      18432    18432  COMPLETED      0:0

Comment 10 Colby Ashley 2020-09-18 13:37:36 MDT

Try commenting out

TopologyPlugin=topology/tree
RoutePlugin=route/topology

in the slurm.conf, reload slurmctld and run

srun -n 1200 -N 1200 hostname

and

srun -n 18432 -N 450 hostname

You do not need to run 18432 tasks though you can run less.

Also what type of network you have setup for your cluster.

~Colby

Comment 11 issp2020support 2020-09-18 21:07:07 MDT

Hi,

I disabled topology settings and restarted slurmctld and slurmd,
#TopologyPlugin=topology/tree
#RoutePlugin=route/topology

then ran
>srun -n 1200 -N 1200 hostname
>srun -n 18432 -N 450 hostname

The latter one succeeded.
The former one failed with the following error messages:

srun: error: task 1121 launch failed: Slurmd could not connect IO
srun: error: task 1077 launch failed: Slurmd could not connect IO
srun: error: task 390 launch failed: Slurmd could not connect IO
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:11881.0 tasks 390,1077,1121: failed to start
srun: step:11881.0 tasks 0-53,102-362,367,371,377-378,380,383,394,408-715,717,725,745,765-913,915-1064,1066-1074,1076,1095,1122-1199: exited
.srun: step:11881.0 tasks 54-101,363-366,368-370,372-376,379,381-382,384-389,391-393,395-407,716,718-724,726-744,746-764,914,1065,1075,1078-1094,1096-1120: unknown
^Csrun: sending Ctrl-C to job 11881.0
srun: Job step 11881.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Comment 12 Colby Ashley 2020-09-24 13:10:56 MDT

Ok need a bit more info from you. 

run with the topology commented out in the slurm.conf

>srun --slurmd-debug=4 -n 1200 -N 1200 hostname

uncomment the topology and run

>srun --slurmd-debug=4 -n 1200 -N 1200 hostname

and send me the log files.

Comment 13 issp2020support 2020-09-24 19:22:12 MDT

Created attachment 16036 [details]
srun debug logs

Hi,

I attached the logs requested.
  topology_enabled:  the log with    topology/tree
  topology_disabled: the log without topology/tree

Comment 15 Colby Ashley 2020-09-28 13:46:53 MDT

Thank you for those logs. Lets try a few more things.

In the slurm.conf add a TCPTimeout field. Try first with

>TCPTimeout=30

and run

>srun --slurmd-debug=4 -n 1200 -N 1200 hostname

If it still doesn't work try again with a timeout of 60. If both of those do not work you will need to look in the slurmd logs on the nodes for error messages that look something like

>Could not open output file 
>Could not open error file

>egrep "Could not open output file|Could not open error file" slurmd.log
Running that command on the nodes should track down the error we are looking for. Send us some of the slurmd.log files from those nodes that have the error.

~Colby

Comment 17 Colby Ashley 2020-10-05 12:02:37 MDT

Checking back to see if adding the TCPTimeout helped.

Comment 18 issp2020support 2020-10-07 01:33:08 MDT

I tried the setting below, and srun failed.
>TCPTimeout=30

No error log was found with the following command.
>egrep "Could not open output file|Could not open error file" slurmd.log

I haven't tried TCPTimeout=60 yet.
Now the system ran into production, so I cannot touch the system.
I might have a chance to try it next month.

Comment 19 Colby Ashley 2020-10-07 13:25:23 MDT

When you have access to the system again and if setting the TCPtimeout to 60 does not work either reopen this ticket again or open a new one. I have to close this one for now.

Comment 20 Colby Ashley 2020-10-07 13:33:36 MDT

changing status