Ticket 9796

Summary: srun stucks with around 1000 nodes
Product: Slurm Reporter: issp2020support
Component: slurmstepdAssignee: Director of Support <support>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: nate
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
Site: U of Tokyo Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf of the system
topology.conf of the system
slurmctlo.log
srun debug logs

Description issp2020support 2020-09-10 01:05:14 MDT
The system has 1,680 compute nodes.

When I ran srun with 1,200 nodes, srun stucked.
 * I got hundreds of responses, and the others failed.
# srun -n 1200 -N 1200 hostname

When I ran many tasks with small number of node, it succeeded.
# srun -n 18432 -N 144 hostname


I think I need to tune some parameters, but I don't know what to do.
Do you have any suggestions?
Comment 1 issp2020support 2020-09-10 01:07:25 MDT
The OS of compute node is RHEL 8.2.

# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

# uname -r
4.18.0-193.el8.x86_64
Comment 3 Colby Ashley 2020-09-10 12:07:17 MDT
Can you upload your slurm.conf, topology.conf and the slurmctld.log. Also do you have a timestamp of when you ran the srun command with 1200 nodes.

~Colby
Comment 4 issp2020support 2020-09-10 17:25:55 MDT
Created attachment 15852 [details]
slurm.conf of the system
Comment 5 issp2020support 2020-09-10 17:26:25 MDT
Created attachment 15853 [details]
topology.conf of the system
Comment 6 issp2020support 2020-09-10 17:32:17 MDT
Created attachment 15854 [details]
slurmctlo.log

Hi, I attached the files.
Could you check them?


JobID 12: 1200node, failed
JobID 13: 1100node, failed
JobID 14: 1000node, succeeded

       JobID    JobName      NCPUS   NNodes      State               Start                 End
------------ ---------- ---------- -------- ---------- ------------------- -------------------
12             hostname     153600     1200 CANCELLED+ 2020-09-10T15:48:24 2020-09-10T15:50:59
12.0           hostname       1200     1200     FAILED 2020-09-10T15:48:24 2020-09-10T15:48:49
13             hostname     140800     1100 CANCELLED+ 2020-09-10T15:51:44 2020-09-10T15:52:58
13.0           hostname       1100     1100     FAILED 2020-09-10T15:51:44 2020-09-10T15:52:09
14             hostname     128000     1000  COMPLETED 2020-09-10T15:53:39 2020-09-10T15:53:42
14.0           hostname       1000     1000  COMPLETED 2020-09-10T15:53:39 2020-09-10T15:53:42
Comment 7 issp2020support 2020-09-17 06:50:53 MDT
Hi,

Is there any update?
We'd like to run HPL on all node.
Comment 8 Colby Ashley 2020-09-17 10:36:17 MDT
Sorry about the delay.

What happens when you run

srun -n 18432 -N 400 hostname

and what happens when you run

srun -n 18432 -N 450 hostname

~Colby
Comment 9 issp2020support 2020-09-17 14:56:11 MDT
Thank you for update.

I can run the following commands successfully.
>srun -n 18432 -N 400 hostname
>srun -n 18432 -N 450 hostname

$ sacct -o jobid,nnodes,ncpus,ntasks,state,exitcode
       JobID   NNodes      NCPUS   NTasks      State ExitCode
------------ -------- ---------- -------- ---------- --------
8592              400      51200           COMPLETED      0:0
8592.0            400      18432    18432  COMPLETED      0:0
8593              450      57600           COMPLETED      0:0
8593.0            450      18432    18432  COMPLETED      0:0
Comment 10 Colby Ashley 2020-09-18 13:37:36 MDT
Try commenting out

TopologyPlugin=topology/tree
RoutePlugin=route/topology

in the slurm.conf, reload slurmctld and run

srun -n 1200 -N 1200 hostname

and

srun -n 18432 -N 450 hostname

You do not need to run 18432 tasks though you can run less.

Also what type of network you have setup for your cluster.

~Colby
Comment 11 issp2020support 2020-09-18 21:07:07 MDT
Hi,

I disabled topology settings and restarted slurmctld and slurmd,
#TopologyPlugin=topology/tree
#RoutePlugin=route/topology

then ran
>srun -n 1200 -N 1200 hostname
>srun -n 18432 -N 450 hostname

The latter one succeeded.
The former one failed with the following error messages:

srun: error: task 1121 launch failed: Slurmd could not connect IO
srun: error: task 1077 launch failed: Slurmd could not connect IO
srun: error: task 390 launch failed: Slurmd could not connect IO
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:11881.0 tasks 390,1077,1121: failed to start
srun: step:11881.0 tasks 0-53,102-362,367,371,377-378,380,383,394,408-715,717,725,745,765-913,915-1064,1066-1074,1076,1095,1122-1199: exited
.srun: step:11881.0 tasks 54-101,363-366,368-370,372-376,379,381-382,384-389,391-393,395-407,716,718-724,726-744,746-764,914,1065,1075,1078-1094,1096-1120: unknown
^Csrun: sending Ctrl-C to job 11881.0
srun: Job step 11881.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
Comment 12 Colby Ashley 2020-09-24 13:10:56 MDT
Ok need a bit more info from you. 

run with the topology commented out in the slurm.conf

>srun --slurmd-debug=4 -n 1200 -N 1200 hostname

uncomment the topology and run

>srun --slurmd-debug=4 -n 1200 -N 1200 hostname

and send me the log files.
Comment 13 issp2020support 2020-09-24 19:22:12 MDT
Created attachment 16036 [details]
srun debug logs

Hi,

I attached the logs requested.
  topology_enabled:  the log with    topology/tree
  topology_disabled: the log without topology/tree
Comment 15 Colby Ashley 2020-09-28 13:46:53 MDT
Thank you for those logs. Lets try a few more things.

In the slurm.conf add a TCPTimeout field. Try first with

>TCPTimeout=30

and run

>srun --slurmd-debug=4 -n 1200 -N 1200 hostname

If it still doesn't work try again with a timeout of 60. If both of those do not work you will need to look in the slurmd logs on the nodes for error messages that look something like

>Could not open output file 
>Could not open error file

>egrep "Could not open output file|Could not open error file" slurmd.log
Running that command on the nodes should track down the error we are looking for. Send us some of the slurmd.log files from those nodes that have the error.

~Colby
Comment 17 Colby Ashley 2020-10-05 12:02:37 MDT
Checking back to see if adding the TCPTimeout helped.
Comment 18 issp2020support 2020-10-07 01:33:08 MDT
I tried the setting below, and srun failed.
>TCPTimeout=30

No error log was found with the following command.
>egrep "Could not open output file|Could not open error file" slurmd.log

I haven't tried TCPTimeout=60 yet.
Now the system ran into production, so I cannot touch the system.
I might have a chance to try it next month.
Comment 19 Colby Ashley 2020-10-07 13:25:23 MDT
When you have access to the system again and if setting the TCPtimeout to 60 does not work either reopen this ticket again or open a new one. I have to close this one for now.
Comment 20 Colby Ashley 2020-10-07 13:33:36 MDT
changing status