| Summary: | srun stucks with around 1000 nodes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | issp2020support |
| Component: | slurmstepd | Assignee: | Director of Support <support> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | nate |
| Version: | 20.02.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | U of Tokyo | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf of the system
topology.conf of the system slurmctlo.log srun debug logs |
||
|
Description
issp2020support
2020-09-10 01:05:14 MDT
The OS of compute node is RHEL 8.2. # cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) # uname -r 4.18.0-193.el8.x86_64 Can you upload your slurm.conf, topology.conf and the slurmctld.log. Also do you have a timestamp of when you ran the srun command with 1200 nodes. ~Colby Created attachment 15852 [details]
slurm.conf of the system
Created attachment 15853 [details]
topology.conf of the system
Created attachment 15854 [details]
slurmctlo.log
Hi, I attached the files.
Could you check them?
JobID 12: 1200node, failed
JobID 13: 1100node, failed
JobID 14: 1000node, succeeded
JobID JobName NCPUS NNodes State Start End
------------ ---------- ---------- -------- ---------- ------------------- -------------------
12 hostname 153600 1200 CANCELLED+ 2020-09-10T15:48:24 2020-09-10T15:50:59
12.0 hostname 1200 1200 FAILED 2020-09-10T15:48:24 2020-09-10T15:48:49
13 hostname 140800 1100 CANCELLED+ 2020-09-10T15:51:44 2020-09-10T15:52:58
13.0 hostname 1100 1100 FAILED 2020-09-10T15:51:44 2020-09-10T15:52:09
14 hostname 128000 1000 COMPLETED 2020-09-10T15:53:39 2020-09-10T15:53:42
14.0 hostname 1000 1000 COMPLETED 2020-09-10T15:53:39 2020-09-10T15:53:42
Hi, Is there any update? We'd like to run HPL on all node. Sorry about the delay. What happens when you run srun -n 18432 -N 400 hostname and what happens when you run srun -n 18432 -N 450 hostname ~Colby Thank you for update.
I can run the following commands successfully.
>srun -n 18432 -N 400 hostname
>srun -n 18432 -N 450 hostname
$ sacct -o jobid,nnodes,ncpus,ntasks,state,exitcode
JobID NNodes NCPUS NTasks State ExitCode
------------ -------- ---------- -------- ---------- --------
8592 400 51200 COMPLETED 0:0
8592.0 400 18432 18432 COMPLETED 0:0
8593 450 57600 COMPLETED 0:0
8593.0 450 18432 18432 COMPLETED 0:0
Try commenting out TopologyPlugin=topology/tree RoutePlugin=route/topology in the slurm.conf, reload slurmctld and run srun -n 1200 -N 1200 hostname and srun -n 18432 -N 450 hostname You do not need to run 18432 tasks though you can run less. Also what type of network you have setup for your cluster. ~Colby Hi,
I disabled topology settings and restarted slurmctld and slurmd,
#TopologyPlugin=topology/tree
#RoutePlugin=route/topology
then ran
>srun -n 1200 -N 1200 hostname
>srun -n 18432 -N 450 hostname
The latter one succeeded.
The former one failed with the following error messages:
srun: error: task 1121 launch failed: Slurmd could not connect IO
srun: error: task 1077 launch failed: Slurmd could not connect IO
srun: error: task 390 launch failed: Slurmd could not connect IO
^Csrun: interrupt (one more within 1 sec to abort)
srun: step:11881.0 tasks 390,1077,1121: failed to start
srun: step:11881.0 tasks 0-53,102-362,367,371,377-378,380,383,394,408-715,717,725,745,765-913,915-1064,1066-1074,1076,1095,1122-1199: exited
.srun: step:11881.0 tasks 54-101,363-366,368-370,372-376,379,381-382,384-389,391-393,395-407,716,718-724,726-744,746-764,914,1065,1075,1078-1094,1096-1120: unknown
^Csrun: sending Ctrl-C to job 11881.0
srun: Job step 11881.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
Ok need a bit more info from you. run with the topology commented out in the slurm.conf >srun --slurmd-debug=4 -n 1200 -N 1200 hostname uncomment the topology and run >srun --slurmd-debug=4 -n 1200 -N 1200 hostname and send me the log files. Created attachment 16036 [details]
srun debug logs
Hi,
I attached the logs requested.
topology_enabled: the log with topology/tree
topology_disabled: the log without topology/tree
Thank you for those logs. Lets try a few more things. In the slurm.conf add a TCPTimeout field. Try first with >TCPTimeout=30 and run >srun --slurmd-debug=4 -n 1200 -N 1200 hostname If it still doesn't work try again with a timeout of 60. If both of those do not work you will need to look in the slurmd logs on the nodes for error messages that look something like >Could not open output file >Could not open error file >egrep "Could not open output file|Could not open error file" slurmd.log Running that command on the nodes should track down the error we are looking for. Send us some of the slurmd.log files from those nodes that have the error. ~Colby Checking back to see if adding the TCPTimeout helped. I tried the setting below, and srun failed. >TCPTimeout=30 No error log was found with the following command. >egrep "Could not open output file|Could not open error file" slurmd.log I haven't tried TCPTimeout=60 yet. Now the system ran into production, so I cannot touch the system. I might have a chance to try it next month. When you have access to the system again and if setting the TCPtimeout to 60 does not work either reopen this ticket again or open a new one. I have to close this one for now. changing status |