The system has 1,680 compute nodes. When I ran srun with 1,200 nodes, srun stucked. * I got hundreds of responses, and the others failed. # srun -n 1200 -N 1200 hostname When I ran many tasks with small number of node, it succeeded. # srun -n 18432 -N 144 hostname I think I need to tune some parameters, but I don't know what to do. Do you have any suggestions?
The OS of compute node is RHEL 8.2. # cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) # uname -r 4.18.0-193.el8.x86_64
Can you upload your slurm.conf, topology.conf and the slurmctld.log. Also do you have a timestamp of when you ran the srun command with 1200 nodes. ~Colby
Created attachment 15852 [details] slurm.conf of the system
Created attachment 15853 [details] topology.conf of the system
Created attachment 15854 [details] slurmctlo.log Hi, I attached the files. Could you check them? JobID 12: 1200node, failed JobID 13: 1100node, failed JobID 14: 1000node, succeeded JobID JobName NCPUS NNodes State Start End ------------ ---------- ---------- -------- ---------- ------------------- ------------------- 12 hostname 153600 1200 CANCELLED+ 2020-09-10T15:48:24 2020-09-10T15:50:59 12.0 hostname 1200 1200 FAILED 2020-09-10T15:48:24 2020-09-10T15:48:49 13 hostname 140800 1100 CANCELLED+ 2020-09-10T15:51:44 2020-09-10T15:52:58 13.0 hostname 1100 1100 FAILED 2020-09-10T15:51:44 2020-09-10T15:52:09 14 hostname 128000 1000 COMPLETED 2020-09-10T15:53:39 2020-09-10T15:53:42 14.0 hostname 1000 1000 COMPLETED 2020-09-10T15:53:39 2020-09-10T15:53:42
Hi, Is there any update? We'd like to run HPL on all node.
Sorry about the delay. What happens when you run srun -n 18432 -N 400 hostname and what happens when you run srun -n 18432 -N 450 hostname ~Colby
Thank you for update. I can run the following commands successfully. >srun -n 18432 -N 400 hostname >srun -n 18432 -N 450 hostname $ sacct -o jobid,nnodes,ncpus,ntasks,state,exitcode JobID NNodes NCPUS NTasks State ExitCode ------------ -------- ---------- -------- ---------- -------- 8592 400 51200 COMPLETED 0:0 8592.0 400 18432 18432 COMPLETED 0:0 8593 450 57600 COMPLETED 0:0 8593.0 450 18432 18432 COMPLETED 0:0
Try commenting out TopologyPlugin=topology/tree RoutePlugin=route/topology in the slurm.conf, reload slurmctld and run srun -n 1200 -N 1200 hostname and srun -n 18432 -N 450 hostname You do not need to run 18432 tasks though you can run less. Also what type of network you have setup for your cluster. ~Colby
Hi, I disabled topology settings and restarted slurmctld and slurmd, #TopologyPlugin=topology/tree #RoutePlugin=route/topology then ran >srun -n 1200 -N 1200 hostname >srun -n 18432 -N 450 hostname The latter one succeeded. The former one failed with the following error messages: srun: error: task 1121 launch failed: Slurmd could not connect IO srun: error: task 1077 launch failed: Slurmd could not connect IO srun: error: task 390 launch failed: Slurmd could not connect IO ^Csrun: interrupt (one more within 1 sec to abort) srun: step:11881.0 tasks 390,1077,1121: failed to start srun: step:11881.0 tasks 0-53,102-362,367,371,377-378,380,383,394,408-715,717,725,745,765-913,915-1064,1066-1074,1076,1095,1122-1199: exited .srun: step:11881.0 tasks 54-101,363-366,368-370,372-376,379,381-382,384-389,391-393,395-407,716,718-724,726-744,746-764,914,1065,1075,1078-1094,1096-1120: unknown ^Csrun: sending Ctrl-C to job 11881.0 srun: Job step 11881.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete
Ok need a bit more info from you. run with the topology commented out in the slurm.conf >srun --slurmd-debug=4 -n 1200 -N 1200 hostname uncomment the topology and run >srun --slurmd-debug=4 -n 1200 -N 1200 hostname and send me the log files.
Created attachment 16036 [details] srun debug logs Hi, I attached the logs requested. topology_enabled: the log with topology/tree topology_disabled: the log without topology/tree
Thank you for those logs. Lets try a few more things. In the slurm.conf add a TCPTimeout field. Try first with >TCPTimeout=30 and run >srun --slurmd-debug=4 -n 1200 -N 1200 hostname If it still doesn't work try again with a timeout of 60. If both of those do not work you will need to look in the slurmd logs on the nodes for error messages that look something like >Could not open output file >Could not open error file >egrep "Could not open output file|Could not open error file" slurmd.log Running that command on the nodes should track down the error we are looking for. Send us some of the slurmd.log files from those nodes that have the error. ~Colby
Checking back to see if adding the TCPTimeout helped.
I tried the setting below, and srun failed. >TCPTimeout=30 No error log was found with the following command. >egrep "Could not open output file|Could not open error file" slurmd.log I haven't tried TCPTimeout=60 yet. Now the system ran into production, so I cannot touch the system. I might have a chance to try it next month.
When you have access to the system again and if setting the TCPtimeout to 60 does not work either reopen this ticket again or open a new one. I have to close this one for now.
changing status