We have been trying to troubleshoot job scaling issues on our cluster and the OSU bencharmark that we are running seems to be giving a big difference in the performance when running interactively vs batch mode. Can you provide us with some insight with what could be contributing to this behavior?
Hi Surendra, I am aware of your problem and I also remember our conversation @ SC. I need some information and maybe we can talk later on call, does it sound good? Here it goes my request: I would like to know first the details on the tests you have done and how, namely the exact commands you used and the related slurmctld and slurmd logs. I need to know the times of the executions to correlate it with the logs. I understand all the benchmarks you talk about have succeeded but when running "interactively" (what does it mean? outside slurm? salloc? srun?) it is faster than running via sbatch. I would need a quick picture of the infrastructure: ib switches, network, nodes, and so on. From the OSU benchmark I'd like to know exactly which tests you have run and which of them are slow. Do you have the results? Also, is it slow in general or only in the start phase? Do you use pmi2 or pmix? When I have this information and after I analyze it then it could be a better moment to do the call. Does it make sense?
Created attachment 14210 [details] salloc rank ordering
I am from HPE and can help explain. We are running the OSU benchmark named 'osu_mbw_mr'. This job tests the bandwidth of the Infiniband fabric. Normally we run with sbatch to submit a job to the scheduler but we wanted to run an interactive test from a node so we submitted the job using salloc. The job runs on 512 nodes. Each node has 36 cores. When running with salloc, the job runs in about 36 seconds but it should take about 30 minutes. Analysis reveals that when using salloc, SLURM is changing the order of the cores in each node. sbatch: SLURM identifies and orders cores in the first node 0-35, the cores in the second node 36-71 etc. salloc: SLURM identifies and orders the even cores in the nodes and then the odd cores in the nodes. Because of this ordering with salloc, the mpi ranks are paired within a node(intra-node) but they should be paired between nodes(inter-node) for the job to run correctly. So the bug here is why is salloc ordering the cores in the nodes starting with all even cores and then odd cores. I attached a file showing the ordering when using salloc.
Created attachment 14218 [details] whereami.c (In reply to GregD from comment #9) > I am from HPE and can help explain. > > We are running the OSU benchmark named 'osu_mbw_mr'. This job tests the > bandwidth of the Infiniband fabric. > > Normally we run with sbatch to submit a job to the scheduler but we wanted > to run an interactive test from a node so we submitted the job using salloc. > The job runs on 512 nodes. Each node has 36 cores. When running with salloc, > the job runs in about 36 seconds but it should take about 30 minutes. I don't see a direct connection with the binding to cores and salloc taking 36 seconds instead of 30 minutes, can you explain? Do you see any error? Can you show me exactly which commands and output do you get when running this? > Analysis reveals that when using salloc, SLURM is changing the order of the > cores in each node. > sbatch: SLURM identifies and orders cores in the first node 0-35, the cores > in the second node 36-71 etc. > > salloc: SLURM identifies and orders the even cores in the nodes and then the > odd cores in the nodes. Affinity of a task to a core is done at node level and by the task/affinity plugin. The plugin is the same in both situations because it is managed by slurmd/stepd when the task is launched, so there shouldn't be differences. You have some control of how task binding is performed with the '-m, --distribution=' flag of srun/salloc/sbatch. You can also use --hint= to instruct how to bind tasks to cores/threads. TaskPluginParam in slurm.conf also modifies how distribution is done, I suggest to leave it commented out unless you specifically want a different functionality. Other specific flags to distribute tasks to nodes can be found in slurm.conf man page, options: CR_Core, CR_CORE_DEFAULT_DIST_BLOCK, CR_Pack_Nodes and others. > Because of this ordering with salloc, the mpi ranks are paired within a > node(intra-node) but they should be paired between nodes(inter-node) for the > job to run correctly. Do you mean you just want one task per node? Then --ntasks=X --ntasks-per-node=1 may help. > So the bug here is why is salloc ordering the cores in the nodes starting > with all even cores and then odd cores. I attached a file showing the > ordering when using salloc. I don't reproduce it in my system and I guess is something related to configuration. Are we talking about cluster "eagle"? or is it another test cluster? If it is another cluster, make sure to have the following settings: slurm.conf: TaskPlugin = task/affinity,task/cgroup SallocDefaultCommand = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL <-- may not be needed.. depends on your use case cgroup.conf: ConstrainCores = yes TaskAffinity = no Also, can you do the following test?: I upload a whereami.c program, I want you to run it with salloc and sbatch and show me the results. You can spawn as many tasks as needed, i.e. 36. 1. Compile with: gcc -o whereami whereami.c 2. Run: srun --nodes=2 --ntasks-per-node=36 whereami salloc --nodes=2 --ntasks-per-node=36 -> srun whereami sbatch --nodes=2 --ntasks-per-node=36 --wrap "./whereami" 3. You can do other combinations to check the binding 4. Let me know about the results
Created attachment 14232 [details] sbatch_slurm-3174930.out
Created attachment 14233 [details] srun_out
Greg, Can you attach one of your osu submit scripts to the ticket. We are running these jobs on the eagle cluster.
From srun output I can say affinity is working so far. 1. I don't see attached the salloc output: salloc --nodes=2 --ntasks-per-node=36 -> srun whereami 2. And I want to see also: sbatch --nodes=2 --ntasks-per-node=36 --wrap "srun ./whereami" 3. Send me also the output of lstopo or lstopo-no-graphics in a node: lstopo or lstopo-no-graphics Thanks
Created attachment 14262 [details] lstopo
Created attachment 14263 [details] sbatch_9009
Created attachment 14264 [details] salloc_9009
i have attached the salloc, sbatch and lstopo outputs to the ticket.
lstopo shows how your cores are numbered by hwloc as: socket0: cores 0 to 17 socket1: cores 18 to 35 The following excerpt from your attached salloc, srun and sbatch logs shows that binding of each task is done in a different core, and from both sockets: task1 is bind to first core of socket 0 task2 is bind to first core of socket 1 task3 is bind to second core of socket 0 task4 is bind to second core of socket 1 ... Thats the parsed exceprt which proof binding is done correctly: srun: 0 @ r4i7n35 |Cpus_allowed_list: 0 1 @ r4i7n35 |Cpus_allowed_list: 18 2 @ r4i7n35 |Cpus_allowed_list: 1 3 @ r4i7n35 |Cpus_allowed_list: 19 4 @ r4i7n35 |Cpus_allowed_list: 2 5 @ r4i7n35 |Cpus_allowed_list: 20 6 @ r4i7n35 |Cpus_allowed_list: 3 7 @ r4i7n35 |Cpus_allowed_list: 21 8 @ r4i7n35 |Cpus_allowed_list: 4 9 @ r4i7n35 |Cpus_allowed_list: 22 sbatch: 0 @ r3i7n35 |Cpus_allowed_list: 0 1 @ r3i7n35 |Cpus_allowed_list: 18 2 @ r3i7n35 |Cpus_allowed_list: 1 3 @ r3i7n35 |Cpus_allowed_list: 19 4 @ r3i7n35 |Cpus_allowed_list: 2 5 @ r3i7n35 |Cpus_allowed_list: 20 6 @ r3i7n35 |Cpus_allowed_list: 3 7 @ r3i7n35 |Cpus_allowed_list: 21 8 @ r3i7n35 |Cpus_allowed_list: 4 9 @ r3i7n35 |Cpus_allowed_list: 22 salloc: 0 @ r3i7n35 |Cpus_allowed_list: 0 1 @ r3i7n35 |Cpus_allowed_list: 18 2 @ r3i7n35 |Cpus_allowed_list: 1 3 @ r3i7n35 |Cpus_allowed_list: 19 4 @ r3i7n35 |Cpus_allowed_list: 2 5 @ r3i7n35 |Cpus_allowed_list: 20 6 @ r3i7n35 |Cpus_allowed_list: 3 7 @ r3i7n35 |Cpus_allowed_list: 21 8 @ r3i7n35 |Cpus_allowed_list: 4 9 @ r3i7n35 |Cpus_allowed_list: 22 If this is not happening in your application there's a chance that the app itself is changing the affinity of the process to cores. Any process run in a set of allowed cores can change its own binding. I don't know what your test is doing internally. Maybe you can explain better which is the issue with binding you are seeing and that affects the performance? From here everything seems good at the moment.
> Maybe you can explain better which is the issue with binding you are seeing > and that affects the performance? > From here everything seems good at the moment. Hi surendra, Do you have any feedback for me? I am interested in knowing if you still see affinity issues. Thanks!
This can be closed. Thanks!
(In reply to surendra from comment #22) > This can be closed. Thanks! Ok Surendra, Please, open if questions arise. I was curious on what was the issue. Regards