14230 – Calculated placement is wrong with --gpus request

Ticket 14230 - Calculated placement is wrong with --gpus request

Summary: Calculated placement is wrong with --gpus request

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	22.05.0
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-06-02 14:49 MDT by Matt Ezell
Modified:	2022-11-02 11:02 MDT (History)
CC List:	4 users (show)

See Also:	14229
Site:	ORNL-OLCF
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	22.05.5
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (4.09 KB, text/plain) 2022-06-07 07:38 MDT, Matt Ezell	Details
gres.conf (900 bytes, text/plain) 2022-06-07 07:39 MDT, Matt Ezell	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Matt Ezell 2022-06-02 14:49:44 MDT

Note: this description was written by Matt Davis, adding on CC

Config: Nodes have Sockets=8 ThreadsPerCore=2 CoresPerSocket=8 CPUs=128.

Description: As documented, when a step is invoked with the --gpus and --ntasks-per-gpu options set (without explicitly setting --ntasks), the number of tasks for the step is calculated. This appears to happen after the distribution for the job is already determined (see example 1), leading to inconsistencies. It is also documented that the number of CPUs needed for the step will be automatically increased if necessary to allow for any calculated task count. This does not appear to be the case anymore, possibly due to the implication of --exact when --cpus-per-task is explicitly set. The cpu allocation appears to be done before that recalculation is made (see example 2).

Example 1: The distribution across nodes is cyclic because the implicit --nnodes=2 and --ntasks=2 when srun is invoked. The task count is recalculated to ntasks-per-gpu*gpus=16 after the distribution is set. Block distribution is expected since now ntasks>nnodes.

$ salloc -N2 --gpus=16
$ srun -l --ntasks-per-gpu=1 bash -c 'echo $(hostname): $(grep Cpus_allowed_list /proc/self/status)' | sort -nk1
 0: borg001: Cpus_allowed_list: 0
 1: borg002: Cpus_allowed_list: 0
 2: borg001: Cpus_allowed_list: 8
 3: borg002: Cpus_allowed_list: 8
 4: borg001: Cpus_allowed_list: 16
 5: borg002: Cpus_allowed_list: 16
 6: borg001: Cpus_allowed_list: 24
 7: borg002: Cpus_allowed_list: 24
 8: borg001: Cpus_allowed_list: 32
 9: borg002: Cpus_allowed_list: 32
10: borg001: Cpus_allowed_list: 40
11: borg002: Cpus_allowed_list: 40
12: borg001: Cpus_allowed_list: 48
13: borg002: Cpus_allowed_list: 48
14: borg001: Cpus_allowed_list: 56
15: borg002: Cpus_allowed_list: 56



Example 2:  In the first srun, the total number of cpus for the job is calculated using the implied value of --ntasks=1, leading to ntasks*cpus-per-task=2 cpus allocated for the step. The recalculation of steps to ntasks-per-gpu*gpus=2 happens after the cpu allocation, leading to the two tasks sharing the two allocated cpus. In the second srun, --ntasks=2 is made explicit so the total number of cores to be allocated is set correctly. The third srun show that the cpus are correctly allocated and bound when --exact is not implied.

$ salloc -N1 --gpus=8
$ srun -l -c2 --gpus=2 --ntasks-per-gpu=1 bash -c 'grep Cpus_allowed_list /proc/self/status' | sort -nk1
0: Cpus_allowed_list:    48,56
1: Cpus_allowed_list:    48,56

$ srun  -l  -n2 -c2 --gpus=2 --ntasks-per-gpu=1 bash -c 'grep Cpus_allowed_list /proc/self/status' | sort -nk1
0: Cpus_allowed_list:    48-49
1: Cpus_allowed_list:    56-57

$ srun -l --gpus=2 --ntasks-per-gpu=1 bash -c 'grep Cpus_allowed_list /proc/self/status' | sort -nk1
0: Cpus_allowed_list:    48
1: Cpus_allowed_list:    56

Comment 1 Dominik Bartkiewicz 2022-06-07 07:32:20 MDT

Hi

I can create some of the described behaviors.
Could you send me gres.conf and slurm.conf?

Dominik

Comment 2 Matt Ezell 2022-06-07 07:38:53 MDT

Created attachment 25396 [details]
slurm.conf

Comment 3 Matt Ezell 2022-06-07 07:39:43 MDT

Created attachment 25397 [details]
gres.conf

Comment 5 Dominik Bartkiewicz 2022-06-21 09:26:59 MDT

Hi

Did you have a chance to test the patch from bug 14229?
If yes, how does it change behaviors described in the initial comment? 
I have a patch that fixes the first cases from Example 2, and it is waiting for QA.

Dominik

Comment 6 Matt Ezell 2022-06-21 21:45:53 MDT

(In reply to Dominik Bartkiewicz from comment #5)
> Hi
> 
> Did you have a chance to test the patch from bug 14229?
> If yes, how does it change behaviors described in the initial comment? 
> I have a patch that fixes the first cases from Example 2, and it is waiting
> for QA.
> 
> Dominik

I have been out of the office since the 15th. We will test the patch from 14229 this week. Thanks!

Comment 7 Matt Ezell 2022-06-22 12:05:42 MDT

(In reply to Dominik Bartkiewicz from comment #5)
> Did you have a chance to test the patch from bug 14229?
> If yes, how does it change behaviors described in the initial comment? 
> I have a patch that fixes the first cases from Example 2, and it is waiting
> for QA.

With the patch from 14229, the behavior listed in this bug seems to be unchanged.

Comment 13 Dominik Bartkiewicz 2022-10-12 05:34:23 MDT

Hi

Sorry that this took so long.
Those commits fix the reported issues and will be included in the next 22.05 release.
https://github.com/SchedMD/slurm/commit/2eb61bb7bb
https://github.com/SchedMD/slurm/commit/cbdac16a19
Please let me know if you have any additional questions or if this ticket is
ready to close.

Dominik

Comment 14 Matthew Davis 2022-10-12 05:40:30 MDT

I am out of the office from the 7th through the 12th.