Ticket 14230

Summary: Calculated placement is wrong with --gpus request
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: User CommandsAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian.gilmer, davismj, lyeager, vergaravg
Version: 22.05.0   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=14229
Site: ORNL-OLCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 22.05.5 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
gres.conf

Description Matt Ezell 2022-06-02 14:49:44 MDT
Note: this description was written by Matt Davis, adding on CC

Config: Nodes have Sockets=8 ThreadsPerCore=2 CoresPerSocket=8 CPUs=128.

Description: As documented, when a step is invoked with the --gpus and --ntasks-per-gpu options set (without explicitly setting --ntasks), the number of tasks for the step is calculated. This appears to happen after the distribution for the job is already determined (see example 1), leading to inconsistencies. It is also documented that the number of CPUs needed for the step will be automatically increased if necessary to allow for any calculated task count. This does not appear to be the case anymore, possibly due to the implication of --exact when --cpus-per-task is explicitly set. The cpu allocation appears to be done before that recalculation is made (see example 2).

Example 1: The distribution across nodes is cyclic because the implicit --nnodes=2 and --ntasks=2 when srun is invoked. The task count is recalculated to ntasks-per-gpu*gpus=16 after the distribution is set. Block distribution is expected since now ntasks>nnodes.

$ salloc -N2 --gpus=16
$ srun -l --ntasks-per-gpu=1 bash -c 'echo $(hostname): $(grep Cpus_allowed_list /proc/self/status)' | sort -nk1
 0: borg001: Cpus_allowed_list: 0
 1: borg002: Cpus_allowed_list: 0
 2: borg001: Cpus_allowed_list: 8
 3: borg002: Cpus_allowed_list: 8
 4: borg001: Cpus_allowed_list: 16
 5: borg002: Cpus_allowed_list: 16
 6: borg001: Cpus_allowed_list: 24
 7: borg002: Cpus_allowed_list: 24
 8: borg001: Cpus_allowed_list: 32
 9: borg002: Cpus_allowed_list: 32
10: borg001: Cpus_allowed_list: 40
11: borg002: Cpus_allowed_list: 40
12: borg001: Cpus_allowed_list: 48
13: borg002: Cpus_allowed_list: 48
14: borg001: Cpus_allowed_list: 56
15: borg002: Cpus_allowed_list: 56



Example 2:  In the first srun, the total number of cpus for the job is calculated using the implied value of --ntasks=1, leading to ntasks*cpus-per-task=2 cpus allocated for the step. The recalculation of steps to ntasks-per-gpu*gpus=2 happens after the cpu allocation, leading to the two tasks sharing the two allocated cpus. In the second srun, --ntasks=2 is made explicit so the total number of cores to be allocated is set correctly. The third srun show that the cpus are correctly allocated and bound when --exact is not implied.

$ salloc -N1 --gpus=8
$ srun -l -c2 --gpus=2 --ntasks-per-gpu=1 bash -c 'grep Cpus_allowed_list /proc/self/status' | sort -nk1
0: Cpus_allowed_list:    48,56
1: Cpus_allowed_list:    48,56

$ srun  -l  -n2 -c2 --gpus=2 --ntasks-per-gpu=1 bash -c 'grep Cpus_allowed_list /proc/self/status' | sort -nk1
0: Cpus_allowed_list:    48-49
1: Cpus_allowed_list:    56-57

$ srun -l --gpus=2 --ntasks-per-gpu=1 bash -c 'grep Cpus_allowed_list /proc/self/status' | sort -nk1
0: Cpus_allowed_list:    48
1: Cpus_allowed_list:    56
Comment 1 Dominik Bartkiewicz 2022-06-07 07:32:20 MDT
Hi

I can create some of the described behaviors.
Could you send me gres.conf and slurm.conf?

Dominik
Comment 2 Matt Ezell 2022-06-07 07:38:53 MDT
Created attachment 25396 [details]
slurm.conf
Comment 3 Matt Ezell 2022-06-07 07:39:43 MDT
Created attachment 25397 [details]
gres.conf
Comment 5 Dominik Bartkiewicz 2022-06-21 09:26:59 MDT
Hi

Did you have a chance to test the patch from bug 14229?
If yes, how does it change behaviors described in the initial comment? 
I have a patch that fixes the first cases from Example 2, and it is waiting for QA.

Dominik
Comment 6 Matt Ezell 2022-06-21 21:45:53 MDT
(In reply to Dominik Bartkiewicz from comment #5)
> Hi
> 
> Did you have a chance to test the patch from bug 14229?
> If yes, how does it change behaviors described in the initial comment? 
> I have a patch that fixes the first cases from Example 2, and it is waiting
> for QA.
> 
> Dominik

I have been out of the office since the 15th. We will test the patch from 14229 this week. Thanks!
Comment 7 Matt Ezell 2022-06-22 12:05:42 MDT
(In reply to Dominik Bartkiewicz from comment #5)
> Did you have a chance to test the patch from bug 14229?
> If yes, how does it change behaviors described in the initial comment? 
> I have a patch that fixes the first cases from Example 2, and it is waiting
> for QA.

With the patch from 14229, the behavior listed in this bug seems to be unchanged.
Comment 13 Dominik Bartkiewicz 2022-10-12 05:34:23 MDT
Hi

Sorry that this took so long.
Those commits fix the reported issues and will be included in the next 22.05 release.
https://github.com/SchedMD/slurm/commit/2eb61bb7bb
https://github.com/SchedMD/slurm/commit/cbdac16a19
Please let me know if you have any additional questions or if this ticket is
ready to close.

Dominik
Comment 14 Matthew Davis 2022-10-12 05:40:30 MDT
I am out of the office from the 7th through the 12th.