Ticket 13626

Summary:	srun and --cpus-per-task
Product:	Slurm	Reporter:	Durai Arasan <arasan.durai>
Component:	Scheduling	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	20.11.8
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Durai Arasan 2022-03-15 14:28:13 MDT

Hello SchedMD,

We are experiencing strange behavior with srun executing commands twice only when setting --cpus-per-task=1

$ srun --cpus-per-task=1 --partition=gpu-2080ti echo foo
srun: job 1298286 queued and waiting for resources
srun: job 1298286 has been allocated resources
foo
foo

This is not seen when --cpus-per-task is another value:

$ srun --cpus-per-task=3 --partition=gpu-2080ti echo foo
srun: job 1298287 queued and waiting for resources
srun: job 1298287 has been allocated resources
foo

Also when specifying --ntasks:
$ srun -n1 --cpus-per-task=1 --partition=gpu-2080ti echo foo
srun: job 1298288 queued and waiting for resources
srun: job 1298288 has been allocated resources
foo

Relevant slurm.conf settings are:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
# example node configuration
NodeName=slurm-bm-58 NodeAddr=xxx.xxx.xxx.xxx Procs=72 Sockets=2 CoresPerSocket=18 ThreadsPerCore=2 RealMemory=354566 Gres=gpu:rtx2080ti:8 Feature=xx_v2.38 State=UNKNOWN

On closer of job variables in the "--cpus-per-task=1" case, the following variables have wrongly acquired a value of 2 for no reason:
SLURM_NTASKS=2
SLURM_NPROCS=2
SLURM_TASKS_PER_NODE=2
SLURM_STEP_NUM_TASKS=2
SLURM_STEP_TASKS_PER_NODE=2

Can you see what could be wrong?

- Durai