Ticket 13626 - srun and --cpus-per-task
Summary: srun and --cpus-per-task
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.11.8
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-03-15 14:28 MDT by Durai Arasan
Modified: 2022-03-15 14:28 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Durai Arasan 2022-03-15 14:28:13 MDT
Hello SchedMD,

We are experiencing strange behavior with srun executing commands twice only when setting --cpus-per-task=1

$ srun --cpus-per-task=1 --partition=gpu-2080ti echo foo
srun: job 1298286 queued and waiting for resources
srun: job 1298286 has been allocated resources
foo
foo

This is not seen when --cpus-per-task is another value:

$ srun --cpus-per-task=3 --partition=gpu-2080ti echo foo
srun: job 1298287 queued and waiting for resources
srun: job 1298287 has been allocated resources
foo

Also when specifying --ntasks:
$ srun -n1 --cpus-per-task=1 --partition=gpu-2080ti echo foo
srun: job 1298288 queued and waiting for resources
srun: job 1298288 has been allocated resources
foo

Relevant slurm.conf settings are:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
# example node configuration
NodeName=slurm-bm-58 NodeAddr=xxx.xxx.xxx.xxx Procs=72 Sockets=2 CoresPerSocket=18 ThreadsPerCore=2 RealMemory=354566 Gres=gpu:rtx2080ti:8 Feature=xx_v2.38 State=UNKNOWN

On closer of job variables in the "--cpus-per-task=1" case, the following variables have wrongly acquired a value of 2 for no reason:
SLURM_NTASKS=2
SLURM_NPROCS=2
SLURM_TASKS_PER_NODE=2
SLURM_STEP_NUM_TASKS=2
SLURM_STEP_TASKS_PER_NODE=2

Can you see what could be wrong?

- Durai