Ticket 16280

Summary: slurmrestd calculating wrong number of cpus through json job submit
Product: Slurm Reporter: Shawn Hoopes <shawn>
Component: slurmrestdAssignee: Ben Glines <ben.glines>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 23.02.0   
Hardware: Linux   
OS: Linux   
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Shawn Hoopes 2023-03-15 14:37:12 MDT
With the following .json:[fred@login ~]$ cat job4.json 
{
       "job": {
                       "tasks": 8,
                       "name": "test",
                       "nodes": "2", "cpus_per_task": 1,
                       "current_working_directory": "/tmp/",
                       "environment": [
                               "PATH=/bin:/usr/bin/:/usr/local/bin/",
                               "LD_LIBRARY_PATH=/lib/:/lib64/:/usr/local/lib"
                       ]
       },
       "script": "#!/bin/bash\nsrun sleep 100"
}
[fred@login ~]$ 


The job fails:
JobId=18 JobName=test
   UserId=fred(1010) GroupId=users(100) MCS_label=N/A
   Priority=4294901742 Nice=0 Account=bedrock QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-03-15T13:34:04 EligibleTime=2023-03-15T13:34:04
   AccrueTime=2023-03-15T13:34:04
   StartTime=2023-03-15T13:34:04 EndTime=2023-03-15T13:34:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-15T13:34:04 Scheduler=Main
   Partition=debug AllocNode:Sid=2001:db8:1:1::1:6:162
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node[00-01]
   BatchHost=node00
   NumNodes=2 NumCPUs=2 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=2,mem=31934M,node=2,billing=2
   AllocTRES=cpu=2,mem=31934M,node=2,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/tmp/
   StdErr=/tmp//slurm-18.out
   StdIn=/dev/null
   StdOut=/tmp//slurm-18.out
   Power=
   

The slurm-18.out output:
[fred@node00 ~]$ cat /tmp/slurm-18.out 
srun: error: Unable to create step for job 18: More processors requested than permitted
[fred@node00 ~]$
Comment 2 Ben Glines 2023-03-16 14:10:15 MDT
Looks like num_cpus is not being calculated correctly for the job submitted via slurmrestd.

Running with srun (no slurmrestd) seems to work just fine:

> $ srun -n8 -N2 --cpus-per-task=1 hostname
> tars
> tars
> tars
> tars
> tars
> tars
> tars
> tars
> $ scontrol show  jobs 117 | grep NumCPUs
>    NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Notice NumCPUs=8 since there are 8 tasks. Shawn's example using slurmrestd only shows 2:
>    NumNodes=2 NumCPUs=2 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Looking into this now.
Comment 9 Ben Glines 2023-03-20 16:07:20 MDT
Thanks for logging this bug Shawn! The fix will be available in the 23.02.1 release.

*   02c5258be1 (HEAD -> master, origin/master) Merge remote-tracking branch 'origin/slurm-23.02'
|\  
| * f36e6c2ff6 (origin/slurm-23.02) job_mgr.c - Set min_cpus if not already set in job description

https://github.com/SchedMD/slurm/commit/f36e6c2ff6cbff81db05231b476e4a101c135696#diff-7ee66c4f1536ac84dc5bbff1b8312e2eef24b974b3e48a5c5c2bcfdf2eb8f3ce