Ticket 16280

Summary:	slurmrestd calculating wrong number of cpus through json job submit
Product:	Slurm	Reporter:	Shawn Hoopes <shawn>
Component:	slurmrestd	Assignee:	Ben Glines <ben.glines>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	ms6472
Version:	23.02.0
Hardware:	Linux
OS:	Linux
Site:	SchedMD	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	23.02.1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Shawn Hoopes 2023-03-15 14:37:12 MDT

With the following .json:[fred@login ~]$ cat job4.json 
{
       "job": {
                       "tasks": 8,
                       "name": "test",
                       "nodes": "2", "cpus_per_task": 1,
                       "current_working_directory": "/tmp/",
                       "environment": [
                               "PATH=/bin:/usr/bin/:/usr/local/bin/",
                               "LD_LIBRARY_PATH=/lib/:/lib64/:/usr/local/lib"
                       ]
       },
       "script": "#!/bin/bash\nsrun sleep 100"
}
[fred@login ~]$ 


The job fails:
JobId=18 JobName=test
   UserId=fred(1010) GroupId=users(100) MCS_label=N/A
   Priority=4294901742 Nice=0 Account=bedrock QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-03-15T13:34:04 EligibleTime=2023-03-15T13:34:04
   AccrueTime=2023-03-15T13:34:04
   StartTime=2023-03-15T13:34:04 EndTime=2023-03-15T13:34:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-15T13:34:04 Scheduler=Main
   Partition=debug AllocNode:Sid=2001:db8:1:1::1:6:162
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node[00-01]
   BatchHost=node00
   NumNodes=2 NumCPUs=2 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=2,mem=31934M,node=2,billing=2
   AllocTRES=cpu=2,mem=31934M,node=2,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/tmp/
   StdErr=/tmp//slurm-18.out
   StdIn=/dev/null
   StdOut=/tmp//slurm-18.out
   Power=
   

The slurm-18.out output:
[fred@node00 ~]$ cat /tmp/slurm-18.out 
srun: error: Unable to create step for job 18: More processors requested than permitted
[fred@node00 ~]$

Comment 2 Ben Glines 2023-03-16 14:10:15 MDT

Looks like num_cpus is not being calculated correctly for the job submitted via slurmrestd.

Running with srun (no slurmrestd) seems to work just fine:

> $ srun -n8 -N2 --cpus-per-task=1 hostname
> tars
> tars
> tars
> tars
> tars
> tars
> tars
> tars
> $ scontrol show  jobs 117 | grep NumCPUs
>    NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Notice NumCPUs=8 since there are 8 tasks. Shawn's example using slurmrestd only shows 2:
>    NumNodes=2 NumCPUs=2 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Looking into this now.

Comment 9 Ben Glines 2023-03-20 16:07:20 MDT

Thanks for logging this bug Shawn! The fix will be available in the 23.02.1 release.

*   02c5258be1 (HEAD -> master, origin/master) Merge remote-tracking branch 'origin/slurm-23.02'
|\  
| * f36e6c2ff6 (origin/slurm-23.02) job_mgr.c - Set min_cpus if not already set in job description

https://github.com/SchedMD/slurm/commit/f36e6c2ff6cbff81db05231b476e4a101c135696#diff-7ee66c4f1536ac84dc5bbff1b8312e2eef24b974b3e48a5c5c2bcfdf2eb8f3ce

Comment 10 Scott Hilton 2024-12-27 14:56:33 MST

*** Ticket 21714 has been marked as a duplicate of this ticket. ***