Ticket 16280 - slurmrestd calculating wrong number of cpus through json job submit
Summary: slurmrestd calculating wrong number of cpus through json job submit
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmrestd (show other tickets)
Version: 23.02.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Ben Glines
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-03-15 14:37 MDT by Shawn Hoopes
Modified: 2024-12-27 14:56 MST (History)
1 user (show)

See Also:
Site: SchedMD
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Shawn Hoopes 2023-03-15 14:37:12 MDT
With the following .json:[fred@login ~]$ cat job4.json 
{
       "job": {
                       "tasks": 8,
                       "name": "test",
                       "nodes": "2", "cpus_per_task": 1,
                       "current_working_directory": "/tmp/",
                       "environment": [
                               "PATH=/bin:/usr/bin/:/usr/local/bin/",
                               "LD_LIBRARY_PATH=/lib/:/lib64/:/usr/local/lib"
                       ]
       },
       "script": "#!/bin/bash\nsrun sleep 100"
}
[fred@login ~]$ 


The job fails:
JobId=18 JobName=test
   UserId=fred(1010) GroupId=users(100) MCS_label=N/A
   Priority=4294901742 Nice=0 Account=bedrock QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-03-15T13:34:04 EligibleTime=2023-03-15T13:34:04
   AccrueTime=2023-03-15T13:34:04
   StartTime=2023-03-15T13:34:04 EndTime=2023-03-15T13:34:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-15T13:34:04 Scheduler=Main
   Partition=debug AllocNode:Sid=2001:db8:1:1::1:6:162
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node[00-01]
   BatchHost=node00
   NumNodes=2 NumCPUs=2 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=2,mem=31934M,node=2,billing=2
   AllocTRES=cpu=2,mem=31934M,node=2,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/tmp/
   StdErr=/tmp//slurm-18.out
   StdIn=/dev/null
   StdOut=/tmp//slurm-18.out
   Power=
   

The slurm-18.out output:
[fred@node00 ~]$ cat /tmp/slurm-18.out 
srun: error: Unable to create step for job 18: More processors requested than permitted
[fred@node00 ~]$
Comment 2 Ben Glines 2023-03-16 14:10:15 MDT
Looks like num_cpus is not being calculated correctly for the job submitted via slurmrestd.

Running with srun (no slurmrestd) seems to work just fine:

> $ srun -n8 -N2 --cpus-per-task=1 hostname
> tars
> tars
> tars
> tars
> tars
> tars
> tars
> tars
> $ scontrol show  jobs 117 | grep NumCPUs
>    NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Notice NumCPUs=8 since there are 8 tasks. Shawn's example using slurmrestd only shows 2:
>    NumNodes=2 NumCPUs=2 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Looking into this now.
Comment 9 Ben Glines 2023-03-20 16:07:20 MDT
Thanks for logging this bug Shawn! The fix will be available in the 23.02.1 release.

*   02c5258be1 (HEAD -> master, origin/master) Merge remote-tracking branch 'origin/slurm-23.02'
|\  
| * f36e6c2ff6 (origin/slurm-23.02) job_mgr.c - Set min_cpus if not already set in job description

https://github.com/SchedMD/slurm/commit/f36e6c2ff6cbff81db05231b476e4a101c135696#diff-7ee66c4f1536ac84dc5bbff1b8312e2eef24b974b3e48a5c5c2bcfdf2eb8f3ce
Comment 10 Scott Hilton 2024-12-27 14:56:33 MST
*** Ticket 21714 has been marked as a duplicate of this ticket. ***