| Summary: | slurmrestd calculating wrong number of cpus through json job submit | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Shawn Hoopes <shawn> |
| Component: | slurmrestd | Assignee: | Ben Glines <ben.glines> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 23.02.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SchedMD | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Looks like num_cpus is not being calculated correctly for the job submitted via slurmrestd. Running with srun (no slurmrestd) seems to work just fine: > $ srun -n8 -N2 --cpus-per-task=1 hostname > tars > tars > tars > tars > tars > tars > tars > tars > $ scontrol show jobs 117 | grep NumCPUs > NumNodes=2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Notice NumCPUs=8 since there are 8 tasks. Shawn's example using slurmrestd only shows 2: > NumNodes=2 NumCPUs=2 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Looking into this now. Thanks for logging this bug Shawn! The fix will be available in the 23.02.1 release. * 02c5258be1 (HEAD -> master, origin/master) Merge remote-tracking branch 'origin/slurm-23.02' |\ | * f36e6c2ff6 (origin/slurm-23.02) job_mgr.c - Set min_cpus if not already set in job description https://github.com/SchedMD/slurm/commit/f36e6c2ff6cbff81db05231b476e4a101c135696#diff-7ee66c4f1536ac84dc5bbff1b8312e2eef24b974b3e48a5c5c2bcfdf2eb8f3ce |
With the following .json:[fred@login ~]$ cat job4.json { "job": { "tasks": 8, "name": "test", "nodes": "2", "cpus_per_task": 1, "current_working_directory": "/tmp/", "environment": [ "PATH=/bin:/usr/bin/:/usr/local/bin/", "LD_LIBRARY_PATH=/lib/:/lib64/:/usr/local/lib" ] }, "script": "#!/bin/bash\nsrun sleep 100" } [fred@login ~]$ The job fails: JobId=18 JobName=test UserId=fred(1010) GroupId=users(100) MCS_label=N/A Priority=4294901742 Nice=0 Account=bedrock QOS=normal JobState=FAILED Reason=NonZeroExitCode Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2023-03-15T13:34:04 EligibleTime=2023-03-15T13:34:04 AccrueTime=2023-03-15T13:34:04 StartTime=2023-03-15T13:34:04 EndTime=2023-03-15T13:34:04 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-15T13:34:04 Scheduler=Main Partition=debug AllocNode:Sid=2001:db8:1:1::1:6:162 ReqNodeList=(null) ExcNodeList=(null) NodeList=node[00-01] BatchHost=node00 NumNodes=2 NumCPUs=2 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=2,mem=31934M,node=2,billing=2 AllocTRES=cpu=2,mem=31934M,node=2,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/tmp/ StdErr=/tmp//slurm-18.out StdIn=/dev/null StdOut=/tmp//slurm-18.out Power= The slurm-18.out output: [fred@node00 ~]$ cat /tmp/slurm-18.out srun: error: Unable to create step for job 18: More processors requested than permitted [fred@node00 ~]$