Ticket 16050

Summary:	sync loop not progressing, holding JobId=J, tried to use N CPUs on node
Product:	Slurm	Reporter:	Kevin Buckley <kevin.buckley>
Component:	Scheduling	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek
Version:	22.05.2
Hardware:	Cray Shasta
OS:	Linux
Site:	Pawsey	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurm config

Description Kevin Buckley 2023-02-15 23:35:19 MST

Created attachment 28888 [details]
Slurm config

Am having a hard time understanding why a particuar
job isn't being scheduled within our environment.

Proably a lack of understanding on my part but I just
can't see why.

The job submission asks for these resources

#SBATCH --account=pawsey0001
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=8G
#SBATCH --job-name=GS_23933

The state of the debug queue/partition at the time of the 
submission is

PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
debug     up    1-4         1:00:00    256 2:64:2       1 allocated  nid001000
debug     up    1-4         1:00:00    256 2:64:2       7 idle       nid[001001-001007]

so 1 node is fully "allocated" and there is NOTHING on the others

The job gets dumped into an JobAdminHeld state with the cltd reporting
that

slurmctld: error: sync loop not progressing, holding JobId=924104, tried to use 5 CPUs on node nid001002 core_map:0,64 avoided_sockets:NONE vpus:1

and the Job as seen via scontrol, looksl ike

   Priority=0 Nice=0 Account=pawsey0001 QOS=normal
   JobState=PENDING Reason=JobHeldAdmin Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2023-02-16T14:14:12 EligibleTime=2023-02-16T14:14:12
   AccrueTime=2023-02-16T14:14:12
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-02-16T14:14:47 Scheduler=Main
   Partition=debug AllocNode:Sid=setonix-01:47989
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=1,mem=8G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:1:1 CoreSpec=*
   MinCPUsNode=5 MinMemoryNode=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)


For a while, the "5 CPUs" and "MinCPUsNode=5" was a mystery to me,
but I have since realised that it is a result of the 8G of memory 
being requested, 5 being the minimum number of nodes needed to get
at least 8G of the available memory.

What I fail to get is why the scheduler tells me that it failed 
when it

  tried to use 5 CPUs on node 

given that there was nothing else running on the node.


Just to mark your card, we may have a hard time altering the
debugging level on the ctld at the moment, so hopefully this
will be "obvious", at least to you, from the configuration.

The Slurm config shouldn't have changed from previous tickets
but I've attached it anyway.

Comment 1 Kevin Buckley 2023-02-15 23:41:16 MST

Just to confuse the hell out of me even more, if I
submit this script but ask for our "work" partition,
it gets scheduled and runs.

slurmctld: _slurm_rpc_submit_batch_job: JobId=924194 InitPrio=75018 usec=6224
slurmctld: sched/backfill: _start_job: Started JobId=924194 in work on nid001060
slurmctld: _job_complete: JobId=924194 WEXITSTATUS 0
slurmctld: _job_complete: JobId=924194 done

so no mention of that "5 CPUs" as it is being scheduled ?

Comment 2 Marcin Stolarek 2023-02-16 05:20:40 MST

Kevin,

The number of CPUs gets bumped to 5 because of configured MaxMemPerCPU, per man slurm.conf "If a job specifies a memory per CPU limit that exceeds this system limit, that job's count of CPUs per task will try to automatically increase."[1].

At the same time the fact that the job doesn't use idle node, but just goes to JobHeldAdmin is a bug. Good news is that it's already fixed by 69406ed1de (Bug 14395) in Slurm 22.05.6.

Let me know if you have any questions. In case of no reply I'll mark the bug report as duplicate.

cheers,
Marcin
[1]https://slurm.schedmd.com/slurm.conf.html#OPT_MaxMemPerCPU

Comment 3 Kevin Buckley 2023-02-16 19:19:27 MST

> At the same time the fact that the job doesn't use idle node, but just goes to
> JobHeldAdmin is a bug. Good news is that it's already fixed by 69406ed1de (Bug
> 14395) in Slurm 22.05.6.
> 
> Let me know if you have any questions. In case of no reply I'll mark the bug
> report as duplicate.

Not entirely clear to me, from the existing bug commentary, why
a fix for a billing issue means that "the job doesn't use idle
node," but am more than happy to hear that the fix for that fixes
the symptoms I am seeing.

All we need now is for HPE/Cray to allow us to go to the version
with the fix in, and we should be good to go.

Feel free to close as you see fit,
Kevin

Comment 4 Marcin Stolarek 2023-02-17 01:11:53 MST

>Not entirely clear to me, from the existing bug commentary,[...]
The summary of that bug is a little bit misleading. Unfortunately this happens quite often that it comes from end users, which means that it's not always referring to a root cause, but to a one of the symptoms.

When you look at the commit that came in as fix, it's more clear:
>commit 69406ed1ded4760422c878f806428c78616fd3be
>Author: Scott Hilton <scott@schedmd.com>
>Date:   Wed Sep 14 16:12:25 2022 -0600
> 
>    Fix the number of allocated cpus for an auto-adjustment case
>    
>    Job requests --ntasks-per-node and --mem (per-node) but the limit is
>    MaxMemPerCPU.
>    
>    Bug 14395
>[...]
>--- a/src/slurmctld/job_mgr.c
>+++ b/src/slurmctld/job_mgr.c
>@@ -8930,6 +8930,11 @@ static bool _valid_pn_min_mem(job_desc_msg_t *job_desc_msg,
>                              "limit", min_cpus);
>                        job_desc_msg->pn_min_cpus = min_cpus;
>                        cpus_per_node = MAX(cpus_per_node, min_cpus);
>+                       if (job_desc_msg->ntasks_per_node)
>+                               job_desc_msg->cpus_per_task =
>+                                       (job_desc_msg->pn_min_cpus +
>+                                        job_desc_msg->ntasks_per_node - 1) /
>+                                       job_desc_msg->ntasks_per_node;
>                }
>                sys_mem_limit *= cpus_per_node;

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 14395 ***