| Summary: | sync loop not progressing, holding JobId=J, tried to use N CPUs on node | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kevin Buckley <kevin.buckley> |
| Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek |
| Version: | 22.05.2 | ||
| Hardware: | Cray Shasta | ||
| OS: | Linux | ||
| Site: | Pawsey | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Slurm config | ||
Just to confuse the hell out of me even more, if I submit this script but ask for our "work" partition, it gets scheduled and runs. slurmctld: _slurm_rpc_submit_batch_job: JobId=924194 InitPrio=75018 usec=6224 slurmctld: sched/backfill: _start_job: Started JobId=924194 in work on nid001060 slurmctld: _job_complete: JobId=924194 WEXITSTATUS 0 slurmctld: _job_complete: JobId=924194 done so no mention of that "5 CPUs" as it is being scheduled ? Kevin, The number of CPUs gets bumped to 5 because of configured MaxMemPerCPU, per man slurm.conf "If a job specifies a memory per CPU limit that exceeds this system limit, that job's count of CPUs per task will try to automatically increase."[1]. At the same time the fact that the job doesn't use idle node, but just goes to JobHeldAdmin is a bug. Good news is that it's already fixed by 69406ed1de (Bug 14395) in Slurm 22.05.6. Let me know if you have any questions. In case of no reply I'll mark the bug report as duplicate. cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_MaxMemPerCPU > At the same time the fact that the job doesn't use idle node, but just goes to
> JobHeldAdmin is a bug. Good news is that it's already fixed by 69406ed1de (Bug
> 14395) in Slurm 22.05.6.
>
> Let me know if you have any questions. In case of no reply I'll mark the bug
> report as duplicate.
Not entirely clear to me, from the existing bug commentary, why
a fix for a billing issue means that "the job doesn't use idle
node," but am more than happy to hear that the fix for that fixes
the symptoms I am seeing.
All we need now is for HPE/Cray to allow us to go to the version
with the fix in, and we should be good to go.
Feel free to close as you see fit,
Kevin
>Not entirely clear to me, from the existing bug commentary,[...] The summary of that bug is a little bit misleading. Unfortunately this happens quite often that it comes from end users, which means that it's not always referring to a root cause, but to a one of the symptoms. When you look at the commit that came in as fix, it's more clear: >commit 69406ed1ded4760422c878f806428c78616fd3be >Author: Scott Hilton <scott@schedmd.com> >Date: Wed Sep 14 16:12:25 2022 -0600 > > Fix the number of allocated cpus for an auto-adjustment case > > Job requests --ntasks-per-node and --mem (per-node) but the limit is > MaxMemPerCPU. > > Bug 14395 >[...] >--- a/src/slurmctld/job_mgr.c >+++ b/src/slurmctld/job_mgr.c >@@ -8930,6 +8930,11 @@ static bool _valid_pn_min_mem(job_desc_msg_t *job_desc_msg, > "limit", min_cpus); > job_desc_msg->pn_min_cpus = min_cpus; > cpus_per_node = MAX(cpus_per_node, min_cpus); >+ if (job_desc_msg->ntasks_per_node) >+ job_desc_msg->cpus_per_task = >+ (job_desc_msg->pn_min_cpus + >+ job_desc_msg->ntasks_per_node - 1) / >+ job_desc_msg->ntasks_per_node; > } > sys_mem_limit *= cpus_per_node; cheers, Marcin *** This ticket has been marked as a duplicate of ticket 14395 *** |
Created attachment 28888 [details] Slurm config Am having a hard time understanding why a particuar job isn't being scheduled within our environment. Proably a lack of understanding on my part but I just can't see why. The job submission asks for these resources #SBATCH --account=pawsey0001 #SBATCH --partition=debug #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --ntasks-per-socket=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=8G #SBATCH --job-name=GS_23933 The state of the debug queue/partition at the time of the submission is PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST debug up 1-4 1:00:00 256 2:64:2 1 allocated nid001000 debug up 1-4 1:00:00 256 2:64:2 7 idle nid[001001-001007] so 1 node is fully "allocated" and there is NOTHING on the others The job gets dumped into an JobAdminHeld state with the cltd reporting that slurmctld: error: sync loop not progressing, holding JobId=924104, tried to use 5 CPUs on node nid001002 core_map:0,64 avoided_sockets:NONE vpus:1 and the Job as seen via scontrol, looksl ike Priority=0 Nice=0 Account=pawsey0001 QOS=normal JobState=PENDING Reason=JobHeldAdmin Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2023-02-16T14:14:12 EligibleTime=2023-02-16T14:14:12 AccrueTime=2023-02-16T14:14:12 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-02-16T14:14:47 Scheduler=Main Partition=debug AllocNode:Sid=setonix-01:47989 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=1,mem=8G,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=1:0:1:1 CoreSpec=* MinCPUsNode=5 MinMemoryNode=8G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) For a while, the "5 CPUs" and "MinCPUsNode=5" was a mystery to me, but I have since realised that it is a result of the 8G of memory being requested, 5 being the minimum number of nodes needed to get at least 8G of the available memory. What I fail to get is why the scheduler tells me that it failed when it tried to use 5 CPUs on node given that there was nothing else running on the node. Just to mark your card, we may have a hard time altering the debugging level on the ctld at the moment, so hopefully this will be "obvious", at least to you, from the configuration. The Slurm config shouldn't have changed from previous tickets but I've attached it anyway.