Ticket 17747

Summary: GPU sbatch job requeue, but the training job seems already completed
Product: Slurm Reporter: H.T. Lee <htlee>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 23.02.5   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description H.T. Lee 2023-09-21 21:27:07 MDT
I am using slurm-23.02.5
with the following application version:
4 pieces of RTX4090 per node
cuda  cuda_11.8.r11.8/compiler.31833905_0
munge/jammy,now 0.5.14-6 amd64 
libmunge-dev/jammy,now 0.5.14-6
libmunge2/jammy,now 0.5.14-6 amd64
NVIDIA-SMI 535.86.05 
hwloc-2.9.3
ebpf: libdbus-1-dev, libbpf-dev

We have 2 problems cannot solve.
The first problem is, "Unauthorized credential for client UID=0 GID=0" will log in   /var/log/munge/munged.log when start slurmd or slurmctld, also will log when run sbatch job

The 2nd job is: GPU batch job will requeue, actually the training loop seems already completed and results success generate out. The job status marks as RH, JobState=REQUEUE_HOLD Reason=launch_failure_limit_exceeded_requeued_held Dependency=(null). but if the training batch change to CPU, batch job can be completed success without requeue.

slurmd.log is:
[2023-09-21T20:39:03.547] Launching batch job 100 for UID 1001
[2023-09-21T20:39:03.600] [100.batch] gres_job_state gres:gpu(7696487) type:rtx_4090(2410783910) job:100 flags:
[2023-09-21T20:39:03.601] [100.batch]   total_node_cnt:1 (sparsely populated for resource selection)
[2023-09-21T20:39:03.601] [100.batch]   total_gres:2
[2023-09-21T20:39:03.601] [100.batch]   node_cnt:1
[2023-09-21T20:39:03.601] [100.batch]   gres_cnt_node_alloc[0]:2
[2023-09-21T20:39:03.601] [100.batch]   gres_bit_alloc[0]:0-1 of 4
[2023-09-21T20:39:03.606] [100.batch] task/cgroup: _memcg_initialize: job: alloc=1031756MB mem.limit=1031756MB memsw.limit=unlimited
[2023-09-21T20:39:03.607] [100.batch] task/cgroup: _memcg_initialize: step: alloc=1031756MB mem.limit=1031756MB memsw.limit=unlimited
[2023-09-21T20:39:03.866] launch task StepId=100.0 request from UID:1001 GID:1001 HOST:127.0.0.1 PORT:41142
[2023-09-21T20:39:03.911] [100.0] gres_job_state gres:gpu(7696487) type:rtx_4090(2410783910) job:100 flags:
[2023-09-21T20:39:03.912] [100.0]   total_node_cnt:1 (sparsely populated for resource selection)
[2023-09-21T20:39:03.912] [100.0]   total_gres:2
[2023-09-21T20:39:03.912] [100.0]   node_cnt:1
[2023-09-21T20:39:03.912] [100.0]   gres_cnt_node_alloc[0]:2
[2023-09-21T20:39:03.912] [100.0]   gres_bit_alloc[0]:0-1 of 4
[2023-09-21T20:39:03.912] [100.0] gres:gpu type:(null)(0) StepId=100.0 flags: state
[2023-09-21T20:39:03.912] [100.0]   gres_bit_alloc[0]:0-1 of 4
[2023-09-21T20:39:03.916] [100.0] task/cgroup: _memcg_initialize: job: alloc=1031756MB mem.limit=1031756MB memsw.limit=unlimited
[2023-09-21T20:39:03.916] [100.0] task/cgroup: _memcg_initialize: step: alloc=1031756MB mem.limit=1031756MB memsw.limit=unlimited
[2023-09-21T21:04:43.908] _handle_stray_script: Purging vestigial job script /var/spool/slurm/d/job00100/slurm_script
[2023-09-21T21:06:59.015] reissued job credential for job 100
[2023-09-21T21:06:59.016] Launching batch job 100 for UID 1001
[2023-09-21T21:06:59.068] [100.batch] gres_job_state gres:gpu(7696487) type:rtx_4090(2410783910) job:100 flags: