Ticket 10194

Summary: PrologSlurmctld requeues a job endlessly when the script returns 99 first time
Product: Slurm Reporter: Taras Shapovalov <taras.shapovalov>
Component: SchedulingAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: jeff, ken.woods
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Taras Shapovalov 2020-11-11 02:28:24 MST
On slurm20 when slurmctld prolog returns 99, then all following exit codes are ignored and the job is requeuered endlessly.

Steps to reproduce:

1) Setup a test prolog:

[root@master ~]# scontrol show config | grep prolog -i
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = /tmp/prolog.sh
PrologFlags             = (null)
ResvProlog              = (null)
SrunProlog              = (null)
TaskProlog              = (null)
[root@master ~]# cat /tmp/prolog.sh
#!/bin/bash
if [ -e /tmp/ok ]; then
  exit 0
fi
exit 99
[root@master ~]#

2) Submit a job and check the logs of slurmctld:

[2020-11-11T10:21:09.681] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=1000
[2020-11-11T10:21:09.681] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:21:09.681] debug2: sched: JobId=5 allocated resources: NodeList=(null)
[2020-11-11T10:21:09.681] _slurm_rpc_submit_batch_job: JobId=5 InitPrio=4294901756 usec=548
[2020-11-11T10:21:10.430] debug:  sched: Running job scheduler
[2020-11-11T10:21:10.430] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:21:10.430] sched: Allocate JobId=5 NodeList=node001 #CPUs=1 Partition=defq
[2020-11-11T10:21:10.431] debug2: Performing full system state save
[2020-11-11T10:21:10.433] debug2: slurmctld_script: creating a new thread for JobId=5
[2020-11-11T10:21:10.437] error: _run_script JobId=5 prolog exit status 99:0
[2020-11-11T10:21:10.437] error: prolog_slurmctld JobId=5 prolog exit status 99:0
[2020-11-11T10:21:10.437] debug:  _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540
[2020-11-11T10:21:10.437] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[2020-11-11T10:21:10.437] debug2: track_script_remove: thread running script from job removed
[2020-11-11T10:21:10.437] debug2: Tree head got back 0 looking for 1
[2020-11-11T10:21:10.444] debug2: Tree head got back 1
[2020-11-11T10:21:10.444] Requeuing JobId=5
[2020-11-11T10:21:10.452] debug2: node_did_resp node001
[2020-11-11T10:21:10.452] debug:  sched: Running job scheduler
[2020-11-11T10:21:11.435] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2020-11-11T10:21:11.435] debug2: _slurm_rpc_dump_partitions, size=193 usec=128
[2020-11-11T10:21:16.185] debug:  backfill: beginning
[2020-11-11T10:21:16.186] debug:  backfill: no jobs to backfill
[2020-11-11T10:21:16.516] debug2: Testing job time limits and checkpoints



3) Touch /tmp/ok

4) Check the logs in a while, it will contain these messages repeated endlessly:

[2020-11-11T10:25:46.190] debug2: backfill: entering _try_sched for JobId=5.
[2020-11-11T10:25:46.190] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:25:46.190] debug2: select_nodes: calling _get_req_features() for JobId=5 with not NULL job resources
[2020-11-11T10:25:46.191] backfill: Started JobId=5 in defq on node001
[2020-11-11T10:25:46.201] debug2: slurmctld_script: creating a new thread for JobId=5
[2020-11-11T10:25:46.204] debug2: _run_script JobId=5 prolog completed
[2020-11-11T10:25:46.204] debug:  _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540
[2020-11-11T10:25:46.204] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[2020-11-11T10:25:46.204] debug2: track_script_remove: thread running script from job removed
[2020-11-11T10:25:46.204] debug2: Tree head got back 0 looking for 1
[2020-11-11T10:25:46.212] debug2: Tree head got back 1
[2020-11-11T10:25:46.212] Requeuing JobId=5



Job status remains pending forever:

[root@master ~]# squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 5      defq   job.sh cmsuppor PD       0:00      1 (BeginTime) 
[root@master ~]#

The issue is not reproducible on slurm 19.05.7.