Ticket 10194

Summary:	PrologSlurmctld requeues a job endlessly when the script returns 99 first time
Product:	Slurm	Reporter:	Taras Shapovalov <taras.shapovalov>
Component:	Scheduling	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	jeff, ken.woods
Version:	20.02.5
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Taras Shapovalov 2020-11-11 02:28:24 MST

On slurm20 when slurmctld prolog returns 99, then all following exit codes are ignored and the job is requeuered endlessly.

Steps to reproduce:

1) Setup a test prolog:

[root@master ~]# scontrol show config | grep prolog -i
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = /tmp/prolog.sh
PrologFlags             = (null)
ResvProlog              = (null)
SrunProlog              = (null)
TaskProlog              = (null)
[root@master ~]# cat /tmp/prolog.sh
#!/bin/bash
if [ -e /tmp/ok ]; then
  exit 0
fi
exit 99
[root@master ~]#

2) Submit a job and check the logs of slurmctld:

[2020-11-11T10:21:09.681] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=1000
[2020-11-11T10:21:09.681] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:21:09.681] debug2: sched: JobId=5 allocated resources: NodeList=(null)
[2020-11-11T10:21:09.681] _slurm_rpc_submit_batch_job: JobId=5 InitPrio=4294901756 usec=548
[2020-11-11T10:21:10.430] debug:  sched: Running job scheduler
[2020-11-11T10:21:10.430] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:21:10.430] sched: Allocate JobId=5 NodeList=node001 #CPUs=1 Partition=defq
[2020-11-11T10:21:10.431] debug2: Performing full system state save
[2020-11-11T10:21:10.433] debug2: slurmctld_script: creating a new thread for JobId=5
[2020-11-11T10:21:10.437] error: _run_script JobId=5 prolog exit status 99:0
[2020-11-11T10:21:10.437] error: prolog_slurmctld JobId=5 prolog exit status 99:0
[2020-11-11T10:21:10.437] debug:  _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540
[2020-11-11T10:21:10.437] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[2020-11-11T10:21:10.437] debug2: track_script_remove: thread running script from job removed
[2020-11-11T10:21:10.437] debug2: Tree head got back 0 looking for 1
[2020-11-11T10:21:10.444] debug2: Tree head got back 1
[2020-11-11T10:21:10.444] Requeuing JobId=5
[2020-11-11T10:21:10.452] debug2: node_did_resp node001
[2020-11-11T10:21:10.452] debug:  sched: Running job scheduler
[2020-11-11T10:21:11.435] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2020-11-11T10:21:11.435] debug2: _slurm_rpc_dump_partitions, size=193 usec=128
[2020-11-11T10:21:16.185] debug:  backfill: beginning
[2020-11-11T10:21:16.186] debug:  backfill: no jobs to backfill
[2020-11-11T10:21:16.516] debug2: Testing job time limits and checkpoints



3) Touch /tmp/ok

4) Check the logs in a while, it will contain these messages repeated endlessly:

[2020-11-11T10:25:46.190] debug2: backfill: entering _try_sched for JobId=5.
[2020-11-11T10:25:46.190] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:25:46.190] debug2: select_nodes: calling _get_req_features() for JobId=5 with not NULL job resources
[2020-11-11T10:25:46.191] backfill: Started JobId=5 in defq on node001
[2020-11-11T10:25:46.201] debug2: slurmctld_script: creating a new thread for JobId=5
[2020-11-11T10:25:46.204] debug2: _run_script JobId=5 prolog completed
[2020-11-11T10:25:46.204] debug:  _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540
[2020-11-11T10:25:46.204] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[2020-11-11T10:25:46.204] debug2: track_script_remove: thread running script from job removed
[2020-11-11T10:25:46.204] debug2: Tree head got back 0 looking for 1
[2020-11-11T10:25:46.212] debug2: Tree head got back 1
[2020-11-11T10:25:46.212] Requeuing JobId=5



Job status remains pending forever:

[root@master ~]# squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 5      defq   job.sh cmsuppor PD       0:00      1 (BeginTime) 
[root@master ~]#

The issue is not reproducible on slurm 19.05.7.