10194 – PrologSlurmctld requeues a job endlessly when the script returns 99 first time

Ticket 10194 - PrologSlurmctld requeues a job endlessly when the script returns 99 first time

Summary: PrologSlurmctld requeues a job endlessly when the script returns 99 first time

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.5
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-11-11 02:28 MST by Taras Shapovalov
Modified:	2020-11-17 04:23 MST (History)
CC List:	2 users (show)

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Taras Shapovalov 2020-11-11 02:28:24 MST

On slurm20 when slurmctld prolog returns 99, then all following exit codes are ignored and the job is requeuered endlessly.

Steps to reproduce:

1) Setup a test prolog:

[root@master ~]# scontrol show config | grep prolog -i
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = /tmp/prolog.sh
PrologFlags             = (null)
ResvProlog              = (null)
SrunProlog              = (null)
TaskProlog              = (null)
[root@master ~]# cat /tmp/prolog.sh
#!/bin/bash
if [ -e /tmp/ok ]; then
  exit 0
fi
exit 99
[root@master ~]#

2) Submit a job and check the logs of slurmctld:

[2020-11-11T10:21:09.681] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=1000
[2020-11-11T10:21:09.681] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:21:09.681] debug2: sched: JobId=5 allocated resources: NodeList=(null)
[2020-11-11T10:21:09.681] _slurm_rpc_submit_batch_job: JobId=5 InitPrio=4294901756 usec=548
[2020-11-11T10:21:10.430] debug:  sched: Running job scheduler
[2020-11-11T10:21:10.430] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:21:10.430] sched: Allocate JobId=5 NodeList=node001 #CPUs=1 Partition=defq
[2020-11-11T10:21:10.431] debug2: Performing full system state save
[2020-11-11T10:21:10.433] debug2: slurmctld_script: creating a new thread for JobId=5
[2020-11-11T10:21:10.437] error: _run_script JobId=5 prolog exit status 99:0
[2020-11-11T10:21:10.437] error: prolog_slurmctld JobId=5 prolog exit status 99:0
[2020-11-11T10:21:10.437] debug:  _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540
[2020-11-11T10:21:10.437] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[2020-11-11T10:21:10.437] debug2: track_script_remove: thread running script from job removed
[2020-11-11T10:21:10.437] debug2: Tree head got back 0 looking for 1
[2020-11-11T10:21:10.444] debug2: Tree head got back 1
[2020-11-11T10:21:10.444] Requeuing JobId=5
[2020-11-11T10:21:10.452] debug2: node_did_resp node001
[2020-11-11T10:21:10.452] debug:  sched: Running job scheduler
[2020-11-11T10:21:11.435] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2020-11-11T10:21:11.435] debug2: _slurm_rpc_dump_partitions, size=193 usec=128
[2020-11-11T10:21:16.185] debug:  backfill: beginning
[2020-11-11T10:21:16.186] debug:  backfill: no jobs to backfill
[2020-11-11T10:21:16.516] debug2: Testing job time limits and checkpoints



3) Touch /tmp/ok

4) Check the logs in a while, it will contain these messages repeated endlessly:

[2020-11-11T10:25:46.190] debug2: backfill: entering _try_sched for JobId=5.
[2020-11-11T10:25:46.190] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:25:46.190] debug2: select_nodes: calling _get_req_features() for JobId=5 with not NULL job resources
[2020-11-11T10:25:46.191] backfill: Started JobId=5 in defq on node001
[2020-11-11T10:25:46.201] debug2: slurmctld_script: creating a new thread for JobId=5
[2020-11-11T10:25:46.204] debug2: _run_script JobId=5 prolog completed
[2020-11-11T10:25:46.204] debug:  _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540
[2020-11-11T10:25:46.204] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[2020-11-11T10:25:46.204] debug2: track_script_remove: thread running script from job removed
[2020-11-11T10:25:46.204] debug2: Tree head got back 0 looking for 1
[2020-11-11T10:25:46.212] debug2: Tree head got back 1
[2020-11-11T10:25:46.212] Requeuing JobId=5



Job status remains pending forever:

[root@master ~]# squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 5      defq   job.sh cmsuppor PD       0:00      1 (BeginTime) 
[root@master ~]#

The issue is not reproducible on slurm 19.05.7.