Ticket 10194 - PrologSlurmctld requeues a job endlessly when the script returns 99 first time
Summary: PrologSlurmctld requeues a job endlessly when the script returns 99 first time
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.02.5
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-11-11 02:28 MST by Taras Shapovalov
Modified: 2020-11-17 04:23 MST (History)
2 users (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Taras Shapovalov 2020-11-11 02:28:24 MST
On slurm20 when slurmctld prolog returns 99, then all following exit codes are ignored and the job is requeuered endlessly.

Steps to reproduce:

1) Setup a test prolog:

[root@master ~]# scontrol show config | grep prolog -i
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = /tmp/prolog.sh
PrologFlags             = (null)
ResvProlog              = (null)
SrunProlog              = (null)
TaskProlog              = (null)
[root@master ~]# cat /tmp/prolog.sh
#!/bin/bash
if [ -e /tmp/ok ]; then
  exit 0
fi
exit 99
[root@master ~]#

2) Submit a job and check the logs of slurmctld:

[2020-11-11T10:21:09.681] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=1000
[2020-11-11T10:21:09.681] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:21:09.681] debug2: sched: JobId=5 allocated resources: NodeList=(null)
[2020-11-11T10:21:09.681] _slurm_rpc_submit_batch_job: JobId=5 InitPrio=4294901756 usec=548
[2020-11-11T10:21:10.430] debug:  sched: Running job scheduler
[2020-11-11T10:21:10.430] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:21:10.430] sched: Allocate JobId=5 NodeList=node001 #CPUs=1 Partition=defq
[2020-11-11T10:21:10.431] debug2: Performing full system state save
[2020-11-11T10:21:10.433] debug2: slurmctld_script: creating a new thread for JobId=5
[2020-11-11T10:21:10.437] error: _run_script JobId=5 prolog exit status 99:0
[2020-11-11T10:21:10.437] error: prolog_slurmctld JobId=5 prolog exit status 99:0
[2020-11-11T10:21:10.437] debug:  _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540
[2020-11-11T10:21:10.437] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[2020-11-11T10:21:10.437] debug2: track_script_remove: thread running script from job removed
[2020-11-11T10:21:10.437] debug2: Tree head got back 0 looking for 1
[2020-11-11T10:21:10.444] debug2: Tree head got back 1
[2020-11-11T10:21:10.444] Requeuing JobId=5
[2020-11-11T10:21:10.452] debug2: node_did_resp node001
[2020-11-11T10:21:10.452] debug:  sched: Running job scheduler
[2020-11-11T10:21:11.435] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2020-11-11T10:21:11.435] debug2: _slurm_rpc_dump_partitions, size=193 usec=128
[2020-11-11T10:21:16.185] debug:  backfill: beginning
[2020-11-11T10:21:16.186] debug:  backfill: no jobs to backfill
[2020-11-11T10:21:16.516] debug2: Testing job time limits and checkpoints



3) Touch /tmp/ok

4) Check the logs in a while, it will contain these messages repeated endlessly:

[2020-11-11T10:25:46.190] debug2: backfill: entering _try_sched for JobId=5.
[2020-11-11T10:25:46.190] debug2: found 2 usable nodes from config containing node[001,002]
[2020-11-11T10:25:46.190] debug2: select_nodes: calling _get_req_features() for JobId=5 with not NULL job resources
[2020-11-11T10:25:46.191] backfill: Started JobId=5 in defq on node001
[2020-11-11T10:25:46.201] debug2: slurmctld_script: creating a new thread for JobId=5
[2020-11-11T10:25:46.204] debug2: _run_script JobId=5 prolog completed
[2020-11-11T10:25:46.204] debug:  _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540
[2020-11-11T10:25:46.204] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
[2020-11-11T10:25:46.204] debug2: track_script_remove: thread running script from job removed
[2020-11-11T10:25:46.204] debug2: Tree head got back 0 looking for 1
[2020-11-11T10:25:46.212] debug2: Tree head got back 1
[2020-11-11T10:25:46.212] Requeuing JobId=5



Job status remains pending forever:

[root@master ~]# squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 5      defq   job.sh cmsuppor PD       0:00      1 (BeginTime) 
[root@master ~]#

The issue is not reproducible on slurm 19.05.7.