| Summary: | PrologSlurmctld requeues a job endlessly when the script returns 99 first time | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Taras Shapovalov <taras.shapovalov> |
| Component: | Scheduling | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | jeff, ken.woods |
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
On slurm20 when slurmctld prolog returns 99, then all following exit codes are ignored and the job is requeuered endlessly. Steps to reproduce: 1) Setup a test prolog: [root@master ~]# scontrol show config | grep prolog -i Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = /tmp/prolog.sh PrologFlags = (null) ResvProlog = (null) SrunProlog = (null) TaskProlog = (null) [root@master ~]# cat /tmp/prolog.sh #!/bin/bash if [ -e /tmp/ok ]; then exit 0 fi exit 99 [root@master ~]# 2) Submit a job and check the logs of slurmctld: [2020-11-11T10:21:09.681] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=1000 [2020-11-11T10:21:09.681] debug2: found 2 usable nodes from config containing node[001,002] [2020-11-11T10:21:09.681] debug2: sched: JobId=5 allocated resources: NodeList=(null) [2020-11-11T10:21:09.681] _slurm_rpc_submit_batch_job: JobId=5 InitPrio=4294901756 usec=548 [2020-11-11T10:21:10.430] debug: sched: Running job scheduler [2020-11-11T10:21:10.430] debug2: found 2 usable nodes from config containing node[001,002] [2020-11-11T10:21:10.430] sched: Allocate JobId=5 NodeList=node001 #CPUs=1 Partition=defq [2020-11-11T10:21:10.431] debug2: Performing full system state save [2020-11-11T10:21:10.433] debug2: slurmctld_script: creating a new thread for JobId=5 [2020-11-11T10:21:10.437] error: _run_script JobId=5 prolog exit status 99:0 [2020-11-11T10:21:10.437] error: prolog_slurmctld JobId=5 prolog exit status 99:0 [2020-11-11T10:21:10.437] debug: _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540 [2020-11-11T10:21:10.437] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB [2020-11-11T10:21:10.437] debug2: track_script_remove: thread running script from job removed [2020-11-11T10:21:10.437] debug2: Tree head got back 0 looking for 1 [2020-11-11T10:21:10.444] debug2: Tree head got back 1 [2020-11-11T10:21:10.444] Requeuing JobId=5 [2020-11-11T10:21:10.452] debug2: node_did_resp node001 [2020-11-11T10:21:10.452] debug: sched: Running job scheduler [2020-11-11T10:21:11.435] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0 [2020-11-11T10:21:11.435] debug2: _slurm_rpc_dump_partitions, size=193 usec=128 [2020-11-11T10:21:16.185] debug: backfill: beginning [2020-11-11T10:21:16.186] debug: backfill: no jobs to backfill [2020-11-11T10:21:16.516] debug2: Testing job time limits and checkpoints 3) Touch /tmp/ok 4) Check the logs in a while, it will contain these messages repeated endlessly: [2020-11-11T10:25:46.190] debug2: backfill: entering _try_sched for JobId=5. [2020-11-11T10:25:46.190] debug2: found 2 usable nodes from config containing node[001,002] [2020-11-11T10:25:46.190] debug2: select_nodes: calling _get_req_features() for JobId=5 with not NULL job resources [2020-11-11T10:25:46.191] backfill: Started JobId=5 in defq on node001 [2020-11-11T10:25:46.201] debug2: slurmctld_script: creating a new thread for JobId=5 [2020-11-11T10:25:46.204] debug2: _run_script JobId=5 prolog completed [2020-11-11T10:25:46.204] debug: _job_requeue_op: JobId=5 state 0x8000 reason 0 priority -65540 [2020-11-11T10:25:46.204] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB [2020-11-11T10:25:46.204] debug2: track_script_remove: thread running script from job removed [2020-11-11T10:25:46.204] debug2: Tree head got back 0 looking for 1 [2020-11-11T10:25:46.212] debug2: Tree head got back 1 [2020-11-11T10:25:46.212] Requeuing JobId=5 Job status remains pending forever: [root@master ~]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5 defq job.sh cmsuppor PD 0:00 1 (BeginTime) [root@master ~]# The issue is not reproducible on slurm 19.05.7.