On a cluster running Slurm 20.11.8, multiple nodes went into the DRAIN state with reason "Duplicate jobid". We were able to reproduce the problem with the following steps: $ sbatch --wrap='sleep 300s' Submitted batch job 6684 # No wait for the job to be scheduled on a node and for the prolog to start executing $ srun --jobid=6684 --mpi=none --overlap --pty bash srun: launch/slurm: launch_p_step_launch: StepId=6684.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 122 seconds for job step to finish. srun: error: se0007: task 0: Terminated srun: launch/slurm: _step_signal: Terminating StepId=6684.0 The node is now drained, and requires admin intervention: $ sinfo -R REASON USER TIMESTAMP NODELIST Duplicate jobid slurm 2022-01-12T14:08:28 se0007 From the Slurm log: 2022-01-12T14:07:44.520448-08:00 slurm-se01 slurmctld[1995474]: _slurm_rpc_submit_batch_job: JobId=6684 InitPrio=16836 usec=11351 2022-01-12T14:07:44.954341-08:00 slurm-se01 slurmctld[1995474]: sched: Allocate JobId=6684 NodeList=se0007 #CPUs=256 Partition=se 2022-01-12T14:08:28.063360-08:00 slurm-se01 slurmctld[1995474]: _slurm_rpc_requeue: 6684: Requested operation is presently disabled 2022-01-12T14:08:28.064238-08:00 slurm-se01 slurmctld[1995474]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=6684 uid 0 2022-01-12T14:08:28.066024-08:00 slurm-se01 slurmctld[1995474]: drain_nodes: node se0007 state set to DRAIN 2022-01-12T14:08:28.066245-08:00 slurm-se01 slurmctld[1995474]: error: Duplicate jobid on nodes se0007, set to state DRAIN This only happens when the "srun --jobid" is used while the prolog is still being executed on the node, our prologs do a lot of checks / preparation so they take approximately 60s. I couldn't get a repro of this issue on a local Slurm installation so far (with all commands and daemons running on the same node).
Hi Felix, > On a cluster running Slurm 20.11.8, multiple nodes went into the DRAIN state > with reason "Duplicate jobid". We were able to reproduce the problem with > the following steps: > > $ sbatch --wrap='sleep 300s' > Submitted batch job 6684 > > # No wait for the job to be scheduled on a node and for the prolog to start > executing > > $ srun --jobid=6684 --mpi=none --overlap --pty bash > srun: launch/slurm: launch_p_step_launch: StepId=6684.0 aborted before step > completely launched. > srun: Job step aborted: Waiting up to 122 seconds for job step to finish. > srun: error: se0007: task 0: Terminated > srun: launch/slurm: _step_signal: Terminating StepId=6684.0 > > > The node is now drained, and requires admin intervention: > $ sinfo -R > REASON USER TIMESTAMP NODELIST > Duplicate jobid slurm 2022-01-12T14:08:28 se0007 Thanks for the reproducer. > I couldn't get a repro of this issue on a local Slurm installation so far > (with all commands and daemons running on the same node). I'm in the same situation for the moment, but I'll keep trying. > From the Slurm log: > 2022-01-12T14:07:44.520448-08:00 slurm-se01 slurmctld[1995474]: > _slurm_rpc_submit_batch_job: JobId=6684 InitPrio=16836 usec=11351 > 2022-01-12T14:07:44.954341-08:00 slurm-se01 slurmctld[1995474]: sched: > Allocate JobId=6684 NodeList=se0007 #CPUs=256 Partition=se > 2022-01-12T14:08:28.063360-08:00 slurm-se01 slurmctld[1995474]: > _slurm_rpc_requeue: 6684: Requested operation is presently disabled > 2022-01-12T14:08:28.064238-08:00 slurm-se01 slurmctld[1995474]: > _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=6684 uid 0 > 2022-01-12T14:08:28.066024-08:00 slurm-se01 slurmctld[1995474]: drain_nodes: > node se0007 state set to DRAIN > 2022-01-12T14:08:28.066245-08:00 slurm-se01 slurmctld[1995474]: error: > Duplicate jobid on nodes se0007, set to state DRAIN It seems that job got requeued. This seems important to reproduce the issue. Could you share the full logs of slurmctld and slurmd when reproducing the issue? Also, could you attach your slurm.conf? > This only happens when the "srun --jobid" is used while the prolog is still > being executed on the node, our prologs do a lot of checks / preparation so > they take approximately 60s. I guess that it's necessary in your case, but 60s is quite big. You are probably getting some warnings on slurmd logs. Anyway, the issue shouldn't happen. Regards, Albert
I'm really sorry, but I realized that we were still carrying the v2 patch from https://bugs.schedmd.com/show_bug.cgi?id=12801, and it seems to be the cause of this bug. I applied this patch locally, and then I was able to repro the bug when setting "PrologFlags=Alloc,Serial,DeferBatch" in my conf. I will make a note about this on the other bug. We haven't reverted this patch on our cluster yet, but I will close this bug for now.
Re-opening as this is now happening on the master branch now that DeferBatch is upstream(see https://bugs.schedmd.com/show_bug.cgi?id=11635#c43), same repro as in the first comment.
We will work on solving this issue in bug 11635 *** This ticket has been marked as a duplicate of ticket 11635 ***