Ticket 13187 - Node drained with "duplicate jobid" when using "srun --jobid"
Summary: Node drained with "duplicate jobid" when using "srun --jobid"
Status: RESOLVED DUPLICATE of ticket 11635
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.11.8
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Albert Gil
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-01-12 15:18 MST by Felix Abecassis
Modified: 2022-02-18 06:14 MST (History)
3 users (show)

See Also:
Site: NVIDIA (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Abecassis 2022-01-12 15:18:49 MST
On a cluster running Slurm 20.11.8, multiple nodes went into the DRAIN state with reason "Duplicate jobid". We were able to reproduce the problem with the following steps:

$ sbatch --wrap='sleep 300s'
Submitted batch job 6684

# No wait for the job to be scheduled on a node and for the prolog to start executing

$ srun --jobid=6684 --mpi=none --overlap --pty bash
srun: launch/slurm: launch_p_step_launch: StepId=6684.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
srun: error: se0007: task 0: Terminated
srun: launch/slurm: _step_signal: Terminating StepId=6684.0


The node is now drained, and requires admin intervention:
$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Duplicate jobid      slurm     2022-01-12T14:08:28 se0007


From the Slurm log:
2022-01-12T14:07:44.520448-08:00 slurm-se01 slurmctld[1995474]: _slurm_rpc_submit_batch_job: JobId=6684 InitPrio=16836 usec=11351
2022-01-12T14:07:44.954341-08:00 slurm-se01 slurmctld[1995474]: sched: Allocate JobId=6684 NodeList=se0007 #CPUs=256 Partition=se
2022-01-12T14:08:28.063360-08:00 slurm-se01 slurmctld[1995474]: _slurm_rpc_requeue: 6684: Requested operation is presently disabled
2022-01-12T14:08:28.064238-08:00 slurm-se01 slurmctld[1995474]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=6684 uid 0
2022-01-12T14:08:28.066024-08:00 slurm-se01 slurmctld[1995474]: drain_nodes: node se0007 state set to DRAIN
2022-01-12T14:08:28.066245-08:00 slurm-se01 slurmctld[1995474]: error: Duplicate jobid on nodes se0007, set to state DRAIN


This only happens when the "srun --jobid" is used while the prolog is still being executed on the node, our prologs do a lot of checks / preparation so they take approximately 60s.

I couldn't get a repro of this issue on a local Slurm installation so far (with all commands and daemons running on the same node).
Comment 1 Albert Gil 2022-01-13 07:50:34 MST
Hi Felix,

> On a cluster running Slurm 20.11.8, multiple nodes went into the DRAIN state
> with reason "Duplicate jobid". We were able to reproduce the problem with
> the following steps:
> 
> $ sbatch --wrap='sleep 300s'
> Submitted batch job 6684
> 
> # No wait for the job to be scheduled on a node and for the prolog to start
> executing
> 
> $ srun --jobid=6684 --mpi=none --overlap --pty bash
> srun: launch/slurm: launch_p_step_launch: StepId=6684.0 aborted before step
> completely launched.
> srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
> srun: error: se0007: task 0: Terminated
> srun: launch/slurm: _step_signal: Terminating StepId=6684.0
> 
> 
> The node is now drained, and requires admin intervention:
> $ sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Duplicate jobid      slurm     2022-01-12T14:08:28 se0007

Thanks for the reproducer.

> I couldn't get a repro of this issue on a local Slurm installation so far
> (with all commands and daemons running on the same node).

I'm in the same situation for the moment, but I'll keep trying.

> From the Slurm log:
> 2022-01-12T14:07:44.520448-08:00 slurm-se01 slurmctld[1995474]:
> _slurm_rpc_submit_batch_job: JobId=6684 InitPrio=16836 usec=11351
> 2022-01-12T14:07:44.954341-08:00 slurm-se01 slurmctld[1995474]: sched:
> Allocate JobId=6684 NodeList=se0007 #CPUs=256 Partition=se
> 2022-01-12T14:08:28.063360-08:00 slurm-se01 slurmctld[1995474]:
> _slurm_rpc_requeue: 6684: Requested operation is presently disabled
> 2022-01-12T14:08:28.064238-08:00 slurm-se01 slurmctld[1995474]:
> _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=6684 uid 0
> 2022-01-12T14:08:28.066024-08:00 slurm-se01 slurmctld[1995474]: drain_nodes:
> node se0007 state set to DRAIN
> 2022-01-12T14:08:28.066245-08:00 slurm-se01 slurmctld[1995474]: error:
> Duplicate jobid on nodes se0007, set to state DRAIN

It seems that job got requeued.
This seems important to reproduce the issue.

Could you share the full logs of slurmctld and slurmd when reproducing the issue?
Also, could you attach your slurm.conf?

> This only happens when the "srun --jobid" is used while the prolog is still
> being executed on the node, our prologs do a lot of checks / preparation so
> they take approximately 60s.

I guess that it's necessary in your case, but 60s is quite big. You are probably getting some warnings on slurmd logs.
Anyway, the issue shouldn't happen.


Regards,
Albert
Comment 2 Felix Abecassis 2022-01-13 18:34:34 MST
I'm really sorry, but I realized that we were still carrying the v2 patch from https://bugs.schedmd.com/show_bug.cgi?id=12801, and it seems to be the cause of this bug. I applied this patch locally, and then I was able to repro the bug when setting "PrologFlags=Alloc,Serial,DeferBatch" in my conf.

I will make a note about this on the other bug. We haven't reverted this patch on our cluster yet, but I will close this bug for now.
Comment 3 Felix Abecassis 2022-02-17 20:44:26 MST
Re-opening as this is now happening on the master branch now that DeferBatch is upstream(see https://bugs.schedmd.com/show_bug.cgi?id=11635#c43), same repro as in the first comment.
Comment 4 Dominik Bartkiewicz 2022-02-18 06:14:22 MST
We will work on solving this issue in bug 11635

*** This ticket has been marked as a duplicate of ticket 11635 ***