5639 – Repeated duplicate jobid

Ticket 5639 - Repeated duplicate jobid

Summary: Repeated duplicate jobid

Status:	RESOLVED DUPLICATE of ticket 5048

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	17.02.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-08-29 08:58 MDT by GSK-ONYX-SLURM
Modified:	2018-08-31 02:27 MDT (History)
CC List:	0 users

See Also:
Site:	GSK
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.11.7
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld log from 29 Aug 2018 (4.09 MB, text/plain) 2018-08-29 08:59 MDT, GSK-ONYX-SLURM	Details
slurmd log from 29 Aug 20 (307.26 KB, text/plain) 2018-08-29 09:00 MDT, GSK-ONYX-SLURM	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description GSK-ONYX-SLURM 2018-08-29 08:58:11 MDT

Hi.
We've had a double instance of a duplicate jobid draining a server and we actually rebooted the server after the first duplicate job id instance.

Here's the sinfo /scontrol from the two occurrences:

1st occurrence:

us1sxlx00202 (The Lion): sinfo -Nl
Wed Aug 29 04:46:09 2018
NODELIST      NODES       PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
us1salx00635      1         us_hpc*       mixed  144  144:1:1 154781    10230      1   (null) none
us1salx00664      1         us_hpc*       mixed   24   2:12:1 386680     2036      1   (null) none
us1salx00791      1 us_columbus_prd        idle   48   48:1:1 386000     2036      1   (null) none
us1salx00792      1 us_columbus_prd        idle   48   48:1:1 386000     2036      1   (null) none
us1salx00843      1         us_hpc*       mixed   32    4:8:1 257527     2036      1   (null) none
us1salx00843      1 us_columbus_prd       mixed   32    4:8:1 257527     2036      1   (null) none
us1salx00844      1         us_hpc*       mixed   32    4:8:1 257527     2036      1   (null) none
us1salx00844      1 us_columbus_prd       mixed   32    4:8:1 257527     2036      1   (null) none
us1salx00928      1         us_hpc*       mixed   48   48:1:1 103000     2036      1   (null) none
us1salx00942      1         us_hpc*        idle   48   48:1:1 257527     2036      1   (null) none
us1salx00943      1         us_hpc*        idle   48   48:1:1 257527     2036      1   (null) none
us1salx00945      1         us_hpc*       mixed  192  192:1:1 206300     2036      1   (null) none
us1salx00946      1         us_hpc*    draining  192  192:1:1 206300     2036      1   (null) Duplicate jobid
us1salx00947      1         us_hpc*       mixed  192  192:1:1 206300     2036      1   (null) none
us1salx00948      1 us_clinical_hpc        idle   88   88:1:1 103100     2036      1   (null) none
us1sxlx00202 (The Lion): scontrol show node us1salx00946
NodeName=us1salx00946 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=8 CPUErr=0 CPUTot=192 CPULoad=8.75
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=us1salx00946 NodeHostName=us1salx00946 Version=17.02
   OS=Linux RealMemory=2063000 AllocMem=455680 FreeMem=1944302 Sockets=192 Boards=1
   MemSpecLimit=1024
   State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=2036 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=us_hpc
   BootTime=2018-07-22T02:47:01 SlurmdStartTime=2018-07-22T02:47:31
   CfgTRES=cpu=192,mem=2063000M
   AllocTRES=cpu=8,mem=445G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Duplicate jobid [slurm@2018-08-29T03:38:27]

2nd occurrence:

us1sxlx00202 (The Lion): sinfo -Nl
Wed Aug 29 04:46:09 2018
NODELIST      NODES       PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
us1salx00635      1         us_hpc*       mixed  144  144:1:1 154781    10230      1   (null) none
us1salx00664      1         us_hpc*       mixed   24   2:12:1 386680     2036      1   (null) none
us1salx00791      1 us_columbus_prd        idle   48   48:1:1 386000     2036      1   (null) none
us1salx00792      1 us_columbus_prd        idle   48   48:1:1 386000     2036      1   (null) none
us1salx00843      1         us_hpc*       mixed   32    4:8:1 257527     2036      1   (null) none
us1salx00843      1 us_columbus_prd       mixed   32    4:8:1 257527     2036      1   (null) none
us1salx00844      1         us_hpc*       mixed   32    4:8:1 257527     2036      1   (null) none
us1salx00844      1 us_columbus_prd       mixed   32    4:8:1 257527     2036      1   (null) none
us1salx00928      1         us_hpc*       mixed   48   48:1:1 103000     2036      1   (null) none
us1salx00942      1         us_hpc*        idle   48   48:1:1 257527     2036      1   (null) none
us1salx00943      1         us_hpc*        idle   48   48:1:1 257527     2036      1   (null) none
us1salx00945      1         us_hpc*       mixed  192  192:1:1 206300     2036      1   (null) none
us1salx00946      1         us_hpc*    draining  192  192:1:1 206300     2036      1   (null) Duplicate jobid
us1salx00947      1         us_hpc*       mixed  192  192:1:1 206300     2036      1   (null) none
us1salx00948      1 us_clinical_hpc        idle   88   88:1:1 103100     2036      1   (null) none
us1sxlx00202 (The Lion): scontrol show node us1salx00946
NodeName=us1salx00946 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=8 CPUErr=0 CPUTot=192 CPULoad=8.75
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=us1salx00946 NodeHostName=us1salx00946 Version=17.02
   OS=Linux RealMemory=2063000 AllocMem=455680 FreeMem=1944302 Sockets=192 Boards=1
   MemSpecLimit=1024
   State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=2036 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=us_hpc
   BootTime=2018-07-22T02:47:01 SlurmdStartTime=2018-07-22T02:47:31
   CfgTRES=cpu=192,mem=2063000M
   AllocTRES=cpu=8,mem=445G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Duplicate jobid [slurm@2018-08-29T03:38:27]

I have attached slurmctld and slurmd logs.

It does look like its the same job id causing the problem

us1salx00945 (The Lion): sacct -D -j 158736
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
158736       chr11chun+     us_hpc                     1  NODE_FAIL      0:0
158736       chr11chun+     us_hpc                     1  NODE_FAIL      0:0
158736       chr11chun+     us_hpc                     1   REQUEUED      0:0
158736       chr11chun+     us_hpc                     1    PENDING      0:0

A new user of slurm has just ramped up his jobs and is submitting many thousands of jobs.  Are we just seeing another instance of the race / deadlock /cgroup issues in 17.02.7 or do we have a specific issue with a job id?

Thanks.
Mark.

Comment 1 GSK-ONYX-SLURM 2018-08-29 08:59:25 MDT

Created attachment 7717 [details]
slurmctld log from 29 Aug 2018

Comment 2 GSK-ONYX-SLURM 2018-08-29 09:00:00 MDT

Created attachment 7718 [details]
slurmd log from 29 Aug 20

Comment 4 Dominik Bartkiewicz 2018-08-30 09:03:25 MDT

Hi

In slurm 17.11 we fixed at least two issues that can generate similar symptoms.
I think this is duplicate of bug 5048.
https://github.com/SchedMD/slurm/commit/3b02902149637 should fix this, this commit is included in 17.11.6.

Dominik

Comment 5 GSK-ONYX-SLURM 2018-08-30 12:13:25 MDT

Thanks Dominik.

We're working towards 17.11.7.  We are also implementing the ConstrainKmemSpace=No fix under 17.02.7 as an interim because the cgroups issue is cause of node failure which is then a pre-cursor to other issues due to the re-queuing that occurs.

Please go ahead and close this bug.  We can revisit it if we think we're getting the duplicate id issue for other reasons.

Thanks.
Mark.

Comment 6 Dominik Bartkiewicz 2018-08-31 02:27:43 MDT

Hi

I'm marking this ticket as the duplicate of 5048, please reopen if needed.

Dominik

*** This ticket has been marked as a duplicate of ticket 5048 ***