| Summary: | Repeated duplicate jobid | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
| Component: | slurmd | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.02.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | GSK | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.11.7 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmctld log from 29 Aug 2018
slurmd log from 29 Aug 20 |
||
Created attachment 7717 [details]
slurmctld log from 29 Aug 2018
Created attachment 7718 [details]
slurmd log from 29 Aug 20
Hi In slurm 17.11 we fixed at least two issues that can generate similar symptoms. I think this is duplicate of bug 5048. https://github.com/SchedMD/slurm/commit/3b02902149637 should fix this, this commit is included in 17.11.6. Dominik Thanks Dominik. We're working towards 17.11.7. We are also implementing the ConstrainKmemSpace=No fix under 17.02.7 as an interim because the cgroups issue is cause of node failure which is then a pre-cursor to other issues due to the re-queuing that occurs. Please go ahead and close this bug. We can revisit it if we think we're getting the duplicate id issue for other reasons. Thanks. Mark. Hi I'm marking this ticket as the duplicate of 5048, please reopen if needed. Dominik *** This ticket has been marked as a duplicate of ticket 5048 *** |
Hi. We've had a double instance of a duplicate jobid draining a server and we actually rebooted the server after the first duplicate job id instance. Here's the sinfo /scontrol from the two occurrences: 1st occurrence: us1sxlx00202 (The Lion): sinfo -Nl Wed Aug 29 04:46:09 2018 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON us1salx00635 1 us_hpc* mixed 144 144:1:1 154781 10230 1 (null) none us1salx00664 1 us_hpc* mixed 24 2:12:1 386680 2036 1 (null) none us1salx00791 1 us_columbus_prd idle 48 48:1:1 386000 2036 1 (null) none us1salx00792 1 us_columbus_prd idle 48 48:1:1 386000 2036 1 (null) none us1salx00843 1 us_hpc* mixed 32 4:8:1 257527 2036 1 (null) none us1salx00843 1 us_columbus_prd mixed 32 4:8:1 257527 2036 1 (null) none us1salx00844 1 us_hpc* mixed 32 4:8:1 257527 2036 1 (null) none us1salx00844 1 us_columbus_prd mixed 32 4:8:1 257527 2036 1 (null) none us1salx00928 1 us_hpc* mixed 48 48:1:1 103000 2036 1 (null) none us1salx00942 1 us_hpc* idle 48 48:1:1 257527 2036 1 (null) none us1salx00943 1 us_hpc* idle 48 48:1:1 257527 2036 1 (null) none us1salx00945 1 us_hpc* mixed 192 192:1:1 206300 2036 1 (null) none us1salx00946 1 us_hpc* draining 192 192:1:1 206300 2036 1 (null) Duplicate jobid us1salx00947 1 us_hpc* mixed 192 192:1:1 206300 2036 1 (null) none us1salx00948 1 us_clinical_hpc idle 88 88:1:1 103100 2036 1 (null) none us1sxlx00202 (The Lion): scontrol show node us1salx00946 NodeName=us1salx00946 Arch=x86_64 CoresPerSocket=1 CPUAlloc=8 CPUErr=0 CPUTot=192 CPULoad=8.75 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=us1salx00946 NodeHostName=us1salx00946 Version=17.02 OS=Linux RealMemory=2063000 AllocMem=455680 FreeMem=1944302 Sockets=192 Boards=1 MemSpecLimit=1024 State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=2036 Weight=1 Owner=N/A MCS_label=N/A Partitions=us_hpc BootTime=2018-07-22T02:47:01 SlurmdStartTime=2018-07-22T02:47:31 CfgTRES=cpu=192,mem=2063000M AllocTRES=cpu=8,mem=445G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Duplicate jobid [slurm@2018-08-29T03:38:27] 2nd occurrence: us1sxlx00202 (The Lion): sinfo -Nl Wed Aug 29 04:46:09 2018 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON us1salx00635 1 us_hpc* mixed 144 144:1:1 154781 10230 1 (null) none us1salx00664 1 us_hpc* mixed 24 2:12:1 386680 2036 1 (null) none us1salx00791 1 us_columbus_prd idle 48 48:1:1 386000 2036 1 (null) none us1salx00792 1 us_columbus_prd idle 48 48:1:1 386000 2036 1 (null) none us1salx00843 1 us_hpc* mixed 32 4:8:1 257527 2036 1 (null) none us1salx00843 1 us_columbus_prd mixed 32 4:8:1 257527 2036 1 (null) none us1salx00844 1 us_hpc* mixed 32 4:8:1 257527 2036 1 (null) none us1salx00844 1 us_columbus_prd mixed 32 4:8:1 257527 2036 1 (null) none us1salx00928 1 us_hpc* mixed 48 48:1:1 103000 2036 1 (null) none us1salx00942 1 us_hpc* idle 48 48:1:1 257527 2036 1 (null) none us1salx00943 1 us_hpc* idle 48 48:1:1 257527 2036 1 (null) none us1salx00945 1 us_hpc* mixed 192 192:1:1 206300 2036 1 (null) none us1salx00946 1 us_hpc* draining 192 192:1:1 206300 2036 1 (null) Duplicate jobid us1salx00947 1 us_hpc* mixed 192 192:1:1 206300 2036 1 (null) none us1salx00948 1 us_clinical_hpc idle 88 88:1:1 103100 2036 1 (null) none us1sxlx00202 (The Lion): scontrol show node us1salx00946 NodeName=us1salx00946 Arch=x86_64 CoresPerSocket=1 CPUAlloc=8 CPUErr=0 CPUTot=192 CPULoad=8.75 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=us1salx00946 NodeHostName=us1salx00946 Version=17.02 OS=Linux RealMemory=2063000 AllocMem=455680 FreeMem=1944302 Sockets=192 Boards=1 MemSpecLimit=1024 State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=2036 Weight=1 Owner=N/A MCS_label=N/A Partitions=us_hpc BootTime=2018-07-22T02:47:01 SlurmdStartTime=2018-07-22T02:47:31 CfgTRES=cpu=192,mem=2063000M AllocTRES=cpu=8,mem=445G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Duplicate jobid [slurm@2018-08-29T03:38:27] I have attached slurmctld and slurmd logs. It does look like its the same job id causing the problem us1salx00945 (The Lion): sacct -D -j 158736 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 158736 chr11chun+ us_hpc 1 NODE_FAIL 0:0 158736 chr11chun+ us_hpc 1 NODE_FAIL 0:0 158736 chr11chun+ us_hpc 1 REQUEUED 0:0 158736 chr11chun+ us_hpc 1 PENDING 0:0 A new user of slurm has just ramped up his jobs and is submitting many thousands of jobs. Are we just seeing another instance of the race / deadlock /cgroup issues in 17.02.7 or do we have a specific issue with a job id? Thanks. Mark.