Ticket 5048

Summary:	sporadic duplicate jobid when jobs are requeued
Product:	Slurm	Reporter:	Sophie Créno <sophie.creno>
Component:	Scheduling	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	slurm-support, tim, vlegrand
Version:	17.11.5
Hardware:	Linux
OS:	Linux
Site:	Institut Pasteur	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.6
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	current slurm.conf slurmd of node 442 slurmd of requeued job 30534282 log of slurmd regarding multi-times requeued job 30534282

Description Sophie Créno 2018-04-10 07:39:14 MDT

Created attachment 6581 [details]
current slurm.conf

Hello,

Since the upgrade from 16.5.9 to 17.11.5 we experience duplicate
jobids at least once a week. These jobs seem to be only automatically
requeued jobs. The requeue is triggered by the script itself when 
exiting with code 99 (cf. RequeueExit=99 in our slurm.conf). Example:

2018-04-08T12:10:14+02:00 tars-442 slurmd[22751]: error: Job 30534282 already running, do not launch second copy


The above job started for the first time at 2018-03-30T19:17:54. 
Here are the iterations before and after the error message:

       JobID    JobName  Partition        QOS      State              Submit               Start    Elapsed                 End ExitCode        NodeList 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ------------------- -------- --------------- 
30533734_548      ez_50  dedicated       fast   REQUEUED 2018-04-08T11:41:11 2018-04-08T11:43:17   00:24:53 2018-04-08T12:08:10     99:0        tars-429 
30533734_54+      batch                           FAILED 2018-04-08T11:43:17 2018-04-08T11:43:17   00:24:53 2018-04-08T12:08:10     99:0        tars-429 
30533734_54+      mcell                        CANCELLED 2018-04-08T11:43:18 2018-04-08T11:43:18   00:24:54 2018-04-08T12:08:12     0:15        tars-429 
30533734_548      ez_50  dedicated       fast   REQUEUED 2018-04-08T12:08:12 2018-04-08T12:10:13   00:00:01 2018-04-08T12:10:14      0:0        tars-442 
30533734_548      ez_50  dedicated       fast   REQUEUED 2018-04-08T12:10:14 2018-04-08T12:12:19   00:24:52 2018-04-08T12:37:11     99:0        tars-430 
30533734_54+      batch                           FAILED 2018-04-08T12:12:19 2018-04-08T12:12:19   00:24:52 2018-04-08T12:37:11     99:0        tars-430 
30533734_54+      mcell                        CANCELLED 2018-04-08T12:12:20 2018-04-08T12:12:20   00:24:52 2018-04-08T12:37:12     0:15        tars-430 


We don't understand that message since there is only 1 instance
appearing in the output of sacct for that job at that time. 
The annoying thing is that the node is systematically drained
by SLURM.

Thanks in advance,

Comment 1 Dominik Bartkiewicz 2018-04-26 09:32:03 MDT

Hi

Could you send me slurmd.log from affected node?

Dominik

Comment 2 Sophie Créno 2018-04-27 02:57:42 MDT

Created attachment 6700 [details]
slurmd of node 442

Hello,

I attach 2 files:
* slurmd of node 442 that was drained due to duplicate job 30534282
* concatenated slurmd outputs regarding the multiple requeue/submissions
of job 30534282.

Thanks

Comment 3 Sophie Créno 2018-04-27 02:58:41 MDT

Created attachment 6701 [details]
slurmd of requeued job 30534282

Comment 4 Dominik Bartkiewicz 2018-05-02 06:25:31 MDT

Hi

I still can't find the reason for this behaviour.
I didn't find any log messages required to confirm my previous suspicions.
Could you send me older slurmd.log from the beginning of job 30534282?


Dominik

Comment 5 Sophie Créno 2018-05-02 06:51:47 MDT

Created attachment 6747 [details]
log of slurmd regarding multi-times requeued job 30534282

Since it was requeued for more than a month, I only attach lines referring to that job. If you need more, please tell me what I must add.

Thanks,

Comment 6 Dominik Bartkiewicz 2018-05-02 08:50:26 MDT

Hi

Another idea.
Did you restart node 442 after this issue?

Could you check if you see slurmstepd process that shouldn't be there, for example for job 30534282? Could you generate the backtrace from it?
Which distribution do you use?
And which version of glibc you currently have?

Dominik

Comment 7 Sophie Créno 2018-05-02 09:11:53 MDT

> Did you restart node 442 after this issue?
No. 
$ uptime
 17:03:55 up 54 days, 16:30,  1 user,  load average: 11.74, 8.32, 7.76


> Could you check if you see slurmstepd process that shouldn't be there, for
> example for job 30534282? Could you generate the backtrace from it?
All I see with pstree correspond to the 2 jobs supposed to run on it:

JOBID     NAME       CPUS  NODES START_TIME            TIME        NODELIST                      
33693596  bt2ST34    6     1     2018-05-02T17:01:09   1:16        tars-442                      
33693531  bt2ST34    6     1     2018-05-02T17:00:46   1:39        tars-442


> Which distribution do you use?
$ cat /etc/centos-release 
CentOS release 6.9 (Final)


> And which version of glibc you currently have?
GNU C Library stable release version 2.12

Does it help?

Comment 8 Dominik Bartkiewicz 2018-05-03 10:01:52 MDT

Hi

Yes, I verified that one of my ideas was wrong :)
We have some problems with deadlocking slurmstepd but it occurs only on centos 7. 

Dominik

Comment 9 Dominik Bartkiewicz 2018-05-08 09:50:51 MDT

Hi

I can reproduce this and I will let you know when we fix this.

Dominik

Comment 11 Dominik Bartkiewicz 2018-05-10 05:34:35 MDT

Hi

https://github.com/SchedMD/slurm/commit/3b02902149637 should fix this, this commit is included in 17.11.6.
There is still hypothetical race that will be fixed in 17.11.7.

Dominik

Comment 12 Véronique Legrand 2018-05-14 02:36:23 MDT

Hello Dominik,

Thank you for indicating the fix. What do you mean "There is still hypothetical race"? Can you tell us a little bit more? I am wondering if it is worth applying a patch right now or if we wouldn't better wait until the whole problem is fixed.

Regards,

Véronique

Comment 13 Dominik Bartkiewicz 2018-05-14 03:30:38 MDT

(In reply to Véronique Legrand from comment #12)
> Hello Dominik,
> 
> Thank you for indicating the fix. What do you mean "There is still
> hypothetical race"? Can you tell us a little bit more? I am wondering if it
> is worth applying a patch right now or if we wouldn't better wait until the
> whole problem is fixed.
> 
> Regards,
> 
> Véronique
Hi

I recommend you updating to 17.11.6, 17.6.11 contains a lot of other fixes too.
Now hitting this issue requires strange set of 3 rpcs and even with code hacks (adding of sleep in some places) is hard to get it.
I don't think you can hit this issue anymore on 17.11.6.

Dominik

Comment 14 Véronique Legrand 2018-05-14 03:36:22 MDT

Thank you for your advice. I will discuss version upgrade with my colleague Sophie when she is back.

Regards,

Véronique

Comment 19 Dominik Bartkiewicz 2018-07-19 03:04:23 MDT

Hi

I'm marking this ticket as resolved/fixed.
As always, please feel free to reopen if you have
additional questions.

Dominik

Comment 20 Dominik Bartkiewicz 2018-08-31 02:27:43 MDT

*** Ticket 5639 has been marked as a duplicate of this ticket. ***