Ticket 5048

Summary: sporadic duplicate jobid when jobs are requeued
Product: Slurm Reporter: Sophie Créno <sophie.creno>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: slurm-support, tim, vlegrand
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: Institut Pasteur Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.11.6 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: current slurm.conf
slurmd of node 442
slurmd of requeued job 30534282
log of slurmd regarding multi-times requeued job 30534282

Description Sophie Créno 2018-04-10 07:39:14 MDT
Created attachment 6581 [details]
current slurm.conf

Hello,

Since the upgrade from 16.5.9 to 17.11.5 we experience duplicate
jobids at least once a week. These jobs seem to be only automatically
requeued jobs. The requeue is triggered by the script itself when 
exiting with code 99 (cf. RequeueExit=99 in our slurm.conf). Example:

2018-04-08T12:10:14+02:00 tars-442 slurmd[22751]: error: Job 30534282 already running, do not launch second copy


The above job started for the first time at 2018-03-30T19:17:54. 
Here are the iterations before and after the error message:

       JobID    JobName  Partition        QOS      State              Submit               Start    Elapsed                 End ExitCode        NodeList 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ------------------- -------- --------------- 
30533734_548      ez_50  dedicated       fast   REQUEUED 2018-04-08T11:41:11 2018-04-08T11:43:17   00:24:53 2018-04-08T12:08:10     99:0        tars-429 
30533734_54+      batch                           FAILED 2018-04-08T11:43:17 2018-04-08T11:43:17   00:24:53 2018-04-08T12:08:10     99:0        tars-429 
30533734_54+      mcell                        CANCELLED 2018-04-08T11:43:18 2018-04-08T11:43:18   00:24:54 2018-04-08T12:08:12     0:15        tars-429 
30533734_548      ez_50  dedicated       fast   REQUEUED 2018-04-08T12:08:12 2018-04-08T12:10:13   00:00:01 2018-04-08T12:10:14      0:0        tars-442 
30533734_548      ez_50  dedicated       fast   REQUEUED 2018-04-08T12:10:14 2018-04-08T12:12:19   00:24:52 2018-04-08T12:37:11     99:0        tars-430 
30533734_54+      batch                           FAILED 2018-04-08T12:12:19 2018-04-08T12:12:19   00:24:52 2018-04-08T12:37:11     99:0        tars-430 
30533734_54+      mcell                        CANCELLED 2018-04-08T12:12:20 2018-04-08T12:12:20   00:24:52 2018-04-08T12:37:12     0:15        tars-430 


We don't understand that message since there is only 1 instance
appearing in the output of sacct for that job at that time. 
The annoying thing is that the node is systematically drained
by SLURM.

Thanks in advance,
Comment 1 Dominik Bartkiewicz 2018-04-26 09:32:03 MDT
Hi

Could you send me slurmd.log from affected node?

Dominik
Comment 2 Sophie Créno 2018-04-27 02:57:42 MDT
Created attachment 6700 [details]
slurmd of node 442

Hello,

I attach 2 files:
* slurmd of node 442 that was drained due to duplicate job 30534282
* concatenated slurmd outputs regarding the multiple requeue/submissions
of job 30534282.

Thanks
Comment 3 Sophie Créno 2018-04-27 02:58:41 MDT
Created attachment 6701 [details]
slurmd of requeued job 30534282
Comment 4 Dominik Bartkiewicz 2018-05-02 06:25:31 MDT
Hi

I still can't find the reason for this behaviour.
I didn't find any log messages required to confirm my previous suspicions.
Could you send me older slurmd.log from the beginning of job 30534282?


Dominik
Comment 5 Sophie Créno 2018-05-02 06:51:47 MDT
Created attachment 6747 [details]
log of slurmd regarding multi-times requeued job 30534282

Since it was requeued for more than a month, I only attach lines referring to that job. If you need more, please tell me what I must add.

Thanks,
Comment 6 Dominik Bartkiewicz 2018-05-02 08:50:26 MDT
Hi

Another idea.
Did you restart node 442 after this issue?

Could you check if you see slurmstepd process that shouldn't be there, for example for job 30534282? Could you generate the backtrace from it?
Which distribution do you use?
And which version of glibc you currently have?

Dominik
Comment 7 Sophie Créno 2018-05-02 09:11:53 MDT
> Did you restart node 442 after this issue?
No. 
$ uptime
 17:03:55 up 54 days, 16:30,  1 user,  load average: 11.74, 8.32, 7.76


> Could you check if you see slurmstepd process that shouldn't be there, for
> example for job 30534282? Could you generate the backtrace from it?
All I see with pstree correspond to the 2 jobs supposed to run on it:

JOBID     NAME       CPUS  NODES START_TIME            TIME        NODELIST                      
33693596  bt2ST34    6     1     2018-05-02T17:01:09   1:16        tars-442                      
33693531  bt2ST34    6     1     2018-05-02T17:00:46   1:39        tars-442


> Which distribution do you use?
$ cat /etc/centos-release 
CentOS release 6.9 (Final)


> And which version of glibc you currently have?
GNU C Library stable release version 2.12

Does it help?
Comment 8 Dominik Bartkiewicz 2018-05-03 10:01:52 MDT
Hi

Yes, I verified that one of my ideas was wrong :)
We have some problems with deadlocking slurmstepd but it occurs only on centos 7. 

Dominik
Comment 9 Dominik Bartkiewicz 2018-05-08 09:50:51 MDT
Hi

I can reproduce this and I will let you know when we fix this.

Dominik
Comment 11 Dominik Bartkiewicz 2018-05-10 05:34:35 MDT
Hi

https://github.com/SchedMD/slurm/commit/3b02902149637 should fix this, this commit is included in 17.11.6.
There is still hypothetical race that will be fixed in 17.11.7.

Dominik
Comment 12 Véronique Legrand 2018-05-14 02:36:23 MDT
Hello Dominik,

Thank you for indicating the fix. What do you mean "There is still hypothetical race"? Can you tell us a little bit more? I am wondering if it is worth applying a patch right now or if we wouldn't better wait until the whole problem is fixed.

Regards,

Véronique
Comment 13 Dominik Bartkiewicz 2018-05-14 03:30:38 MDT
(In reply to Véronique Legrand from comment #12)
> Hello Dominik,
> 
> Thank you for indicating the fix. What do you mean "There is still
> hypothetical race"? Can you tell us a little bit more? I am wondering if it
> is worth applying a patch right now or if we wouldn't better wait until the
> whole problem is fixed.
> 
> Regards,
> 
> Véronique
Hi

I recommend you updating to 17.11.6, 17.6.11 contains a lot of other fixes too.
Now hitting this issue requires strange set of 3 rpcs and even with code hacks (adding of sleep in some places) is hard to get it.
I don't think you can hit this issue anymore on 17.11.6.

Dominik
Comment 14 Véronique Legrand 2018-05-14 03:36:22 MDT
Thank you for your advice. I will discuss version upgrade with my colleague Sophie when she is back.

Regards,

Véronique
Comment 19 Dominik Bartkiewicz 2018-07-19 03:04:23 MDT
Hi

I'm marking this ticket as resolved/fixed.
As always, please feel free to reopen if you have
additional questions.

Dominik
Comment 20 Dominik Bartkiewicz 2018-08-31 02:27:43 MDT
*** Ticket 5639 has been marked as a duplicate of this ticket. ***