| Summary: | sporadic duplicate jobid when jobs are requeued | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sophie Créno <sophie.creno> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | slurm-support, tim, vlegrand |
| Version: | 17.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Institut Pasteur | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.11.6 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
current slurm.conf
slurmd of node 442 slurmd of requeued job 30534282 log of slurmd regarding multi-times requeued job 30534282 |
||
Hi Could you send me slurmd.log from affected node? Dominik Created attachment 6700 [details]
slurmd of node 442
Hello,
I attach 2 files:
* slurmd of node 442 that was drained due to duplicate job 30534282
* concatenated slurmd outputs regarding the multiple requeue/submissions
of job 30534282.
Thanks
Created attachment 6701 [details]
slurmd of requeued job 30534282
Hi I still can't find the reason for this behaviour. I didn't find any log messages required to confirm my previous suspicions. Could you send me older slurmd.log from the beginning of job 30534282? Dominik Created attachment 6747 [details]
log of slurmd regarding multi-times requeued job 30534282
Since it was requeued for more than a month, I only attach lines referring to that job. If you need more, please tell me what I must add.
Thanks,
Hi Another idea. Did you restart node 442 after this issue? Could you check if you see slurmstepd process that shouldn't be there, for example for job 30534282? Could you generate the backtrace from it? Which distribution do you use? And which version of glibc you currently have? Dominik > Did you restart node 442 after this issue? No. $ uptime 17:03:55 up 54 days, 16:30, 1 user, load average: 11.74, 8.32, 7.76 > Could you check if you see slurmstepd process that shouldn't be there, for > example for job 30534282? Could you generate the backtrace from it? All I see with pstree correspond to the 2 jobs supposed to run on it: JOBID NAME CPUS NODES START_TIME TIME NODELIST 33693596 bt2ST34 6 1 2018-05-02T17:01:09 1:16 tars-442 33693531 bt2ST34 6 1 2018-05-02T17:00:46 1:39 tars-442 > Which distribution do you use? $ cat /etc/centos-release CentOS release 6.9 (Final) > And which version of glibc you currently have? GNU C Library stable release version 2.12 Does it help? Hi Yes, I verified that one of my ideas was wrong :) We have some problems with deadlocking slurmstepd but it occurs only on centos 7. Dominik Hi I can reproduce this and I will let you know when we fix this. Dominik Hi https://github.com/SchedMD/slurm/commit/3b02902149637 should fix this, this commit is included in 17.11.6. There is still hypothetical race that will be fixed in 17.11.7. Dominik Hello Dominik, Thank you for indicating the fix. What do you mean "There is still hypothetical race"? Can you tell us a little bit more? I am wondering if it is worth applying a patch right now or if we wouldn't better wait until the whole problem is fixed. Regards, Véronique (In reply to Véronique Legrand from comment #12) > Hello Dominik, > > Thank you for indicating the fix. What do you mean "There is still > hypothetical race"? Can you tell us a little bit more? I am wondering if it > is worth applying a patch right now or if we wouldn't better wait until the > whole problem is fixed. > > Regards, > > Véronique Hi I recommend you updating to 17.11.6, 17.6.11 contains a lot of other fixes too. Now hitting this issue requires strange set of 3 rpcs and even with code hacks (adding of sleep in some places) is hard to get it. I don't think you can hit this issue anymore on 17.11.6. Dominik Thank you for your advice. I will discuss version upgrade with my colleague Sophie when she is back. Regards, Véronique Hi I'm marking this ticket as resolved/fixed. As always, please feel free to reopen if you have additional questions. Dominik *** Ticket 5639 has been marked as a duplicate of this ticket. *** |
Created attachment 6581 [details] current slurm.conf Hello, Since the upgrade from 16.5.9 to 17.11.5 we experience duplicate jobids at least once a week. These jobs seem to be only automatically requeued jobs. The requeue is triggered by the script itself when exiting with code 99 (cf. RequeueExit=99 in our slurm.conf). Example: 2018-04-08T12:10:14+02:00 tars-442 slurmd[22751]: error: Job 30534282 already running, do not launch second copy The above job started for the first time at 2018-03-30T19:17:54. Here are the iterations before and after the error message: JobID JobName Partition QOS State Submit Start Elapsed End ExitCode NodeList ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ------------------- -------- --------------- 30533734_548 ez_50 dedicated fast REQUEUED 2018-04-08T11:41:11 2018-04-08T11:43:17 00:24:53 2018-04-08T12:08:10 99:0 tars-429 30533734_54+ batch FAILED 2018-04-08T11:43:17 2018-04-08T11:43:17 00:24:53 2018-04-08T12:08:10 99:0 tars-429 30533734_54+ mcell CANCELLED 2018-04-08T11:43:18 2018-04-08T11:43:18 00:24:54 2018-04-08T12:08:12 0:15 tars-429 30533734_548 ez_50 dedicated fast REQUEUED 2018-04-08T12:08:12 2018-04-08T12:10:13 00:00:01 2018-04-08T12:10:14 0:0 tars-442 30533734_548 ez_50 dedicated fast REQUEUED 2018-04-08T12:10:14 2018-04-08T12:12:19 00:24:52 2018-04-08T12:37:11 99:0 tars-430 30533734_54+ batch FAILED 2018-04-08T12:12:19 2018-04-08T12:12:19 00:24:52 2018-04-08T12:37:11 99:0 tars-430 30533734_54+ mcell CANCELLED 2018-04-08T12:12:20 2018-04-08T12:12:20 00:24:52 2018-04-08T12:37:12 0:15 tars-430 We don't understand that message since there is only 1 instance appearing in the output of sacct for that job at that time. The annoying thing is that the node is systematically drained by SLURM. Thanks in advance,