| Summary: | Array tasks never start | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | slurmd | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | hrz, pedmon, tim |
| Version: | 17.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=3998 | ||
| Site: | Stanford | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.02.7 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmd log
Full slurmd log Possible fix for bug 3977. slurmd log with patch |
||
Hi Could you attach full slurmd.log? Did you notice any unusual messages in dmesg/syslog? Which kernel/systemd version do you use? Dominik Created attachment 4906 [details]
Full slurmd log
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #1) > Could you attach full slurmd.log? It's huge but sure, here it is. > Did you notice any unusual messages in dmesg/syslog? No, syslog and dmesg are completely devoid of any warning/error. > Which kernel/systemd version do you use? Latest CentOS 7.3, so that's: [root@sh-102-44 ~]# uname -r 3.10.0-514.26.2.el7.x86_64 [root@sh-102-44 ~]# systemctl --version systemd 219 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN Thanks! -- Kilian Hi Maybe this is not related but in some points this looks similar to bug 3466. Which sssd version are you using? Dominik Hi Dominik, (In reply to Dominik Bartkiewicz from comment #6) > Maybe this is not related but in some points this looks similar to bug 3466. > Which sssd version are you using? That's interesting, but I don't think it's related either: we don't see any resource exhaustion message in any log, and the issue can be replicated with just a 4-task simple "sleep" job array on an otherwise idle 20-core machine. We actually don't use sssd. That's by design, because we need full tables enumeration for our management tools, that sssd doesn't support anymore. We use nscld instead. Cheers, -- Kilian *** Ticket 3978 has been marked as a duplicate of this ticket. *** Created attachment 4915 [details] Possible fix for bug 3977. Hey guys - I've merged bug 3978 into this one; I suspect both of you are seeing the same issue. Can you try out the attached patch to 17.02.6 and see if this improves things? I believe this is a regression from 17.02.5, and this restores the prior behavior and I believe may mitigate this launch issue. cheer, - Tim Hi Tim, (In reply to Tim Wickberg from comment #9) > Can you try out the attached patch to 17.02.6 and see if this improves > things? I believe this is a regression from 17.02.5, and this restores the > prior behavior and I believe may mitigate this launch issue. Thanks for the patch, but unfortunately, it doesn't seem to fix the problem: I still see "ghost" jobs with a simple: $ sbatch -p test -w sh-101-59 --array=1-8 --wrap='echo $SLURM_ARRAY_TASK_ID' Patched slurmd full log attached, withSlurmdDebug=debug3, on a fresh node start, no other job running. There are 2 ghost jobs out of 8: $ squeue -u kilian JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 485277_6 test wrap kilian R 2:34 1 sh-101-59 485277_8 test wrap kilian R 2:34 1 sh-101-59 Created attachment 4916 [details]
slurmd log with patch
Just an update on this from our end. We downgraded to 17.02.5 on our production cluster in order to minimize impact to users. Our test cluster is still 17.02.6 and can reproduce the problem we were having on a single node. Let me know if you need any other info from us. Hi finally I can recreate this. I will try to provide patch as fast as it is possible. Dominik Hi Commit https://github.com/SchedMD/slurm/commit/b40bd8d35ef85 should fix this issue. It will be included in next 17.02 release. Could you confirm if this solved problem on your environment? Dominik Hi Dominik,
> Commit https://github.com/SchedMD/slurm/commit/b40bd8d35ef85 should fix this
> issue.
> It will be included in next 17.02 release.
> Could you confirm if this solved problem on your environment?
Just tested the patch, and I can confirm that it fixes the issue: all the jobs start and terminate correctly, no ghost job anymore,
Great work, and thank you!
Cheers,
--
Kilian
I can also verify that it fixed the problem on our end as well. Well done! -Paul Edmon- On 7/19/2017 11:07 AM, bugs@schedmd.com wrote: > > *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=3977#c25> on > bug 3977 <https://bugs.schedmd.com/show_bug.cgi?id=3977> from Kilian > Cavalotti <mailto:kilian@stanford.edu> * > Hi Dominik, > > > Commit https://github.com/SchedMD/slurm/commit/b40bd8d35ef85 should fix this > > issue. > It will be included in next 17.02 release. > Could you > confirm if this solved problem on your environment? > > Just tested the patch, and I can confirm that it fixes the issue: all the jobs > start and terminate correctly, no ghost job anymore, > > Great work, and thank you! > > Cheers, > -- > Kilian > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You are on the CC list for the bug. > Hi I am glad to hear that. I am closing this ticket. Dominik *** Ticket 3998 has been marked as a duplicate of this ticket. *** |
Created attachment 4897 [details] slurmd log Hi! We're facing a weird issue with array jobs in 17.02.6. The symptom is that some of the tasks of an array don't seem to actually start any of the job's processes, and just linger around until reaching the time limit. That is, the job array task shows up in squeue, but no process is actually started by slurmd on the node. Thus none exits and the job array task never ends. This seems to happens only when multiple tasks of the same array are started at the same time on the same node. This is quite reproducible on {CentOS,RHEL}7 with cgroups enabled (I "feel" like this may be a race issue with cgroups, but don't have any solid evidence). Simple reproducer: -- 8< ------------------------------------------------------------------------------- $ sbatch -w sh-102-44 --array=1-8 --wrap='echo $SLURM_ARRAY_TASK_ID' -- 8< ------------------------------------------------------------------------------- Resulting squeue and sacct output when most tasks have normally exited: -- 8< ------------------------------------------------------------------------------- $ squeue -u kilian JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 459887_2 owners wrap kilian R 1:22 1 sh-102-44 $ sacct -j 459887 -X JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 459887_8 wrap owners ruthm 1 COMPLETED 0:0 459887_1 wrap owners ruthm 1 COMPLETED 0:0 459887_2 wrap owners ruthm 1 RUNNING 0:0 459887_3 wrap owners ruthm 1 COMPLETED 0:0 459887_4 wrap owners ruthm 1 COMPLETED 0:0 459887_5 wrap owners ruthm 1 COMPLETED 0:0 459887_6 wrap owners ruthm 1 COMPLETED 0:0 459887_7 wrap owners ruthm 1 COMPLETED 0:0 -- 8< ------------------------------------------------------------------------------- The debug3 slurmd log from the node is attached, for the period when the job was launched. I noticed those messages, but I don't know if they're related or not: -- 8< ------------------------------------------------------------------------------- slurmstepd[180578]: debug3: xcgroup_set_param: parameter 'cpuset.cpus' set to '0,2,8,10' for '/sys/fs/cgroup/cpuset/slurm/uid_215845/job_459889' slurmstepd[180599]: debug2: _file_read_content: unable to open '/sys/fs/cgroup/cpuset/slurm/cpus' for reading : No such file or directory slurmstepd[180599]: debug2: xcgroup_get_param: unable to get parameter 'cpus' for '/sys/fs/cgroup/cpuset/slurm' [...] slurmstepd[180661]: error: User requested launch of zero tasks! [...] slurmd[178796]: error: _remove_starting_step: step 459887.4294967295 not found slurmd[178796]: error: Error cleaning up starting_step list -- 8< ------------------------------------------------------------------------------- Anyway, if you have any way to reproduce this and get an idea of what's going on, it would be much appreciated. Thanks! -- Kilian