Summary: | Jobs in CG state block nodes | ||
---|---|---|---|
Product: | Slurm | Reporter: | Alex <ab2080> |
Component: | Scheduling | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | gsg8, jv2575 |
Version: | 17.11.2 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Columbia University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Alex
2018-05-15 11:29:56 MDT
Yes, 4733 has been resolved as a duplicate of bug 5103, which has been fixed and is in 17.11.6. You may upgrade or apply the referenced commit. Can you upload a backtrace of all threads of a stuck slurmstepd so I can verify it's the same cause? gdb attach <stepd pid> thread apply all bt Is this sufficient?: (gdb) thread apply all bt Thread 2 (Thread 0x2aaaaf87b700 (LWP 6108)): #0 0x00002aaaabdff0fc in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x00002aaaabe5fb5d in _L_lock_183 () from /lib64/libc.so.6 #2 0x00002aaaabe5fb0d in arena_thread_freeres () from /lib64/libc.so.6 #3 0x00002aaaabe5fbb2 in __libc_thread_freeres () from /lib64/libc.so.6 #4 0x00002aaaabae5dd8 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaabdf173d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aaaaab0b500 (LWP 6099)): #0 0x00002aaaabae6ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x000000000040aa6d in stepd_cleanup (msg=msg@entry=0x63fbd0, job=job@entry=0x659630, cli=cli@entry=0x654eb0, self=self@entry=0x0, rc=0, only_mem=only_mem@entry=false) at slurmstepd.c:184 #2 0x000000000040c133 in main (argc=1, argv=0x7fffffffedb8) at slurmstepd.c:169 (gdb) (In reply to Alex from comment #2) > Is this sufficient?: > > (gdb) thread apply all bt > > Thread 2 (Thread 0x2aaaaf87b700 (LWP 6108)): > #0 0x00002aaaabdff0fc in __lll_lock_wait_private () from /lib64/libc.so.6 > #1 0x00002aaaabe5fb5d in _L_lock_183 () from /lib64/libc.so.6 > #2 0x00002aaaabe5fb0d in arena_thread_freeres () from /lib64/libc.so.6 > #3 0x00002aaaabe5fbb2 in __libc_thread_freeres () from /lib64/libc.so.6 > #4 0x00002aaaabae5dd8 in start_thread () from /lib64/libpthread.so.0 > #5 0x00002aaaabdf173d in clone () from /lib64/libc.so.6 > > Thread 1 (Thread 0x2aaaaab0b500 (LWP 6099)): > #0 0x00002aaaabae6ef7 in pthread_join () from /lib64/libpthread.so.0 > #1 0x000000000040aa6d in stepd_cleanup (msg=msg@entry=0x63fbd0, > job=job@entry=0x659630, > cli=cli@entry=0x654eb0, self=self@entry=0x0, rc=0, > only_mem=only_mem@entry=false) at slurmstepd.c:184 > #2 0x000000000040c133 in main (argc=1, argv=0x7fffffffedb8) at > slurmstepd.c:169 > (gdb) Yes, that's the same bug. It's been fixed in 17.11.6 - see bug 5103. You can upgrade or apply the patches from there. There's a related bug (it also causes deadlock) in the slurmstepd that isn't in 17.11.6, but the fix will be in 17.11.7. We hope to release .7 in just a few weeks. If you'd like that patch before .7 is released, please let us know. It would need to be applied on top of the patches in bug 5103. Does that answer your question? If you want the patch for the related slurmstepd bug, I can upload it here for you. (In reply to Marshall Garey from comment #3) > There's a related bug (it also causes deadlock) in the slurmstepd that isn't > in 17.11.6, but the fix will be in 17.11.7. I realized this sentence is a bit confusing. To clarify - the related bug is in 17.11.6. The fix for it will be in 17.11.7. I'm closing as resolved/duplicate. Please let us know if you'd like the patch before .7 is released. *** This ticket has been marked as a duplicate of ticket 5103 *** |