| Summary: | Multitask GPU jobs jobs fail on Slurm 23.02 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Taras Shapovalov <taras.shapovalov> |
| Component: | GPU | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 23.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | PI CUDA calculation | ||
|
Description
Taras Shapovalov
2023-08-04 06:56:39 MDT
A support agreement needs to be put in place before SchedMD can assign an engineer to this. It turned out the problem is in the gather-plugins (tried both linux and groups). Because of them slurmstepd crashes:
Stack trace of thread 20151:
#0 0x0000155553761acf raise (libc.so.6)
#1 0x0000155553734ea5 abort (libc.so.6)
#2 0x00001555537a2cd7 __libc_message (libc.so.6)
#3 0x00001555537a9fdc malloc_printerr (libc.so.6)
#4 0x00001555537ad204 _int_malloc (libc.so.6)
#5 0x00001555537af646 __libc_calloc (libc.so.6)
#6 0x0000155555074df9 slurm_xcalloc (libslurmfull.so)
#7 0x00001555550755f9 _xstrdup_vprintf (libslurmfull.so)
#8 0x00001555550759aa _xstrfmtcat (libslurmfull.so)
#9 0x000015554fe922f8 _handle_stats (jobacct_gather_linux.so)
#10 0x000015554fe9272c jag_common_poll_data (jobacct_gather_linux.so)
#11 0x000015554fe9158b jobacct_gather_p_poll_data (jobacct_gather_linux.so)
#12 0x000015555509cc5c _poll_data (libslurmfull.so)
#13 0x000015555509ce5c _watch_tasks (libslurmfull.so)
#14 0x00001555544861ca start_thread (libpthread.so.0)
#15 0x000015555374ce73 __clone (libc.so.6)
Stack trace of thread 20154:
#0 0x0000155553836f41 __poll (libc.so.6)
#1 0x0000155554fa9dc9 poll (libslurmfull.so)
#2 0x000000000041bc3a _io_thr (slurmstepd)
#3 0x00001555544861ca start_thread (libpthread.so.0)
#4 0x000015555374ce73 __clone (libc.so.6)
Stack trace of thread 20153:
#0 0x0000155553836f41 __poll (libc.so.6)
#1 0x0000155554fa9dc9 poll (libslurmfull.so)
#2 0x000000000042b1c7 _msg_thr_internal (slurmstepd)
#3 0x00001555544861ca start_thread (libpthread.so.0)
#4 0x000015555374ce73 __clone (libc.so.6)
Stack trace of thread 20152:
#0 0x000015555448c7aa pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
#1 0x000015555507c2e8 _timer_thread (libslurmfull.so)
#2 0x00001555544861ca start_thread (libpthread.so.0)
#3 0x000015555374ce73 __clone (libc.so.6)
Stack trace of thread 20150:
#0 0x000015555374bdde wait4 (libc.so.6)
#1 0x0000000000415df1 _wait_for_any_task (slurmstepd)
#2 0x000000000041775a _wait_for_all_tasks (slurmstepd)
#3 0x00000000004127e6 main (slurmstepd)
#4 0x000015555374dd85 __libc_start_main (libc.so.6)
#5 0x000000000040d2de _start (slurmstepd)
Workaround: use jobacct_gather/none.
The issue relate to CUDA that is already described in https://bugs.schedmd.com/show_bug.cgi?id=17102 *** This ticket has been marked as a duplicate of ticket 17102 *** |