17355 – Multitask GPU jobs jobs fail on Slurm 23.02

Ticket 17355 - Multitask GPU jobs jobs fail on Slurm 23.02

Summary: Multitask GPU jobs jobs fail on Slurm 23.02

Status:	RESOLVED DUPLICATE of ticket 17102

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other tickets)
Version:	23.02.4
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-08-04 06:56 MDT by Taras Shapovalov
Modified:	2023-08-16 01:14 MDT (History)
CC List:	0 users

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
PI CUDA calculation (1.54 KB, text/x-csrc) 2023-08-04 06:56 MDT, Taras Shapovalov	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Taras Shapovalov 2023-08-04 06:56:39 MDT

Created attachment 31597 [details]
PI CUDA calculation

Reproduced on 23.02.2 and 23.02.4. With simple CUDA/pi job and also with complex OpenMPI4/CUDA/NCCL tests. 

To reproduce the issue we used 2 nodes with 2 GPUs each (NVIDIA v100). The NCCL tests were cloned from https://github.com/NVIDIA/nccl-tests (tested on different versions) and compiled with openmpi4, cuda12.1, nccl2, gcc11.

When we run the tests manually via mpirun they finish fine:

[cmsupport@ts-tr-v100-gpus ~]$ mpirun -np 4  -H node001:2,node002:2 -np 4 /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100
# nThread 1 nGpus 1 minBytes 1048576 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  22092 on    node001 device  0 [0x00] Tesla V100-SXM3-32GB
#  Rank  1 Group  0 Pid  22093 on    node001 device  1 [0x00] Tesla V100-SXM3-32GB
#  Rank  2 Group  0 Pid  22336 on    node002 device  0 [0x00] Tesla V100-SXM3-32GB
#  Rank  3 Group  0 Pid  22337 on    node002 device  1 [0x00] Tesla V100-SXM3-32GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1   2857.2    0.37    0.55      0   2527.8    0.41    0.62      0
     2097152        524288     float     sum      -1   3678.6    0.57    0.86      0   3809.3    0.55    0.83      0
     4194304       1048576     float     sum      -1   6769.8    0.62    0.93      0   6798.0    0.62    0.93      0
     8388608       2097152     float     sum      -1    11100    0.76    1.13      0    11359    0.74    1.11      0
    16777216       4194304     float     sum      -1    33257    0.50    0.76      0    30115    0.56    0.84      0
    33554432       8388608     float     sum      -1    69484    0.48    0.72      0    67956    0.49    0.74      0
    67108864      16777216     float     sum      -1    89091    0.75    1.13      0    97271    0.69    1.03      0
   134217728      33554432     float     sum      -1   175642    0.76    1.15      0   186022    0.72    1.08      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.900032 
#


But when we run them via srun, then after size 32M the job just stuck forever (and no GPU usage in nvidia-smi):

[cmsupport@ts-tr-v100-gpus ~]$ srun --export=ALL,NCCL_DEBUG=TRACE --ntasks=4 --mpi=pmix -N 2 --gres=gpu:v100:2 /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100
# nThread 1 nGpus 1 minBytes 1048576 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  23252 on    node001 device  0 [0x00] Tesla V100-SXM3-32GB
#  Rank  1 Group  0 Pid  23253 on    node001 device  1 [0x00] Tesla V100-SXM3-32GB
#  Rank  2 Group  0 Pid  23453 on    node002 device  0 [0x00] Tesla V100-SXM3-32GB
#  Rank  3 Group  0 Pid  23454 on    node002 device  1 [0x00] Tesla V100-SXM3-32GB
[...]
node002:23454:23480 [1] NCCL INFO comm 0x5668d80 rank 3 nranks 4 cudaDev 1 busId 70 commId 0x63119ee87dc368a4 - Init COMPLETE
node002:23453:23481 [0] NCCL INFO comm 0x5660860 rank 2 nranks 4 cudaDev 0 busId 60 commId 0x63119ee87dc368a4 - Init COMPLETE
node001:23253:23280 [1] NCCL INFO comm 0x5660680 rank 1 nranks 4 cudaDev 1 busId 70 commId 0x63119ee87dc368a4 - Init COMPLETE
node001:23252:23279 [0] NCCL INFO comm 0x566a6c0 rank 0 nranks 4 cudaDev 0 busId 60 commId 0x63119ee87dc368a4 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1   2477.9    0.42    0.63      0   3280.4    0.32    0.48      0
     2097152        524288     float     sum      -1   3745.8    0.56    0.84      0   5053.6    0.41    0.62      0
     4194304       1048576     float     sum      -1   9576.0    0.44    0.66      0   6886.6    0.61    0.91      0
     8388608       2097152     float     sum      -1    11821    0.71    1.06      0   9707.7    0.86    1.30      0
    16777216       4194304     float     sum      -1    34853    0.48    0.72      0    33461    0.50    0.75      0
    33554432       8388608     float     sum      -1    61860    0.54    0.81      0    71392    0.47    0.71      0
^C

When we run it via sbatch, then the memory error is printed after size 32M (malloc(): invalid size (unsorted)), the job script exits successfully, but the Slurm job hangs in RUNNING state forever (no slurmstepd runs on the node when jobscript exists):

[cmsupport@ts-tr-v100-gpus ~]$ cat nccl-job.sh 
#!/bin/sh
module load openmpi4 cuda11.8/toolkit nccl2-cuda12.1-gcc11
mpirun /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100
[cmsupport@ts-tr-v100-gpus ~]$ 

[cmsupport@ts-tr-v100-gpus ~]$ tail -f slurm-34.out
node002:29615:29635 [1] NCCL INFO comm 0x5672e40 rank 3 nranks 4 cudaDev 1 busId 70 commId 0x78c32bcb8e410e6 - Init COMPLETE
node002:29614:29636 [0] NCCL INFO comm 0x566a5c0 rank 2 nranks 4 cudaDev 0 busId 60 commId 0x78c32bcb8e410e6 - Init COMPLETE
node001:29814:29831 [0] NCCL INFO comm 0x5673620 rank 0 nranks 4 cudaDev 0 busId 60 commId 0x78c32bcb8e410e6 - Init COMPLETE
node001:29815:29832 [1] NCCL INFO comm 0x566a420 rank 1 nranks 4 cudaDev 1 busId 70 commId 0x78c32bcb8e410e6 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1   2468.6    0.42    0.64      0   2676.2    0.39    0.59      0
     2097152        524288     float     sum      -1   4284.8    0.49    0.73      0   4677.5    0.45    0.67      0
     4194304       1048576     float     sum      -1   8769.7    0.48    0.72      0   7837.0    0.54    0.80      0
     8388608       2097152     float     sum      -1    10130    0.83    1.24      0    10612    0.79    1.19      0
    16777216       4194304     float     sum      -1    33209    0.51    0.76      0    31680    0.53    0.79      0
    33554432       8388608     float     sum      -1    61937    0.54    0.81      0    63101    0.53    0.80      0
malloc(): invalid size (unsorted)
    67108864      16777216     float     sum      -1    85079    0.79    1.18      0    81763    0.82    1.23      0
   134217728      33554432     float     sum      -1   156199    0.86    1.29      0   155185    0.86    1.30      0
node002:29615:29615 [1] NCCL INFO comm 0x5672e40 rank 3 nranks 4 cudaDev 1 busId 70 - Destroy COMPLETE
node001:29815:29815 [1] NCCL INFO comm 0x566a420 rank 1 nranks 4 cudaDev 1 busId 70 - Destroy COMPLETE
node002:29614:29614 [0] NCCL INFO comm 0x566a5c0 rank 2 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE
node001:29814:29814 [0] NCCL INFO comm 0x5673620 rank 0 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.921416 
#

[cmsupport@ts-tr-v100-gpus ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                34      defq nccl-job cmsuppor  R       3:21      2 node[001-002]
[cmsupport@ts-tr-v100-gpus ~]$ 

Sometimes Slurm restarts the job with error in slurmctld log:

[2023-08-03T15:23:38.041] Batch JobId=35 missing from batch node node001 (not found BatchStartTime after startup), Requeuing job                                     
[2023-08-03T15:23:38.041] _job_complete: JobId=35 WTERMSIG 126                                                                                                       
[2023-08-03T15:23:38.041] _job_complete: JobId=35 cancelled by node failure                                                                                          
[2023-08-03T15:23:38.042] _job_complete: requeue JobId=35 due to node failure                                                                                        
[2023-08-03T15:23:38.046] _job_complete: JobId=35 done                                                                                                               
[2023-08-03T15:23:38.415] Requeuing JobId=35

and in slurmd log:

[2023-08-03T15:23:03.113] Launching batch job 35 for UID 1000
[2023-08-03T15:23:38.035] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.035] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00034/slurm_script
[2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.410] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:25:54.245] reissued job credential for job 35

This only happens with NCCL tests when >1 nodes are used.


A similar experiment was performed for simple CUDA program (the code is attached).

On Slurm 21.03 it works fine:

[cmsupport@ts-92-v100-gpus ~]$ srun --export=ALL --ntasks=4 --tasks-per-node=2 -N 2 --gres=gpu:v100:2 ./pi                                                                                             
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256                                                                                                                                        
PI calculated in : 0.135157 s.                                                                                                                                                                                    
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]                                                                                                                                               
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256                                                                                                                                        
PI calculated in : 0.157336 s.                                                                                                                                                                                    
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]                                                                                                                                               
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256                                                                                                                                        
PI calculated in : 0.164840 s.                                                                                                                                                                                    
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]                                                                                                                                               
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256                                                                                                                                        
PI calculated in : 0.143957 s.                                                                                                                                                                                    
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]                                                                                                                                               
[cmsupport@ts-92-v100-gpus ~]$

On Slurm 23.02 (note "free()" error from main task):

[cmsupport@ts-tr-v100-gpus ~]$ srun --export=ALL --ntasks=4 --tasks-per-node=2 -N 2 --gres=gpu:v100:2 ./pi
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256
PI calculated in : 0.166167 s.
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256
PI calculated in : 0.161374 s.
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]
free(): invalid next size (fast)
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256
PI calculated in : 0.148317 s.
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256
PI calculated in : 0.171998 s.
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]
^C
<hangs>

While on same Slurm 23.02 with 1 task per node it works just fine.

Comment 1 Jacob Jenson 2023-08-04 06:59:52 MDT

A support agreement needs to be put in place before SchedMD can assign an engineer to this.

Comment 2 Taras Shapovalov 2023-08-08 07:03:24 MDT

It turned out the problem is in the gather-plugins (tried both linux and groups). Because of them slurmstepd crashes:

                Stack trace of thread 20151:                                                                                                              
                #0  0x0000155553761acf raise (libc.so.6)                                                                                                  
                #1  0x0000155553734ea5 abort (libc.so.6)                                                                                                  
                #2  0x00001555537a2cd7 __libc_message (libc.so.6)                                                                                         
                #3  0x00001555537a9fdc malloc_printerr (libc.so.6)                                                                                        
                #4  0x00001555537ad204 _int_malloc (libc.so.6)                                                                                            
                #5  0x00001555537af646 __libc_calloc (libc.so.6)                                                                                          
                #6  0x0000155555074df9 slurm_xcalloc (libslurmfull.so)                                                                                    
                #7  0x00001555550755f9 _xstrdup_vprintf (libslurmfull.so)                                                                                 
                #8  0x00001555550759aa _xstrfmtcat (libslurmfull.so)                                                                                      
                #9  0x000015554fe922f8 _handle_stats (jobacct_gather_linux.so)                                                                            
                #10 0x000015554fe9272c jag_common_poll_data (jobacct_gather_linux.so)                                                                     
                #11 0x000015554fe9158b jobacct_gather_p_poll_data (jobacct_gather_linux.so)                                                               
                #12 0x000015555509cc5c _poll_data (libslurmfull.so)                                                                                       
                #13 0x000015555509ce5c _watch_tasks (libslurmfull.so)                                                                                     
                #14 0x00001555544861ca start_thread (libpthread.so.0)                                                                                     
                #15 0x000015555374ce73 __clone (libc.so.6)                                                                                                
                                                                                                                                                          
                Stack trace of thread 20154:                                                                                                              
                #0  0x0000155553836f41 __poll (libc.so.6)                                                                                                 
                #1  0x0000155554fa9dc9 poll (libslurmfull.so)                                                                                             
                #2  0x000000000041bc3a _io_thr (slurmstepd)                                                                                               
                #3  0x00001555544861ca start_thread (libpthread.so.0)                                                                                     
                #4  0x000015555374ce73 __clone (libc.so.6)                                                                                                
                                                                                                                                                          
                Stack trace of thread 20153:                                                                                                              
                #0  0x0000155553836f41 __poll (libc.so.6)                                                                                                 
                #1  0x0000155554fa9dc9 poll (libslurmfull.so)                                                                                             
                #2  0x000000000042b1c7 _msg_thr_internal (slurmstepd)
                #3  0x00001555544861ca start_thread (libpthread.so.0)
                #4  0x000015555374ce73 __clone (libc.so.6)
                 
                Stack trace of thread 20152:
                #0  0x000015555448c7aa pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000015555507c2e8 _timer_thread (libslurmfull.so)
                #2  0x00001555544861ca start_thread (libpthread.so.0)
                #3  0x000015555374ce73 __clone (libc.so.6)
                 
                Stack trace of thread 20150:
                #0  0x000015555374bdde wait4 (libc.so.6)
                #1  0x0000000000415df1 _wait_for_any_task (slurmstepd)
                #2  0x000000000041775a _wait_for_all_tasks (slurmstepd)
                #3  0x00000000004127e6 main (slurmstepd)
                #4  0x000015555374dd85 __libc_start_main (libc.so.6)
                #5  0x000000000040d2de _start (slurmstepd)
 

Workaround: use jobacct_gather/none.

Comment 3 Taras Shapovalov 2023-08-16 01:14:13 MDT

The issue relate to CUDA that is already described in https://bugs.schedmd.com/show_bug.cgi?id=17102

*** This ticket has been marked as a duplicate of ticket 17102 ***