Ticket 17355

Summary:	Multitask GPU jobs jobs fail on Slurm 23.02
Product:	Slurm	Reporter:	Taras Shapovalov <taras.shapovalov>
Component:	GPU	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	23.02.4
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	PI CUDA calculation

Description Taras Shapovalov 2023-08-04 06:56:39 MDT

Created attachment 31597 [details]
PI CUDA calculation

Reproduced on 23.02.2 and 23.02.4. With simple CUDA/pi job and also with complex OpenMPI4/CUDA/NCCL tests. 

To reproduce the issue we used 2 nodes with 2 GPUs each (NVIDIA v100). The NCCL tests were cloned from https://github.com/NVIDIA/nccl-tests (tested on different versions) and compiled with openmpi4, cuda12.1, nccl2, gcc11.

When we run the tests manually via mpirun they finish fine:

[cmsupport@ts-tr-v100-gpus ~]$ mpirun -np 4  -H node001:2,node002:2 -np 4 /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100
# nThread 1 nGpus 1 minBytes 1048576 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  22092 on    node001 device  0 [0x00] Tesla V100-SXM3-32GB
#  Rank  1 Group  0 Pid  22093 on    node001 device  1 [0x00] Tesla V100-SXM3-32GB
#  Rank  2 Group  0 Pid  22336 on    node002 device  0 [0x00] Tesla V100-SXM3-32GB
#  Rank  3 Group  0 Pid  22337 on    node002 device  1 [0x00] Tesla V100-SXM3-32GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1   2857.2    0.37    0.55      0   2527.8    0.41    0.62      0
     2097152        524288     float     sum      -1   3678.6    0.57    0.86      0   3809.3    0.55    0.83      0
     4194304       1048576     float     sum      -1   6769.8    0.62    0.93      0   6798.0    0.62    0.93      0
     8388608       2097152     float     sum      -1    11100    0.76    1.13      0    11359    0.74    1.11      0
    16777216       4194304     float     sum      -1    33257    0.50    0.76      0    30115    0.56    0.84      0
    33554432       8388608     float     sum      -1    69484    0.48    0.72      0    67956    0.49    0.74      0
    67108864      16777216     float     sum      -1    89091    0.75    1.13      0    97271    0.69    1.03      0
   134217728      33554432     float     sum      -1   175642    0.76    1.15      0   186022    0.72    1.08      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.900032 
#


But when we run them via srun, then after size 32M the job just stuck forever (and no GPU usage in nvidia-smi):

[cmsupport@ts-tr-v100-gpus ~]$ srun --export=ALL,NCCL_DEBUG=TRACE --ntasks=4 --mpi=pmix -N 2 --gres=gpu:v100:2 /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100
# nThread 1 nGpus 1 minBytes 1048576 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  23252 on    node001 device  0 [0x00] Tesla V100-SXM3-32GB
#  Rank  1 Group  0 Pid  23253 on    node001 device  1 [0x00] Tesla V100-SXM3-32GB
#  Rank  2 Group  0 Pid  23453 on    node002 device  0 [0x00] Tesla V100-SXM3-32GB
#  Rank  3 Group  0 Pid  23454 on    node002 device  1 [0x00] Tesla V100-SXM3-32GB
[...]
node002:23454:23480 [1] NCCL INFO comm 0x5668d80 rank 3 nranks 4 cudaDev 1 busId 70 commId 0x63119ee87dc368a4 - Init COMPLETE
node002:23453:23481 [0] NCCL INFO comm 0x5660860 rank 2 nranks 4 cudaDev 0 busId 60 commId 0x63119ee87dc368a4 - Init COMPLETE
node001:23253:23280 [1] NCCL INFO comm 0x5660680 rank 1 nranks 4 cudaDev 1 busId 70 commId 0x63119ee87dc368a4 - Init COMPLETE
node001:23252:23279 [0] NCCL INFO comm 0x566a6c0 rank 0 nranks 4 cudaDev 0 busId 60 commId 0x63119ee87dc368a4 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1   2477.9    0.42    0.63      0   3280.4    0.32    0.48      0
     2097152        524288     float     sum      -1   3745.8    0.56    0.84      0   5053.6    0.41    0.62      0
     4194304       1048576     float     sum      -1   9576.0    0.44    0.66      0   6886.6    0.61    0.91      0
     8388608       2097152     float     sum      -1    11821    0.71    1.06      0   9707.7    0.86    1.30      0
    16777216       4194304     float     sum      -1    34853    0.48    0.72      0    33461    0.50    0.75      0
    33554432       8388608     float     sum      -1    61860    0.54    0.81      0    71392    0.47    0.71      0
^C

When we run it via sbatch, then the memory error is printed after size 32M (malloc(): invalid size (unsorted)), the job script exits successfully, but the Slurm job hangs in RUNNING state forever (no slurmstepd runs on the node when jobscript exists):

[cmsupport@ts-tr-v100-gpus ~]$ cat nccl-job.sh 
#!/bin/sh
module load openmpi4 cuda11.8/toolkit nccl2-cuda12.1-gcc11
mpirun /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100
[cmsupport@ts-tr-v100-gpus ~]$ 

[cmsupport@ts-tr-v100-gpus ~]$ tail -f slurm-34.out
node002:29615:29635 [1] NCCL INFO comm 0x5672e40 rank 3 nranks 4 cudaDev 1 busId 70 commId 0x78c32bcb8e410e6 - Init COMPLETE
node002:29614:29636 [0] NCCL INFO comm 0x566a5c0 rank 2 nranks 4 cudaDev 0 busId 60 commId 0x78c32bcb8e410e6 - Init COMPLETE
node001:29814:29831 [0] NCCL INFO comm 0x5673620 rank 0 nranks 4 cudaDev 0 busId 60 commId 0x78c32bcb8e410e6 - Init COMPLETE
node001:29815:29832 [1] NCCL INFO comm 0x566a420 rank 1 nranks 4 cudaDev 1 busId 70 commId 0x78c32bcb8e410e6 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float     sum      -1   2468.6    0.42    0.64      0   2676.2    0.39    0.59      0
     2097152        524288     float     sum      -1   4284.8    0.49    0.73      0   4677.5    0.45    0.67      0
     4194304       1048576     float     sum      -1   8769.7    0.48    0.72      0   7837.0    0.54    0.80      0
     8388608       2097152     float     sum      -1    10130    0.83    1.24      0    10612    0.79    1.19      0
    16777216       4194304     float     sum      -1    33209    0.51    0.76      0    31680    0.53    0.79      0
    33554432       8388608     float     sum      -1    61937    0.54    0.81      0    63101    0.53    0.80      0
malloc(): invalid size (unsorted)
    67108864      16777216     float     sum      -1    85079    0.79    1.18      0    81763    0.82    1.23      0
   134217728      33554432     float     sum      -1   156199    0.86    1.29      0   155185    0.86    1.30      0
node002:29615:29615 [1] NCCL INFO comm 0x5672e40 rank 3 nranks 4 cudaDev 1 busId 70 - Destroy COMPLETE
node001:29815:29815 [1] NCCL INFO comm 0x566a420 rank 1 nranks 4 cudaDev 1 busId 70 - Destroy COMPLETE
node002:29614:29614 [0] NCCL INFO comm 0x566a5c0 rank 2 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE
node001:29814:29814 [0] NCCL INFO comm 0x5673620 rank 0 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.921416 
#

[cmsupport@ts-tr-v100-gpus ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                34      defq nccl-job cmsuppor  R       3:21      2 node[001-002]
[cmsupport@ts-tr-v100-gpus ~]$ 

Sometimes Slurm restarts the job with error in slurmctld log:

[2023-08-03T15:23:38.041] Batch JobId=35 missing from batch node node001 (not found BatchStartTime after startup), Requeuing job                                     
[2023-08-03T15:23:38.041] _job_complete: JobId=35 WTERMSIG 126                                                                                                       
[2023-08-03T15:23:38.041] _job_complete: JobId=35 cancelled by node failure                                                                                          
[2023-08-03T15:23:38.042] _job_complete: requeue JobId=35 due to node failure                                                                                        
[2023-08-03T15:23:38.046] _job_complete: JobId=35 done                                                                                                               
[2023-08-03T15:23:38.415] Requeuing JobId=35

and in slurmd log:

[2023-08-03T15:23:03.113] Launching batch job 35 for UID 1000
[2023-08-03T15:23:38.035] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.035] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00034/slurm_script
[2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:23:38.410] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script
[2023-08-03T15:25:54.245] reissued job credential for job 35

This only happens with NCCL tests when >1 nodes are used.


A similar experiment was performed for simple CUDA program (the code is attached).

On Slurm 21.03 it works fine:

[cmsupport@ts-92-v100-gpus ~]$ srun --export=ALL --ntasks=4 --tasks-per-node=2 -N 2 --gres=gpu:v100:2 ./pi                                                                                             
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256                                                                                                                                        
PI calculated in : 0.135157 s.                                                                                                                                                                                    
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]                                                                                                                                               
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256                                                                                                                                        
PI calculated in : 0.157336 s.                                                                                                                                                                                    
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]                                                                                                                                               
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256                                                                                                                                        
PI calculated in : 0.164840 s.                                                                                                                                                                                    
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]                                                                                                                                               
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256                                                                                                                                        
PI calculated in : 0.143957 s.                                                                                                                                                                                    
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]                                                                                                                                               
[cmsupport@ts-92-v100-gpus ~]$

On Slurm 23.02 (note "free()" error from main task):

[cmsupport@ts-tr-v100-gpus ~]$ srun --export=ALL --ntasks=4 --tasks-per-node=2 -N 2 --gres=gpu:v100:2 ./pi
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256
PI calculated in : 0.166167 s.
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256
PI calculated in : 0.161374 s.
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]
free(): invalid next size (fast)
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256
PI calculated in : 0.148317 s.
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]
# of trials per thread = 4096, # of blocks = 256, # of threads/block = 256
PI calculated in : 0.171998 s.
estimated PI = 3.141592653589822870 [error of 0.000000000000029754]
^C
<hangs>

While on same Slurm 23.02 with 1 task per node it works just fine.

Comment 1 Jacob Jenson 2023-08-04 06:59:52 MDT

A support agreement needs to be put in place before SchedMD can assign an engineer to this.

Comment 2 Taras Shapovalov 2023-08-08 07:03:24 MDT

It turned out the problem is in the gather-plugins (tried both linux and groups). Because of them slurmstepd crashes:

                Stack trace of thread 20151:                                                                                                              
                #0  0x0000155553761acf raise (libc.so.6)                                                                                                  
                #1  0x0000155553734ea5 abort (libc.so.6)                                                                                                  
                #2  0x00001555537a2cd7 __libc_message (libc.so.6)                                                                                         
                #3  0x00001555537a9fdc malloc_printerr (libc.so.6)                                                                                        
                #4  0x00001555537ad204 _int_malloc (libc.so.6)                                                                                            
                #5  0x00001555537af646 __libc_calloc (libc.so.6)                                                                                          
                #6  0x0000155555074df9 slurm_xcalloc (libslurmfull.so)                                                                                    
                #7  0x00001555550755f9 _xstrdup_vprintf (libslurmfull.so)                                                                                 
                #8  0x00001555550759aa _xstrfmtcat (libslurmfull.so)                                                                                      
                #9  0x000015554fe922f8 _handle_stats (jobacct_gather_linux.so)                                                                            
                #10 0x000015554fe9272c jag_common_poll_data (jobacct_gather_linux.so)                                                                     
                #11 0x000015554fe9158b jobacct_gather_p_poll_data (jobacct_gather_linux.so)                                                               
                #12 0x000015555509cc5c _poll_data (libslurmfull.so)                                                                                       
                #13 0x000015555509ce5c _watch_tasks (libslurmfull.so)                                                                                     
                #14 0x00001555544861ca start_thread (libpthread.so.0)                                                                                     
                #15 0x000015555374ce73 __clone (libc.so.6)                                                                                                
                                                                                                                                                          
                Stack trace of thread 20154:                                                                                                              
                #0  0x0000155553836f41 __poll (libc.so.6)                                                                                                 
                #1  0x0000155554fa9dc9 poll (libslurmfull.so)                                                                                             
                #2  0x000000000041bc3a _io_thr (slurmstepd)                                                                                               
                #3  0x00001555544861ca start_thread (libpthread.so.0)                                                                                     
                #4  0x000015555374ce73 __clone (libc.so.6)                                                                                                
                                                                                                                                                          
                Stack trace of thread 20153:                                                                                                              
                #0  0x0000155553836f41 __poll (libc.so.6)                                                                                                 
                #1  0x0000155554fa9dc9 poll (libslurmfull.so)                                                                                             
                #2  0x000000000042b1c7 _msg_thr_internal (slurmstepd)
                #3  0x00001555544861ca start_thread (libpthread.so.0)
                #4  0x000015555374ce73 __clone (libc.so.6)
                 
                Stack trace of thread 20152:
                #0  0x000015555448c7aa pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000015555507c2e8 _timer_thread (libslurmfull.so)
                #2  0x00001555544861ca start_thread (libpthread.so.0)
                #3  0x000015555374ce73 __clone (libc.so.6)
                 
                Stack trace of thread 20150:
                #0  0x000015555374bdde wait4 (libc.so.6)
                #1  0x0000000000415df1 _wait_for_any_task (slurmstepd)
                #2  0x000000000041775a _wait_for_all_tasks (slurmstepd)
                #3  0x00000000004127e6 main (slurmstepd)
                #4  0x000015555374dd85 __libc_start_main (libc.so.6)
                #5  0x000000000040d2de _start (slurmstepd)
 

Workaround: use jobacct_gather/none.

Comment 3 Taras Shapovalov 2023-08-16 01:14:13 MDT

The issue relate to CUDA that is already described in https://bugs.schedmd.com/show_bug.cgi?id=17102

*** This ticket has been marked as a duplicate of ticket 17102 ***