Created attachment 31597 [details] PI CUDA calculation Reproduced on 23.02.2 and 23.02.4. With simple CUDA/pi job and also with complex OpenMPI4/CUDA/NCCL tests. To reproduce the issue we used 2 nodes with 2 GPUs each (NVIDIA v100). The NCCL tests were cloned from https://github.com/NVIDIA/nccl-tests (tested on different versions) and compiled with openmpi4, cuda12.1, nccl2, gcc11. When we run the tests manually via mpirun they finish fine: [cmsupport@ts-tr-v100-gpus ~]$ mpirun -np 4 -H node001:2,node002:2 -np 4 /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100 # nThread 1 nGpus 1 minBytes 1048576 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 22092 on node001 device 0 [0x00] Tesla V100-SXM3-32GB # Rank 1 Group 0 Pid 22093 on node001 device 1 [0x00] Tesla V100-SXM3-32GB # Rank 2 Group 0 Pid 22336 on node002 device 0 [0x00] Tesla V100-SXM3-32GB # Rank 3 Group 0 Pid 22337 on node002 device 1 [0x00] Tesla V100-SXM3-32GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 2857.2 0.37 0.55 0 2527.8 0.41 0.62 0 2097152 524288 float sum -1 3678.6 0.57 0.86 0 3809.3 0.55 0.83 0 4194304 1048576 float sum -1 6769.8 0.62 0.93 0 6798.0 0.62 0.93 0 8388608 2097152 float sum -1 11100 0.76 1.13 0 11359 0.74 1.11 0 16777216 4194304 float sum -1 33257 0.50 0.76 0 30115 0.56 0.84 0 33554432 8388608 float sum -1 69484 0.48 0.72 0 67956 0.49 0.74 0 67108864 16777216 float sum -1 89091 0.75 1.13 0 97271 0.69 1.03 0 134217728 33554432 float sum -1 175642 0.76 1.15 0 186022 0.72 1.08 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 0.900032 # But when we run them via srun, then after size 32M the job just stuck forever (and no GPU usage in nvidia-smi): [cmsupport@ts-tr-v100-gpus ~]$ srun --export=ALL,NCCL_DEBUG=TRACE --ntasks=4 --mpi=pmix -N 2 --gres=gpu:v100:2 /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100 # nThread 1 nGpus 1 minBytes 1048576 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 23252 on node001 device 0 [0x00] Tesla V100-SXM3-32GB # Rank 1 Group 0 Pid 23253 on node001 device 1 [0x00] Tesla V100-SXM3-32GB # Rank 2 Group 0 Pid 23453 on node002 device 0 [0x00] Tesla V100-SXM3-32GB # Rank 3 Group 0 Pid 23454 on node002 device 1 [0x00] Tesla V100-SXM3-32GB [...] node002:23454:23480 [1] NCCL INFO comm 0x5668d80 rank 3 nranks 4 cudaDev 1 busId 70 commId 0x63119ee87dc368a4 - Init COMPLETE node002:23453:23481 [0] NCCL INFO comm 0x5660860 rank 2 nranks 4 cudaDev 0 busId 60 commId 0x63119ee87dc368a4 - Init COMPLETE node001:23253:23280 [1] NCCL INFO comm 0x5660680 rank 1 nranks 4 cudaDev 1 busId 70 commId 0x63119ee87dc368a4 - Init COMPLETE node001:23252:23279 [0] NCCL INFO comm 0x566a6c0 rank 0 nranks 4 cudaDev 0 busId 60 commId 0x63119ee87dc368a4 - Init COMPLETE # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 2477.9 0.42 0.63 0 3280.4 0.32 0.48 0 2097152 524288 float sum -1 3745.8 0.56 0.84 0 5053.6 0.41 0.62 0 4194304 1048576 float sum -1 9576.0 0.44 0.66 0 6886.6 0.61 0.91 0 8388608 2097152 float sum -1 11821 0.71 1.06 0 9707.7 0.86 1.30 0 16777216 4194304 float sum -1 34853 0.48 0.72 0 33461 0.50 0.75 0 33554432 8388608 float sum -1 61860 0.54 0.81 0 71392 0.47 0.71 0 ^C When we run it via sbatch, then the memory error is printed after size 32M (malloc(): invalid size (unsorted)), the job script exits successfully, but the Slurm job hangs in RUNNING state forever (no slurmstepd runs on the node when jobscript exists): [cmsupport@ts-tr-v100-gpus ~]$ cat nccl-job.sh #!/bin/sh module load openmpi4 cuda11.8/toolkit nccl2-cuda12.1-gcc11 mpirun /home/cmsupport/nccl-tests/build/all_reduce_perf -b 1M -e 128M -f 2 -g 1 -n 100 [cmsupport@ts-tr-v100-gpus ~]$ [cmsupport@ts-tr-v100-gpus ~]$ tail -f slurm-34.out node002:29615:29635 [1] NCCL INFO comm 0x5672e40 rank 3 nranks 4 cudaDev 1 busId 70 commId 0x78c32bcb8e410e6 - Init COMPLETE node002:29614:29636 [0] NCCL INFO comm 0x566a5c0 rank 2 nranks 4 cudaDev 0 busId 60 commId 0x78c32bcb8e410e6 - Init COMPLETE node001:29814:29831 [0] NCCL INFO comm 0x5673620 rank 0 nranks 4 cudaDev 0 busId 60 commId 0x78c32bcb8e410e6 - Init COMPLETE node001:29815:29832 [1] NCCL INFO comm 0x566a420 rank 1 nranks 4 cudaDev 1 busId 70 commId 0x78c32bcb8e410e6 - Init COMPLETE # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 2468.6 0.42 0.64 0 2676.2 0.39 0.59 0 2097152 524288 float sum -1 4284.8 0.49 0.73 0 4677.5 0.45 0.67 0 4194304 1048576 float sum -1 8769.7 0.48 0.72 0 7837.0 0.54 0.80 0 8388608 2097152 float sum -1 10130 0.83 1.24 0 10612 0.79 1.19 0 16777216 4194304 float sum -1 33209 0.51 0.76 0 31680 0.53 0.79 0 33554432 8388608 float sum -1 61937 0.54 0.81 0 63101 0.53 0.80 0 malloc(): invalid size (unsorted) 67108864 16777216 float sum -1 85079 0.79 1.18 0 81763 0.82 1.23 0 134217728 33554432 float sum -1 156199 0.86 1.29 0 155185 0.86 1.30 0 node002:29615:29615 [1] NCCL INFO comm 0x5672e40 rank 3 nranks 4 cudaDev 1 busId 70 - Destroy COMPLETE node001:29815:29815 [1] NCCL INFO comm 0x566a420 rank 1 nranks 4 cudaDev 1 busId 70 - Destroy COMPLETE node002:29614:29614 [0] NCCL INFO comm 0x566a5c0 rank 2 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE node001:29814:29814 [0] NCCL INFO comm 0x5673620 rank 0 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 0.921416 # [cmsupport@ts-tr-v100-gpus ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 34 defq nccl-job cmsuppor R 3:21 2 node[001-002] [cmsupport@ts-tr-v100-gpus ~]$ Sometimes Slurm restarts the job with error in slurmctld log: [2023-08-03T15:23:38.041] Batch JobId=35 missing from batch node node001 (not found BatchStartTime after startup), Requeuing job [2023-08-03T15:23:38.041] _job_complete: JobId=35 WTERMSIG 126 [2023-08-03T15:23:38.041] _job_complete: JobId=35 cancelled by node failure [2023-08-03T15:23:38.042] _job_complete: requeue JobId=35 due to node failure [2023-08-03T15:23:38.046] _job_complete: JobId=35 done [2023-08-03T15:23:38.415] Requeuing JobId=35 and in slurmd log: [2023-08-03T15:23:03.113] Launching batch job 35 for UID 1000 [2023-08-03T15:23:38.035] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script [2023-08-03T15:23:38.035] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00034/slurm_script [2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script [2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script [2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script [2023-08-03T15:23:38.061] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script [2023-08-03T15:23:38.410] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job00035/slurm_script [2023-08-03T15:25:54.245] reissued job credential for job 35 This only happens with NCCL tests when >1 nodes are used. A similar experiment was performed for simple CUDA program (the code is attached). On Slurm 21.03 it works fine: [cmsupport@ts-92-v100-gpus ~]$ srun --export=ALL --ntasks=4 --tasks-per-node=2 -N 2 --gres=gpu:v100:2 ./pi # of trials per thread = 4096, # of blocks = 256, # of threads/block = 256 PI calculated in : 0.135157 s. estimated PI = 3.141592653589822870 [error of 0.000000000000029754] # of trials per thread = 4096, # of blocks = 256, # of threads/block = 256 PI calculated in : 0.157336 s. estimated PI = 3.141592653589822870 [error of 0.000000000000029754] # of trials per thread = 4096, # of blocks = 256, # of threads/block = 256 PI calculated in : 0.164840 s. estimated PI = 3.141592653589822870 [error of 0.000000000000029754] # of trials per thread = 4096, # of blocks = 256, # of threads/block = 256 PI calculated in : 0.143957 s. estimated PI = 3.141592653589822870 [error of 0.000000000000029754] [cmsupport@ts-92-v100-gpus ~]$ On Slurm 23.02 (note "free()" error from main task): [cmsupport@ts-tr-v100-gpus ~]$ srun --export=ALL --ntasks=4 --tasks-per-node=2 -N 2 --gres=gpu:v100:2 ./pi # of trials per thread = 4096, # of blocks = 256, # of threads/block = 256 PI calculated in : 0.166167 s. estimated PI = 3.141592653589822870 [error of 0.000000000000029754] # of trials per thread = 4096, # of blocks = 256, # of threads/block = 256 PI calculated in : 0.161374 s. estimated PI = 3.141592653589822870 [error of 0.000000000000029754] free(): invalid next size (fast) # of trials per thread = 4096, # of blocks = 256, # of threads/block = 256 PI calculated in : 0.148317 s. estimated PI = 3.141592653589822870 [error of 0.000000000000029754] # of trials per thread = 4096, # of blocks = 256, # of threads/block = 256 PI calculated in : 0.171998 s. estimated PI = 3.141592653589822870 [error of 0.000000000000029754] ^C <hangs> While on same Slurm 23.02 with 1 task per node it works just fine.
A support agreement needs to be put in place before SchedMD can assign an engineer to this.
It turned out the problem is in the gather-plugins (tried both linux and groups). Because of them slurmstepd crashes: Stack trace of thread 20151: #0 0x0000155553761acf raise (libc.so.6) #1 0x0000155553734ea5 abort (libc.so.6) #2 0x00001555537a2cd7 __libc_message (libc.so.6) #3 0x00001555537a9fdc malloc_printerr (libc.so.6) #4 0x00001555537ad204 _int_malloc (libc.so.6) #5 0x00001555537af646 __libc_calloc (libc.so.6) #6 0x0000155555074df9 slurm_xcalloc (libslurmfull.so) #7 0x00001555550755f9 _xstrdup_vprintf (libslurmfull.so) #8 0x00001555550759aa _xstrfmtcat (libslurmfull.so) #9 0x000015554fe922f8 _handle_stats (jobacct_gather_linux.so) #10 0x000015554fe9272c jag_common_poll_data (jobacct_gather_linux.so) #11 0x000015554fe9158b jobacct_gather_p_poll_data (jobacct_gather_linux.so) #12 0x000015555509cc5c _poll_data (libslurmfull.so) #13 0x000015555509ce5c _watch_tasks (libslurmfull.so) #14 0x00001555544861ca start_thread (libpthread.so.0) #15 0x000015555374ce73 __clone (libc.so.6) Stack trace of thread 20154: #0 0x0000155553836f41 __poll (libc.so.6) #1 0x0000155554fa9dc9 poll (libslurmfull.so) #2 0x000000000041bc3a _io_thr (slurmstepd) #3 0x00001555544861ca start_thread (libpthread.so.0) #4 0x000015555374ce73 __clone (libc.so.6) Stack trace of thread 20153: #0 0x0000155553836f41 __poll (libc.so.6) #1 0x0000155554fa9dc9 poll (libslurmfull.so) #2 0x000000000042b1c7 _msg_thr_internal (slurmstepd) #3 0x00001555544861ca start_thread (libpthread.so.0) #4 0x000015555374ce73 __clone (libc.so.6) Stack trace of thread 20152: #0 0x000015555448c7aa pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0) #1 0x000015555507c2e8 _timer_thread (libslurmfull.so) #2 0x00001555544861ca start_thread (libpthread.so.0) #3 0x000015555374ce73 __clone (libc.so.6) Stack trace of thread 20150: #0 0x000015555374bdde wait4 (libc.so.6) #1 0x0000000000415df1 _wait_for_any_task (slurmstepd) #2 0x000000000041775a _wait_for_all_tasks (slurmstepd) #3 0x00000000004127e6 main (slurmstepd) #4 0x000015555374dd85 __libc_start_main (libc.so.6) #5 0x000000000040d2de _start (slurmstepd) Workaround: use jobacct_gather/none.
The issue relate to CUDA that is already described in https://bugs.schedmd.com/show_bug.cgi?id=17102 *** This ticket has been marked as a duplicate of ticket 17102 ***