17102 – Repeated slurmstepd segfaults (related to GPU usage reporting?)

Ticket 17102 - Repeated slurmstepd segfaults (related to GPU usage reporting?)

Summary: Repeated slurmstepd segfaults (related to GPU usage reporting?)

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	23.02.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Duplicates (4):	17355 17398 17637 19018 (view as ticket list)
Depends on:
Blocks:	17915
	Show dependency tree / graph

Reported:	2023-07-04 00:45 MDT by Kilian Cavalotti
Modified:	2024-11-18 01:41 MST (History)
CC List:	13 users (show)

See Also:	17469 17321 17987 19018 20040
Site:	Stanford
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Output of 'thread apply all bt full' (8.14 KB, text/x-log) 2023-07-04 00:45 MDT, Kilian Cavalotti	Details
nvml_test.c (2.17 KB, text/x-csrc) 2023-07-05 10:29 MDT, Felip Moll	Details
nvml_test.c (3.12 KB, patch) 2023-07-20 07:15 MDT, Felip Moll	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2023-07-04 00:45:19 MDT

Created attachment 31064 [details]
Output of 'thread apply all bt full'

Hi,

We just upgraded Sherlock to 23.02.3, and are now seeing waves of slurmstepd segfaults (more precisely aborts), which lead to failed jobs and leftover processes on compute nodes.

They generate core files in the compute nodes' slurmd spool dir, and the trace looks like this:

(gdb) bt
#0  0x00002b398fddb387 in raise () from /usr/lib64/libc.so.6
#1  0x00002b398fddca78 in abort () from /usr/lib64/libc.so.6
#2  0x00002b398fe1df67 in __libc_message () from /usr/lib64/libc.so.6
#3  0x00002b398fe26329 in _int_free () from /usr/lib64/libc.so.6
#4  0x00002b398e75ed48 in slurm_xfree (item=item@entry=0x2b398e50c7e0) at xmalloc.c:213
#5  0x00002b3994481018 in gpu_p_usage_read (pid=10348, data=0x2b399c002600) at gpu_nvml.c:1818
#6  0x00002b398e770dc3 in gpu_g_usage_read (pid=pid@entry=10348, data=<optimized out>) at gpu.c:249
#7  0x00002b3992c8ed19 in _handle_stats (tres_count=13, callbacks=<optimized out>, pid=10348) at common_jag.c:585
#8  _get_precs (task_list=<optimized out>, cont_id=<optimized out>, callbacks=<optimized out>) at common_jag.c:642
#9  0x00002b3992c8f53d in jag_common_poll_data (task_list=0x2451cb0, cont_id=10282, callbacks=callbacks@entry=0x2b3992e92340 <callbacks.15582>, profile=true) at common_jag.c:901
#10 0x00002b3992c8e2ee in jobacct_gather_p_poll_data (task_list=<optimized out>, cont_id=<optimized out>, profile=<optimized out>) at jobacct_gather_linux.c:271
#11 0x00002b398e785aa5 in _poll_data (profile=profile@entry=true) at jobacct_gather.c:323
#12 0x00002b398e785cf1 in _watch_tasks (arg=<optimized out>) at jobacct_gather.c:361
#13 0x00002b398f268ea5 in start_thread () from /usr/lib64/libpthread.so.0
#14 0x00002b398fea3b0d in clone () from /usr/lib64/libc.so.6 

Output of 'thread apply all bt full' is attached.

Thanks!
--
Kilian

Comment 1 Felip Moll 2023-07-04 09:58:55 MDT

Hi Kilian,

I am looking into that now. The code has changed substantially from 23.02 to master. I will come back as soon as I have something.

It seems the issue may come from nvmlDeviceGetGraphicsRunningProcesses() and nvmlDeviceGetComputeRunningProcesses() returning 0.

The fix might be as easy as:

diff --git a/src/plugins/gpu/nvml/gpu_nvml.c b/src/plugins/gpu/nvml/gpu_nvml.c
index af05ad2782..3015415b3d 100644
--- a/src/plugins/gpu/nvml/gpu_nvml.c
+++ b/src/plugins/gpu/nvml/gpu_nvml.c
@@ -1815,7 +1815,9 @@ extern int gpu_p_usage_read(pid_t pid, acct_gather_data_t *data)
                                proc_info[j].usedGpuMemory;
                        break;
                }
-               xfree(proc_info);
+
+               if (proc_info)
+                       xfree(proc_info);
 
                log_flag(JAG, "pid %d has GPUUtil=%lu and MemMB=%lu",
                         pid,


but before confirming that let me understand why gcnt and ccnt can be 0.

Comment 3 Kilian Cavalotti 2023-07-05 06:58:31 MDT

Hi Felip,

(In reply to Felip Moll from comment #1)
> I am looking into that now. The code has changed substantially from 23.02 to
> master. I will come back as soon as I have something.
> 
> It seems the issue may come from nvmlDeviceGetGraphicsRunningProcesses() and
> nvmlDeviceGetComputeRunningProcesses() returning 0.
> 
> The fix might be as easy as:
> 
> diff --git a/src/plugins/gpu/nvml/gpu_nvml.c
> b/src/plugins/gpu/nvml/gpu_nvml.c
> index af05ad2782..3015415b3d 100644
> --- a/src/plugins/gpu/nvml/gpu_nvml.c
> +++ b/src/plugins/gpu/nvml/gpu_nvml.c
> @@ -1815,7 +1815,9 @@ extern int gpu_p_usage_read(pid_t pid,
> acct_gather_data_t *data)
>                                 proc_info[j].usedGpuMemory;
>                         break;
>                 }
> -               xfree(proc_info);
> +
> +               if (proc_info)
> +                       xfree(proc_info);
>  
>                 log_flag(JAG, "pid %d has GPUUtil=%lu and MemMB=%lu",
>                          pid,

Thanks! We may have to apply that patch as a stopgap measure, because all our GPU jobs that are lasting more than JObAcctGatherFrequency are failing :(

Or is there a way to disable the new GPU usage gathering? 

> but before confirming that let me understand why gcnt and ccnt can be 0.

Thank you!

Cheers,
--
Kilian

Comment 4 Felip Moll 2023-07-05 07:03:10 MDT

Kilian, my first assumptions were incorrect. The patch does not work.

I am looking for a new workaround for you.

It is strange and must be something related to the treatment that nvmlDeviceGetComputeRunningProcesses does of the poc_info structure, because the fault is in libc free, like if some internal bytes where used and locked in the nvml library or similar.

I reproduced that and trying to debug the issue at the moment.

I will inform asap.

Comment 5 Kilian Cavalotti 2023-07-05 07:05:42 MDT

(In reply to Felip Moll from comment #4)
> Kilian, my first assumptions were incorrect. The patch does not work.
> 
> I am looking for a new workaround for you.
> 
> It is strange and must be something related to the treatment that
> nvmlDeviceGetComputeRunningProcesses does of the poc_info structure, because
> the fault is in libc free, like if some internal bytes where used and locked
> in the nvml library or similar.
> 
> I reproduced that and trying to debug the issue at the moment.
> 
> I will inform asap.

Much appreciated, thanks Felip!

Cheers,
--
Kilian

Comment 8 Felip Moll 2023-07-05 09:41:14 MDT

Kilian,

At the moment I am finding what looks like memory corruption after calling nvmlDeviceGetGraphicsRunningProcesses(). I am still don't know if its our fault or a bug in nvidia nvml.

I extracted the code for getting the process information into a separate program outside of slurm and doing some tests I can see how it correctly detects 3 processes, but it fills the structs incorrectly. Note for example hou usedGpuMemory contains the pid of one process instead of used memory.

]$ ./main 
Found 3 processes for device index 0
------
Process:1/3
Pid: 3342
usedGpuMemory: 144392192
------
Process:2/3
Pid: 0
usedGpuMemory: 3545
------
Process:3/3
Pid: 4294967295
usedGpuMemory: 0

]$ nvidia-smi 
Wed Jul  5 17:39:36 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GT 1030         Off | 00000000:07:00.0  On |                  N/A |
| 35%   43C    P0              N/A /  30W |    346MiB /  2048MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3342      G   /usr/libexec/Xorg                           137MiB |
|    0   N/A  N/A      3545      G   /usr/bin/gnome-shell                         53MiB |
|    0   N/A  N/A      5010      G   /usr/lib64/firefox/firefox                  152MiB |
+---------------------------------------------------------------------------------------+

Comment 9 Felip Moll 2023-07-05 09:47:30 MDT

Question: Does your nvidia dev libraries version match the installed driver version? I doubt the nvmlProcessInfo_t changed but a mismatch between installed libraries/driver/headers could be something.

Comment 11 Kilian Cavalotti 2023-07-05 10:03:27 MDT

(In reply to Felip Moll from comment #9)
> Question: Does your nvidia dev libraries version match the installed driver
> version? I doubt the nvmlProcessInfo_t changed but a mismatch between
> installed libraries/driver/headers could be something.

That's a good point and I had the same intuition so I started looking at that, but it looks like things should work: we compiled Slurm 23.02.3 with the CUDA 12.1 headers and our GPU nodes are using NVIDIA driver 535.54.03 (which shows "CUDA Version: 12.2"). 

I'd be surprised if there were a major struct change between 12.1.1 and 12.2, especially if you're seeing the same thing on your end.

I can try to recompile slurmd/slurmstepd with CUDA 12.2, to see if it changes anything.

Cheers,
--
Kilian

Comment 12 Felip Moll 2023-07-05 10:29:55 MDT

Created attachment 31088 [details]
nvml_test.c

I am actually trying outside of slurm with this test program. My cuda version is 11.8. I will now install the latest 12.2 to see if it fixes my issues. My driver version is 535.54.03 like you, which says cuda 12.2.

You can try my test program, which under valgrind I get several issues in bad writes in nvmlDeviceGetGraphicsRunningProcesses_v3:

]$ gcc -o nvml_test -ggdb -lnvidia-ml -I/usr/local/cuda-11.8/targets/x86_64-linux/include/ nvml_test.c

Will come back after I tested with 12.2.

Comment 13 Felip Moll 2023-07-05 10:40:57 MDT

After upgrading to cuda12.2, check the difference... now it is working.


[lipi@llit Escriptori]$ nvidia-smi
Wed Jul  5 18:38:42 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GT 1030         Off | 00000000:07:00.0  On |                  N/A |
| 35%   44C    P0              N/A /  30W |    282MiB /  2048MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3342      G   /usr/libexec/Xorg                           137MiB |
|    0   N/A  N/A      3545      G   /usr/bin/gnome-shell                         25MiB |
|    0   N/A  N/A      5010      G   /usr/lib64/firefox/firefox                  116MiB |
+---------------------------------------------------------------------------------------+



[lipi@llit Escriptori]$ gcc -o nvml_test -ggdb -lnvidia-ml -I/usr/local/cuda-12.2/include/ nvml_test.c
[lipi@llit Escriptori]$ ./nvml_test 
Found 1 nvidia devices
Got handle 9b1ff0 for nvidia0

###### Device nvidia0 processes ######
Found 3 processes for device nvidia0
------
Process:1/3
Pid: 3342
usedGpuMemory: 144392192
gpuInstanceId: 4294967295
computeInstanceId: 4294967295
------
Process:2/3
Pid: 3545
usedGpuMemory: 26263552
gpuInstanceId: 4294967295
computeInstanceId: 4294967295
------
Process:3/3
Pid: 5010
usedGpuMemory: 157405184
gpuInstanceId: 4294967295
computeInstanceId: 4294967295


[lipi@llit Escriptori]$ gcc -o nvml_test -ggdb -lnvidia-ml -I/usr/local/cuda-11.8/targets/x86_64-linux/include/ nvml_test.c
[lipi@llit Escriptori]$ ./nvml_test 
Found 1 nvidia devices
Got handle 569ff0 for nvidia0

###### Device nvidia0 processes ######
Found 3 processes for device nvidia0
------
Process:1/3
Pid: 3342
usedGpuMemory: 136003584
gpuInstanceId: 4294967295
computeInstanceId: 4294967295
------
Process:2/3
Pid: 0
usedGpuMemory: 3545
gpuInstanceId: 27836416
computeInstanceId: 0
------
Process:3/3
Pid: 4294967295
usedGpuMemory: 0
gpuInstanceId: 5010
computeInstanceId: 0

Comment 14 Kilian Cavalotti 2023-07-05 10:50:48 MDT

(In reply to Felip Moll from comment #13)
> After upgrading to cuda12.2, check the difference... now it is working.

Ah, that's great! Well, somwwhow, because it also means that there's a stronger-than-expected dependency of Slurm (more specifically of the CUDA version it's been compiled with) with the NVIDIA driver used on GPU nodes. Meaning that the next NVIDIA driver update on compute nodes can potentially break Slurm. :\

As a quick fix for this issue, I will recompile Slurm with CUDA 12.2, but going forward, two questions:

1. is there a way to make that NVML code in Slurm do more checks about the API and structs to avoid segfaulting if things are not working as it expects? Disabling GPU utilization and memory accounting would be much better that slurmstepd crashing and making jobs fail.

2. would it be possible to introduce a setting to completely disable that GPU utilization accounting feature, to avoid compatibility issues?

Thanks!

Cheers,
--
Kilian

Comment 15 Felip Moll 2023-07-05 11:00:13 MDT

(In reply to Kilian Cavalotti from comment #14)
> (In reply to Felip Moll from comment #13)
> > After upgrading to cuda12.2, check the difference... now it is working.
> 
> Ah, that's great! Well, somwwhow, because it also means that there's a
> stronger-than-expected dependency of Slurm (more specifically of the CUDA
> version it's been compiled with) with the NVIDIA driver used on GPU nodes.
> Meaning that the next NVIDIA driver update on compute nodes can potentially
> break Slurm. :\

Well, it breaks Slurm yes, but because we're using the nvml library who is the one which segfaults. It's like saying we're using libc and if libc crashes then obviously slurm crashes too. Upgrading the nvidia driver can break slurm if you're using its libraries, the same way than upgrading pmix, hdf, or any other library that we're using. It is not great, I agree, since the difference is that this probably requires a recompile.

> As a quick fix for this issue, I will recompile Slurm with CUDA 12.2

I am doing some tests to see if just installing the latest cuda could be enough (which I doubt).

> going forward, two questions:
> 
> 1. is there a way to make that NVML code in Slurm do more checks about the
> API and structs to avoid segfaulting if things are not working as it
> expects? Disabling GPU utilization and memory accounting would be much
> better that slurmstepd crashing and making jobs fail.

I will discuss that internally and see if we can add some checks.
 
> 2. would it be possible to introduce a setting to completely disable that
> GPU utilization accounting feature, to avoid compatibility issues?

I will discuss that too.

Comment 16 Kilian Cavalotti 2023-07-05 11:09:47 MDT

(In reply to Felip Moll from comment #15)
> Well, it breaks Slurm yes, but because we're using the nvml library who is
> the one which segfaults. It's like saying we're using libc and if libc
> crashes then obviously slurm crashes too. Upgrading the nvidia driver can
> break slurm if you're using its libraries, the same way than upgrading pmix,
> hdf, or any other library that we're using. It is not great, I agree, since
> the difference is that this probably requires a recompile.

Yes, I agree, the same kind of issue could also happen with other libraries Slurm depends on, that's true.

> > As a quick fix for this issue, I will recompile Slurm with CUDA 12.2
> 
> I am doing some tests to see if just installing the latest cuda could be
> enough (which I doubt).

I can confirm that recompiling Slurm with the same CUDA version used on GPU nodes resolves the problem. Hopefully the next CUDA release won't break things again. :\

Is that something you can bring up to NVIDIA?

> > 1. is there a way to make that NVML code in Slurm do more checks about the
> > API and structs to avoid segfaulting if things are not working as it
> > expects? Disabling GPU utilization and memory accounting would be much
> > better that slurmstepd crashing and making jobs fail.
> 
> I will discuss that internally and see if we can add some checks.
>  
> > 2. would it be possible to introduce a setting to completely disable that
> > GPU utilization accounting feature, to avoid compatibility issues?
> 
> I will discuss that too.

Thank you!

Cheers,
--
Kilian

Comment 18 Kilian Cavalotti 2023-07-11 19:25:25 MDT

Hi Felip, 

(In reply to Kilian Cavalotti from comment #16)
> > > 1. is there a way to make that NVML code in Slurm do more checks about the
> > > API and structs to avoid segfaulting if things are not working as it
> > > expects? Disabling GPU utilization and memory accounting would be much
> > > better that slurmstepd crashing and making jobs fail.
> > 
> > I will discuss that internally and see if we can add some checks.

I've tried to track down the issue more closely, and I believe that same issue has been reported here: https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards-compatibility/254999

Would the suggestion about defining NVML_NO_UNVERSIONED_FUNC_DEFS and using versioned functions help? (see https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards-compatibility/254999/5)

Cheers,
--
Kilian

Comment 19 Felip Moll 2023-07-12 08:15:08 MDT

(In reply to Kilian Cavalotti from comment #18)
> Hi Felip, 
> 
> (In reply to Kilian Cavalotti from comment #16)
> > > > 1. is there a way to make that NVML code in Slurm do more checks about the
> > > > API and structs to avoid segfaulting if things are not working as it
> > > > expects? Disabling GPU utilization and memory accounting would be much
> > > > better that slurmstepd crashing and making jobs fail.
> > > 
> > > I will discuss that internally and see if we can add some checks.
> 
> I've tried to track down the issue more closely, and I believe that same
> issue has been reported here:
> https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards-
> compatibility/254999
> 
> Would the suggestion about defining NVML_NO_UNVERSIONED_FUNC_DEFS and using
> versioned functions help? (see
> https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards-
> compatibility/254999/5)
> 
> Cheers,
> --
> Kilian

I really dislike the idea of using 'versioned' calls to functions. This comment says we should use nvmlDeviceGetComputeRunningProcesses_v2, or similar instead of nvmlDeviceGetComputeRunningProcesses. But how do we know about nvidia renaming conventions? I think it's a poor api management to break compatibility without a warning, from nvidia.

I am seeing if I can warn the user when a different driver version is detected vs the one which slurm was compiled with, I think is the only thing we can do if we want to support a weak api.

Comment 21 Kilian Cavalotti 2023-07-12 09:12:56 MDT

(In reply to Felip Moll from comment #19)
> (In reply to Kilian Cavalotti from comm
> I really dislike the idea of using 'versioned' calls to functions. This
> comment says we should use nvmlDeviceGetComputeRunningProcesses_v2, or
> similar instead of nvmlDeviceGetComputeRunningProcesses. But how do we know
> about nvidia renaming conventions? I think it's a poor api management to
> break compatibility without a warning, from nvidia.

Agreed. Plus, the next comment in that thread seems to indicate that the versioned functions are not compatible across driver versions either.

> I am seeing if I can warn the user when a different driver version is
> detected vs the one which slurm was compiled with, I think is the only thing
> we can do if we want to support a weak api.

A warning couldn't hurt, but I think the more important thing here is to avoid segfaults and aborts. If the check just logs a warning and doesn't prevent slurmd/slurmstepd from crashing, that will be only marginally useful.

But if that check prevents slurmd from starting, that won't be a great solution either: the release cadence of the NVIDIA driver is much higher than the Slurm release schedule, and asking users to recompile Slurm each time a new NVIDIA driver is released is not really sustainable.

So the best solution will likely be additional checks to catch possible segfault-type situations, in addition to logging a warning for the admin if an API mismatch is detected. Although, this is the first time it's happening in many many CUDA versions, so maybe we can just hope that this was a one off?

But given the level of partnership between SchedMD and NVIDIA, this is probably something that would be worth discussing with your direct contacts there directly, isn't it?

Thanks!
--
Kilian

Comment 26 Felip Moll 2023-07-20 07:15:03 MDT

Created attachment 31331 [details]
nvml_test.c

Comment 28 Kilian Cavalotti 2023-08-01 18:22:50 MDT

Hi there!

Quick update: after having acknowledged the ABI breakage (https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards-compatibility/254999/9), NVIDIA released a new driver (535.86.10), which, apparently breaks things again :(

After a few minutes, any GPU job running on a system with the 535.86.10 driver  under a slurmstepd compiled with CUDA 12.2 now fails with:

slurmstepd: [26057335.interactive]: symbol lookup error: /usr/lib64/slurm/gpu_nvml.so: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses_v3

There's still no way to disable GPU accounting on the compute nodes, right? :\

Thanks,
--
Kilian

Comment 29 Kilian Cavalotti 2023-08-01 19:11:32 MDT

Sooo, further investigation reveals that this latest driver doesn't seem to be providing nvmlDeviceGetGraphicsRunningProcesses_v3 anymore (despite the function still being available in nvml.h from the recommended CUDA version):

In 535.54.03, we could see the 3 versioned functions:

# strings /usr/lib64/libnvidia-ml.so.535.54.03 | grep nvmlDeviceGetGraphicsRunningProcesses | sort -u
nvmlDeviceGetGraphicsRunningProcesses
nvmlDeviceGetGraphicsRunningProcesses_v2
nvmlDeviceGetGraphicsRunningProcesses_v3

But in 535.86.10, there seems to be just 2 left:

# strings /usr/lib64/libnvidia-ml.so.535.86.10 | grep nvmlDeviceGetGraphicsRunningProcesses | sort -u
nvmlDeviceGetGraphicsRunningProcesses
nvmlDeviceGetGraphicsRunningProcesses_v2

So the temporary "fix" we deployed was to re-recompile Slurm with the last CUDA version that did *not* ship nvmlDeviceGetGraphicsRunningProcesses_v3 (for us, that's CUDA 11.5), and restart slurmd on all our GPU nodes.

This is pretty annoying though, and I still believe that having a switch to turn GPU metrics collection off in slurmstepd would be useful.

Thanks!
--
Kilian

Comment 30 Marshall Garey 2023-08-07 16:54:00 MDT

Felip is on vacation, but I'm seeing responses while he is out.

> There's still no way to disable GPU accounting on the compute nodes, right? :\

You are correct.

For now, obviously you can recompile Slurm with the latest CUDA version. If you want to disable it though, you can also comment out the one call to gpu_g_usage_read() in src/plugins/jobacct_gather/common/common_job.c, though that obviously requires recompiling Slurm again.


> This is pretty annoying though, and I still believe that having a switch to turn GPU metrics collection off in slurmstepd would be useful.

Would you want this to be a job-specific option, or a configuration flag, or both? Are you thinking something similar to how --acctg-freq=<type>=0 can disable sampling of that type. Or maybe a flag in JobAcctGatherParams could disable it cluster-wide.

Comment 31 Felip Moll 2023-08-15 05:16:59 MDT

Hi Kilian,

I am back!.

I certainly see the ability to disable gpu accounting at cluster level a good thing, specially for performance reasons if somebody wanted to avoid the call to nvml, but I don't see it as a good reason to have this feature just to 'avoid' a faulty driver/abi or a compatibility issue.

In any case, if you want, I can study to add this option. Would something like this work for you?:

JobAcctGatherParams=NoGPUAcct

Marshall proposed to add a setting like JobAcctGatherFrequency=gpu=0, the problem with that is that 'gpu' is not a jobacctgather plugin per se. The accounting of gpu stats is done from the acctgather energy plugin and from the common code (task=). So in that case we should make 'gpu=0' to be a special flag which acts over all the other jobacctgather plugins. And then, if we set some value greater than 0, e.g. gpu=10, that would mean having a different frequency for the gpu gathering for all plugins?

I'd rather see it more convenient to have a JobAcctGatherParams=nogpuacct which disables completely the gpu accounting everywhere. We will discuss it internally.

Comment 32 Kilian Cavalotti 2023-08-15 09:11:17 MDT

Hi Felip!

(In reply to Felip Moll from comment #31)
> I certainly see the ability to disable gpu accounting at cluster level a
> good thing, specially for performance reasons if somebody wanted to avoid
> the call to nvml, but I don't see it as a good reason to have this feature
> just to 'avoid' a faulty driver/abi or a compatibility issue.

Agreed! Performance is even a better reason (as calls to the NVML are far from being free), but having a sort of on/off switch would be good whatever the underlying motivation. 

> In any case, if you want, I can study to add this option. Would something
> like this work for you?:
> 
> JobAcctGatherParams=NoGPUAcct

Yes, that would definitely work.

> Marshall proposed to add a setting like JobAcctGatherFrequency=gpu=0, the
> problem with that is that 'gpu' is not a jobacctgather plugin per se. The
> accounting of gpu stats is done from the acctgather energy plugin and from
> the common code (task=). So in that case we should make 'gpu=0' to be a
> special flag which acts over all the other jobacctgather plugins. And then,
> if we set some value greater than 0, e.g. gpu=10, that would mean having a
> different frequency for the gpu gathering for all plugins?

Yes, understood, and it makes complete sense to me.

> I'd rather see it more convenient to have a JobAcctGatherParams=nogpuacct
> which disables completely the gpu accounting everywhere. We will discuss it
> internally.

Sounds great, thanks a lot for taking this into consideration!

Cheers,
--
Kilian

Comment 38 Taras Shapovalov 2023-08-16 01:14:13 MDT

*** Ticket 17355 has been marked as a duplicate of this ticket. ***

Comment 46 Marshall Garey 2023-09-06 15:59:37 MDT

*** Ticket 17637 has been marked as a duplicate of this ticket. ***

Comment 49 Scott Hilton 2023-09-11 16:22:36 MDT

*** Ticket 17398 has been marked as a duplicate of this ticket. ***

Comment 55 Jason Booth 2023-10-16 09:51:52 MDT

*** Ticket 17915 has been marked as a duplicate of this ticket. ***

Comment 67 Nate Rini 2024-02-19 09:59:12 MST

*** Ticket 19018 has been marked as a duplicate of this ticket. ***

Comment 72 Marcin Stolarek 2024-07-17 20:56:26 MDT

Kilian,

I guess you're already aware of the option merged into Slurm 23.02.7[1] (I just took over the case since Felip is on PTO).

We're still continuing to work on that, trying to figure out the best way to make Slurm more resistant to unexpected ABI changes on NVML side. I'll keep you posted on the progress.

Are you OK with lowering the case severity to 4?

cheers,
Marcin

Comment 73 Kilian Cavalotti 2024-07-22 09:45:55 MDT

Hi Marcin,

(In reply to Marcin Stolarek from comment #72)
> I guess you're already aware of the option merged into Slurm 23.02.7[1] (I
> just took over the case since Felip is on PTO).
> 
> We're still continuing to work on that, trying to figure out the best way to
> make Slurm more resistant to unexpected ABI changes on NVML side. I'll keep
> you posted on the progress.

Thanks!

> Are you OK with lowering the case severity to 4?

Yes, that sounds right.

Cheers,
--
Kilian

Comment 84 Marcin Stolarek 2024-11-18 01:41:19 MST

Kilian,

We've discussed a few ways we can further improve Slurm's resistance to changes in NVML, and it looks like there is nothing we can do, since there are no rules we can rely on Nvidia's side here.

Slurm 24.11 comes with new gpu/nvidia plugin, that performs information gathering based on the information exposed over linux pseudo-file systems (i.e. /sys, /proc) which doesn't require linking against nvidia-ml.

Unfortunately, the functionality is limited there today. Hopefully more information will be provided by Nvidia kernel drivers and long term gpu/nvidia will be able to replace gpu/nvml.

I'll go ahead and close the ticket as "won't fix". Should you have any questions, please reopen.

cheers,
Marcin