Created attachment 31064 [details] Output of 'thread apply all bt full' Hi, We just upgraded Sherlock to 23.02.3, and are now seeing waves of slurmstepd segfaults (more precisely aborts), which lead to failed jobs and leftover processes on compute nodes. They generate core files in the compute nodes' slurmd spool dir, and the trace looks like this: (gdb) bt #0 0x00002b398fddb387 in raise () from /usr/lib64/libc.so.6 #1 0x00002b398fddca78 in abort () from /usr/lib64/libc.so.6 #2 0x00002b398fe1df67 in __libc_message () from /usr/lib64/libc.so.6 #3 0x00002b398fe26329 in _int_free () from /usr/lib64/libc.so.6 #4 0x00002b398e75ed48 in slurm_xfree (item=item@entry=0x2b398e50c7e0) at xmalloc.c:213 #5 0x00002b3994481018 in gpu_p_usage_read (pid=10348, data=0x2b399c002600) at gpu_nvml.c:1818 #6 0x00002b398e770dc3 in gpu_g_usage_read (pid=pid@entry=10348, data=<optimized out>) at gpu.c:249 #7 0x00002b3992c8ed19 in _handle_stats (tres_count=13, callbacks=<optimized out>, pid=10348) at common_jag.c:585 #8 _get_precs (task_list=<optimized out>, cont_id=<optimized out>, callbacks=<optimized out>) at common_jag.c:642 #9 0x00002b3992c8f53d in jag_common_poll_data (task_list=0x2451cb0, cont_id=10282, callbacks=callbacks@entry=0x2b3992e92340 <callbacks.15582>, profile=true) at common_jag.c:901 #10 0x00002b3992c8e2ee in jobacct_gather_p_poll_data (task_list=<optimized out>, cont_id=<optimized out>, profile=<optimized out>) at jobacct_gather_linux.c:271 #11 0x00002b398e785aa5 in _poll_data (profile=profile@entry=true) at jobacct_gather.c:323 #12 0x00002b398e785cf1 in _watch_tasks (arg=<optimized out>) at jobacct_gather.c:361 #13 0x00002b398f268ea5 in start_thread () from /usr/lib64/libpthread.so.0 #14 0x00002b398fea3b0d in clone () from /usr/lib64/libc.so.6 Output of 'thread apply all bt full' is attached. Thanks! -- Kilian
Hi Kilian, I am looking into that now. The code has changed substantially from 23.02 to master. I will come back as soon as I have something. It seems the issue may come from nvmlDeviceGetGraphicsRunningProcesses() and nvmlDeviceGetComputeRunningProcesses() returning 0. The fix might be as easy as: diff --git a/src/plugins/gpu/nvml/gpu_nvml.c b/src/plugins/gpu/nvml/gpu_nvml.c index af05ad2782..3015415b3d 100644 --- a/src/plugins/gpu/nvml/gpu_nvml.c +++ b/src/plugins/gpu/nvml/gpu_nvml.c @@ -1815,7 +1815,9 @@ extern int gpu_p_usage_read(pid_t pid, acct_gather_data_t *data) proc_info[j].usedGpuMemory; break; } - xfree(proc_info); + + if (proc_info) + xfree(proc_info); log_flag(JAG, "pid %d has GPUUtil=%lu and MemMB=%lu", pid, but before confirming that let me understand why gcnt and ccnt can be 0.
Hi Felip, (In reply to Felip Moll from comment #1) > I am looking into that now. The code has changed substantially from 23.02 to > master. I will come back as soon as I have something. > > It seems the issue may come from nvmlDeviceGetGraphicsRunningProcesses() and > nvmlDeviceGetComputeRunningProcesses() returning 0. > > The fix might be as easy as: > > diff --git a/src/plugins/gpu/nvml/gpu_nvml.c > b/src/plugins/gpu/nvml/gpu_nvml.c > index af05ad2782..3015415b3d 100644 > --- a/src/plugins/gpu/nvml/gpu_nvml.c > +++ b/src/plugins/gpu/nvml/gpu_nvml.c > @@ -1815,7 +1815,9 @@ extern int gpu_p_usage_read(pid_t pid, > acct_gather_data_t *data) > proc_info[j].usedGpuMemory; > break; > } > - xfree(proc_info); > + > + if (proc_info) > + xfree(proc_info); > > log_flag(JAG, "pid %d has GPUUtil=%lu and MemMB=%lu", > pid, Thanks! We may have to apply that patch as a stopgap measure, because all our GPU jobs that are lasting more than JObAcctGatherFrequency are failing :( Or is there a way to disable the new GPU usage gathering? > but before confirming that let me understand why gcnt and ccnt can be 0. Thank you! Cheers, -- Kilian
Kilian, my first assumptions were incorrect. The patch does not work. I am looking for a new workaround for you. It is strange and must be something related to the treatment that nvmlDeviceGetComputeRunningProcesses does of the poc_info structure, because the fault is in libc free, like if some internal bytes where used and locked in the nvml library or similar. I reproduced that and trying to debug the issue at the moment. I will inform asap.
(In reply to Felip Moll from comment #4) > Kilian, my first assumptions were incorrect. The patch does not work. > > I am looking for a new workaround for you. > > It is strange and must be something related to the treatment that > nvmlDeviceGetComputeRunningProcesses does of the poc_info structure, because > the fault is in libc free, like if some internal bytes where used and locked > in the nvml library or similar. > > I reproduced that and trying to debug the issue at the moment. > > I will inform asap. Much appreciated, thanks Felip! Cheers, -- Kilian
Kilian, At the moment I am finding what looks like memory corruption after calling nvmlDeviceGetGraphicsRunningProcesses(). I am still don't know if its our fault or a bug in nvidia nvml. I extracted the code for getting the process information into a separate program outside of slurm and doing some tests I can see how it correctly detects 3 processes, but it fills the structs incorrectly. Note for example hou usedGpuMemory contains the pid of one process instead of used memory. ]$ ./main Found 3 processes for device index 0 ------ Process:1/3 Pid: 3342 usedGpuMemory: 144392192 ------ Process:2/3 Pid: 0 usedGpuMemory: 3545 ------ Process:3/3 Pid: 4294967295 usedGpuMemory: 0 ]$ nvidia-smi Wed Jul 5 17:39:36 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GT 1030 Off | 00000000:07:00.0 On | N/A | | 35% 43C P0 N/A / 30W | 346MiB / 2048MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3342 G /usr/libexec/Xorg 137MiB | | 0 N/A N/A 3545 G /usr/bin/gnome-shell 53MiB | | 0 N/A N/A 5010 G /usr/lib64/firefox/firefox 152MiB | +---------------------------------------------------------------------------------------+
Question: Does your nvidia dev libraries version match the installed driver version? I doubt the nvmlProcessInfo_t changed but a mismatch between installed libraries/driver/headers could be something.
(In reply to Felip Moll from comment #9) > Question: Does your nvidia dev libraries version match the installed driver > version? I doubt the nvmlProcessInfo_t changed but a mismatch between > installed libraries/driver/headers could be something. That's a good point and I had the same intuition so I started looking at that, but it looks like things should work: we compiled Slurm 23.02.3 with the CUDA 12.1 headers and our GPU nodes are using NVIDIA driver 535.54.03 (which shows "CUDA Version: 12.2"). I'd be surprised if there were a major struct change between 12.1.1 and 12.2, especially if you're seeing the same thing on your end. I can try to recompile slurmd/slurmstepd with CUDA 12.2, to see if it changes anything. Cheers, -- Kilian
Created attachment 31088 [details] nvml_test.c I am actually trying outside of slurm with this test program. My cuda version is 11.8. I will now install the latest 12.2 to see if it fixes my issues. My driver version is 535.54.03 like you, which says cuda 12.2. You can try my test program, which under valgrind I get several issues in bad writes in nvmlDeviceGetGraphicsRunningProcesses_v3: ]$ gcc -o nvml_test -ggdb -lnvidia-ml -I/usr/local/cuda-11.8/targets/x86_64-linux/include/ nvml_test.c Will come back after I tested with 12.2.
After upgrading to cuda12.2, check the difference... now it is working. [lipi@llit Escriptori]$ nvidia-smi Wed Jul 5 18:38:42 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GT 1030 Off | 00000000:07:00.0 On | N/A | | 35% 44C P0 N/A / 30W | 282MiB / 2048MiB | 3% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3342 G /usr/libexec/Xorg 137MiB | | 0 N/A N/A 3545 G /usr/bin/gnome-shell 25MiB | | 0 N/A N/A 5010 G /usr/lib64/firefox/firefox 116MiB | +---------------------------------------------------------------------------------------+ [lipi@llit Escriptori]$ gcc -o nvml_test -ggdb -lnvidia-ml -I/usr/local/cuda-12.2/include/ nvml_test.c [lipi@llit Escriptori]$ ./nvml_test Found 1 nvidia devices Got handle 9b1ff0 for nvidia0 ###### Device nvidia0 processes ###### Found 3 processes for device nvidia0 ------ Process:1/3 Pid: 3342 usedGpuMemory: 144392192 gpuInstanceId: 4294967295 computeInstanceId: 4294967295 ------ Process:2/3 Pid: 3545 usedGpuMemory: 26263552 gpuInstanceId: 4294967295 computeInstanceId: 4294967295 ------ Process:3/3 Pid: 5010 usedGpuMemory: 157405184 gpuInstanceId: 4294967295 computeInstanceId: 4294967295 [lipi@llit Escriptori]$ gcc -o nvml_test -ggdb -lnvidia-ml -I/usr/local/cuda-11.8/targets/x86_64-linux/include/ nvml_test.c [lipi@llit Escriptori]$ ./nvml_test Found 1 nvidia devices Got handle 569ff0 for nvidia0 ###### Device nvidia0 processes ###### Found 3 processes for device nvidia0 ------ Process:1/3 Pid: 3342 usedGpuMemory: 136003584 gpuInstanceId: 4294967295 computeInstanceId: 4294967295 ------ Process:2/3 Pid: 0 usedGpuMemory: 3545 gpuInstanceId: 27836416 computeInstanceId: 0 ------ Process:3/3 Pid: 4294967295 usedGpuMemory: 0 gpuInstanceId: 5010 computeInstanceId: 0
(In reply to Felip Moll from comment #13) > After upgrading to cuda12.2, check the difference... now it is working. Ah, that's great! Well, somwwhow, because it also means that there's a stronger-than-expected dependency of Slurm (more specifically of the CUDA version it's been compiled with) with the NVIDIA driver used on GPU nodes. Meaning that the next NVIDIA driver update on compute nodes can potentially break Slurm. :\ As a quick fix for this issue, I will recompile Slurm with CUDA 12.2, but going forward, two questions: 1. is there a way to make that NVML code in Slurm do more checks about the API and structs to avoid segfaulting if things are not working as it expects? Disabling GPU utilization and memory accounting would be much better that slurmstepd crashing and making jobs fail. 2. would it be possible to introduce a setting to completely disable that GPU utilization accounting feature, to avoid compatibility issues? Thanks! Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #14) > (In reply to Felip Moll from comment #13) > > After upgrading to cuda12.2, check the difference... now it is working. > > Ah, that's great! Well, somwwhow, because it also means that there's a > stronger-than-expected dependency of Slurm (more specifically of the CUDA > version it's been compiled with) with the NVIDIA driver used on GPU nodes. > Meaning that the next NVIDIA driver update on compute nodes can potentially > break Slurm. :\ Well, it breaks Slurm yes, but because we're using the nvml library who is the one which segfaults. It's like saying we're using libc and if libc crashes then obviously slurm crashes too. Upgrading the nvidia driver can break slurm if you're using its libraries, the same way than upgrading pmix, hdf, or any other library that we're using. It is not great, I agree, since the difference is that this probably requires a recompile. > As a quick fix for this issue, I will recompile Slurm with CUDA 12.2 I am doing some tests to see if just installing the latest cuda could be enough (which I doubt). > going forward, two questions: > > 1. is there a way to make that NVML code in Slurm do more checks about the > API and structs to avoid segfaulting if things are not working as it > expects? Disabling GPU utilization and memory accounting would be much > better that slurmstepd crashing and making jobs fail. I will discuss that internally and see if we can add some checks. > 2. would it be possible to introduce a setting to completely disable that > GPU utilization accounting feature, to avoid compatibility issues? I will discuss that too.
(In reply to Felip Moll from comment #15) > Well, it breaks Slurm yes, but because we're using the nvml library who is > the one which segfaults. It's like saying we're using libc and if libc > crashes then obviously slurm crashes too. Upgrading the nvidia driver can > break slurm if you're using its libraries, the same way than upgrading pmix, > hdf, or any other library that we're using. It is not great, I agree, since > the difference is that this probably requires a recompile. Yes, I agree, the same kind of issue could also happen with other libraries Slurm depends on, that's true. > > As a quick fix for this issue, I will recompile Slurm with CUDA 12.2 > > I am doing some tests to see if just installing the latest cuda could be > enough (which I doubt). I can confirm that recompiling Slurm with the same CUDA version used on GPU nodes resolves the problem. Hopefully the next CUDA release won't break things again. :\ Is that something you can bring up to NVIDIA? > > 1. is there a way to make that NVML code in Slurm do more checks about the > > API and structs to avoid segfaulting if things are not working as it > > expects? Disabling GPU utilization and memory accounting would be much > > better that slurmstepd crashing and making jobs fail. > > I will discuss that internally and see if we can add some checks. > > > 2. would it be possible to introduce a setting to completely disable that > > GPU utilization accounting feature, to avoid compatibility issues? > > I will discuss that too. Thank you! Cheers, -- Kilian
Hi Felip, (In reply to Kilian Cavalotti from comment #16) > > > 1. is there a way to make that NVML code in Slurm do more checks about the > > > API and structs to avoid segfaulting if things are not working as it > > > expects? Disabling GPU utilization and memory accounting would be much > > > better that slurmstepd crashing and making jobs fail. > > > > I will discuss that internally and see if we can add some checks. I've tried to track down the issue more closely, and I believe that same issue has been reported here: https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards-compatibility/254999 Would the suggestion about defining NVML_NO_UNVERSIONED_FUNC_DEFS and using versioned functions help? (see https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards-compatibility/254999/5) Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #18) > Hi Felip, > > (In reply to Kilian Cavalotti from comment #16) > > > > 1. is there a way to make that NVML code in Slurm do more checks about the > > > > API and structs to avoid segfaulting if things are not working as it > > > > expects? Disabling GPU utilization and memory accounting would be much > > > > better that slurmstepd crashing and making jobs fail. > > > > > > I will discuss that internally and see if we can add some checks. > > I've tried to track down the issue more closely, and I believe that same > issue has been reported here: > https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards- > compatibility/254999 > > Would the suggestion about defining NVML_NO_UNVERSIONED_FUNC_DEFS and using > versioned functions help? (see > https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards- > compatibility/254999/5) > > Cheers, > -- > Kilian I really dislike the idea of using 'versioned' calls to functions. This comment says we should use nvmlDeviceGetComputeRunningProcesses_v2, or similar instead of nvmlDeviceGetComputeRunningProcesses. But how do we know about nvidia renaming conventions? I think it's a poor api management to break compatibility without a warning, from nvidia. I am seeing if I can warn the user when a different driver version is detected vs the one which slurm was compiled with, I think is the only thing we can do if we want to support a weak api.
(In reply to Felip Moll from comment #19) > (In reply to Kilian Cavalotti from comm > I really dislike the idea of using 'versioned' calls to functions. This > comment says we should use nvmlDeviceGetComputeRunningProcesses_v2, or > similar instead of nvmlDeviceGetComputeRunningProcesses. But how do we know > about nvidia renaming conventions? I think it's a poor api management to > break compatibility without a warning, from nvidia. Agreed. Plus, the next comment in that thread seems to indicate that the versioned functions are not compatible across driver versions either. > I am seeing if I can warn the user when a different driver version is > detected vs the one which slurm was compiled with, I think is the only thing > we can do if we want to support a weak api. A warning couldn't hurt, but I think the more important thing here is to avoid segfaults and aborts. If the check just logs a warning and doesn't prevent slurmd/slurmstepd from crashing, that will be only marginally useful. But if that check prevents slurmd from starting, that won't be a great solution either: the release cadence of the NVIDIA driver is much higher than the Slurm release schedule, and asking users to recompile Slurm each time a new NVIDIA driver is released is not really sustainable. So the best solution will likely be additional checks to catch possible segfault-type situations, in addition to logging a warning for the admin if an API mismatch is detected. Although, this is the first time it's happening in many many CUDA versions, so maybe we can just hope that this was a one off? But given the level of partnership between SchedMD and NVIDIA, this is probably something that would be worth discussing with your direct contacts there directly, isn't it? Thanks! -- Kilian
Created attachment 31331 [details] nvml_test.c
Hi there! Quick update: after having acknowledged the ABI breakage (https://forums.developer.nvidia.com/t/nvml-12-535-43-02-breaks-backwards-compatibility/254999/9), NVIDIA released a new driver (535.86.10), which, apparently breaks things again :( After a few minutes, any GPU job running on a system with the 535.86.10 driver under a slurmstepd compiled with CUDA 12.2 now fails with: slurmstepd: [26057335.interactive]: symbol lookup error: /usr/lib64/slurm/gpu_nvml.so: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses_v3 There's still no way to disable GPU accounting on the compute nodes, right? :\ Thanks, -- Kilian
Sooo, further investigation reveals that this latest driver doesn't seem to be providing nvmlDeviceGetGraphicsRunningProcesses_v3 anymore (despite the function still being available in nvml.h from the recommended CUDA version): In 535.54.03, we could see the 3 versioned functions: # strings /usr/lib64/libnvidia-ml.so.535.54.03 | grep nvmlDeviceGetGraphicsRunningProcesses | sort -u nvmlDeviceGetGraphicsRunningProcesses nvmlDeviceGetGraphicsRunningProcesses_v2 nvmlDeviceGetGraphicsRunningProcesses_v3 But in 535.86.10, there seems to be just 2 left: # strings /usr/lib64/libnvidia-ml.so.535.86.10 | grep nvmlDeviceGetGraphicsRunningProcesses | sort -u nvmlDeviceGetGraphicsRunningProcesses nvmlDeviceGetGraphicsRunningProcesses_v2 So the temporary "fix" we deployed was to re-recompile Slurm with the last CUDA version that did *not* ship nvmlDeviceGetGraphicsRunningProcesses_v3 (for us, that's CUDA 11.5), and restart slurmd on all our GPU nodes. This is pretty annoying though, and I still believe that having a switch to turn GPU metrics collection off in slurmstepd would be useful. Thanks! -- Kilian
Felip is on vacation, but I'm seeing responses while he is out. > There's still no way to disable GPU accounting on the compute nodes, right? :\ You are correct. For now, obviously you can recompile Slurm with the latest CUDA version. If you want to disable it though, you can also comment out the one call to gpu_g_usage_read() in src/plugins/jobacct_gather/common/common_job.c, though that obviously requires recompiling Slurm again. > This is pretty annoying though, and I still believe that having a switch to turn GPU metrics collection off in slurmstepd would be useful. Would you want this to be a job-specific option, or a configuration flag, or both? Are you thinking something similar to how --acctg-freq=<type>=0 can disable sampling of that type. Or maybe a flag in JobAcctGatherParams could disable it cluster-wide.
Hi Kilian, I am back!. I certainly see the ability to disable gpu accounting at cluster level a good thing, specially for performance reasons if somebody wanted to avoid the call to nvml, but I don't see it as a good reason to have this feature just to 'avoid' a faulty driver/abi or a compatibility issue. In any case, if you want, I can study to add this option. Would something like this work for you?: JobAcctGatherParams=NoGPUAcct Marshall proposed to add a setting like JobAcctGatherFrequency=gpu=0, the problem with that is that 'gpu' is not a jobacctgather plugin per se. The accounting of gpu stats is done from the acctgather energy plugin and from the common code (task=). So in that case we should make 'gpu=0' to be a special flag which acts over all the other jobacctgather plugins. And then, if we set some value greater than 0, e.g. gpu=10, that would mean having a different frequency for the gpu gathering for all plugins? I'd rather see it more convenient to have a JobAcctGatherParams=nogpuacct which disables completely the gpu accounting everywhere. We will discuss it internally.
Hi Felip! (In reply to Felip Moll from comment #31) > I certainly see the ability to disable gpu accounting at cluster level a > good thing, specially for performance reasons if somebody wanted to avoid > the call to nvml, but I don't see it as a good reason to have this feature > just to 'avoid' a faulty driver/abi or a compatibility issue. Agreed! Performance is even a better reason (as calls to the NVML are far from being free), but having a sort of on/off switch would be good whatever the underlying motivation. > In any case, if you want, I can study to add this option. Would something > like this work for you?: > > JobAcctGatherParams=NoGPUAcct Yes, that would definitely work. > Marshall proposed to add a setting like JobAcctGatherFrequency=gpu=0, the > problem with that is that 'gpu' is not a jobacctgather plugin per se. The > accounting of gpu stats is done from the acctgather energy plugin and from > the common code (task=). So in that case we should make 'gpu=0' to be a > special flag which acts over all the other jobacctgather plugins. And then, > if we set some value greater than 0, e.g. gpu=10, that would mean having a > different frequency for the gpu gathering for all plugins? Yes, understood, and it makes complete sense to me. > I'd rather see it more convenient to have a JobAcctGatherParams=nogpuacct > which disables completely the gpu accounting everywhere. We will discuss it > internally. Sounds great, thanks a lot for taking this into consideration! Cheers, -- Kilian
*** Ticket 17355 has been marked as a duplicate of this ticket. ***
*** Ticket 17637 has been marked as a duplicate of this ticket. ***
*** Ticket 17398 has been marked as a duplicate of this ticket. ***
*** Ticket 17915 has been marked as a duplicate of this ticket. ***
*** Ticket 19018 has been marked as a duplicate of this ticket. ***
Kilian, I guess you're already aware of the option merged into Slurm 23.02.7[1] (I just took over the case since Felip is on PTO). We're still continuing to work on that, trying to figure out the best way to make Slurm more resistant to unexpected ABI changes on NVML side. I'll keep you posted on the progress. Are you OK with lowering the case severity to 4? cheers, Marcin
Hi Marcin, (In reply to Marcin Stolarek from comment #72) > I guess you're already aware of the option merged into Slurm 23.02.7[1] (I > just took over the case since Felip is on PTO). > > We're still continuing to work on that, trying to figure out the best way to > make Slurm more resistant to unexpected ABI changes on NVML side. I'll keep > you posted on the progress. Thanks! > Are you OK with lowering the case severity to 4? Yes, that sounds right. Cheers, -- Kilian
Kilian, We've discussed a few ways we can further improve Slurm's resistance to changes in NVML, and it looks like there is nothing we can do, since there are no rules we can rely on Nvidia's side here. Slurm 24.11 comes with new gpu/nvidia plugin, that performs information gathering based on the information exposed over linux pseudo-file systems (i.e. /sys, /proc) which doesn't require linking against nvidia-ml. Unfortunately, the functionality is limited there today. Hopefully more information will be provided by Nvidia kernel drivers and long term gpu/nvidia will be able to replace gpu/nvml. I'll go ahead and close the ticket as "won't fix". Should you have any questions, please reopen. cheers, Marcin