Created attachment 40338 [details] cgroup.conf I want to emphasize that this problem started after upgrade to slurm 24.11.0. Also, it only occurs when requesting a node via "srun --x11..." or "salloc --x11". It does NOT occur when I ssh directly to a node via "ssh -Y..." or "ssh -X". When using Matlab 2023b in X11 GUI mode on a Linux system, when we attempt a plot, or even when we simply do "opengl info", we are getting: com.jogamp.nativewindow.NativeWindowException: X11Util.Display: Unable to create a display(localhost:36.0) connection. Thread AWT-EventQueue-0-SharedResourceRunner at jogamp.nativewindow.x11.X11Util.openDisplay(X11Util.java:453) at jogamp.opengl.x11.glx.X11GLXDrawableFactory$SharedResourceImplementation.createSharedResource(X11GLXDrawableFactory.java:266) at jogamp.opengl.SharedResourceRunner.run(SharedResourceRunner.java:297) at java.lang.Thread.run(Thread.java:748) MATLAB has experienced a low-level graphics error, and may not have drawn correctly. Read about what you can do to prevent this issue at Resolving Low-Level Graphics Issues then restart MATLAB. To share details of this issue with MathWorks technical support, please include this file with your service request. These articles have been somewhat helpful: https://www.mathworks.com/matlabcentral/answers/1468426-cannot-enable-hardware-opengl-r2021b-ubuntu-20-04 https://www.mathworks.com/help/matlab/matlab_env/java-opts-file.html?searchHighlight=java.opts Also see https://www.mathworks.com/matlabcentral/answers/1934060-what-are-the-downside-of-setting-djogl-disable-openglarbcontext-1-in-linux AND https://github.com/robotology/robotology-superbuild/issues/953 Using export JAVA_TOOL_OPTIONS="-Djogl.disable.openglarbcontext=1" matlab sometimes helps. Using MESA_LOADER_DRIVER_OVERRIDE=i965 matlab usually helps, but not always. We are on a Linux cluster using slurm, and this behavior just started after upgrading to 24.11.0. Multiple users have reported this issue, and I am able to reproduce. We occasionally see the message: salloc: error: _half_duplex: wrote -1 of 1928 when starting up Matlab.
Created attachment 40339 [details] slurm.conf
Created attachment 40340 [details] nodes.conf (included by slurm.conf)
Created attachment 40341 [details] gres.conf
What version of slurm were you at before the update? -Connor
23.02.5
Additional info. On the nodes where these jobs are landing, I'm seeing a lot of messages in our slurmd.log from the Matlab jobs that look like: [2024-12-09T12:22:54.290] [54539143.extern] error: _half_duplex: wrote -1 of 32 [2024-12-09T12:23:47.963] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-09T12:23:47.963] [54539143.extern] error: _half_duplex: wrote -1 of 64 [2024-12-09T12:24:44.773] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-09T13:00:08.453] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-09T16:05:56.487] [54539210.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-09T16:05:56.583] [54539210.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-10T13:23:46.449] [54540210.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-11T09:03:47.693] [54541947.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-11T09:56:54.825] [54541947.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-12T13:19:21.225] [54543975.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-12T13:19:21.370] [54543975.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-15T19:10:52.373] [54547840.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-16T12:46:16.675] [54549056.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-17T19:56:54.664] [54549861.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-18T13:12:15.609] [54550739.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-18T16:13:11.154] [54550870.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-26T10:35:54.699] [54557926.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-28T17:18:19.158] [54559695.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-30T14:10:51.757] [54560772.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-31T11:39:46.005] [54560878.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-31T11:39:46.010] [54560878.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-31T17:50:16.064] [54560894.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-02T17:10:36.345] [54561023.extern] error: _half_duplex: wrote -1 of 2328 [2025-01-03T09:43:04.699] [54561083.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-05T17:45:11.708] [54561231.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-07T15:38:42.882] [54568414.extern] error: _half_duplex: wrote -1 of 2328 [2025-01-07T16:43:40.220] [54571462.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-07T16:43:40.227] [54571462.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-08T10:24:57.304] [54579037.extern] error: _half_duplex: wrote -1 of 2132 [2025-01-09T12:05:38.145] [54592342.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-09T12:53:51.887] [54595892.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-09T12:53:51.928] [54595892.extern] error: _half_duplex: wrote -1 of 44 [2025-01-09T13:02:14.880] [54592370.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-10T10:49:48.517] [54651002.extern] error: _half_duplex: read error -1 Connection reset by peer
This behavior is tracking with an on going issue that stemmed from a change in commit https://github.com/SchedMD/slurm/commit/ecfc7f6ff7. We are actively trying to find a fix to solve both issues. Short of reverting the fix from the link above we don't have a workaround at this time. -Connor
OK, should we keep this ticket open until a fix is available? Do you have an ETA?
Yes you can leave this ticket open and I'll be sure to respond here when we land on a fix. We're actively working on it so our goal is to get it out asap, but no release has a fixed lined up yet. Thanks, Connor
Just checking to see if there is a fix / release plan for this?
Hey Jeff, Sorry it's still a work in progress at the moment However, there is a patch in the pipeline going through the review process now. I appreciate your patience and hope we can push it out soon. -Connor