Ticket 21783 - MATLAB Error - Low Level Graphics Issue
Summary: MATLAB Error - Low Level Graphics Issue
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 24.11.0
Hardware: Linux Linux
: 2 - High Impact
Assignee: Connor
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-01-09 13:41 MST by Jeff Haferman
Modified: 2025-03-19 12:22 MDT (History)
2 users (show)

See Also:
Site: NPS HPC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
cgroup.conf (218 bytes, text/x-matlab)
2025-01-09 13:41 MST, Jeff Haferman
Details
slurm.conf (3.08 KB, text/plain)
2025-01-09 13:41 MST, Jeff Haferman
Details
nodes.conf (included by slurm.conf) (6.58 KB, text/x-matlab)
2025-01-09 13:42 MST, Jeff Haferman
Details
gres.conf (1.46 KB, text/plain)
2025-01-09 13:42 MST, Jeff Haferman
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Jeff Haferman 2025-01-09 13:41:19 MST
Created attachment 40338 [details]
cgroup.conf

I want to emphasize that this problem started after upgrade to slurm 24.11.0. Also, it only occurs when requesting a node via "srun --x11..." or "salloc --x11". 

It does NOT occur when I ssh directly to a node via "ssh -Y..." or "ssh -X".

When using Matlab 2023b in X11 GUI mode on a Linux system, when we attempt a plot, or even when we simply do "opengl info", we are getting:

com.jogamp.nativewindow.NativeWindowException: X11Util.Display: Unable to create a display(localhost:36.0) connection. Thread AWT-EventQueue-0-SharedResourceRunner
at jogamp.nativewindow.x11.X11Util.openDisplay(X11Util.java:453)
at jogamp.opengl.x11.glx.X11GLXDrawableFactory$SharedResourceImplementation.createSharedResource(X11GLXDrawableFactory.java:266)
at jogamp.opengl.SharedResourceRunner.run(SharedResourceRunner.java:297)
at java.lang.Thread.run(Thread.java:748)
MATLAB has experienced a low-level graphics error, and may not have drawn correctly.
Read about what you can do to prevent this issue at Resolving Low-Level Graphics Issues then restart MATLAB.
To share details of this issue with MathWorks technical support,
please include this file with your service request.

These articles have been somewhat helpful:
https://www.mathworks.com/matlabcentral/answers/1468426-cannot-enable-hardware-opengl-r2021b-ubuntu-20-04
https://www.mathworks.com/help/matlab/matlab_env/java-opts-file.html?searchHighlight=java.opts

Also see
https://www.mathworks.com/matlabcentral/answers/1934060-what-are-the-downside-of-setting-djogl-disable-openglarbcontext-1-in-linux
AND
https://github.com/robotology/robotology-superbuild/issues/953

Using
export JAVA_TOOL_OPTIONS="-Djogl.disable.openglarbcontext=1"
matlab

sometimes helps. Using
MESA_LOADER_DRIVER_OVERRIDE=i965 matlab

usually helps,  but not always. We are on a Linux cluster using slurm, and this behavior just started after upgrading to 24.11.0.

Multiple users have reported this issue, and I am able to reproduce. We occasionally see the message:
salloc: error: _half_duplex: wrote -1 of 1928

when starting up Matlab.
Comment 1 Jeff Haferman 2025-01-09 13:41:51 MST
Created attachment 40339 [details]
slurm.conf
Comment 2 Jeff Haferman 2025-01-09 13:42:21 MST
Created attachment 40340 [details]
nodes.conf (included by slurm.conf)
Comment 3 Jeff Haferman 2025-01-09 13:42:44 MST
Created attachment 40341 [details]
gres.conf
Comment 5 Connor 2025-01-10 07:32:55 MST
What version of slurm were you at before the update?


-Connor
Comment 6 Jeff Haferman 2025-01-10 09:03:59 MST
23.02.5
Comment 7 Jeff Haferman 2025-01-10 12:49:44 MST
Additional info. On the nodes where these jobs are landing, I'm seeing a lot of messages in our slurmd.log from the Matlab jobs that look like:

[2024-12-09T12:22:54.290] [54539143.extern] error: _half_duplex: wrote -1 of 32
[2024-12-09T12:23:47.963] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-09T12:23:47.963] [54539143.extern] error: _half_duplex: wrote -1 of 64
[2024-12-09T12:24:44.773] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-09T13:00:08.453] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-09T16:05:56.487] [54539210.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-09T16:05:56.583] [54539210.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-10T13:23:46.449] [54540210.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-11T09:03:47.693] [54541947.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-11T09:56:54.825] [54541947.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-12T13:19:21.225] [54543975.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-12T13:19:21.370] [54543975.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-15T19:10:52.373] [54547840.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-16T12:46:16.675] [54549056.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-17T19:56:54.664] [54549861.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-18T13:12:15.609] [54550739.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-18T16:13:11.154] [54550870.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-26T10:35:54.699] [54557926.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-28T17:18:19.158] [54559695.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-30T14:10:51.757] [54560772.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-31T11:39:46.005] [54560878.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-31T11:39:46.010] [54560878.extern] error: _half_duplex: read error -1 Connection reset by peer
[2024-12-31T17:50:16.064] [54560894.extern] error: _half_duplex: read error -1 Connection reset by peer
[2025-01-02T17:10:36.345] [54561023.extern] error: _half_duplex: wrote -1 of 2328
[2025-01-03T09:43:04.699] [54561083.extern] error: _half_duplex: read error -1 Connection reset by peer
[2025-01-05T17:45:11.708] [54561231.extern] error: _half_duplex: read error -1 Connection reset by peer
[2025-01-07T15:38:42.882] [54568414.extern] error: _half_duplex: wrote -1 of 2328
[2025-01-07T16:43:40.220] [54571462.extern] error: _half_duplex: read error -1 Connection reset by peer
[2025-01-07T16:43:40.227] [54571462.extern] error: _half_duplex: read error -1 Connection reset by peer
[2025-01-08T10:24:57.304] [54579037.extern] error: _half_duplex: wrote -1 of 2132
[2025-01-09T12:05:38.145] [54592342.extern] error: _half_duplex: read error -1 Connection reset by peer
[2025-01-09T12:53:51.887] [54595892.extern] error: _half_duplex: read error -1 Connection reset by peer
[2025-01-09T12:53:51.928] [54595892.extern] error: _half_duplex: wrote -1 of 44
[2025-01-09T13:02:14.880] [54592370.extern] error: _half_duplex: read error -1 Connection reset by peer
[2025-01-10T10:49:48.517] [54651002.extern] error: _half_duplex: read error -1 Connection reset by peer
Comment 8 Connor 2025-01-10 15:30:11 MST
This behavior is tracking with an on going issue that stemmed from a change in commit https://github.com/SchedMD/slurm/commit/ecfc7f6ff7.

We are actively trying to find a fix to solve both issues.

Short of reverting the fix from the link above we don't have a workaround at this time.


-Connor
Comment 9 Jeff Haferman 2025-01-12 12:01:28 MST
OK, should we keep this ticket open until a fix is available? Do you have an ETA?
Comment 10 Connor 2025-01-13 09:03:18 MST
Yes you can leave this ticket open and I'll be sure to respond here when we land on a fix.

We're actively working on it so our goal is to get it out asap, but no release has a fixed lined up yet.


Thanks,
Connor
Comment 11 Jeff Haferman 2025-03-19 11:49:00 MDT
Just checking to see if there is a fix / release plan for this?
Comment 12 Connor 2025-03-19 12:22:10 MDT
Hey Jeff,

Sorry it's still a work in progress at the moment 

However, there is a patch in the pipeline going through the review process now. I appreciate your patience and hope we can push it out soon. 


-Connor