Summary: | MATLAB Error - Low Level Graphics Issue | ||
---|---|---|---|
Product: | Slurm | Reporter: | Jeff Haferman <jlhaferm> |
Component: | User Commands | Assignee: | Connor <connor> |
Status: | OPEN --- | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | CC: | connor, kilian |
Version: | 24.11.0 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | NPS HPC | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
cgroup.conf
slurm.conf nodes.conf (included by slurm.conf) gres.conf |
Description
Jeff Haferman
2025-01-09 13:41:19 MST
Created attachment 40339 [details]
slurm.conf
Created attachment 40340 [details]
nodes.conf (included by slurm.conf)
Created attachment 40341 [details]
gres.conf
What version of slurm were you at before the update? -Connor 23.02.5 Additional info. On the nodes where these jobs are landing, I'm seeing a lot of messages in our slurmd.log from the Matlab jobs that look like: [2024-12-09T12:22:54.290] [54539143.extern] error: _half_duplex: wrote -1 of 32 [2024-12-09T12:23:47.963] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-09T12:23:47.963] [54539143.extern] error: _half_duplex: wrote -1 of 64 [2024-12-09T12:24:44.773] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-09T13:00:08.453] [54539143.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-09T16:05:56.487] [54539210.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-09T16:05:56.583] [54539210.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-10T13:23:46.449] [54540210.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-11T09:03:47.693] [54541947.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-11T09:56:54.825] [54541947.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-12T13:19:21.225] [54543975.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-12T13:19:21.370] [54543975.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-15T19:10:52.373] [54547840.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-16T12:46:16.675] [54549056.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-17T19:56:54.664] [54549861.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-18T13:12:15.609] [54550739.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-18T16:13:11.154] [54550870.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-26T10:35:54.699] [54557926.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-28T17:18:19.158] [54559695.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-30T14:10:51.757] [54560772.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-31T11:39:46.005] [54560878.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-31T11:39:46.010] [54560878.extern] error: _half_duplex: read error -1 Connection reset by peer [2024-12-31T17:50:16.064] [54560894.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-02T17:10:36.345] [54561023.extern] error: _half_duplex: wrote -1 of 2328 [2025-01-03T09:43:04.699] [54561083.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-05T17:45:11.708] [54561231.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-07T15:38:42.882] [54568414.extern] error: _half_duplex: wrote -1 of 2328 [2025-01-07T16:43:40.220] [54571462.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-07T16:43:40.227] [54571462.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-08T10:24:57.304] [54579037.extern] error: _half_duplex: wrote -1 of 2132 [2025-01-09T12:05:38.145] [54592342.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-09T12:53:51.887] [54595892.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-09T12:53:51.928] [54595892.extern] error: _half_duplex: wrote -1 of 44 [2025-01-09T13:02:14.880] [54592370.extern] error: _half_duplex: read error -1 Connection reset by peer [2025-01-10T10:49:48.517] [54651002.extern] error: _half_duplex: read error -1 Connection reset by peer This behavior is tracking with an on going issue that stemmed from a change in commit https://github.com/SchedMD/slurm/commit/ecfc7f6ff7. We are actively trying to find a fix to solve both issues. Short of reverting the fix from the link above we don't have a workaround at this time. -Connor OK, should we keep this ticket open until a fix is available? Do you have an ETA? Yes you can leave this ticket open and I'll be sure to respond here when we land on a fix. We're actively working on it so our goal is to get it out asap, but no release has a fixed lined up yet. Thanks, Connor Just checking to see if there is a fix / release plan for this? Hey Jeff, Sorry it's still a work in progress at the moment However, there is a patch in the pipeline going through the review process now. I appreciate your patience and hope we can push it out soon. -Connor |