Ticket 22034 - X11 Forwarding Not Working
Summary: X11 Forwarding Not Working
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 24.11.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Connor
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-02-10 06:02 MST by Aaron Jezghani
Modified: 2025-04-08 13:38 MDT (History)
1 user (show)

See Also:
Site: GA Tech
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Aaron Jezghani 2025-02-10 06:02:49 MST
Following our recent adoption of job_container/tmpfs plugin and upgrade to Slurm 24.11.x (we're on 24.11.1 now, but problem observed in 24.11.0 as well), we're seeing issues with X11 forwarding.

In the configuration, we have both `PrologFlags=Alloc,Contain,X11` and `X11Parameters=home_xauthority` set. Direct SSH to the node allows X11 forwarding as expected, but running under an interactive job fails.

Checking things like $DISPLAY and `xauth list` present identical information between the manual SSH and interactive job launch, so as far as I can tell, things look to be landing properly.

I have noticed that occasionally, an error will pop up when trying to launch an X11-based application: 'salloc: error: _half_duplex: wrote -1 of 1748' - unsure if this is telling or a red herring, though.

Please let me know if additional information is required.
Comment 1 Connor 2025-02-11 10:40:05 MST
This looks similar to an ongoing issue we have. We are actively working on a patch to handle x11 connections better.

Aside from reverting the change in ecfc7f6ff7 we don't have a work around at this time unfortunately.

-Connor
Comment 2 Aaron Jezghani 2025-02-12 17:26:17 MST
To test this change, which nodes need the update - e.g., just compute or head node, or is the library part of the entire pipeline?
Comment 3 Connor 2025-02-13 07:03:01 MST
It would be the compute nodes and where the the slurm commands are run. "salloc" and "srun" will pull in this library and as well as the slurmd/slurmstepd services.


-Connor
Comment 4 Connor 2025-04-08 13:20:06 MDT
Hey Aaron,


Just providing you with an update: we have a patch that is currently in review. Hopefully we'll have it queued for a release soon.

-Connor
Comment 5 Aaron Jezghani 2025-04-08 13:38:08 MDT
Connor - 

Thanks for the update! Looking forward to it.

Aaron