Following our recent adoption of job_container/tmpfs plugin and upgrade to Slurm 24.11.x (we're on 24.11.1 now, but problem observed in 24.11.0 as well), we're seeing issues with X11 forwarding. In the configuration, we have both `PrologFlags=Alloc,Contain,X11` and `X11Parameters=home_xauthority` set. Direct SSH to the node allows X11 forwarding as expected, but running under an interactive job fails. Checking things like $DISPLAY and `xauth list` present identical information between the manual SSH and interactive job launch, so as far as I can tell, things look to be landing properly. I have noticed that occasionally, an error will pop up when trying to launch an X11-based application: 'salloc: error: _half_duplex: wrote -1 of 1748' - unsure if this is telling or a red herring, though. Please let me know if additional information is required.
This looks similar to an ongoing issue we have. We are actively working on a patch to handle x11 connections better. Aside from reverting the change in ecfc7f6ff7 we don't have a work around at this time unfortunately. -Connor
To test this change, which nodes need the update - e.g., just compute or head node, or is the library part of the entire pipeline?
It would be the compute nodes and where the the slurm commands are run. "salloc" and "srun" will pull in this library and as well as the slurmd/slurmstepd services. -Connor
Hey Aaron, Just providing you with an update: we have a patch that is currently in review. Hopefully we'll have it queued for a release soon. -Connor
Connor - Thanks for the update! Looking forward to it. Aaron