Ticket 18492 - X11 forwarding not working
Summary: X11 forwarding not working
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 23.11.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Connor
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-12-15 12:15 MST by Paul Edmon
Modified: 2025-04-08 13:20 MDT (History)
3 users (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Log of X11 test commands (123.61 KB, text/x-log)
2023-12-15 13:27 MST, Paul Edmon
Details
Current slurm.conf (65.73 KB, text/x-matlab)
2024-08-06 06:50 MDT, Paul Edmon
Details
Current topology.conf (4.68 KB, text/x-matlab)
2024-08-06 06:50 MDT, Paul Edmon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2023-12-15 12:15:56 MST
We upgraded to 23.02.7 yesterday and we've hit a problem with X11 forwarding:

pedmon@orion:~$ ssh -Y holylogin03

[pedmon@holylogin03 ~]$ salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 -p test
salloc: Pending job allocation 12780042
salloc: job 12780042 queued and waiting for resources
salloc: job 12780042 has been allocated resources
salloc: Granted job allocation 12780042
salloc: Waiting for resource configuration
salloc: Nodes holy7c24102 are ready for job
[pedmon@holy7c24102 ~]$ emacs
Display localhost:71.0 can’t be opened
Comment 1 Tyler Connel 2023-12-15 12:28:37 MST
Hello Paul,

I've seen a similar issue via the linked bug, and I suspect this may be a duplicate. I'll be conferring with the team as of now to address this.

Best,
Tyler Connel
Comment 5 Tyler Connel 2023-12-15 12:42:40 MST
Hi Paul,

Which version are you upgrading from? Would it happen to be 21.08?

Best,
TC
Comment 6 Paul Edmon 2023-12-15 12:43:18 MST
No it was from 23.02.6.

-Paul Edmon-

On 12/15/23 2:42 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c5> on 
> bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler 
> Connel <mailto:tyler@schedmd.com> *
> Hi Paul,
>
> Which version are you upgrading from? Would it happen to be 21.08?
>
> Best,
> TC
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 11 Tyler Connel 2023-12-15 13:22:31 MST
Hi Paul,

X11 issues are notoriously difficult to reproduce on our end, would you mind sharing the input/output of some commands for our sake so we can get an idea about your stepd/job submit environment?

> echo $DISPLAY
> xauth list
> salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 -p test
> echo $DISPLAY
> echo $XAUTHORITY
> xauth list
> netstat -ant

Best,
Tyler Connel
Comment 12 Paul Edmon 2023-12-15 13:27:51 MST
Created attachment 33742 [details]
Log of X11 test commands
Comment 13 Paul Edmon 2023-12-15 13:28:24 MST
Yup, I'm with you on that.

I've attached a log of the results of those commands.

-Paul Edmon-

On 12/15/23 3:22 PM, bugs@schedmd.com wrote:
>
> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c11> on 
> bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler 
> Connel <mailto:tyler@schedmd.com> *
> Hi Paul,
>
> X11 issues are notoriously difficult to reproduce on our end, would you mind
> sharing the input/output of some commands for our sake so we can get an idea
> about your stepd/job submit environment?
>
> > echo $DISPLAY > xauth list > salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 -p 
> test > echo $DISPLAY > echo $XAUTHORITY > xauth list > netstat -ant
>
> Best,
> Tyler Connel
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 14 Tyler Connel 2023-12-15 14:08:21 MST
Does the display launch when you try an X application through ssh with x forwarding (-X)?

$ ssh -X holy7c24202 emacs
Comment 16 Paul Edmon 2023-12-16 08:33:03 MST
You mean sshing directly to a node rather than getting a session via salloc?

If I ssh directly from my host to a node with -X X11 works as expected. 
It's only the salloc step that breaks things.

-Paul Edmon-

On 12/15/2023 4:08 PM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c14> on 
> bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler 
> Connel <mailto:tyler@schedmd.com> *
> Does the display launch when you try an X application through ssh with x
> forwarding (-X)?
>
> $ ssh -X holy7c24202 emacs
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 17 Paul Edmon 2023-12-16 08:34:10 MST
Also I will note that if I have a salloc on a node and then ssh from the 
login node to the node I have salloc on with -X it also breaks. This is 
likely due to the ssh session being shunted into the existing job but 
not getting the X11 settings right when that happens.

-Paul Edmon-

On 12/15/2023 4:08 PM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c14> on 
> bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler 
> Connel <mailto:tyler@schedmd.com> *
> Does the display launch when you try an X application through ssh with x
> forwarding (-X)?
>
> $ ssh -X holy7c24202 emacs
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 19 Tyler Connel 2024-01-18 19:36:02 MST
Hi Paul,

Apologies for the delay on this ticket. I've got a test setup to replicate something like your environment now and have encountered some issues as well with X11 forwarding with srun. At the moment, I suspect they're unrelated, but I'm hoping that tomorrow I'll have reproduced the particular issue that you're experiencing.

Best,
Tyler Connel
Comment 20 Paul Edmon 2024-01-19 07:47:16 MST
Thanks for the update.

-Paul Edmon-

On 1/18/2024 9:36 PM, bugs@schedmd.com wrote:
>
> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c19> on 
> bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler 
> Connel <mailto:tyler@schedmd.com> *
> Hi Paul,
>
> Apologies for the delay on this ticket. I've got a test setup to replicate
> something like your environment now and have encountered some issues as well
> with X11 forwarding with srun. At the moment, I suspect they're unrelated, but
> I'm hoping that tomorrow I'll have reproduced the particular issue that you're
> experiencing.
>
> Best,
> Tyler Connel
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 21 Tyler Connel 2024-01-19 16:00:22 MST
Hello Paul,

I think I have found a commit in 23.02-7 where this issue might have been introduced. Would you mind sharing your slurmd logs for some time before/after X11 forwarding fails to be established? If you don't mind running slurmd with extra verbosity during the test (-vvv), that might also be helpful.

Best,
Tyler Connel
Comment 22 Paul Edmon 2024-01-22 11:35:06 MST
Yup, this is what I got:

[pedmon@builds01 ~]$ salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 
-p rc-testing -w holy2a24201
salloc: Granted job allocation 5720
salloc: Nodes holy2a24201 are ready for job
[pedmon@holy2a24201 ~]$ emacs
Display localhost:33.0 can’t be opened
[pedmon@holy2a24201 ~]$ salloc: error: _half_duplex: wrote -1 of 4096

[pedmon@holy2a24201 ~]$

slurmd: CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=257451 
TmpDisk=233649 Uptime=1562472 CPUSpecList=(null) FeaturesAvail=(null) 
FeaturesActive=(null)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 12 TRES.
slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_PROLOG
slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG
slurmd: debug:  Checking credential with 892 bytes of sig data
slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task StepId=5720.interactive request from UID:56483 
GID:40273 HOST:10.31.128.251 PORT:51072
slurmd: task/affinity: lllp_distribution: JobId=5720 manual binding: 
mask_cpu,one_thread
slurmd: debug:  Waiting for job 5720's prolog to complete
slurmd: debug:  Finished wait for job 5720's prolog to complete
slurmd: debug:  Leaving stepd_get_x11_display
slurmd: debug2: _setup_x11_display: setting DISPLAY=localhost:33:0 for 
job 5720 step 4294967290
slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE
slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE

-Paul Edmon-

On 1/19/2024 6:00 PM, bugs@schedmd.com wrote:
>
> *Comment # 21 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c21> on 
> bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler 
> Connel <mailto:tyler@schedmd.com> *
> Hello Paul,
>
> I think I have found a commit in 23.02-7 where this issue might have been
> introduced. Would you mind sharing your slurmd logs for some time before/after
> X11 forwarding fails to be established? If you don't mind running slurmd with
> extra verbosity during the test (-vvv), that might also be helpful.
>
> Best,
> Tyler Connel
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 24 Tyler Connel 2024-02-08 18:24:31 MST
Hi Paul,

Do you use home_xauthority in your X11Parameters in slurm.conf?

-TC
Comment 25 Paul Edmon 2024-02-12 08:10:30 MST
No, the only X11 settings we have are:

PrologFlags=Contain,X11

-Paul Edmon-

On 2/8/24 8:24 PM, bugs@schedmd.com wrote:
> Tyler Connel <mailto:tyler@schedmd.com> changed bug 18492 
> <https://bugs.schedmd.com/show_bug.cgi?id=18492>
> What 	Removed 	Added
> See Also 		https://bugs.schedmd.com/show_bug.cgi?id=18139
>
> *Comment # 24 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c24> on 
> bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler 
> Connel <mailto:tyler@schedmd.com> *
> Hi Paul,
>
> Do you use home_xauthority in your X11Parameters in slurm.conf?
>
> -TC
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 26 Paul Edmon 2024-03-08 07:52:17 MST
Just FYI, we upgraded to 23.11.4 on Monday and this issue still exists. Just wanted to confirm it occurred in the latest version as well.
Comment 27 Tyler Connel 2024-03-13 14:47:01 MDT
Thanks Paul, and apologies on the delay on this ticket. I'll get another chance to look into this soon.
Comment 29 Tim McMullan 2024-07-31 07:02:51 MDT
Hi Paul,

I'm picking this ticket up from Tyler, while I'm getting more caught up on the current state of things can you confirm that you are still seeing the issue and that you are still on 23.11.4?

Thanks!
--Tim
Comment 30 Paul Edmon 2024-07-31 07:36:28 MDT
We recently updated to 23.11.8 and the error is still occuring:

pedmon@DESKTOP-5GBIA4B:~$ ssh -Y holylogin03
[pedmon@holylogin03 ~]$ salloc --x11=all -c 10 -N 1 --mem=30000 -t 
0-10:00 -p test
salloc: Granted job allocation 42026829
salloc: Nodes holy8a24301 are ready for job
[pedmon@holy8a24301 ~]$ emacs
Display localhost:30.0 can’t be opened

-Paul Edmon-

On 7/31/2024 9:02 AM, bugs@schedmd.com wrote:
>
> *Comment # 29 <https://support.schedmd.com/show_bug.cgi?id=18492#c29> 
> on ticket 18492 <https://support.schedmd.com/show_bug.cgi?id=18492> 
> from Tim McMullan <mailto:mcmullan@schedmd.com> *
> Hi Paul,
>
> I'm picking this ticket up from Tyler, while I'm getting more caught up on the
> current state of things can you confirm that you are still seeing the issue and
> that you are still on 23.11.4?
>
> Thanks!
> --Tim
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>
Comment 31 Tim McMullan 2024-07-31 09:47:45 MDT
Ok, thanks for the update!  I've updated the version on the ticket to reflect the new version that this is still an issue on.
Comment 32 Tim McMullan 2024-08-05 12:57:41 MDT
Hi Paul,

Would you be able to provide the full slurm.conf fir the current system?

Thanks!
Comment 33 Paul Edmon 2024-08-06 06:50:30 MDT
Created attachment 38188 [details]
Current slurm.conf
Comment 34 Paul Edmon 2024-08-06 06:50:45 MDT
Created attachment 38189 [details]
Current topology.conf
Comment 35 Paul Edmon 2024-08-06 06:51:00 MDT
Yup. I've posted them.

-Paul Edmon-

On 8/5/2024 2:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 32 <https://support.schedmd.com/show_bug.cgi?id=18492#c32> 
> on ticket 18492 <https://support.schedmd.com/show_bug.cgi?id=18492> 
> from Tim McMullan <mailto:mcmullan@schedmd.com> *
> Hi Paul,
>
> Would you be able to provide the full slurm.conf fir the current system?
>
> Thanks!
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>
Comment 36 Paul Edmon 2024-09-12 08:41:19 MDT
FYI this is still happening in 24.05.3:

pedmon@orion:~$ ssh -Y login.rc.fas.harvard.edu
(pedmon@login.rc.fas.harvard.edu) Password: 
(pedmon@login.rc.fas.harvard.edu) VerificationCode: 
Last login: Thu Sep 12 10:36:07 2024 from 10.255.12.55
!!!!!!!!!!!!!!!!!!!!!!! Cannon Cluster !!!!!!!!!!!!!!!!!!!!!!!!!!
Cannon is a general HPC resource for Harvard's research community
hosted by the Faculty of Arts and Sciences Research Computing.

+-------------------- NEWS & UPDATES ---------------------------+
+ Status and maintenance page: https://fasrc.instatus.com       +
+ Office Hours: Wednesdays noon-3pm, see website for details:   +
+    https://www.rc.fas.harvard.edu/training/office-hours       +
+ Training: https://www.rc.fas.harvard.edu/upcoming-training    +
+---------------------------------------------------------------+

+------------------- HELPFUL DOCUMENTATION ---------------------+
+ https://docs.rc.fas.harvard.edu/kb/quickstart-guide           +
+ https://docs.rc.fas.harvard.edu/kb/running-jobs               +
+ https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands  +
+---------------------------------------------------------------+

NEXT MAINTENANCE: OCTOBER 7TH 7AM-11AM

+---------------- Slurm Stats for Sep 11 -----------------------+
|                  End of Day Fairshare                         |
|                 kempner_lab: 0.000000                         |
|                   kuang_lab: 0.092327                         |
|                    rc_admin: 0.999719                         |
+---------------No jobs completed on Sep 11 --------------------+
| https://docs.rc.fas.harvard.edu/kb/slurm-stats                |
+---------------------------------------------------------------+
[pedmon@holylogin01 ~]$ salloc --x11=all -c 12 -N 1 --mem=80G -t 0-10:00 -p test
salloc: Pending job allocation 46669386
salloc: job 46669386 queued and waiting for resources
salloc: job 46669386 has been allocated resources
salloc: Granted job allocation 46669386
salloc: Nodes holy8a24302 are ready for job
[pedmon@holy8a24302 ~]$ xrdb
xrdb: Connection refused
xrdb: Can't open display 'localhost:20.0'
[pedmon@holy8a24302 ~]$ 


What is the status of this? This bug has been open for well over 9 months now?
Comment 37 Connor 2024-09-13 08:35:18 MDT
Hey Paul,


Sorry for the delay. This ticket has switched hands a couple time. Still working on a resolution at the moment

We've found that it seems to be application dependent as in our testing xterm for example works with out issue while emacs seems to never work. 

Leading on the side of a race condition of sorts, but haven't narrowed down the cause. I'll keep you posted when we find a solution. 


Thanks,
-Connor
Comment 38 Connor 2025-04-08 13:16:25 MDT
Hey Paul,

Just providing you with an update. We have a patch that is currently in review. Hoping to get it queued for a release soon.

-Connor
Comment 39 Paul Edmon 2025-04-08 13:20:45 MDT
Great. Thanks for the update.

-Paul Edmon-

On 4/8/2025 3:16 PM, bugs@schedmd.com wrote:
>
> *Comment # 38 <https://support.schedmd.com/show_bug.cgi?id=18492#c38> 
> on ticket 18492 <https://support.schedmd.com/show_bug.cgi?id=18492> 
> from Connor <mailto:connor@schedmd.com> *
> Hey Paul,
>
> Just providing you with an update. We have a patch that is currently in review.
> Hoping to get it queued for a release soon.
>
> -Connor
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>