We upgraded to 23.02.7 yesterday and we've hit a problem with X11 forwarding: pedmon@orion:~$ ssh -Y holylogin03 [pedmon@holylogin03 ~]$ salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 -p test salloc: Pending job allocation 12780042 salloc: job 12780042 queued and waiting for resources salloc: job 12780042 has been allocated resources salloc: Granted job allocation 12780042 salloc: Waiting for resource configuration salloc: Nodes holy7c24102 are ready for job [pedmon@holy7c24102 ~]$ emacs Display localhost:71.0 can’t be opened
Hello Paul, I've seen a similar issue via the linked bug, and I suspect this may be a duplicate. I'll be conferring with the team as of now to address this. Best, Tyler Connel
Hi Paul, Which version are you upgrading from? Would it happen to be 21.08? Best, TC
No it was from 23.02.6. -Paul Edmon- On 12/15/23 2:42 PM, bugs@schedmd.com wrote: > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c5> on > bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler > Connel <mailto:tyler@schedmd.com> * > Hi Paul, > > Which version are you upgrading from? Would it happen to be 21.08? > > Best, > TC > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Hi Paul, X11 issues are notoriously difficult to reproduce on our end, would you mind sharing the input/output of some commands for our sake so we can get an idea about your stepd/job submit environment? > echo $DISPLAY > xauth list > salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 -p test > echo $DISPLAY > echo $XAUTHORITY > xauth list > netstat -ant Best, Tyler Connel
Created attachment 33742 [details] Log of X11 test commands
Yup, I'm with you on that. I've attached a log of the results of those commands. -Paul Edmon- On 12/15/23 3:22 PM, bugs@schedmd.com wrote: > > *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c11> on > bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler > Connel <mailto:tyler@schedmd.com> * > Hi Paul, > > X11 issues are notoriously difficult to reproduce on our end, would you mind > sharing the input/output of some commands for our sake so we can get an idea > about your stepd/job submit environment? > > > echo $DISPLAY > xauth list > salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 -p > test > echo $DISPLAY > echo $XAUTHORITY > xauth list > netstat -ant > > Best, > Tyler Connel > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Does the display launch when you try an X application through ssh with x forwarding (-X)? $ ssh -X holy7c24202 emacs
You mean sshing directly to a node rather than getting a session via salloc? If I ssh directly from my host to a node with -X X11 works as expected. It's only the salloc step that breaks things. -Paul Edmon- On 12/15/2023 4:08 PM, bugs@schedmd.com wrote: > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c14> on > bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler > Connel <mailto:tyler@schedmd.com> * > Does the display launch when you try an X application through ssh with x > forwarding (-X)? > > $ ssh -X holy7c24202 emacs > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Also I will note that if I have a salloc on a node and then ssh from the login node to the node I have salloc on with -X it also breaks. This is likely due to the ssh session being shunted into the existing job but not getting the X11 settings right when that happens. -Paul Edmon- On 12/15/2023 4:08 PM, bugs@schedmd.com wrote: > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c14> on > bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler > Connel <mailto:tyler@schedmd.com> * > Does the display launch when you try an X application through ssh with x > forwarding (-X)? > > $ ssh -X holy7c24202 emacs > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Hi Paul, Apologies for the delay on this ticket. I've got a test setup to replicate something like your environment now and have encountered some issues as well with X11 forwarding with srun. At the moment, I suspect they're unrelated, but I'm hoping that tomorrow I'll have reproduced the particular issue that you're experiencing. Best, Tyler Connel
Thanks for the update. -Paul Edmon- On 1/18/2024 9:36 PM, bugs@schedmd.com wrote: > > *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c19> on > bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler > Connel <mailto:tyler@schedmd.com> * > Hi Paul, > > Apologies for the delay on this ticket. I've got a test setup to replicate > something like your environment now and have encountered some issues as well > with X11 forwarding with srun. At the moment, I suspect they're unrelated, but > I'm hoping that tomorrow I'll have reproduced the particular issue that you're > experiencing. > > Best, > Tyler Connel > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Hello Paul, I think I have found a commit in 23.02-7 where this issue might have been introduced. Would you mind sharing your slurmd logs for some time before/after X11 forwarding fails to be established? If you don't mind running slurmd with extra verbosity during the test (-vvv), that might also be helpful. Best, Tyler Connel
Yup, this is what I got: [pedmon@builds01 ~]$ salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 -p rc-testing -w holy2a24201 salloc: Granted job allocation 5720 salloc: Nodes holy2a24201 are ready for job [pedmon@holy2a24201 ~]$ emacs Display localhost:33.0 can’t be opened [pedmon@holy2a24201 ~]$ salloc: error: _half_duplex: wrote -1 of 4096 [pedmon@holy2a24201 ~]$ slurmd: CPUs=36 Boards=1 Sockets=2 Cores=18 Threads=1 Memory=257451 TmpDisk=233649 Uptime=1562472 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) slurmd: debug: _handle_node_reg_resp: slurmctld sent back 12 TRES. slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG slurmd: debug2: Processing RPC: REQUEST_LAUNCH_PROLOG slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG slurmd: debug: Checking credential with 892 bytes of sig data slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: launch task StepId=5720.interactive request from UID:56483 GID:40273 HOST:10.31.128.251 PORT:51072 slurmd: task/affinity: lllp_distribution: JobId=5720 manual binding: mask_cpu,one_thread slurmd: debug: Waiting for job 5720's prolog to complete slurmd: debug: Finished wait for job 5720's prolog to complete slurmd: debug: Leaving stepd_get_x11_display slurmd: debug2: _setup_x11_display: setting DISPLAY=localhost:33:0 for job 5720 step 4294967290 slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY slurmd: debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE slurmd: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE -Paul Edmon- On 1/19/2024 6:00 PM, bugs@schedmd.com wrote: > > *Comment # 21 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c21> on > bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler > Connel <mailto:tyler@schedmd.com> * > Hello Paul, > > I think I have found a commit in 23.02-7 where this issue might have been > introduced. Would you mind sharing your slurmd logs for some time before/after > X11 forwarding fails to be established? If you don't mind running slurmd with > extra verbosity during the test (-vvv), that might also be helpful. > > Best, > Tyler Connel > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Hi Paul, Do you use home_xauthority in your X11Parameters in slurm.conf? -TC
No, the only X11 settings we have are: PrologFlags=Contain,X11 -Paul Edmon- On 2/8/24 8:24 PM, bugs@schedmd.com wrote: > Tyler Connel <mailto:tyler@schedmd.com> changed bug 18492 > <https://bugs.schedmd.com/show_bug.cgi?id=18492> > What Removed Added > See Also https://bugs.schedmd.com/show_bug.cgi?id=18139 > > *Comment # 24 <https://bugs.schedmd.com/show_bug.cgi?id=18492#c24> on > bug 18492 <https://bugs.schedmd.com/show_bug.cgi?id=18492> from Tyler > Connel <mailto:tyler@schedmd.com> * > Hi Paul, > > Do you use home_xauthority in your X11Parameters in slurm.conf? > > -TC > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Just FYI, we upgraded to 23.11.4 on Monday and this issue still exists. Just wanted to confirm it occurred in the latest version as well.
Thanks Paul, and apologies on the delay on this ticket. I'll get another chance to look into this soon.
Hi Paul, I'm picking this ticket up from Tyler, while I'm getting more caught up on the current state of things can you confirm that you are still seeing the issue and that you are still on 23.11.4? Thanks! --Tim
We recently updated to 23.11.8 and the error is still occuring: pedmon@DESKTOP-5GBIA4B:~$ ssh -Y holylogin03 [pedmon@holylogin03 ~]$ salloc --x11=all -c 10 -N 1 --mem=30000 -t 0-10:00 -p test salloc: Granted job allocation 42026829 salloc: Nodes holy8a24301 are ready for job [pedmon@holy8a24301 ~]$ emacs Display localhost:30.0 can’t be opened -Paul Edmon- On 7/31/2024 9:02 AM, bugs@schedmd.com wrote: > > *Comment # 29 <https://support.schedmd.com/show_bug.cgi?id=18492#c29> > on ticket 18492 <https://support.schedmd.com/show_bug.cgi?id=18492> > from Tim McMullan <mailto:mcmullan@schedmd.com> * > Hi Paul, > > I'm picking this ticket up from Tyler, while I'm getting more caught up on the > current state of things can you confirm that you are still seeing the issue and > that you are still on 23.11.4? > > Thanks! > --Tim > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Ok, thanks for the update! I've updated the version on the ticket to reflect the new version that this is still an issue on.
Hi Paul, Would you be able to provide the full slurm.conf fir the current system? Thanks!
Created attachment 38188 [details] Current slurm.conf
Created attachment 38189 [details] Current topology.conf
Yup. I've posted them. -Paul Edmon- On 8/5/2024 2:57 PM, bugs@schedmd.com wrote: > > *Comment # 32 <https://support.schedmd.com/show_bug.cgi?id=18492#c32> > on ticket 18492 <https://support.schedmd.com/show_bug.cgi?id=18492> > from Tim McMullan <mailto:mcmullan@schedmd.com> * > Hi Paul, > > Would you be able to provide the full slurm.conf fir the current system? > > Thanks! > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
FYI this is still happening in 24.05.3: pedmon@orion:~$ ssh -Y login.rc.fas.harvard.edu (pedmon@login.rc.fas.harvard.edu) Password: (pedmon@login.rc.fas.harvard.edu) VerificationCode: Last login: Thu Sep 12 10:36:07 2024 from 10.255.12.55 !!!!!!!!!!!!!!!!!!!!!!! Cannon Cluster !!!!!!!!!!!!!!!!!!!!!!!!!! Cannon is a general HPC resource for Harvard's research community hosted by the Faculty of Arts and Sciences Research Computing. +-------------------- NEWS & UPDATES ---------------------------+ + Status and maintenance page: https://fasrc.instatus.com + + Office Hours: Wednesdays noon-3pm, see website for details: + + https://www.rc.fas.harvard.edu/training/office-hours + + Training: https://www.rc.fas.harvard.edu/upcoming-training + +---------------------------------------------------------------+ +------------------- HELPFUL DOCUMENTATION ---------------------+ + https://docs.rc.fas.harvard.edu/kb/quickstart-guide + + https://docs.rc.fas.harvard.edu/kb/running-jobs + + https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands + +---------------------------------------------------------------+ NEXT MAINTENANCE: OCTOBER 7TH 7AM-11AM +---------------- Slurm Stats for Sep 11 -----------------------+ | End of Day Fairshare | | kempner_lab: 0.000000 | | kuang_lab: 0.092327 | | rc_admin: 0.999719 | +---------------No jobs completed on Sep 11 --------------------+ | https://docs.rc.fas.harvard.edu/kb/slurm-stats | +---------------------------------------------------------------+ [pedmon@holylogin01 ~]$ salloc --x11=all -c 12 -N 1 --mem=80G -t 0-10:00 -p test salloc: Pending job allocation 46669386 salloc: job 46669386 queued and waiting for resources salloc: job 46669386 has been allocated resources salloc: Granted job allocation 46669386 salloc: Nodes holy8a24302 are ready for job [pedmon@holy8a24302 ~]$ xrdb xrdb: Connection refused xrdb: Can't open display 'localhost:20.0' [pedmon@holy8a24302 ~]$ What is the status of this? This bug has been open for well over 9 months now?
Hey Paul, Sorry for the delay. This ticket has switched hands a couple time. Still working on a resolution at the moment We've found that it seems to be application dependent as in our testing xterm for example works with out issue while emacs seems to never work. Leading on the side of a race condition of sorts, but haven't narrowed down the cause. I'll keep you posted when we find a solution. Thanks, -Connor
Hey Paul, Just providing you with an update. We have a patch that is currently in review. Hoping to get it queued for a release soon. -Connor
Great. Thanks for the update. -Paul Edmon- On 4/8/2025 3:16 PM, bugs@schedmd.com wrote: > > *Comment # 38 <https://support.schedmd.com/show_bug.cgi?id=18492#c38> > on ticket 18492 <https://support.schedmd.com/show_bug.cgi?id=18492> > from Connor <mailto:connor@schedmd.com> * > Hey Paul, > > Just providing you with an update. We have a patch that is currently in review. > Hoping to get it queued for a release soon. > > -Connor > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >