Summary: | Error parsing DISPLAY environment variable. Cannot use X11 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Daniel P Davis <daniel.p.davis> |
Component: | Scheduling | Assignee: | Nate Rini <nate> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | 18.08.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6558 https://bugs.schedmd.com/show_bug.cgi?id=6543 |
||
Site: | EM | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 19.05.0pre2, 18.08.6 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | Xorg log |
Description
Daniel P Davis
2019-02-15 12:34:58 MST
(In reply to Daniel P Davis from comment #0) > $ srun -p interactive --pty --x11 bash > srun: error: Error parsing DISPLAY environment variable. Cannot use X11 > forwarding. > $ echo $DISPLAY > 172.20.0.13:65 The DISPLAY variable lacks a screen number: > 172.20.0.13:65.0 Does putting ".0" at the end work as a functional workaround? --Nate Adding .0 to the end lets me land on a node. (Thanks!) However, I get an error when launching an app now. $ xcalc No protocol specified Error: Can't open display: localhost:60.0 $ echo $DISPLAY localhost:60.0 If I ssh -Y to this same node from another terminal I can launch the app. The display variable looks similar, just with a different port. localhost:10.0 (In reply to Daniel P Davis from comment #0) > Getting X11 working natively has been on our backlog for a long time. We > recently upgraded to 18.08.4 and tested again. Please note that X11 support is under active development for 19.05 and that work is being tracked through https://bugs.schedmd.com/show_bug.cgi?id=3647. (In reply to Nate Rini from comment #2) > (In reply to Daniel P Davis from comment #0) > > $ srun -p interactive --pty --x11 bash > > srun: error: Error parsing DISPLAY environment variable. Cannot use X11 > > forwarding. > > $ echo $DISPLAY > > 172.20.0.13:65 > > The DISPLAY variable lacks a screen number: > > 172.20.0.13:65.0 Can you please provide the output of "xauth list" on the source node and the from inside of the job. Please censor (replace with XXXX) out the magic cookie hex values. Please also provide the Xorg.log of the forwarded display. Please make sure to XXXX any keys or other private information. That error is likely a incompatibility between the clients. Please also provide ldd of xclock or xterm on all the hosts. > If I ssh -Y to this same node from another terminal I can launch the app. Are you connecting to the calling node using -X or -Y in ssh? Are there any errors in the SSH log on the client? --Nate $ echo $DISPLAY 172.20.0.13:67 $ export DISPLAY=172.20.0.13:67.0 $ xauth list | cut -f1,3 -d' ' login3.descartes:65 MIT-MAGIC-COOKIE-1 $ srun -p test --pty --x11 bash [SLURM]$ xauth list | cut -d' ' -f1 159.70.70.203:2 rambo.na.xom.com/unix:2 159.70.70.203:1 rambo.na.xom.com/unix:1 login2-eth1.descartes:11 login1-eth1.descartes:11 login1-eth1.descartes:16 login2-eth1.descartes:12 login3-eth1.descartes:27 login1-eth1.descartes:10 login1-eth1.descartes:1 clnhpc01/unix:1 login1-eth1.descartes:4 clnhpc01/unix:4 login1-eth1.descartes:5 clnhpc01/unix:5 login2-eth1.descartes:6 clnhpc02/unix:6 login4-eth1.descartes:1 clnhpc04/unix:1 clnhpc03/unix:26 clnxcat01/unix:12 159.70.88.217:1 clndnode25.na.xom.com/unix:1 clnhpc04/unix:23 clnhpc03/unix:12 clnxcat02/unix:10 clnhpc02/unix:12 clnhpc01/unix:10 159.70.71.201:1 clndnode15.na.xom.com/unix:1 clnsand02.na.xom.com:1 clnsand02.na.xom.com/unix:1 clnhpc02/unix:16 clnhpc02/unix:11 clnhpc02/unix:17 clnhpc03/unix:17 clndnode02.na.xom.com/unix:10 clnhpc01/unix:19 clnhpc02/unix:15 n0106/unix:10 clnhpc02/unix:18 clndnode11.na.xom.com/unix:10 login2.descartes:2 clnhpc02/unix:2 login2.descartes:4 clnhpc02/unix:4 login2.descartes:6 login4.descartes:1 n0101/unix:10 n0103/unix:10 gpu1.descartes/unix:11 gpu1.descartes/unix:10 clnxcat01/unix:14 login2-eth1.descartes:5 clnhpc02/unix:5 clnsand01.na.xom.com/unix:10 clndnode01.na.xom.com/unix:10 clnxcat01/unix:18 clnxcat01/unix:19 clnxcat01/unix:22 clnxcat01/unix:20 clnhpc03/unix:15 n0103/unix:84 n0102/unix:60 n0102.descartes/unix:10 n0102/unix:56 n0102/unix:48 n0102/unix:91 n0102/unix:49 [SLURM]$ xcalc No protocol specified Error: Can't open display: localhost:49.0 [SLURM]$ ldd /usr/bin/xcalc linux-vdso.so.1 => (0x00007ffc3bd91000) libXaw.so.7 => /lib64/libXaw.so.7 (0x00007ffb27977000) libXt.so.6 => /lib64/libXt.so.6 (0x00007ffb2770f000) libX11.so.6 => /lib64/libX11.so.6 (0x00007ffb273d1000) libm.so.6 => /lib64/libm.so.6 (0x00007ffb270cf000) libc.so.6 => /lib64/libc.so.6 (0x00007ffb26d01000) libXext.so.6 => /lib64/libXext.so.6 (0x00007ffb26aef000) libXmu.so.6 => /lib64/libXmu.so.6 (0x00007ffb268d4000) libXpm.so.4 => /lib64/libXpm.so.4 (0x00007ffb266c1000) libSM.so.6 => /lib64/libSM.so.6 (0x00007ffb264b9000) libICE.so.6 => /lib64/libICE.so.6 (0x00007ffb2629d000) libxcb.so.1 => /lib64/libxcb.so.1 (0x00007ffb26074000) libdl.so.2 => /lib64/libdl.so.2 (0x00007ffb25e70000) /lib64/ld-linux-x86-64.so.2 (0x00007ffb27beb000) libuuid.so.1 => /lib64/libuuid.so.1 (0x00007ffb25c6b000) libXau.so.6 => /lib64/libXau.so.6 (0x00007ffb25a66000) FYI, my xauth list on the client has a ton of old entries. Not sure how that would be persisting, as these are stateless nodes. Created attachment 9211 [details]
Xorg log
Trying again after clearing xauth entries on both login and client node. $ echo $DISPLAY 172.20.0.13:65 $ xauth list | cut -d' ' -f1 login3.descartes:65 $ export DISPLAY=172.20.0.13:65.0 $ srun -p test --x11 --pty bash srun: job 31850 queued and waiting for resources srun: job 31850 has been allocated resources [SLURM]$ xauth list n0102/unix:97 MIT-MAGIC-COOKIE-1 dd778dc51948e78de6c3aa994b93a67c [SLURM]$ xcalc No protocol specified Error: Can't open display: localhost:97.0 Forgot to remove my magic cookie from the log, but I have cleared it on my side now. (In reply to Daniel P Davis from comment #8) > $ echo $DISPLAY > 172.20.0.13:65 Are you calling 'ssh -X' into the login node to generate this DISPLAY? I would expect it to point to localhost. --Nate I use ssh -Y Here is a look at the ssh -Y approach that works: $ echo $DISPLAY 172.20.0.13:65 $ ssh -Y n0101 [n0101]$ echo $DISPLAY localhost:10.0 [n0101]$ xcalc Runs as expected. Also, I did not need to update my DISPLAY with a screen number. ssh -X also works: $ ssh -X n0101 Last login: Thu Feb 21 08:34:19 2019 from login3.descartes [n0101]$ echo $DISPLAY localhost:11.0 [n0101]$ xcalc I do notice that my magic cookies are fqdn for the ssh -Y/X cases: [n0101]$ xauth list | cut -d' ' -f1 n0101/unix:88 n0101/unix:57 n0101.descartes/unix:10 n0101.descartes/unix:11 OK, adding the fqdn entry to Xauthority fixes this issue: [SLURM]$ xauth list | cut -d' ' -f1 n0101/unix:88 n0101/unix:57 n0101.descartes/unix:10 n0101.descartes/unix:11 ***n0101.descartes/unix:57*** I can run xcalc after adding that entry. Sooo... where does this leave us? (In reply to Daniel P Davis from comment #14) > Sooo... where does this leave us? A code review to determine how best move forward with handling FQDN. --Nate Daniel,
Are both hostnames resolvable?
> getent hosts n0101
> getent hosts n0101.descartes
--Nate
$ getent hosts n0101 172.20.1.1 n0101.descartes $ getent hosts n0101.descartes 172.20.1.1 n0101.descartes Can we also determine why I need to add the screen number (.0) to the DISPLAY variable? My regular ssh -X/Y tests do not require this. (In reply to Daniel P Davis from comment #21) > Can we also determine why I need to add the screen number (.0) to the > DISPLAY variable? My regular ssh -X/Y tests do not require this. Yes, Xorg does not require screen to be present. Doing QA on patch now. --Nate Daniel, These commits should fix the issues: https://github.com/SchedMD/slurm/commit/db14d9472eeddd925a25cb58afb13ec514ba891chttps://github.com/SchedMD/slurm/commit/55d1927e97f57165fca803c4a98af8e15ce8e719 Please add this to your slurm.conf: > X11Parameters=use_raw_hostname Please reply to this ticket if you have any more issues or questions. --Nate |