We're seeing exactly the same behavior as bug 4691 - after upgrading to 17.11.7 via the OpenHPC packages --x11 doesn't work: babbott@amarel4:~$ srun --reservation=slurm_upgrade --x11 xterm X11 connection rejected because of wrong authentication. /usr/bin/xterm: Xt error: Can't open display: localhost:46.0 srun: error: slepner010: task 0: Exited with exit code 1 rsa keys are fine, slurm.conf has PrologFlags=x11, previous spank rpm was removed. Bug 4691 was closed as "Info given", but was the underlying issue ever resolved?
Hi Bill, I will look into this and see what I can find. What version did you upgrade from? Best regards, Jason
16.05.10
Hi Bill, Starting at 17.11 we use a built-in X11 feature based on libssh2. https://slurm.schedmd.com/faq.html#x11 If you attach a copy of the slurmd logs then this might give us some additional details as to what is going on. Also, it is possible to go back to the SPANK mode but you would have to build with the "--disable-x11" as mentioned in the faq link above. Best regards, Jason
Created attachment 7485 [details] x11 slurmctld log
Created attachment 7486 [details] x11 slurmd log
I've attached the relevant slurmd and slurmctld logs with debug level 5. From the login node it looks like this: babbott@nixon:~$ ssh -X perceval1.hpc.rutgers.edu babbott@perceval1:~$ srun --reservation=slurm_upgrade --x11 xterm X11 connection rejected because of wrong authentication. /usr/bin/xterm: Xt error: Can't open display: localhost:49.0 srun: error: node131: task 0: Exited with exit code 1 babbott@perceval1:~$ srun --reservation=slurm_upgrade --x11 --pty bash -i babbott@node131:~$ xterm X11 connection rejected because of wrong authentication. xterm: Xt error: Can't open display: localhost:87.0 babbott@node131:~$ exit exit srun: error: node131: task 0: Exited with exit code 1 babbott@perceval1:~$ exit logout Connection to perceval1.hpc.rutgers.edu closed. Running xterm from the login node (perceval1) works fine, and the rsa keys seem to be set up correctly. I set StrictHostKeyChecking=no in ssh_config, no change. Libssh2 is installed on all nodes. We didn't compile this ourselves; this is via the OpenHPC rpms. The slurm.conf files has PrologFlags=x11.
Hi Bill, The error is not specifically generated by SLURM. "X11 connection rejected because of wrong authentication." The issue seems tied to the ".Xauthority" as outlined by the following two sites. https://www.cyberciti.biz/faq/x11-connection-rejected-because-of-wrong-authentication/ https://access.redhat.com/solutions/1473133 Best regards, Jason
I'll investigate, thanks.
Hi Bill, Were you able to look into the ".Xauthority" and did that help resolve the issue with "X11 connection rejected because of wrong authentication."? Best regards, Jason
Hi Jason, I haven't been able to do testing on this. Please set the importance to minor until we can. Bill
I can confirm we're seeing the same thing here. We had a test OpenHPC 1.3.3 cluster setup with the OHPC Slurm 17.02.9 package and a custom compiled SPANK X11 plugin. Everything "just worked" on that setup in regards to X11 forwarding. We encountered the bug described here while building the production server using OpenHPC 1.3.5 with OHPC SLURM 17.11.7. Launching an X11 program on the compute node via "srun --x11 --pty xterm" results in the "X11 connection rejected because of wrong authentication." error. Opening a compute node shell using "srun --x11 --pty /bin/bash" shows the following: $DISPLAY=localhost:99.0 xauth listing: headnode.full.domain.name/unix:10 MIT-MAGIC-COOKIE.1 AABBCCDDEE compute-node/unix:99 MIT-MAGIC-COOKIE.1 AABCCDDEE Running "xterm" from the prompt gives the authentication rejection error. I have found that manually adding an xauth cookie for localhost:99 on the compute node gets things working, i.e.: xauth add localhost:99 MIT-MAGIC-COOKIE.1 AABBCCDDEE but attempting to do that automatically via, say, a TaskProlog script causes an xauth allocation time out and a downed compute node. Starting a non-X11 srun job, then doing "ssh -X cluster-node" to the node allows X forwarding to work fine from the SSH session. Both head and compute nodes are CentOS 7.5.1804 with the latest updates. We are using "X11Forwarding yes" and "X11UseLocalhost no" on both the head node and compute node. We are also using user PubKey authentication and RSA keys. The .Xauthority files are at $HOME/.Xauthority which is an NFS mount shared between the head node and the compute nodes. We have not encountered xauth locking issues. Our head node has 2 DNS names, a FQDN for our campus network and a cluster specific name for the compute nodes if that has any bearing on the issue. The "/etc/hosts" file created and distributed by Warewulf has the cluster side IP/DNS names setup properly. If there is a way to script around this until SLURM gets updated by OHPC that would fantastic.
Hey Bill - I'm closing this as a duplicate of the original X11 forwarding plugin bug, and will be updating that as additional configuration entries are added to support, for example, different hostname patterns in the xauth file. These changes will only happen on the newer 18.08 release however. We do not expect to make any further 17.11 maintenance release at this time, and that release is also missing the X11Parameters configuration option which will give you control over these settings. If you have further questions, please add them over on bug 3647. thanks, - Tim *** This ticket has been marked as a duplicate of ticket 3647 ***