Ticket 8266

Summary: slurm cannot use x11 ssh public key authentication failure Callback returned error
Product: Slurm Reporter: zhaof17
Component: SchedulingAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description zhaof17 2019-12-21 18:36:48 MST
The manage node and one computing node are both Ubuntu 19.04.
Public key has been transferred between these two machines so that ssh is passwordless.

xterm command is ok on both machines.

However, when I use "srun --pty --x11 xterm`. The error is 

/usr/bin/xterm: Xt error: Can't open display: localhost:10.0
srun: error: zhaofengLapTop: task 0: Exited with exit code 1

The log of slurmd on computing node is:

[2019-12-22T09:15:17.877] _run_prolog: run job script took usec=11949
[2019-12-22T09:15:17.877] _run_prolog: prolog with lock for job 16 ran for 0 seconds
[2019-12-22T09:15:18.074] [16.extern] error: ssh public key authentication failure: Callback returned error
[2019-12-22T09:15:18.075] [16.extern] error: x11 port forwarding setup failed
[2019-12-22T09:15:18.076] [16.extern] error: _spawn_job_container: failed retrieving x11 display value: No such file or directory
[2019-12-22T09:15:18.076] [16.extern] error: _spawn_job_container: failed retrieving x11 authority value: No such file or directory
[2019-12-22T09:15:18.080] [16.extern] done with job
[2019-12-22T09:15:18.178] launch task 16.0 request from UID:1010 GID:1010 HOST:10.8.15.136 PORT:38114
[2019-12-22T09:15:18.183] error: could not get x11 forwarding display for job 16 step 0, x11 forwarding disabled

The content of slurm.conf is

SlurmctldHost=zhiyuanWorkstation(10.8.15.136)
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/affinity
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=zhiyuanWorkstation
ClusterName=test-cluster
JobAcctGatherType=jobacct_gather/linux
PrologFlags=x11
SlurmdLogFile=/var/log/slurm-llnl/log.txt
# COMPUTE NODES
NodeName=zhaofengLapTop NodeAddr=10.8.15.92 CPUs=4 State=UNKNOWN
NodeName=raspberrypi NodeAddr=10.8.15.88 CPUs=4 ThreadsPerCore=1 State=UNKNOWN
NodeName=raspberrypi2 NodeAddr=10.8.15.87 CPUs=4 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP OverSubscribe=YES
Comment 1 zhaof17 2020-01-04 07:48:32 MST
It seems the auth key should be generated in pem format ssh-keygen -m pem.
libssh does not support the new key format.