Ticket 709

Summary: running parallel jobs is failing without munge
Product: Slurm Reporter: Stuart Midgley <stuartm>
Component: slurmdAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 14.03.0   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Stuart Midgley 2014-04-15 01:55:35 MDT
We have auth/none configured in our slurm.conf... but when we try to run parallel jobs, it fails... and on the client we see messages about not being able to talk to munge.

20140415214557 bud30:tomo2_control> salloc -N 4 -p teambm
salloc: Pending job allocation 702224
salloc: job 702224 queued and waiting for resources
salloc: error: Lookup failed: Unknown host
salloc: job 702224 has been allocated resources
salloc: Granted job allocation 702224
20140415214839 bud30:tomo2_control> srun hostname
srun: error: Task launch for 702224.0 failed on node clus419: Invalid job credential
srun: error: Task launch for 702224.0 failed on node clus418: Invalid job credential
srun: error: Task launch for 702224.0 failed on node clus420: Invalid job credential
srun: error: Task launch for 702224.0 failed on node clus421: Invalid job credential
srun: error: Application launch failed: Invalid job credential
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


and from one of the hosts


Apr 15 21:46:38 clus419 slurmstepd[8331]: Null authentication plugin loaded
Apr 15 21:46:38 clus419 slurmstepd[8331]: Handling REQUEST_SIGNAL_CONTAINER
Apr 15 21:46:38 clus419 slurmstepd[8331]: _handle_signal_container for step=702025.4294967294 uid=0 signal=18
Apr 15 21:46:38 clus419 slurmstepd[8331]: Sent signal 18 to 702025.4294967294
Apr 15 21:46:38 clus419 slurmstepd[8331]: Handling REQUEST_SIGNAL_CONTAINER
Apr 15 21:46:38 clus419 slurmstepd[8331]: _handle_signal_container for step=702025.4294967294 uid=0 signal=15
Apr 15 21:46:38 clus419 slurmstepd[8331]: error: *** JOB 702025 CANCELLED AT 2014-04-15T21:46:38 ***
Apr 15 21:46:38 clus419 slurmstepd[8331]: Sent signal 15 to 702025.4294967294
Apr 15 21:46:38 clus419 slurmstepd[8331]: Handling REQUEST_STATE
Apr 15 21:46:38 clus419 slurmstepd[8331]: Job 702025 memory used:0 limit:8134656 KB
Apr 15 21:46:38 clus419 slurmstepd[8331]: task 0 (8335) exited. Killed by signal 15.
Apr 15 21:46:38 clus419 slurmstepd[8331]: task_p_post_term: 702025.4294967294, task 0
Apr 15 21:46:39 clus419 slurmstepd[8331]: Handling REQUEST_STATE
Apr 15 21:46:39 clus419 slurmstepd[8331]: cpu_freq_reset: #cpus reset = 0
Apr 15 21:46:39 clus419 slurmstepd[8331]: Message thread exited
Apr 15 21:46:39 clus419 slurmstepd[8331]: get_exit_code task 0 killed by cmd
Apr 15 21:46:39 clus419 slurmstepd[8331]: job 702025 completed with slurm_rc = 0, job_rc = -2
Apr 15 21:46:39 clus419 slurmstepd[8331]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
Apr 15 21:46:39 clus419 slurmstepd[8331]: done with job
Apr 15 21:46:40 clus419 slurmd[28628]: debug:  Waiting for job 702025's prolog to complete
Apr 15 21:46:40 clus419 slurmd[28628]: debug:  Finished wait for job 702025's prolog to complete
Apr 15 21:46:40 clus419 slurmd[28628]: debug:  Calling /d/sw/slurm/20140415/sbin/slurmstepd spank epilog
Apr 15 21:46:40 clus419 spank-epilog[14703]: Reading slurm.conf file: /d/sw/slurm/20140415/etc/slurm.conf
Apr 15 21:46:40 clus419 spank-epilog[14703]: Running spank/epilog for jobid [702025] uid [1226]
Apr 15 21:46:40 clus419 spank-epilog[14703]: spank: opening plugin stack /d/sw/slurm/20140415/etc/plugstack.conf
Apr 15 21:46:40 clus419 slurmd[28628]: debug:  [job 702025] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
Apr 15 21:46:41 clus419 slurmd[28628]: debug:  completed epilog for jobid 702025
Apr 15 21:46:41 clus419 slurmd[28628]: debug:  Job 702025: sent epilog complete msg: rc = 0
Apr 15 21:46:51 clus419 nrpe[14720]: Error: Could not complete SSL handshake. 5
Apr 15 21:47:33 clus419 monitoring: systemp=27
Apr 15 21:47:58 clus419 sshd[14744]: SSH: Server;Ltype: Version;Remote: 172.16.255.10-58257;Protocol: 2.0;Client: check_ssh_1.4.15
Apr 15 21:47:58 clus419 sshd[14744]: Connection closed by 172.16.255.10 [preauth]
Apr 15 21:48:13 clus419 monitoring: systemp=27
Apr 15 21:48:54 clus419 slurmd[28628]: debug:  task_p_slurmd_launch_request: 702224.0 1
Apr 15 21:48:54 clus419 slurmd[28628]: launch task 702224.0 request from 3005.2000@172.16.251.30 (port 36744)
Apr 15 21:48:54 clus419 slurmd[28628]: debug:  Checking credential with 300 bytes of sig data
Apr 15 21:48:54 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:54 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: debug:  Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
Apr 15 21:48:55 clus419 slurmd[28628]: error: If munged is up, restart with --num-threads=10
Apr 15 21:48:55 clus419 slurmd[28628]: error: Credential signature check: Socket communication error
Apr 15 21:48:55 clus419 slurmd[28628]: error: Invalid job credential from 3005@172.16.251.30: Invalid job credential
Comment 1 Moe Jette 2014-04-15 04:58:14 MDT
If you do not have munge installed, then you need to at least have OpenSSL and set the CryptoType plugin. Below information is from slurm.conf man page: 

CryptoType
The cryptographic signature tool to be used in the creation  of  job  step  credentials. The  slurmctld  daemon  must  be  restarted  for  a change in CryptoType to take effect. Acceptable values at present include "crypto/munge" and "crypto/openssl".   The  default value is "crypto/munge".