Created attachment 16747 [details] Node Slurm logs grep to only job in question Initial user reported, JOB 1886595 ON mrcd53 CANCELLED AT 2020-11-17T20:28:57 DUE TO NODE FAILURE. which looks to have been originated with an AUTH error between node and control server. below you can find information from submission, slurmctld logs pertaining to job and node as well as the Slurm.log from node were you see error "credential for job 1886595 revoked" right before cancelling job. how can you tell what interface or credential is being used? this might be similar to Bug 9876 - Node failures on the campus-wide Cluster. there was scontrol reconfigure done for queue rebalancing days before the failed node, the nodes 53,68 were no longer in those two queues. not sure if that was the issue. scontrol show job 1886595 slurm_load_jobs error: Invalid job id specified The submission should be around 1:00 am of 11/11/2020. Then it started running on mrcd[53,68] at 4:37 am of the same day #!/bin/bash #SBATCH --job-name=stest #SBATCH --partition=long.q,amartini.q #SBATCH --time=336:00:00 #SBATCH --export=ALL #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 echo $SLURM_JOB_NODELIST > nodelist.out module load lammps/lammps16 export OMP_NUM_THREADS=1 mpirun -n 48 lmp_mpi -in Input.in > output_sidefix_move.txt [root@mercedhead ~]# cat /var/log/Slurm/Slurmctld.log | grep 1886595 [2020-11-11T00:35:35.295] _slurm_rpc_submit_batch_job: JobId=1886595 InitPrio=1 usec=960 [2020-11-11T00:35:35.550] debug: sched: JobId=1886595 unable to schedule in Partition=long.q,amartini.q (per _failed_partition()). State=PENDING. Previous-Reason=None. Previous-Desc=(null). New-Reason=Priority. Priority=1. [2020-11-11T04:40:28.205] backfill: Started JobId=1886595 in long.q on mrcd[53,68] [2020-11-11T04:40:28.208] prolog_running_decr: Configuration for JobId=1886595 is complete [2020-11-11T04:40:28.485] debug: _slurm_rpc_het_job_alloc_info: JobId=1886595 NodeList=mrcd[53,68] usec=2 [2020-11-12T19:37:24.207] recovered JobId=1886595 StepId=Batch [2020-11-12T19:37:24.207] recovered JobId=1886595 StepId=0 [2020-11-12T19:37:24.207] Recovered JobId=1886595 Assoc=52 [2020-11-17T20:28:55.643] Killing JobId=1886595 on failed node mrcd53 Node Errors from Slurmctld.log: [2020-11-12T19:37:28.581] debug: validate_node_specs: node mrcd53 registered with 1 jobs [2020-11-15T19:47:47.138] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T19:49:27.257] Node mrcd53 now responding [2020-11-15T20:01:07.202] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T20:02:47.362] Node mrcd53 now responding [2020-11-15T20:37:47.208] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T20:39:27.334] Node mrcd53 now responding [2020-11-15T21:14:27.289] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T21:16:07.406] Node mrcd53 now responding [2020-11-15T21:37:47.268] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T21:39:27.394] Node mrcd53 now responding [2020-11-15T22:14:27.399] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T22:16:07.623] Node mrcd53 now responding [2020-11-15T22:41:07.603] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T22:42:48.188] Node mrcd53 now responding [2020-11-15T22:46:07.110] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T22:47:47.243] Node mrcd53 now responding [2020-11-15T23:14:27.397] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T23:16:07.549] Node mrcd53 now responding [2020-11-15T23:47:47.275] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-15T23:49:27.421] Node mrcd53 now responding [2020-11-17T10:37:04.087] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T10:38:44.264] Node mrcd53 now responding [2020-11-17T11:30:24.096] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T11:32:04.206] Node mrcd53 now responding [2020-11-17T11:53:44.257] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T11:55:24.401] Node mrcd53 now responding [2020-11-17T12:27:04.347] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T12:28:44.520] Node mrcd53 now responding [2020-11-17T14:07:04.187] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T14:08:44.371] Node mrcd53 now responding [2020-11-17T18:18:55.067] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T18:20:35.516] Node mrcd53 now responding [2020-11-17T18:52:15.237] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T18:53:55.742] Node mrcd53 now responding [2020-11-17T19:08:55.270] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T19:10:35.678] Node mrcd53 now responding [2020-11-17T19:20:35.162] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T19:22:15.346] Node mrcd53 now responding [2020-11-17T19:38:55.322] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T19:40:35.684] Node mrcd53 now responding [2020-11-17T19:47:15.162] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T19:48:55.352] Node mrcd53 now responding [2020-11-17T20:25:35.354] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error [2020-11-17T20:28:55.643] Killing JobId=1886595 on failed node mrcd53 [2020-11-17T20:28:57.529] Node mrcd53 now responding [2020-11-17T20:28:57.529] node_did_resp: node mrcd53 returned to service
Please attach your slurm.conf (& friends). Please also call: > slurmctld -V
[root@mercedhead ~]# slurmctld -V slurm 20.02.4
Created attachment 16748 [details] slurmdbd.conf
Created attachment 16749 [details] slurm.conf removed ip address.
Created attachment 16750 [details] gres.conf
Created attachment 16751 [details] cgroup.conf
everything has been uploaded, Please let me know if there is anything else needed.
(In reply to Robert Romero from comment #3) > Created attachment 16748 [details] > slurmdbd.conf Please make sure you have changed your password. > StoragePass=s************
(In reply to Robert Romero from comment #0) > [2020-11-15T19:47:47.138] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : > Protocol authentication error Were the Slurm binaries updated or changed recently? > AuthType=auth/munge Has the munge key (default: /etc/munge/munge.key) been synced to the all the nodes, specifically mrcd[53,68]? Please call this on mrcd[53,68]: > remunge Please do not attach the munge key to this bug. Please also call the following on mrcd[53,68]: > pgrep slurmd | xargs -i grep . -nH /proc/{}/maps > pgrep slurmstepd | xargs -i grep . -nH /proc/{}/maps Please also call the following on host running slurmctld (controller): > pgrep slurmctld | xargs -i grep . -nH /proc/{}/maps
Created attachment 16758 [details] controller-slurmctld.txt
Created attachment 16759 [details] mrcd53-slurmd
Created attachment 16760 [details] mrcd53-slurmstepd
Created attachment 16762 [details] mrcd68-slurmd
Created attachment 16763 [details] mrcd68-slurmstepd
(In reply to Nate Rini from comment #9) > (In reply to Robert Romero from comment #0) > > [2020-11-15T19:47:47.138] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : > > Protocol authentication error > > Were the Slurm binaries updated or changed recently? > > No, not recently, the last upgrade was over 4 months ago. the only thing done recently was "scontrol reconfigure" which took place on the 11/12/2020. > > > AuthType=auth/munge > Has the munge key (default: /etc/munge/munge.key) been synced to the all the > nodes, specifically mrcd[53,68]? Please call this on mrcd[53,68]: > > remunge > > Please do not attach the munge key to this bug. > > Yes, this would have been part of the upgrade process performed 4 months ago, the node failure was only on 53 and not seen 68. with 53 immediately returning to normal after the single job was terminated. > > > > Please also call the following on mrcd[53,68]: > > pgrep slurmd | xargs -i grep . -nH /proc/{}/maps > > pgrep slurmstepd | xargs -i grep . -nH /proc/{}/maps > > Please also call the following on host running slurmctld (controller): > > pgrep slurmctld | xargs -i grep . -nH /proc/{}/maps > > Output files have been uploaded.
remunge performed. [root@mrcd53 ~]# remunge 2020-11-20 11:21:09 Spawning 1 thread for encoding 2020-11-20 11:21:09 Processing credentials for 1 second 2020-11-20 11:21:10 Processed 25379 credentials in 1.000s (25374 creds/sec) [root@mrcd68 ~]# remunge 2020-11-20 11:22:09 Spawning 1 thread for encoding 2020-11-20 11:22:09 Processing credentials for 1 second 2020-11-20 11:22:10 Processed 23888 credentials in 1.000s (23884 creds/sec)
@nate I'm noticing node reporting mismatch slurm.conf files, with error " appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf." being reported. could this be still from the scontrol reconfigure? we did add and remove nodes from queues without restart slurmctld and the slurmd daemons as we interrupted that the slurmctld deamon would the slurmd daemons to re-read the file once re ran scontrol reconfigure. could this be the reason? as we are continuing to seeing node failures
(In reply to Robert Romero from comment #17) > we did > add and remove nodes from queues without restart slurmctld and the slurmd > daemons as we interrupted that the slurmctld deamon would the slurmd daemons > to re-read the file once re ran scontrol reconfigure. > > could this be the reason? as we are continuing to seeing node failures Have there been any major changes to the slurm.conf? I would suggest just syncing the file and restarting all of the slurmd daemons. It would be rather surprising for a slurm.conf sync issue to cause an issue with munge but having slurmd daemons not getting updated could cause it.
Robert, Timing this ticket out. Please respond if you have any more questions and we can continue from there. Thanks, --Nate