Ticket 10260

Summary:	Node Failure - Possible bug when scontrol reconfigure with changes to nodes in queues job is currently running on.
Product:	Slurm	Reporter:	Robert Romero <rromero39>
Component:	slurmctld	Assignee:	Nate Rini <nate>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	20.02.4
Hardware:	Linux
OS:	Linux
Site:	UC Merced	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	controller-slurmctld.txt mrcd53-slurmd mrcd53-slurmstepd mrcd68-slurmd mrcd68-slurmstepd

Description Robert Romero 2020-11-19 15:16:06 MST

Created attachment 16747 [details]
Node Slurm logs grep to only job in question

Initial user reported, JOB 1886595 ON mrcd53 CANCELLED AT 2020-11-17T20:28:57 DUE TO NODE FAILURE. which looks to have been originated with an AUTH error between node and control server. below you can find information from submission, slurmctld logs pertaining to job and node as well as the Slurm.log from node were you see error "credential for job 1886595 revoked" right before cancelling job. how can you tell what interface or credential is being used? this might be similar to Bug 9876 - Node failures on the campus-wide Cluster. there was scontrol reconfigure done for queue rebalancing days before the failed node, the nodes 53,68 were no longer in those two queues. not sure if that was the issue.


scontrol show job 1886595
slurm_load_jobs error: Invalid job id specified

The submission should be around 1:00 am of 11/11/2020. Then it started running on mrcd[53,68] at 4:37 am of the same day

#!/bin/bash
#SBATCH --job-name=stest
#SBATCH --partition=long.q,amartini.q
#SBATCH --time=336:00:00
#SBATCH --export=ALL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
echo $SLURM_JOB_NODELIST > nodelist.out
module load lammps/lammps16
export OMP_NUM_THREADS=1
mpirun -n 48 lmp_mpi -in Input.in > output_sidefix_move.txt

[root@mercedhead ~]# cat /var/log/Slurm/Slurmctld.log | grep 1886595
[2020-11-11T00:35:35.295] _slurm_rpc_submit_batch_job: JobId=1886595 InitPrio=1 usec=960
[2020-11-11T00:35:35.550] debug:  sched: JobId=1886595 unable to schedule in Partition=long.q,amartini.q (per _failed_partition()). State=PENDING. Previous-Reason=None. Previous-Desc=(null). New-Reason=Priority. Priority=1.
[2020-11-11T04:40:28.205] backfill: Started JobId=1886595 in long.q on mrcd[53,68]
[2020-11-11T04:40:28.208] prolog_running_decr: Configuration for JobId=1886595 is complete
[2020-11-11T04:40:28.485] debug:  _slurm_rpc_het_job_alloc_info: JobId=1886595 NodeList=mrcd[53,68] usec=2
[2020-11-12T19:37:24.207] recovered JobId=1886595 StepId=Batch
[2020-11-12T19:37:24.207] recovered JobId=1886595 StepId=0
[2020-11-12T19:37:24.207] Recovered JobId=1886595 Assoc=52
[2020-11-17T20:28:55.643] Killing JobId=1886595 on failed node mrcd53


Node Errors from Slurmctld.log:
[2020-11-12T19:37:28.581] debug:  validate_node_specs: node mrcd53 registered with 1 jobs
[2020-11-15T19:47:47.138] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T19:49:27.257] Node mrcd53 now responding
[2020-11-15T20:01:07.202] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T20:02:47.362] Node mrcd53 now responding
[2020-11-15T20:37:47.208] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T20:39:27.334] Node mrcd53 now responding
[2020-11-15T21:14:27.289] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T21:16:07.406] Node mrcd53 now responding
[2020-11-15T21:37:47.268] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T21:39:27.394] Node mrcd53 now responding
[2020-11-15T22:14:27.399] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T22:16:07.623] Node mrcd53 now responding
[2020-11-15T22:41:07.603] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T22:42:48.188] Node mrcd53 now responding
[2020-11-15T22:46:07.110] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T22:47:47.243] Node mrcd53 now responding
[2020-11-15T23:14:27.397] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T23:16:07.549] Node mrcd53 now responding
[2020-11-15T23:47:47.275] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-15T23:49:27.421] Node mrcd53 now responding
[2020-11-17T10:37:04.087] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T10:38:44.264] Node mrcd53 now responding
[2020-11-17T11:30:24.096] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T11:32:04.206] Node mrcd53 now responding
[2020-11-17T11:53:44.257] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T11:55:24.401] Node mrcd53 now responding
[2020-11-17T12:27:04.347] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T12:28:44.520] Node mrcd53 now responding
[2020-11-17T14:07:04.187] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T14:08:44.371] Node mrcd53 now responding
[2020-11-17T18:18:55.067] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T18:20:35.516] Node mrcd53 now responding
[2020-11-17T18:52:15.237] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T18:53:55.742] Node mrcd53 now responding
[2020-11-17T19:08:55.270] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T19:10:35.678] Node mrcd53 now responding
[2020-11-17T19:20:35.162] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T19:22:15.346] Node mrcd53 now responding
[2020-11-17T19:38:55.322] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T19:40:35.684] Node mrcd53 now responding
[2020-11-17T19:47:15.162] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T19:48:55.352] Node mrcd53 now responding
[2020-11-17T20:25:35.354] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : Protocol authentication error
[2020-11-17T20:28:55.643] Killing JobId=1886595 on failed node mrcd53
[2020-11-17T20:28:57.529] Node mrcd53 now responding
[2020-11-17T20:28:57.529] node_did_resp: node mrcd53 returned to service

Comment 1 Nate Rini 2020-11-19 15:48:21 MST

Please attach your slurm.conf (& friends).

Please also call:
> slurmctld -V

Comment 2 Robert Romero 2020-11-19 15:50:06 MST

[root@mercedhead ~]# slurmctld -V
slurm 20.02.4

Comment 3 Robert Romero 2020-11-19 15:56:35 MST

Created attachment 16748 [details]
slurmdbd.conf

Comment 4 Robert Romero 2020-11-19 15:59:57 MST

Created attachment 16749 [details]
slurm.conf

removed ip address.

Comment 5 Robert Romero 2020-11-19 16:00:30 MST

Created attachment 16750 [details]
gres.conf

Comment 6 Robert Romero 2020-11-19 16:00:59 MST

Created attachment 16751 [details]
cgroup.conf

Comment 7 Robert Romero 2020-11-19 16:02:13 MST

everything has been uploaded, Please let me know if there is anything else needed.

Comment 8 Nate Rini 2020-11-19 19:14:55 MST

(In reply to Robert Romero from comment #3)
> Created attachment 16748 [details]
> slurmdbd.conf

Please make sure you have changed your password.
> StoragePass=s************

Comment 9 Nate Rini 2020-11-19 19:30:58 MST

(In reply to Robert Romero from comment #0)
> [2020-11-15T19:47:47.138] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING :
> Protocol authentication error

Were the Slurm binaries updated or changed recently?

> AuthType=auth/munge
Has the munge key (default: /etc/munge/munge.key) been synced to the all the nodes, specifically mrcd[53,68]? Please call this on mrcd[53,68]:
> remunge

Please do not attach the munge key to this bug.

Please also call the following on mrcd[53,68]:
> pgrep slurmd | xargs -i grep . -nH /proc/{}/maps
> pgrep slurmstepd | xargs -i grep . -nH /proc/{}/maps

Please also call the following on host running slurmctld (controller):
> pgrep slurmctld | xargs -i grep . -nH /proc/{}/maps

Comment 10 Robert Romero 2020-11-20 12:01:31 MST

Created attachment 16758 [details]
controller-slurmctld.txt

Comment 11 Robert Romero 2020-11-20 12:01:58 MST

Created attachment 16759 [details]
mrcd53-slurmd

Comment 12 Robert Romero 2020-11-20 12:02:21 MST

Created attachment 16760 [details]
mrcd53-slurmstepd

Comment 13 Robert Romero 2020-11-20 12:02:57 MST

Created attachment 16762 [details]
mrcd68-slurmd

Comment 14 Robert Romero 2020-11-20 12:03:20 MST

Created attachment 16763 [details]
mrcd68-slurmstepd

Comment 15 Robert Romero 2020-11-20 12:17:17 MST

(In reply to Nate Rini from comment #9)
> (In reply to Robert Romero from comment #0)
> > [2020-11-15T19:47:47.138] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING :
> > Protocol authentication error
> 
> Were the Slurm binaries updated or changed recently?
> 
> No, not recently, the last upgrade was over 4 months ago. the only thing done recently was "scontrol reconfigure" which took place on the 11/12/2020.
>
> > AuthType=auth/munge
> Has the munge key (default: /etc/munge/munge.key) been synced to the all the
> nodes, specifically mrcd[53,68]? Please call this on mrcd[53,68]:
> > remunge
>
> Please do not attach the munge key to this bug.
>
> Yes, this would have been part of the upgrade process performed 4 months ago, the node failure was only on 53 and not seen 68. with 53 immediately returning to normal after the single job was terminated.
>
> 
> 
> Please also call the following on mrcd[53,68]:
> > pgrep slurmd | xargs -i grep . -nH /proc/{}/maps
> > pgrep slurmstepd | xargs -i grep . -nH /proc/{}/maps
> 
> Please also call the following on host running slurmctld (controller):
> > pgrep slurmctld | xargs -i grep . -nH /proc/{}/maps
> 
> Output files have been uploaded.

Comment 16 Robert Romero 2020-11-20 12:22:57 MST

remunge performed.

[root@mrcd53 ~]# remunge
2020-11-20 11:21:09 Spawning 1 thread for encoding
2020-11-20 11:21:09 Processing credentials for 1 second
2020-11-20 11:21:10 Processed 25379 credentials in 1.000s (25374 creds/sec)

[root@mrcd68 ~]# remunge
2020-11-20 11:22:09 Spawning 1 thread for encoding
2020-11-20 11:22:09 Processing credentials for 1 second
2020-11-20 11:22:10 Processed 23888 credentials in 1.000s (23884 creds/sec)

Comment 17 Robert Romero 2020-11-25 10:17:08 MST

@nate I'm noticing node reporting mismatch slurm.conf files, with error " appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf."

being reported. could this be still from the scontrol reconfigure? we did add and remove nodes from queues without restart slurmctld and the slurmd daemons as we interrupted that the slurmctld deamon would the slurmd daemons to re-read the file once re ran scontrol reconfigure.

could this be the reason? as we are continuing to seeing node failures

Comment 18 Nate Rini 2020-11-30 15:08:34 MST

(In reply to Robert Romero from comment #17)
> we did
> add and remove nodes from queues without restart slurmctld and the slurmd
> daemons as we interrupted that the slurmctld deamon would the slurmd daemons
> to re-read the file once re ran scontrol reconfigure.
> 
> could this be the reason? as we are continuing to seeing node failures
Have there been any major changes to the slurm.conf? I would suggest just syncing the file and restarting all of the slurmd daemons.

It would be rather surprising for a slurm.conf sync issue to cause an issue with munge but having slurmd daemons not getting updated could cause it.

Comment 19 Nate Rini 2020-12-07 13:47:00 MST

Robert,

Timing this ticket out. Please respond if you have any more questions and we can continue from there.

Thanks,
--Nate