Ticket 14221

Summary:	nodes down with expired credentials
Product:	Slurm	Reporter:	Nicholas Labello <nicholas.labello>
Component:	slurmd	Assignee:	Director of Support <support>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	21.08.5
Hardware:	Linux
OS:	Linux
Site:	Pfizer	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	sdiag.log slurm.conf slurmctld-20220601.gz

Description Nicholas Labello 2022-06-01 18:37:43 MDT

I have copied the last few lines of slurmd from two nodes that went down with similar 'expired credentials' messages.  I'm not sure what can cause this error, please advise.


node 1

[2022-06-01T06:28:08.853] error: Munge decode failed: Expired credential
[2022-06-01T06:28:08.927] error: Munge decode failed: Expired credential
[2022-06-01T06:28:08.927] error: Munge decode failed: Expired credential
[2022-06-01T06:28:08.927] error: Munge decode failed: Expired credential
[2022-06-01T06:28:08.927] error: Munge decode failed: Expired credential
[2022-06-01T06:28:08.927] error: Munge decode failed: Expired credential
[2022-06-01T06:31:35.749] auth/munge: _print_cred: ENCODED: Wed Jun 01 05:31:57 2022
[2022-06-01T06:31:35.820] auth/munge: _print_cred: ENCODED: Wed Jun 01 05:16:55 2022
[2022-06-01T06:31:35.820] auth/munge: _print_cred: ENCODED: Wed Jun 01 05:20:15 2022
[2022-06-01T06:31:35.820] auth/munge: _print_cred: ENCODED: Wed Jun 01 05:20:15 2022
[2022-06-01T06:31:35.820] auth/munge: _print_cred: ENCODED: Wed Jun 01 05:18:35 2022
[2022-06-01T06:31:35.820] auth/munge: _print_cred: ENCODED: Wed Jun 01 06:05:25 2022
[2022-06-01T06:31:35.830] auth/munge: _print_cred: DECODED: Wed Jun 01 06:28:07 2022
[2022-06-01T06:31:35.830] auth/munge: _print_cred: DECODED: Wed Jun 01 06:28:07 2022
[2022-06-01T06:31:35.831] auth/munge: _print_cred: DECODED: Wed Jun 01 06:28:07 2022
[2022-06-01T06:31:35.831] auth/munge: _print_cred: DECODED: Wed Jun 01 06:28:07 2022
[2022-06-01T06:31:38.322] auth/munge: _print_cred: DECODED: Wed Jun 01 06:28:07 2022
[2022-06-01T06:31:38.336] auth/munge: _print_cred: DECODED: Wed Jun 01 06:28:07 2022
[2022-06-01T06:31:38.619] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-06-01T06:31:38.636] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_TERMINATE_JOB has authentication error: Invalid authentication credential
[2022-06-01T06:31:38.636] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_PING has authentication error: Invalid authentication credential
[2022-06-01T06:31:38.636] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_PING has authentication error: Invalid authentication credential
[2022-06-01T06:31:38.644] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T06:31:38.644] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_TERMINATE_JOB has authentication error: Invalid authentication credential
[2022-06-01T06:31:38.644] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T06:31:38.644] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T06:31:38.644] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T06:31:38.644] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-06-01T06:31:38.644] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T06:31:38.644] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T07:01:58.594] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T07:01:59.618] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T07:01:59.756] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T07:01:59.756] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T07:01:59.756] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T07:01:59.756] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T07:01:59.791] error: Munge decode failed: Expired credential
[2022-06-01T07:06:25.705] error: slurm_msg_sendto: address:port=11.64.0.1:33926 msg_type=8001: No error
[2022-06-01T07:10:17.025] error: slurm_msg_sendto: address:port=11.64.0.1:45318 msg_type=8001: No error
[2022-06-01T07:10:17.089] error: slurm_msg_sendto: address:port=11.64.0.1:39780 msg_type=8001: No error
[2022-06-01T07:10:17.089] error: slurm_msg_sendto: address:port=11.64.0.118:57552 msg_type=8001: No error
[2022-06-01T07:10:17.089] error: slurm_msg_sendto: address:port=11.64.0.1:39786 msg_type=8001: No error

[2022-06-01T07:10:17.089] error: slurm_msg_sendto: address:port=11.64.0.1:39786 msg_type=8001: No error
[2022-06-01T07:10:17.089] error: slurm_msg_sendto: address:port=11.64.0.1:34284 msg_type=8001: No error
[2022-06-01T07:10:17.089] auth/munge: _print_cred: ENCODED: Wed Jun 01 06:38:49 2022
[2022-06-01T07:10:17.089] auth/munge: _print_cred: DECODED: Wed Jun 01 07:01:59 2022
[2022-06-01T07:10:17.089] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-06-01T07:10:17.089] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T07:10:17.100] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T07:10:17.100] error: slurm_msg_sendto: address:port=11.64.0.1:60226 msg_type=8001: No error
[2022-06-01T07:18:16.172] error: Munge decode failed: Expired credential
[2022-06-01T07:18:16.335] auth/munge: _print_cred: ENCODED: Wed Jun 01 07:12:13 2022
[2022-06-01T07:18:16.335] auth/munge: _print_cred: DECODED: Wed Jun 01 07:18:14 2022
[2022-06-01T07:18:16.335] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-06-01T07:18:16.335] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T07:18:16.345] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T07:18:16.563] error: slurm_msg_sendto: address:port=11.64.0.1:44312 msg_type=8001: No error
[2022-06-01T07:36:36.284] [3015043.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-06-01T07:36:36.321] [3015043.batch] done with job
[2022-06-01T07:37:17.385] [3052794.0] done with job
[2022-06-01T07:37:17.664] [3052794.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-06-01T07:37:17.667] [3052794.batch] done with job
[2022-06-01T07:45:42.746] [3015043.extern] done with job
[2022-06-01T07:45:42.769] [3052794.extern] done with job



node 2

[2022-06-01T21:33:23.003] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T21:33:23.014] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T21:33:23.014] error: slurm_msg_sendto: address:port=11.64.0.1:48066 msg_type=8001: No error
[2022-06-01T22:15:56.011] error: Munge decode failed: Expired credential
[2022-06-01T22:15:56.020] error: slurm_msg_sendto: address:port=11.64.0.1:34820 msg_type=1016: No error
[2022-06-01T22:15:56.020] auth/munge: _print_cred: ENCODED: Wed Jun 01 22:07:47 2022
[2022-06-01T22:15:56.020] auth/munge: _print_cred: DECODED: Wed Jun 01 22:15:55 2022
[2022-06-01T22:15:56.020] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-06-01T22:15:56.020] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T22:15:56.038] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T22:15:56.052] error: slurm_msg_sendto: address:port=11.64.0.1:40666 msg_type=8001: No error
[2022-06-01T23:20:26.375] error: Munge decode failed: Expired credential
[2022-06-01T23:20:26.381] error: Munge decode failed: Expired credential
[2022-06-01T23:20:26.381] auth/munge: _print_cred: ENCODED: Wed Jun 01 23:14:35 2022
[2022-06-01T23:20:26.381] auth/munge: _print_cred: DECODED: Wed Jun 01 23:20:26 2022
[2022-06-01T23:20:26.381] auth/munge: _print_cred: ENCODED: Wed Jun 01 22:41:10 2022
[2022-06-01T23:20:26.381] auth/munge: _print_cred: DECODED: Wed Jun 01 23:20:26 2022
[2022-06-01T23:20:26.392] error: slurm_msg_sendto: address:port=11.64.0.1:33424 msg_type=8001: No error
[2022-06-01T23:20:26.425] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-06-01T23:20:26.425] error: slurm_receive_msg_and_forward: auth_g_verify: REQUEST_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
[2022-06-01T23:20:26.425] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T23:20:26.425] error: slurm_msg_sendto: address:port=11.64.0.1:33426 msg_type=8001: No error
[2022-06-01T23:20:26.430] error: slurm_msg_sendto: address:port=11.64.0.1:33432 msg_type=8001: No error
[2022-06-01T23:20:26.430] error: slurm_msg_sendto: address:port=11.64.0.1:33428 msg_type=8001: No error
[2022-06-01T23:20:26.430] error: slurm_msg_sendto: address:port=11.64.0.1:33430 msg_type=8001: No error
[2022-06-01T23:20:26.442] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T23:20:26.442] error: slurm_receive_msg_and_forward: Protocol authentication error
[2022-06-01T23:20:26.442] error: slurm_msg_sendto: address:port=11.64.0.1:33924 msg_type=8001: No error
[2022-06-01T23:20:26.452] error: service_connection: slurm_receive_msg: Protocol authentication error
[2022-06-01T23:20:26.452] error: slurm_msg_sendto: address:port=11.64.0.1:34570 msg_type=8001: No error
[2022-06-01T23:21:32.675] [3052794.extern] error: common_file_write_content: unable to write 6 bytes to cgroup /sys/fs/cgroup/freezer/slurm/uid_23043471/job_3052794/step_extern/freezer.state: No such device
[2022-06-01T23:21:32.858] [3052794.extern] error: common_file_write_content: unable to write 6 bytes to cgroup /sys/fs/cgroup/freezer/slurm/uid_23043471/job_3052794/step_extern/freezer.state: No such device
[2022-06-01T23:21:32.860] [3052794.extern] error: common_file_write_content: unable to write 6 bytes to cgroup /sys/fs/cgroup/freezer/slurm/uid_23043471/job_3052794/step_extern/freezer.state: No such device
[2022-06-01T23:21:33.107] [3151480.extern] done with job

Comment 1 Jason Booth 2022-06-02 11:16:07 MDT

> Munge decode failed: Expired credential

These errors can occur for a few different reasons.

1. Notes are out of sync with their time. Munge requires in sync notes in-order 
for their signature to be valid.
2. You could be experiencing timeouts. This could be due to how busy the cluster 
is. We can make adjustments if needed once we gather some additional information.
3. You may need to increase the munge thread count.


Please send us the slurm.conf, the full slurmctld.log cover this time period, and the full slurmd.log from one of the nodes.

What I will be looking for is any errors that center around TCP timeouts and munge 
errors in the slurmctld.log. 

Please also send us the output of sdiag ran 5 times 1 minute apart.


Regarding the munge thread, you can make this change now by editing the service 
file and adding these additional threads and reloading/restarting the service.

https://slurm.schedmd.com/high_throughput.html#munge_config

Comment 2 Nicholas Labello 2022-06-02 16:20:29 MDT

Created attachment 25342 [details]
sdiag.log

Thanks Jason.  Attached.  Unfortunately the nodes have been rebooted wiping slurmd logs.  I may have to wait for the next occurrence to collect them.

I noticed while collecting these logs that we do not have memory enforcement enabled.  I think this must have been accidentally lost during a recent Bright upgrade which included jumping 2 major versions of Slurm.  Given that the node failures occurred when a very resource-hungry user job was running on them I am wondering if it ran the nodes out of memory.

Based on our slurm.conf is ConstrainRAMSpace=yes all we need to do to enable memory enforcement?




From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, June 2, 2022 1:16 PM
To: Labello, Nicholas <Nicholas.Labello@pfizer.com>
Subject: [EXTERNAL] [Bug 14221] nodes down with expired credentials

Comment # 1<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=14221*c1__;Iw!!H9nueQsQ!98uPaWgyUugXWR0lf69BuV5ZNImstm3IBCzuwsu5N8koKkexH_tKq7L59ah8qZdj3Uei1uR_z_MSZnSZGGU$> on bug 14221<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=14221__;!!H9nueQsQ!98uPaWgyUugXWR0lf69BuV5ZNImstm3IBCzuwsu5N8koKkexH_tKq7L59ah8qZdj3Uei1uR_z_MSBEQ5SDQ$> from Jason Booth<mailto:jbooth@schedmd.com>

> Munge decode failed: Expired credential



These errors can occur for a few different reasons.



1. Notes are out of sync with their time. Munge requires in sync notes in-order

for their signature to be valid.

2. You could be experiencing timeouts. This could be due to how busy the

cluster

is. We can make adjustments if needed once we gather some additional

information.

3. You may need to increase the munge thread count.





Please send us the slurm.conf, the full slurmctld.log cover this time period,

and the full slurmd.log from one of the nodes.



What I will be looking for is any errors that center around TCP timeouts and

munge

errors in the slurmctld.log.



Please also send us the output of sdiag ran 5 times 1 minute apart.





Regarding the munge thread, you can make this change now by editing the service

file and adding these additional threads and reloading/restarting the service.



https://slurm.schedmd.com/high_throughput.html#munge_config<https://urldefense.com/v3/__https:/slurm.schedmd.com/high_throughput.html*munge_config__;Iw!!H9nueQsQ!98uPaWgyUugXWR0lf69BuV5ZNImstm3IBCzuwsu5N8koKkexH_tKq7L59ah8qZdj3Uei1uR_z_MSY23ICnc$>

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 3 Nicholas Labello 2022-06-02 16:20:30 MDT

Created attachment 25343 [details]
slurm.conf

Comment 4 Nicholas Labello 2022-06-02 16:20:30 MDT

Created attachment 25344 [details]
slurmctld-20220601.gz

Comment 5 Michael Hinton 2022-06-03 13:10:23 MDT

Hi Nicholas,

(In reply to Nicholas Labello from comment #2)
> Thanks Jason.  Attached.  Unfortunately the nodes have been rebooted wiping
> slurmd logs.  I may have to wait for the next occurrence to collect them.
Were those nodes out of sync with real time? That can cause these type of expired credential errors.

> I noticed while collecting these logs that we do not have memory enforcement
> enabled.  I think this must have been accidentally lost during a recent
> Bright upgrade which included jumping 2 major versions of Slurm.  Given that
> the node failures occurred when a very resource-hungry user job was running
> on them I am wondering if it ran the nodes out of memory.
That's possible. Is there anything in the system logs that would indicate that?

> Based on our slurm.conf is ConstrainRAMSpace=yes all we need to do to enable
> memory enforcement?
I think so. You already have the task/cgroup plugin enabled.

Thanks,
-Michael

Comment 6 Michael Hinton 2022-06-06 13:08:19 MDT

Hi Nicholas, any updates?

Thanks,
-Michael

Comment 7 Michael Hinton 2022-06-14 09:32:32 MDT

Feel free to reopen if you need more assistance.

Thanks!