| Summary: | Node Failure - Possible bug when scontrol reconfigure with changes to nodes in queues job is currently running on. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Robert Romero <rromero39> |
| Component: | slurmctld | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | UC Merced | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
controller-slurmctld.txt
mrcd53-slurmd mrcd53-slurmstepd mrcd68-slurmd mrcd68-slurmstepd |
||
|
Description
Robert Romero
2020-11-19 15:16:06 MST
Please attach your slurm.conf (& friends).
Please also call:
> slurmctld -V
[root@mercedhead ~]# slurmctld -V slurm 20.02.4 Created attachment 16748 [details]
slurmdbd.conf
Created attachment 16749 [details]
slurm.conf
removed ip address.
Created attachment 16750 [details]
gres.conf
Created attachment 16751 [details]
cgroup.conf
everything has been uploaded, Please let me know if there is anything else needed. (In reply to Robert Romero from comment #3) > Created attachment 16748 [details] > slurmdbd.conf Please make sure you have changed your password. > StoragePass=s************ (In reply to Robert Romero from comment #0) > [2020-11-15T19:47:47.138] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : > Protocol authentication error Were the Slurm binaries updated or changed recently? > AuthType=auth/munge Has the munge key (default: /etc/munge/munge.key) been synced to the all the nodes, specifically mrcd[53,68]? Please call this on mrcd[53,68]: > remunge Please do not attach the munge key to this bug. Please also call the following on mrcd[53,68]: > pgrep slurmd | xargs -i grep . -nH /proc/{}/maps > pgrep slurmstepd | xargs -i grep . -nH /proc/{}/maps Please also call the following on host running slurmctld (controller): > pgrep slurmctld | xargs -i grep . -nH /proc/{}/maps Created attachment 16758 [details]
controller-slurmctld.txt
Created attachment 16759 [details]
mrcd53-slurmd
Created attachment 16760 [details]
mrcd53-slurmstepd
Created attachment 16762 [details]
mrcd68-slurmd
Created attachment 16763 [details]
mrcd68-slurmstepd
(In reply to Nate Rini from comment #9) > (In reply to Robert Romero from comment #0) > > [2020-11-15T19:47:47.138] agent/is_node_resp: node:mrcd53 RPC:REQUEST_PING : > > Protocol authentication error > > Were the Slurm binaries updated or changed recently? > > No, not recently, the last upgrade was over 4 months ago. the only thing done recently was "scontrol reconfigure" which took place on the 11/12/2020. > > > AuthType=auth/munge > Has the munge key (default: /etc/munge/munge.key) been synced to the all the > nodes, specifically mrcd[53,68]? Please call this on mrcd[53,68]: > > remunge > > Please do not attach the munge key to this bug. > > Yes, this would have been part of the upgrade process performed 4 months ago, the node failure was only on 53 and not seen 68. with 53 immediately returning to normal after the single job was terminated. > > > > Please also call the following on mrcd[53,68]: > > pgrep slurmd | xargs -i grep . -nH /proc/{}/maps > > pgrep slurmstepd | xargs -i grep . -nH /proc/{}/maps > > Please also call the following on host running slurmctld (controller): > > pgrep slurmctld | xargs -i grep . -nH /proc/{}/maps > > Output files have been uploaded. remunge performed. [root@mrcd53 ~]# remunge 2020-11-20 11:21:09 Spawning 1 thread for encoding 2020-11-20 11:21:09 Processing credentials for 1 second 2020-11-20 11:21:10 Processed 25379 credentials in 1.000s (25374 creds/sec) [root@mrcd68 ~]# remunge 2020-11-20 11:22:09 Spawning 1 thread for encoding 2020-11-20 11:22:09 Processing credentials for 1 second 2020-11-20 11:22:10 Processed 23888 credentials in 1.000s (23884 creds/sec) @nate I'm noticing node reporting mismatch slurm.conf files, with error " appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf." being reported. could this be still from the scontrol reconfigure? we did add and remove nodes from queues without restart slurmctld and the slurmd daemons as we interrupted that the slurmctld deamon would the slurmd daemons to re-read the file once re ran scontrol reconfigure. could this be the reason? as we are continuing to seeing node failures (In reply to Robert Romero from comment #17) > we did > add and remove nodes from queues without restart slurmctld and the slurmd > daemons as we interrupted that the slurmctld deamon would the slurmd daemons > to re-read the file once re ran scontrol reconfigure. > > could this be the reason? as we are continuing to seeing node failures Have there been any major changes to the slurm.conf? I would suggest just syncing the file and restarting all of the slurmd daemons. It would be rather surprising for a slurm.conf sync issue to cause an issue with munge but having slurmd daemons not getting updated could cause it. Robert, Timing this ticket out. Please respond if you have any more questions and we can continue from there. Thanks, --Nate |