A little background on our setup: We have 3 Cray systems as well some "external" nodes that don't live within any of the Cray systems. Each Cray system has a slurmctld that lives inside the Cray as well as a backup slurmctld on a VM that lives outside the Cray. Each Cray node has a hostname in the form "nid#####". These "nid" names are duplicated among the systems. Since we need interactive jobs to work from the shared login nodes, we created hostname aliases in the form "cluster-nid#####". Inside the Cray, the hosts file has entries for both forms. Outside the Cray, nodes only have the "cluster-nid#####" entries. On c4, for example, we use: NodeName=nid0[0000-0015,0020-0075,0080-0143,0148-0459,0464-0527,0532-0843,0848-0911,0916-1295,1300-2687] NodeHostname=c4-nid0[0000-0015,0020-0075,0080-0143,0148-0459,0464-0527,0532-0843,0848-0911,0916-1295,1300-2687] <rest of the parameters> This seems to work fine for both batch and interactive jobs. Problem: Last week our primary controller on c4 crashed. The backup on the VM took over, but quickly marked all the nodes as unresponsive and requeued all the running jobs with messages like: [2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host [2020-02-26T12:29:18.041] requeue job JobId=268605023 due to failure of node nid01982 # scontrol --local show node=nid00035 | grep Node NodeName=nid00035 Arch=x86_64 CoresPerSocket=18 NodeAddr=c4-nid00035 NodeHostName=c4-nid00035 Version=19.05.5 Based on my reading of the docs for NodeName, NodeHostname, and NodeAddr, I would have expected this to use the c4-nid##### hostname instead of the nid##### hostname. That didn't seem to be the case... can you recommend how we should set this up?
Matt, Can you please provide the following from each cluster: > getent hosts nid00035 > getent hosts c4-nid00035 Can you please also attach your slurm.conf for each cluster too. Thanks, --Nate
Sorry, we are having some connectivity issues after the tornado in Nashville severed one of our fiber connections for the lab. Here's what I'm able to grab now, I can hopefully get a more full response tomorrow. # Primary controller for c4, inside the c4 Cray c4-sys0:~ # hostname c4-sys0 c4-sys0:~ # getent hosts nid00035 172.25.32.36 nid00035 c4-nid00035 c0-0c0s8n3 c4-sys0:~ # getent hosts c4-nid00035 172.25.32.36 nid00035 c4-nid00035 c0-0c0s8n3 # Backup controller for c4, outside the Cray [root@c4-slurm-backup c4]# hostname c4-slurm-backup.ncrc.gov [root@c4-slurm-backup c4]# getent hosts nid00035 [root@c4-slurm-backup c4]# getent hosts c4-nid00035 172.25.32.36 c4-nid00035 As you can see, the backup controller doesn't know the 'NodeName' hostname, but can resolve the 'NodeHostname' hostname. I suppose it's possible I could add the 'NodeName' aliases on the backup controllers (since they only talk to nodes in their own cluster), but I can't do that on the login nodes because the 'NodeName' hostnames are ambiguous/duplicated.
Matt Can you please provide the a few of the lines above: > [2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host Thanks, --Nate
(In reply to Nate Rini from comment #4) > Matt > > Can you please provide the a few of the lines above: > > [2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host Please also call this: > scontrol show fed Are you using the overriden host names in the database to make "external" clusters in 19.05? Can you please also attach your slurm.conf for each cluster too. I assume this is one of the open systems. Thanks, --Nate
(In reply to Nate Rini from comment #4) > Can you please provide the a few of the lines above: > > [2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604657 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604659 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604661 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604663 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604665 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604667 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604669 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604672 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604674 [2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604676 [2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604678 [2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604680 [2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604682 [2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604684 [2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604686 [2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604688 [2020-02-26T12:27:39.266] _job_complete: JobId=268605057 WEXITSTATUS 1 [2020-02-26T12:27:39.267] email msg to wesley.ebisuzaki@noaa.gov: Slurm Job_id=268605057 Name=C128_convonly Failed, Run time 00:43:10, FAILED, ExitCode 1 [2020-02-26T12:27:39.267] _job_complete: JobId=268605057 done [2020-02-26T12:28:50.859] error: _pick_step_nodes: JobId=268605038 StepId=Batch has no step_node_bitmap [2020-02-26T12:28:51.967] _job_complete: JobId=268605038 WEXITSTATUS 0 [2020-02-26T12:28:51.968] _job_complete: JobId=268605038 done [2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host [2020-02-26T12:29:18.041] requeue job JobId=268605023 due to failure of node nid01982 [2020-02-26T12:29:18.042] error: Error connecting, bad data: family = 0, port = 0 [2020-02-26T12:29:18.044] error: Unable to resolve "nid00035": Unknown host [2020-02-26T12:29:18.044] email msg to Bill.Hurlin: Slurm Job_id=268605023 Name=CM4_c192L33_am4p0_piControl_new Failed, Run time 09:05:36, NODE_FAIL, ExitCode 0 [2020-02-26T12:29:18.044] error: Error connecting, bad data: family = 0, port = 0 [2020-02-26T12:29:18.126] error: Unable to resolve "nid00047": Unknown host [2020-02-26T12:29:18.126] requeue job JobId=268604868 due to failure of node nid01983 [2020-02-26T12:29:18.127] error: Error connecting, bad data: family = 0, port = 0 [2020-02-26T12:29:18.129] error: Unable to resolve "nid00047": Unknown host [root@c4-slurm-backup c4]# scontrol show fed Federation: gaea Self: c4:192.188.179.96:6817 ID:4 FedState:ACTIVE Features: Sibling: c3:192.188.179.71:6817 ID:3 FedState:ACTIVE Features: PersistConnSend/Recv:Yes/Yes Synced:Yes Sibling: es:192.188.179.88:6817 ID:1 FedState:ACTIVE Features: PersistConnSend/Recv:Yes/Yes Synced:Yes The "external" cluster is "gfdl". It is not in the federation. We don't do any hostname trickery for that, just manually set the IP address in the database.
Created attachment 13268 [details] slurm.conf from c4-sys0
Created attachment 13270 [details] slurm.conf from c3-sys0
Created attachment 13271 [details] slurm.conf from es-slurm
(In reply to Matt Ezell from comment #7) > The "external" cluster is "gfdl". It is not in the federation. We don't do > any hostname trickery for that, just manually set the IP address in the > database. Any plans (in near future) to upgrade to 20.02 for the supported external clusters?
(In reply to Matt Ezell from comment #7) > (In reply to Nate Rini from comment #4) > [2020-02-26T12:28:50.859] error: _pick_step_nodes: JobId=268605038 > StepId=Batch has no step_node_bitmap This look like bug#7499 comment#73 and should be fully fixed in 20.02.
(In reply to Nate Rini from comment #11) > Any plans (in near future) to upgrade to 20.02 for the supported external > clusters? It is on our to-do list, but it's behind upgrading Cray CLE. So we are probably several months out, at earliest.
(In reply to Nate Rini from comment #12) > (In reply to Matt Ezell from comment #7) > > (In reply to Nate Rini from comment #4) > > [2020-02-26T12:28:50.859] error: _pick_step_nodes: JobId=268605038 > > StepId=Batch has no step_node_bitmap > This look like bug#7499 comment#73 and should be fully fixed in 20.02. Is this what caused all the running jobs to abort, or is this just a related problem? That bug is private and I cannot see the details. Is the fix backportable to 19.05, or is it based in more fundamental changes?
(In reply to Matt Ezell from comment #15) > (In reply to Nate Rini from comment #12) > > (In reply to Matt Ezell from comment #7) > > > (In reply to Nate Rini from comment #4) > > > [2020-02-26T12:28:50.859] error: _pick_step_nodes: JobId=268605038 > > > StepId=Batch has no step_node_bitmap > > This look like bug#7499 comment#73 and should be fully fixed in 20.02. > > Is this what caused all the running jobs to abort, or is this just a related > problem? This appears to a symptom but not the cause of this issue. Is slurmctld SEGFAULTing? > That bug is private and I cannot see the details. NERSC agreed to open most of the ticket to the public. It should be open now. > Is the fix backportable to 19.05, or is it based in more fundamental changes? The patches change the RPC layer, so backporting would require all binaries be updated at once to avoid RPC corruption. I don't advise it. Please see bug #7499 comment#82 for patch list.
(In reply to Nate Rini from comment #16) > This appears to a symptom but not the cause of this issue. Is slurmctld > SEGFAULTing? slurmctld is SIGABRT'ing as described in bug #8584. When the secondary takes over, we are seeing the behavior described in this bug. I've been treating them as separate issues, but they could be related by more than chronology.
(In reply to Matt Ezell from comment #17) > (In reply to Nate Rini from comment #16) > > This appears to a symptom but not the cause of this issue. Is slurmctld > > SEGFAULTing? > > slurmctld is SIGABRT'ing as described in bug #8584. When the secondary > takes over, we are seeing the behavior described in this bug. I've been > treating them as separate issues, but they could be related by more than > chronology. Your backtrace looks different than the one in bug#7499 (kill_step_on_node() vs _attempt_backfill()), but I'll defer to Dominic since he already owns bug#8584. I will continue to pursue these independently.
Matt, Still working on recreating your issue. Looks like a shutdown (or SIGKILL) slurmctld failure is not sufficient to trigger the issue. --Nate
Matt, Is it possible to get this output? > sacct -j 268605038 -p -a Thanks --Nate
(In reply to Matt Ezell from comment #0) > # scontrol --local show node=nid00035 | grep Node > NodeName=nid00035 Arch=x86_64 CoresPerSocket=18 > NodeAddr=c4-nid00035 NodeHostName=c4-nid00035 Version=19.05.5 > > Based on my reading of the docs for NodeName, NodeHostname, and NodeAddr, I > would have expected this to use the c4-nid##### hostname instead of the > nid##### hostname. That didn't seem to be the case... can you recommend how > we should set this up? It is providing the nodename as configured for the given cluster. The scontrol command is talking directly the the primary slurmctld on each cluster: > [root@mgmtc2s2 ~]# scontrol -M cluster1 show node=node01|grep Node > NodeName=node01 Arch=x86_64 CoresPerSocket=6 > NodeAddr=c1node01 NodeHostName=c1node01 Version=19.05.5 > [root@mgmtc2s2 ~]# scontrol -M cluster2 show node=node01|grep Node > NodeName=node01 Arch=x86_64 CoresPerSocket=6 > NodeAddr=c2node01 NodeHostName=c2node01 Version=19.05.5 > [root@mgmtc2s2 ~]# scontrol -M cluster3 show node=node01|grep Node > NodeName=node01 Arch=x86_64 CoresPerSocket=6 > NodeAddr=c3node01 NodeHostName=c3node01 Version=19.05.5 I admit this is quite confusing but it crosses into RFE territory to have Slurm return a more universal node name.
(In reply to Matt Ezell from comment #7) > [2020-02-26T12:29:18.127] error: Error connecting, bad data: family = 0, > port = 0 > [2020-02-26T12:29:18.129] error: Unable to resolve "nid00047": Unknown host Is it possible to get a few pages worth of logs following these errors?
(In reply to Nate Rini from comment #21) > (In reply to Matt Ezell from comment #0) > > # scontrol --local show node=nid00035 | grep Node > > NodeName=nid00035 Arch=x86_64 CoresPerSocket=18 > > NodeAddr=c4-nid00035 NodeHostName=c4-nid00035 Version=19.05.5 > > > > Based on my reading of the docs for NodeName, NodeHostname, and NodeAddr, I > > would have expected this to use the c4-nid##### hostname instead of the > > nid##### hostname. That didn't seem to be the case... can you recommend how > > we should set this up? > It is providing the nodename as configured for the given cluster. > > The scontrol command is talking directly the the primary slurmctld on each > cluster: > > [root@mgmtc2s2 ~]# scontrol -M cluster1 show node=node01|grep Node > > NodeName=node01 Arch=x86_64 CoresPerSocket=6 > > NodeAddr=c1node01 NodeHostName=c1node01 Version=19.05.5 > > [root@mgmtc2s2 ~]# scontrol -M cluster2 show node=node01|grep Node > > NodeName=node01 Arch=x86_64 CoresPerSocket=6 > > NodeAddr=c2node01 NodeHostName=c2node01 Version=19.05.5 > > [root@mgmtc2s2 ~]# scontrol -M cluster3 show node=node01|grep Node > > NodeName=node01 Arch=x86_64 CoresPerSocket=6 > > NodeAddr=c3node01 NodeHostName=c3node01 Version=19.05.5 > > I admit this is quite confusing but it crosses into RFE territory to have > Slurm return a more universal node name. Sorry if I was unclear here. The 'scontrol' output is as I would expect. It's that the controller tried to reach out to the NodeName hostname (as indicated in the logs) instead of the NodeHostname or NodeAddr hostname that seems wrong.
(In reply to Matt Ezell from comment #23) > It's that the controller tried to reach out to the NodeName hostname (as > indicated in the logs) instead of the NodeHostname or NodeAddr hostname that > seems wrong. Agreed. Getting more logs per comment #22 would be helpful. So far I have not been able to recreate this issue locally. I suspect there is a fun race condition here that your site managed to hit.
Created attachment 13335 [details] slurmctld log from c4-slurm-backup
(In reply to Nate Rini from comment #24) > (In reply to Matt Ezell from comment #23) > > It's that the controller tried to reach out to the NodeName hostname (as > > indicated in the logs) instead of the NodeHostname or NodeAddr hostname that > > seems wrong. > > Agreed. > > Getting more logs per comment #22 would be helpful. So far I have not been > able to recreate this issue locally. I suspect there is a fun race condition > here that your site managed to hit. I think attachment 13335 [details] should have 2 full occurrences from the backup controller.
Can you please confirm your not running in front end mode?
(In reply to Nate Rini from comment #28) > Can you please confirm your not running in front end mode? We are not in frontend mode.
(In reply to Matt Ezell from comment #26) > I think attachment 13335 [details] should have 2 full occurrences from the > backup controller. One of the main confusing parts of the reading the logs is that all the threads are getting sparsed together. I suggest setting this in your slurm.conf for future issues: > LogTimeFormat=thread_id
Matt Can you please provide the topology.conf for each cluster? Thanks, --Nate
(In reply to Nate Rini from comment #31) > Matt > > Can you please provide the topology.conf for each cluster? > > Thanks, > --Nate Interesting angle. Those use the NodeName: # c3 SwitchName=s0 Nodes=nid0[0004-0067,0072-0127,0132-0383] SwitchName=s1 Nodes=nid0[0388-0451,0456-0511,0516-0767] SwitchName=s2 Nodes=nid0[0772-0835,0840-1151] SwitchName=s3 Nodes=nid0[1156-1219,1224-1535] SwitchName=root Switches=s[0-3] # c4 SwitchName=s0 Nodes=nid0[0000-0015,0020-0075,0080-0143,0148-0383] SwitchName=s1 Nodes=nid0[0384-0459,0464-0527,0532-0767] SwitchName=s2 Nodes=nid0[0768-0843,0848-0911,0916-1151] SwitchName=s3 Nodes=nid0[1152-1295,1300-1535] SwitchName=s4 Nodes=nid0[1536-1919] SwitchName=s5 Nodes=nid0[1920-2303] SwitchName=s6 Nodes=nid0[2304-2687] SwitchName=root Switches=s[0-6] # es # no topology.conf file exists
Matt How is DNS handled on the controllers? I was not able to trigger the issue with setting up topology. Thanks, --Nate
(In reply to Nate Rini from comment #34) > Matt > > How is DNS handled on the controllers? > > I was not able to trigger the issue with setting up topology. > > Thanks, > --Nate The compute nodes are in /etc/hosts and not in DNS. On the primary controller, both hostnames are in /etc/hosts, like: # grep nid00008 /etc/hosts 172.25.32.9 nid00008 c4-nid00008 c0-0c0s2n0 Whereas on the backup controllers, only the c4-nid##### hostnames resolve: # grep nid00008 /etc/hosts 172.25.32.9 c4-nid00008 If a host is not found in /etc/hosts, it goes out to our center-wide DNS servers per the nsswitch: # grep hosts /etc/nsswitch.conf # hosts defined in: Class[Dns::Client] hosts: files dns myhostname As a workaround, I suppose I could add the "bare" nid hostnames to the backup controllers.
(In reply to Matt Ezell from comment #35) > Whereas on the backup controllers, only the c4-nid##### hostnames resolve: > > As a workaround, I suppose I could add the "bare" nid hostnames to the > backup controllers. Okay, now this makes sense. I assumed /etc/hosts (and/or DNS) would be same within any given full cluster. Yes, you need to have the host names resolvable (via getent) on the backup controllers same as the primary controller. The backup controllers are just normal primary controllers when they are active and have the same requirements. Please tell me if you have any more questions. Thanks, --Nate
(In reply to Nate Rini from comment #36) > (In reply to Matt Ezell from comment #35) > > Whereas on the backup controllers, only the c4-nid##### hostnames resolve: > > > > As a workaround, I suppose I could add the "bare" nid hostnames to the > > backup controllers. > Okay, now this makes sense. I assumed /etc/hosts (and/or DNS) would be same > within any given full cluster. Yes, you need to have the host names > resolvable (via getent) on the backup controllers same as the primary > controller. The backup controllers are just normal primary controllers when > they are active and have the same requirements. > > Please tell me if you have any more questions. > > Thanks, > --Nate Why is it using NodeName? Isn't the point of NodeHostname that everything would us that instead?
(In reply to Matt Ezell from comment #35) > Whereas on the backup controllers, only the c4-nid##### hostnames resolve: Are all of the long host names from every cluster included here? > If a host is not found in /etc/hosts, it goes out to our center-wide DNS > servers per the nsswitch: Is there any chance that the center-wide DNS was failing at the time? (In reply to Matt Ezell from comment #37) > (In reply to Nate Rini from comment #36) > > (In reply to Matt Ezell from comment #35) > > Okay, now this makes sense. I assumed /etc/hosts (and/or DNS) would be same > > within any given full cluster. Yes, you need to have the host names Should have said NodeHostname. > > resolvable (via getent) on the backup controllers same as the primary > > controller. > Why is it using NodeName? It should not be using NodeName for any of the networking, but the cli commands will for user/admin interaction (which can get really confusing). Slurm internally uses NodeName everywhere until the networking code, where it is converted to NodeHostname. > Isn't the point of NodeHostname that everything would us that instead? For networking only. On my test fed cluster, I don't have any dns provider for the NodeName but only NodeHostname and everything is working as expected (including fail over with down nodes). Did issue only happened with bug#8584 causing a failover? Is it possible to try to trigger it again with some extra logging to find where it is failing since the logs don't provide enough details currently.
(In reply to Nate Rini from comment #38) > (In reply to Matt Ezell from comment #35) > > Whereas on the backup controllers, only the c4-nid##### hostnames resolve: > Are all of the long host names from every cluster included here? No, only the local cluster > > If a host is not found in /etc/hosts, it goes out to our center-wide DNS > > servers per the nsswitch: > Is there any chance that the center-wide DNS was failing at the time? None of the compute nodes are in center-wide DNS. > (In reply to Matt Ezell from comment #37) > > (In reply to Nate Rini from comment #36) > > > (In reply to Matt Ezell from comment #35) > > > Okay, now this makes sense. I assumed /etc/hosts (and/or DNS) would be same > > > within any given full cluster. Yes, you need to have the host names > Should have said NodeHostname. > > > resolvable (via getent) on the backup controllers same as the primary > > > controller. > > > Why is it using NodeName? > It should not be using NodeName for any of the networking, but the cli > commands will for user/admin interaction (which can get really confusing). > Slurm internally uses NodeName everywhere until the networking code, where > it is converted to NodeHostname. > > > Isn't the point of NodeHostname that everything would us that instead? > For networking only. On my test fed cluster, I don't have any dns provider > for the NodeName but only NodeHostname and everything is working as expected > (including fail over with down nodes). > > Did issue only happened with bug#8584 causing a failover? Is it possible to > try to trigger it again with some extra logging to find where it is failing > since the logs don't provide enough details currently. I can try to reproduce this on t4 (the test Cray machine) although the environment is slightly different (SLE15 instead of SLES12). I'll try SIGKILL'ing the primary controller and get back to you after I've had a chance to try it.
(In reply to Matt Ezell from comment #39) > (In reply to Nate Rini from comment #38) > > Did issue only happened with bug#8584 causing a failover? Is it possible to > > try to trigger it again with some extra logging to find where it is failing > > since the logs don't provide enough details currently. > > I can try to reproduce this on t4 (the test Cray machine) although the > environment is slightly different (SLE15 instead of SLES12). I'll try > SIGKILL'ing the primary controller and get back to you after I've had a > chance to try it. Please set SlurmctldDebug=debug4 and debugflags=agent,network.
(In reply to Nate Rini from comment #40) > Please set SlurmctldDebug=debug4 and debugflags=agent,network. # slurmctld slurmctld: error: Invalid DebugFlag: network slurmctld: error: DebugFlags invalid: agent,network slurmctld: fatal: Unable to process configuration file
Created attachment 13390 [details] slurmctld log from t4-slurm-backup
(In reply to Matt Ezell from comment #41) > (In reply to Nate Rini from comment #40) > > Please set SlurmctldDebug=debug4 and debugflags=agent,network. > > # slurmctld > slurmctld: error: Invalid DebugFlag: network > slurmctld: error: DebugFlags invalid: agent,network > slurmctld: fatal: Unable to process configuration file Network was added in 20.02, please just do agent for now. > debugflags=agent Please also set: > LogTimeFormat=thread_id Thanks, --Nate
(In reply to Nate Rini from comment #40) > > I can try to reproduce this on t4 (the test Cray machine) although the > > environment is slightly different (SLE15 instead of SLES12). I'll try (In reply to Matt Ezell from comment #42) > Created attachment 13390 [details] > slurmctld log from t4-slurm-backup Were any nodes set to down while logging?
(In reply to Nate Rini from comment #44) > Were any nodes set to down while logging? Sorry, looks like the log I uploaded did not have all the relevant messages in it. I'll re-upload momentarily. [2020-03-16T23:52:49.588] requeue job JobId=134262404 due to failure of node nid00008 [2020-03-16T23:52:49.588] debug3: select/cons_res: _rm_job_from_res: JobId=134262404 action 0 [2020-03-16T23:52:49.588] debug3: select/cons_res: removed JobId=134262404 from part batch row 0 [2020-03-16T23:52:49.588] debug3: make_node_comp: Node nid00008 being left DOWN [2020-03-16T23:52:49.588] agent_trigger: pending_wait_time=65534->999 mail_too=F->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=1 [2020-03-16T23:52:49.588] debug: Spawning registration agent for nid[00024-00027] 4 hosts [2020-03-16T23:52:49.588] agent_trigger: pending_wait_time=999->999 mail_too=F->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=2 [2020-03-16T23:52:49.588] error: Nodes nid[00008-00023] not responding, setting DOWN
Created attachment 13391 [details] slurmctld log from t4-slurm-backup
(In reply to Matt Ezell from comment #46) > Created attachment 13391 [details] > slurmctld log from t4-slurm-backup There were no resolution errors: > $ grep -c 'Unable to resolve' slurmctld.log.t4agentdebug > 0 Were all the hosts added per comment #35 before the test?
(In reply to Nate Rini from comment #47) > There were no resolution errors: > > $ grep -c 'Unable to resolve' slurmctld.log.t4agentdebug > > 0 > > Were all the hosts added per comment #35 before the test? No, not on t4: # grep nid00008 /etc/hosts 172.25.56.9 t4-nid00008 # ping nid00008 ping: nid00008: Name or service not known So maybe this didn't reproduce in the same way?
(In reply to Matt Ezell from comment #48) > So maybe this didn't reproduce in the same way? I would have expected errors like this one for DNS issues: > error: Unable to resolve "nid00035": Unknown host This issue looks like slurmds not responding in a timely manner: > debug2: Error connecting slurm stream socket at 172.25.56.9:6818: Connection timed out Was slurmd restarted at the same time? Kernel refused the first connection attempts which suggests slurmd wasn't even listening. > debug2: Error connecting slurm stream socket at 192.188.179.91:6817: Connection refused
(In reply to Nate Rini from comment #49) > Was slurmd restarted at the same time? Kernel refused the first connection No, but we do have some weirdness due to the way we are doing our routing: # ping -c1 t4-nid00008 PING t4-nid00008 (172.25.56.9) 56(84) bytes of data. From ncrc-rtr1-v405.ncrc.gov (192.188.179.66) icmp_seq=1 Redirect Host(New nexthop: t4-batch1.ncrc.gov (192.188.179.93)) --- t4-nid00008 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms # ping -c2 t4-nid00008 PING t4-nid00008 (172.25.56.9) 56(84) bytes of data. From ncrc-rtr1-v405.ncrc.gov (192.188.179.66) icmp_seq=1 Redirect Host(New nexthop: t4-batch1.ncrc.gov (192.188.179.93)) From ncrc-rtr1-v405.ncrc.gov (192.188.179.66): icmp_seq=1 Redirect Host(New nexthop: t4-batch1.ncrc.gov (192.188.179.93)) 64 bytes from t4-nid00008 (172.25.56.9): icmp_seq=1 ttl=63 time=0.885 ms --- t4-nid00008 ping statistics --- 1 packets transmitted, 1 received, +1 errors, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.885/0.885/0.885/0.000 ms Maybe I should setup manual routes in the routing table instead of expecting the default gateway to take care of it.
(In reply to Matt Ezell from comment #50) > Maybe I should setup manual routes in the routing table instead of expecting > the default gateway to take care of it. I was hoping to see replication of the issue in comment #0. So far, all of my internal tests have not shown the issue, so I must be missing something. Since your hosts are in /etc/hosts, the routing *shouldn't* matter but I've been surprised before by libnss.
Matt Have you seen any errors like these? > slurmctld: error: pack_msg: Invalid message version=0, type:4502 > slurmctld: error: _queue_rpc: failed to pack msg_type:4502 Your slurm.conf didn't have any lines with MsgAggregationParams but I wanted to make sure you didn't somehow have it setup: > scontrol show config | grep -i -e window -e agg In bug#8697, we found that some cloud nodes might try to resolve early on startup when message aggregation is active. I also setup my test cluster to only have the local long form nodes names in DNS (at all) and I still don't see the resolution errors. --Nate
(In reply to Nate Rini from comment #52) > Matt > > Have you seen any errors like these? > > slurmctld: error: pack_msg: Invalid message version=0, type:4502 > > slurmctld: error: _queue_rpc: failed to pack msg_type:4502 No, not on the primary or backup. > Your slurm.conf didn't have any lines with MsgAggregationParams but I wanted > to make sure you didn't somehow have it setup: > > scontrol show config | grep -i -e window -e agg > In bug#8697, we found that some cloud nodes might try to resolve early on > startup when message aggregation is active. Nope: # scontrol show config | grep -i -e window -e agg MsgAggregationParams = (null) > I also setup my test cluster to only have the local long form nodes names in > DNS (at all) and I still don't see the resolution errors. Thanks for all your hard work trying to reproduce this - it seems that we don't have a reliable reproducer other than the referenced bug (which I can't reliably reproduce).
(In reply to Matt Ezell from comment #53) > Thanks for all your hard work trying to reproduce this - it seems that we > don't have a reliable reproducer other than the referenced bug (which I > can't reliably reproduce). At this point, only way I can reproduce the issue is to remove the DNS entries for given nodes (NodeHostname) in /etc/hosts. That doesn't look like the issue observed in comment #0. With that, I'll close this ticket. Please reply to have it reopened and we can continue from there. If you see this happening again, please call gcore to grab a core from slurmctld process and that should give us a starting point.