Felipe asked to open a separate bug for our case, logs and configs are attached to this bug report: https://bugs.schedmd.com/show_bug.cgi?id=6502
Problem can not be in the slurm_prolog script. If prolog fails I do not see that "start slurm_prolog" string on the syslog. #!/bin/sh logger "start slurm_prolog"
Tommi, I need a log of slurmd that includes the full process of a job. I need to see everything from the start or even before it. As you noticed r11c54's slurmd log starts at [2019-10-08T14:59:18.209] debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG and the problematic job, accordingly to slurmctld at: [2019-10-08T14:58:33.579] sched: Allocate JobId=293312 NodeList=r11c54 #CPUs=10 Partition=fmi A dmesg in the node and an scontrol show job would be also useful. Also please, set DebugLevel=debug2, SlurmdDebug=debug2 and ensure the DebugFlags Agent, Protocol and Route are set. As you know this implies some performance penalty, so set logs to info or debug when you're done with getting them. I have a slurm.conf from march, it would be good to have an updated one, also the topology.conf.
Hi, Yesterday I enabled debug2 on compute nodes and enabled debugflags on slurmctld but problem disappeared :( So problem may be some kind of race condition which is hard to catch? I'm not able to access "see also"-ticket: You are not authorized to access bug #6493 Recent configs can be found on this ticket: https://bugs.schedmd.com/show_bug.cgi?id=7908 I checked from the logs, general failure rate was about 4-5/day when debug2 was not enabled. I've to disable debug2-level log during the next weekend, we do not have disks on the compute nodes and jobs with lot of steps can fill /var tmpfs filesystem quite fast. But I'll continue on next week. slurmctld.log-20191009.gz:[2019-10-08T13:39:46.912] error: validate_node_specs: Prolog or job env setup failure on node r07c25, draining the node slurmctld.log-20191009.gz:[2019-10-08T14:59:22.939] error: validate_node_specs: Prolog or job env setup failure on node r11c54, draining the node slurmctld.log-20191009.gz:[2019-10-08T22:25:59.511] error: validate_node_specs: Prolog or job env setup failure on node r05c48, draining the node slurmctld.log-20191010.gz:[2019-10-09T09:30:29.592] error: validate_node_specs: Prolog or job env setup failure on node r11c65, draining the node slurmctld.log-20191010.gz:[2019-10-09T12:47:37.447] error: validate_node_specs: Prolog or job env setup failure on node r02g03, draining the node slurmctld.log-20191010.gz:[2019-10-09T15:41:20.668] error: validate_node_specs: Prolog or job env setup failure on node r07c05, draining the node slurmctld.log-20191010.gz:[2019-10-09T16:29:43.271] error: validate_node_specs: Prolog or job env setup failure on node r11c53, draining the node slurmctld.log-20191010.gz:[2019-10-09T16:43:05.134] error: validate_node_specs: Prolog or job env setup failure on node r01c38, draining the node slurmctld.log-20191010.gz:[2019-10-09T17:08:35.986] error: validate_node_specs: Prolog or job env setup failure on node r01g08, draining the node slurmctld.log-20191010.gz:[2019-10-09T18:14:04.076] error: validate_node_specs: Prolog or job env setup failure on node r02c01, draining the node slurmctld.log-20191010.gz:[2019-10-09T19:31:03.604] error: validate_node_specs: Prolog or job env setup failure on node r16c10, draining the node slurmctld.log-20191010.gz:[2019-10-10T00:22:13.406] error: validate_node_specs: Prolog or job env setup failure on node r11c71, draining the node slurmctld.log-20191011.gz:[2019-10-10T09:57:16.019] error: validate_node_specs: Prolog or job env setup failure on node r15c14, draining the node slurmctld.log-20191011.gz:[2019-10-10T10:59:20.881] error: validate_node_specs: Prolog or job env setup failure on node r01g01, draining the node slurmctld.log-20191011.gz:[2019-10-10T13:11:59.753] error: validate_node_specs: Prolog or job env setup failure on node r04c10, draining the node slurmctld.log-20191011.gz:[2019-10-10T14:32:46.488] error: validate_node_specs: Prolog or job env setup failure on node r03c23, draining the node slurmctld.log-20191011.gz:[2019-10-10T17:05:56.299] error: validate_node_specs: Prolog or job env setup failure on node r14c19, draining the node slurmctld.log-20191011.gz:[2019-10-11T02:55:01.292] error: validate_node_specs: Prolog or job env setup failure on node r05c50, draining the node slurmctld.log-20191013.gz:[2019-10-11T07:59:40.250] error: validate_node_specs: Prolog or job env setup failure on node r11c54, draining the node slurmctld.log-20191013.gz:[2019-10-11T09:53:09.073] error: validate_node_specs: Prolog or job env setup failure on node r01g01, draining the node slurmctld.log-20191013.gz:[2019-10-11T13:43:13.716] error: validate_node_specs: Prolog or job env setup failure on node r02g01, draining the node slurmctld.log-20191013.gz:[2019-10-11T18:00:59.428] error: validate_node_specs: Prolog or job env setup failure on node r15c01, draining the node slurmctld.log-20191013.gz:[2019-10-11T21:59:13.154] error: validate_node_specs: Prolog or job env setup failure on node r07c19, draining the node slurmctld.log-20191013.gz:[2019-10-12T00:27:37.752] error: validate_node_specs: Prolog or job env setup failure on node r17c13, draining the node slurmctld.log-20191013.gz:[2019-10-12T01:21:43.237] error: validate_node_specs: Prolog or job env setup failure on node r05c44, draining the node slurmctld.log-20191015.gz:[2019-10-14T04:22:33.650] error: validate_node_specs: Prolog or job env setup failure on node r01c16, draining the node slurmctld.log-20191015.gz:[2019-10-14T06:19:10.071] error: validate_node_specs: Prolog or job env setup failure on node r06c15, draining the node slurmctld.log-20191015.gz:[2019-10-14T15:19:13.857] error: validate_node_specs: Prolog or job env setup failure on node r05c34, draining the node slurmctld.log-20191017.gz:[2019-10-15T03:24:39.769] error: validate_node_specs: Prolog or job env setup failure on node r06c04, draining the node slurmctld.log-20191017.gz:[2019-10-15T08:39:50.923] error: validate_node_specs: Prolog or job env setup failure on node r03c08, draining the node slurmctld.log-20191017.gz:[2019-10-15T09:38:58.090] error: validate_node_specs: Prolog or job env setup failure on node r03c08, draining the node slurmctld.log-20191017.gz:[2019-10-15T12:02:08.873] error: validate_node_specs: Prolog or job env setup failure on node r04c46, draining the node slurmctld.log-20191017.gz:[2019-10-15T13:36:50.139] error: validate_node_specs: Prolog or job env setup failure on node r17g05, draining the node slurmctld.log-20191017.gz:[2019-10-15T15:36:21.258] error: validate_node_specs: Prolog or job env setup failure on node r01g01, draining the node slurmctld.log-20191017.gz:[2019-10-16T11:15:04.401] error: validate_node_specs: Prolog or job env setup failure on node r02g02, draining the node slurmctld.log-20191017.gz:[2019-10-16T12:22:36.146] error: validate_node_specs: Prolog or job env setup failure on node r12c11, draining the node slurmctld.log-20191017.gz:[2019-10-16T16:32:24.111] error: validate_node_specs: Prolog or job env setup failure on node r01g02, draining the node slurmctld.log-20191017.gz:[2019-10-16T17:23:00.217] error: validate_node_specs: Prolog or job env setup failure on node r01g01, draining the node slurmctld.log-20191017.gz:[2019-10-17T00:27:28.460] error: validate_node_specs: Prolog or job env setup failure on node r16c25, draining the node slurmctld.log-20191018.gz:[2019-10-17T10:48:58.984] error: validate_node_specs: Prolog or job env setup failure on node r15g04, draining the node slurmctld.log-20191018.gz:[2019-10-17T11:56:51.536] error: validate_node_specs: Prolog or job env setup failure on node r12c08, draining the node slurmctld.log-20191018.gz:[2019-10-17T12:11:58.865] error: validate_node_specs: Prolog or job env setup failure on node r12c03, draining the node
(In reply to Tommi Tervo from comment #4) > Hi, > > Yesterday I enabled debug2 on compute nodes and enabled debugflags on > slurmctld but problem disappeared :( Typical case when you look at the problem, it just fixes itself :P > So problem may be some kind of race condition which is hard to catch? I think so, but right now I cannot say if it is in processing an RPC in an incorrect order, or missing one, or just some issue with nodes... I cannot say anything without more info. > I'm not able to access "see also"-ticket: > You are not authorized to access bug #6493 There's no relevant information, it is an internal unresolved ticket with a similar issue than yours on a specific environment when running the testsuite. > Recent configs can be found on this ticket: > https://bugs.schedmd.com/show_bug.cgi?id=7908 Good. > I checked from the logs, general failure rate was about 4-5/day when debug2 > was not enabled. I've to disable debug2-level log during the next weekend, > we do not have disks on the compute nodes and jobs with lot of steps can > fill /var tmpfs filesystem quite fast. But I'll continue on next week. It's fine, let's see what happens next week.
Created attachment 12021 [details] slurmctld debug logs
Created attachment 12022 [details] compute node debug log Now we got prolog error logs with requested debug level.
What does nhc.log say? It exited with code 0x0100, can you attach your nhc script too? [2019-10-21T13:45:27.479] debug2: Start processing RPC: REQUEST_HEALTH_CHECK [2019-10-21T13:45:27.479] debug2: Processing RPC: REQUEST_HEALTH_CHECK [2019-10-21T13:45:27.479] debug: attempting to run health_check [/usr/sbin/nhc] [2019-10-21T13:45:30.753] error: /usr/sbin/nhc: exited with status 0x0100 [2019-10-21T13:45:30.753] debug2: Finish processing RPC: REQUEST_HEALTH_CHECK [2019-10-21T13:46:06.209] error: Waiting for JobId=356502 prolog has failed, giving up after 50 sec
Created attachment 12023 [details] NHC logs and scripts: lbnl-nhc-1.4.2-1.el7.noarch
I tracked down an error that is showing up in your nhc.log and I suspect there may be a relation between this line: /etc/nhc/scripts/common.nhc: line 381: slurm.r12c05.log-20191018.gz: syntax error: invalid arithmetic operator (error token is ".r12c05.log-20191018.gz") NHC watchdog timer 164451 (60 secs) has expired. Signaling NHC: kill -s ALRM -- -164450 /usr/sbin/nhc: line 485: kill: (-164450) - No such process And these checks: @cn@ || check_ps_loadavg 400 @cn@ || check_ps_service -S -u root auditd @cn@ || check_ps_service -r rpcbind @cn@ || check_ps_service -S ntpd @cn@ || check_ps_service -S -u root sshd @cn@ || check_ps_service -S -d rsyslogd -u root rsyslog @cn@ || check_ps_service -S -u root crond This could be the cause of: [2019-10-21T13:45:27.479] debug2: Start processing RPC: REQUEST_HEALTH_CHECK [2019-10-21T13:45:27.479] debug2: Processing RPC: REQUEST_HEALTH_CHECK [2019-10-21T13:45:27.479] debug: attempting to run health_check [/usr/sbin/nhc] [2019-10-21T13:45:30.753] error: /usr/sbin/nhc: exited with status 0x0100 [2019-10-21T13:45:30.753] debug2: Finish processing RPC: REQUEST_HEALTH_CHECK In fact, after that I've found: https://github.com/mej/nhc/issues/25 It could have relation with your issue so I will ask you to fix it before continue looking. Could you please fix it?
Thanks, I'll patch nhc tomorrow. I just noticed another failure where nhc is not interfering. Unfortunately there was no debug2 level logs but that 2^32 stepd looks a bit suspicious? [2019-10-21T17:09:24.319] error: Waiting for JobId=356826 prolog has failed, giving up after 50 sec [2019-10-21T17:09:24.365] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:24.371] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:25.366] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:25.372] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:26.367] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:26.372] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:27.368] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:27.373] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:28.369] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:28.374] error: stepd_connect to 356826.4294967295 failed: No such file or directory [2019-10-21T17:09:31.507] error: /usr/sbin/nhc: exited with status 0x0100
(In reply to Tommi Tervo from comment #11) > Thanks, I'll patch nhc tomorrow. I just noticed another failure where nhc is > not interfering. Unfortunately there was no debug2 level logs but that 2^32 > stepd looks a bit suspicious? > > [2019-10-21T17:09:24.319] error: Waiting for JobId=356826 prolog has failed, > giving up after 50 sec > [2019-10-21T17:09:24.365] error: stepd_connect to 356826.4294967295 failed: > No such file or directory > ...... > [2019-10-21T17:09:31.507] error: /usr/sbin/nhc: exited with status 0x0100 Well, I see an NHC message too, so it can interfer. Don't worry for the 2³², it is the ID for the extern step. I'd need a debug2 to say more and a 'ps aux' in the node, but it seems slurmd/stepd was trying to connect to the extern step and it wasn't there anymore. I'd say to comment the process checks in NHC right now if you can, and later apply the patch and fix the bug. Aside from that, I have to admit I see now similar behavior than in 6502: CTLD is sending REQUEST_LAUNCH_PROLOG, but it never gets to the node due to "Communication connection failure" [2019-10-21T13:45:16.200] retry_list retry_list_size:2 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,0,0,0 [2019-10-21T13:45:16.200] retry_list retry_list_size:1 msg_type=REQUEST_BATCH_JOB_LAUNCH,0,0,0,0 [2019-10-21T13:45:16.200] agent: Agent_cnt=1 agent_thread_cnt=12 with msg_type=REQUEST_BATCH_JOB_LAUNCH retry_list_size=0 [2019-10-21T13:45:16.200] agent: Agent_cnt=1 agent_thread_cnt=12 with msg_type=REQUEST_LAUNCH_PROLOG retry_list_size=0 [2019-10-21T13:45:16.200] _make_agent_info: sending msg_type REQUEST_BATCH_JOB_LAUNCH to nodes r12c05 [2019-10-21T13:45:16.200] _make_agent_info: sending msg_type REQUEST_LAUNCH_PROLOG to nodes r12c05 [2019-10-21T13:45:16.200] agent: New agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_BATCH_JOB_LAUNCH protocol_version:8704 [2019-10-21T13:45:16.200] agent: New agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_LAUNCH_PROLOG protocol_version:8704 [2019-10-21T13:45:16.200] _thread_per_group_rpc: sending REQUEST_BATCH_JOB_LAUNCH to r12c05 [2019-10-21T13:45:16.200] _thread_per_group_rpc: sending REQUEST_LAUNCH_PROLOG to r12c05 [2019-10-21T13:45:16.216] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_BATCH_JOB_LAUNCH protocol_version:8704 [2019-10-21T13:45:17.681] retry_list retry_list_size:1 msg_type=REQUEST_LAUNCH_PROLOG,0,0,0,0 .. [2019-10-21T13:46:16.256] agent/is_node_resp: node:r12c05 RPC:REQUEST_LAUNCH_PROLOG : Communication connection failure So it is understandable than the node doesn't run the prolog and fails with an error. Interestingly it receives the REQUEST_BATCH_JOB_LAUNCH: [2019-10-21T13:45:16.201] debug2: Start processing RPC: REQUEST_BATCH_JOB_LAUNCH [2019-10-21T13:45:16.201] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH [2019-10-21T13:45:16.202] debug2: _group_cache_lookup_internal: no entry found for westerl2 [2019-10-21T13:45:27.479] debug2: Start processing RPC: REQUEST_HEALTH_CHECK [2019-10-21T13:45:27.479] debug2: Processing RPC: REQUEST_HEALTH_CHECK [2019-10-21T13:45:27.479] debug: attempting to run health_check [/usr/sbin/nhc] [2019-10-21T13:45:30.753] error: /usr/sbin/nhc: exited with status 0x0100 [2019-10-21T13:45:30.753] debug2: Finish processing RPC: REQUEST_HEALTH_CHECK [2019-10-21T13:46:06.209] error: Waiting for JobId=356502 prolog has failed, giving up after 50 sec I must investigate what happens when it receives REQUEST_BATCH_JOB_LAUNCH, if there's some condition which makes the node stuck, specially after _group_cache_lookup_internal. Working on it while I wait for the fix of nhc.
Though I don't have more logs from surendra, I see that you both use NHC: Apr 24 10:14:41 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_HEALTH_CHECK Apr 24 10:14:41 r5i0n10 slurmd[3441]: debug: attempting to run health_check [/usr/sbin/nhc] Apr 24 10:15:49 r5i0n10 slurmd[3441]: debug2: got this type of message 1017 Apr 24 10:15:49 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE Apr 24 10:15:50 r5i0n10 slurmd[3441]: debug2: got this type of message 1008 Apr 24 10:16:19 r5i0n10 slurmd[3441]: debug2: got this type of message 1017 Apr 24 10:16:19 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE Apr 24 10:16:35 r5i0n10 slurmd[3441]: debug2: got this type of message 4005 Apr 24 10:16:35 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH Apr 24 10:16:35 r5i0n10 slurmd[3441]: debug2: _group_cache_lookup_internal: no entry found for jacl0659 Apr 24 10:16:35 r5i0n10 slurmd[3441]: task_p_slurmd_batch_request: 689894 Apr 24 10:16:35 r5i0n10 slurmd[3441]: task/affinity: job 689894 CPU input mask for node: 0xFFFFFFFFF Apr 24 10:16:35 r5i0n10 slurmd[3441]: task/affinity: job 689894 CPU final HW mask for node: 0xFFFFFFFFF Apr 24 10:16:49 r5i0n10 slurmd[3441]: debug2: got this type of message 1017 Apr 24 10:16:49 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE Apr 24 10:17:19 r5i0n10 slurmd[3441]: debug2: got this type of message 1017 Apr 24 10:17:19 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE Apr 24 10:17:25 r5i0n10 slurmd[3441]: error: Waiting for JobId=689894 prolog has failed, giving up after 50 sec In both situations I also see _group_cache_lookup_internal messages. It would be interesting to see also if there are any NSS/SSSD logs in the node. Note that you have CacheGroups=0 set in your slurm.conf from bug 7908. This parameter was deprecated since commit 48a4cdf back in 2016. See GroupUpdate* parameters and RELEASE_NOTES. Probably you're receiving a warning when starting slurmctld.
> In both situations I also see _group_cache_lookup_internal messages. > > It would be interesting to see also if there are any NSS/SSSD logs in the > node. We have no NSS/SSSD on nodes, plain /etc/passwd and /etc/group files > Note that you have CacheGroups=0 set in your slurm.conf from bug 7908. This > parameter was deprecated since commit 48a4cdf back in 2016. See GroupUpdate* > parameters and RELEASE_NOTES. Probably you're receiving a warning when > starting slurmctld. I'll disable cachegroups on next slurm conf update. I patched NHC but it still fails miserably if slurmd runs it: /etc/nhc/scripts/common.nhc: line 381: slurm.r07c03.log-20191018.gz: syntax error: invalid arithmetic operator (error token is ".r07c03.log-20191018.gz") NHC watchdog timer 32444 (60 secs) has expired. Signaling NHC: kill -s ALRM -- -32443 /usr/sbin/nhc: line 491: kill: (-32443) - No such process But from command line and cron it works as expected: r07c03 20191023 15:40:12 [slurm] /usr/libexec/nhc/node-mark-online r07c03 /usr/libexec/nhc/node-mark-online: Node r07c03 is already online. /usr/libexec/nhc/node-mark-online: Skipping alloc node r07c03 (none I'll configure NHC to run from cron on all nodes on tomorrow and see if it makes any difference.
(In reply to Tommi Tervo from comment #14) > > In both situations I also see _group_cache_lookup_internal messages. > > > > It would be interesting to see also if there are any NSS/SSSD logs in the > > node. > > We have no NSS/SSSD on nodes, plain /etc/passwd and /etc/group files > > > > Note that you have CacheGroups=0 set in your slurm.conf from bug 7908. This > > parameter was deprecated since commit 48a4cdf back in 2016. See GroupUpdate* > > parameters and RELEASE_NOTES. Probably you're receiving a warning when > > starting slurmctld. > > I'll disable cachegroups on next slurm conf update. > > I patched NHC but it still fails miserably if slurmd runs it: > > /etc/nhc/scripts/common.nhc: line 381: slurm.r07c03.log-20191018.gz: syntax > error: invalid arithmetic operator (error token is ".r07c03.log-20191018.gz") > NHC watchdog timer 32444 (60 secs) has expired. Signaling NHC: kill -s > ALRM -- -32443 > /usr/sbin/nhc: line 491: kill: (-32443) - No such process > > But from command line and cron it works as expected: > > r07c03 > 20191023 15:40:12 [slurm] /usr/libexec/nhc/node-mark-online r07c03 > /usr/libexec/nhc/node-mark-online: Node r07c03 is already online. > /usr/libexec/nhc/node-mark-online: Skipping alloc node r07c03 (none > > I'll configure NHC to run from cron on all nodes on tomorrow and see if it > makes any difference. It really *shouldn't* do a difference. The only difference you may have found is which processes are running in the node at the moment of the execution. Maybe with cron there was no processes with dots at that run. I would suggest instead to remove the processes checks in nhc config.
Hi Tommi, Do you have any news for that issue? Did the modifications to NHC help? Thanks
(In reply to Felip Moll from comment #16) > Hi Tommi, > > Do you have any news for that issue? > > Did the modifications to NHC help? Hi, Nope, separating nhc from slurm did not help. We are seeing another variant of this problem, prolog is somehow stuck. First thing what is should do is to log jobid to syslog but it does not do that: /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:25.930] error: Waiting for JobId=415346 prolog has failed, giving up after 50 sec /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:25.979] error: stepd_connect to 415346.4294967295 failed: No such file or directory /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:26.980] error: stepd_connect to 415346.4294967295 failed: No such file or directory /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:27.980] error: stepd_connect to 415346.4294967295 failed: No such file or directory /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:28.981] error: stepd_connect to 415346.4294967295 failed: No such file or directory /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:29.982] error: stepd_connect to 415346.4294967295 failed: No such file or directory [root@r05c27 ~]# grep 415346 /var/log/syslog /var/log/syslog:1572942865 2019 Nov 5 10:34:25 r05c27 user notice root start slurm_epilog for job: 415346 /var/log/syslog:1572942866 2019 Nov 5 10:34:26 r05c27 user notice root ending_2 slurm_epilog for job: 415346. [root@r05c27 ~]# head /etc/slurm/slurm_prolog #!/bin/sh logger "start slurm_prolog for job: ${SLURM_JOB_ID}"
(In reply to Tommi Tervo from comment #17) > (In reply to Felip Moll from comment #16) > > Hi Tommi, > > > > Do you have any news for that issue? > > > > Did the modifications to NHC help? > > Hi, > > Nope, separating nhc from slurm did not help. We are seeing another variant > of this problem, prolog is somehow stuck. First thing what is should do is > to log jobid to syslog but it does not do that: But did you just separate it or removed the 'ps' checks? Even if you separate it, the same 'ps' command can still exhaust the resources in the machine so it could still have some impact. I really would like you to also disable the process checks in nhc.conf to completely discard NHC issues. > > > /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:25.930] error: Waiting for > JobId=415346 prolog has failed, giving up after 50 sec > /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:25.979] error: > stepd_connect to 415346.4294967295 failed: No such file or directory > /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:26.980] error: > stepd_connect to 415346.4294967295 failed: No such file or directory > /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:27.980] error: > stepd_connect to 415346.4294967295 failed: No such file or directory > /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:28.981] error: > stepd_connect to 415346.4294967295 failed: No such file or directory > /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:29.982] error: > stepd_connect to 415346.4294967295 failed: No such file or directory > > [root@r05c27 ~]# grep 415346 /var/log/syslog > /var/log/syslog:1572942865 2019 Nov 5 10:34:25 r05c27 user notice root > start slurm_epilog for job: 415346 > /var/log/syslog:1572942866 2019 Nov 5 10:34:26 r05c27 user notice root > ending_2 slurm_epilog for job: 415346. > > > > [root@r05c27 ~]# head /etc/slurm/slurm_prolog > #!/bin/sh > logger "start slurm_prolog for job: ${SLURM_JOB_ID}" The thread who is waiting for the prolog to run is not the same thread than executes it. So despite you see "Waiting for JobId=415346 prolog has failed, giving up after 50 sec", doesn't mean the prolog has started at all. After you see the issue happening again, please attach fresh logs. I expect not to see the NHC error anymore and I am interested in seeing the new workflow. It would be also good to try to catch the error when it is happening but I understand you don't have a specific pattern to reproduce, is that right?
> But did you just separate it or removed the 'ps' checks? Even if you > separate it, the same 'ps' command can still exhaust the resources in the > machine so it could still have some impact. > I really would like you to also disable the process checks in nhc.conf to > completely discard NHC issues. I disabled all but loadavg check when I did this nhc separation: ####################################################################### ### ### Process checks ### @cn@ || check_ps_loadavg 400 # @cn@ || check_ps_service -S -u root auditd # @cn@ || check_ps_service -r rpcbind # @cn@ || check_ps_service -S ntpd # @cn@ || check_ps_service -S -u root sshd # @cn@ || check_ps_service -S -d rsyslogd -u root rsyslog # @cn@ || check_ps_service -S -u root crond > The thread who is waiting for the prolog to run is not the same thread than > executes it. So despite you see "Waiting for JobId=415346 prolog has failed, > giving up after 50 sec", doesn't mean the prolog has started at all. Ah, this makes sense. > After you see the issue happening again, please attach fresh logs. I expect > not to see the NHC error anymore and I am interested in seeing the new > workflow. > It would be also good to try to catch the error when it is happening but I > understand you don't have a specific pattern to reproduce, is that right? I'll enable debug logging tomorrow if day is calm enough. I've not figured out the pattern when this problem occurs. If I put failed prolog nodes back to resume state those will run jobs normally. Best Regards, Tommi
Created attachment 12272 [details] New logs without NHC
Created attachment 12273 [details] logs from slurmctld
Tommi, slurmd log starts to far in the time. From slurmctld I get to see how the messages are queued: [2019-11-08T13:33:29.228] backfill: Started JobId=427942 in gpu on r13g04 .. [2019-11-08T13:33:29.299] _make_agent_info: sending msg_type REQUEST_LAUNCH_PROLOG to nodes r13g04 [2019-11-08T13:33:29.299] _make_agent_info: sending msg_type REQUEST_BATCH_JOB_LAUNCH to nodes r13g04 .. [2019-11-08T13:33:29.299] _thread_per_group_rpc: sending REQUEST_BATCH_JOB_LAUNCH to r13g04 [2019-11-08T13:33:29.299] _thread_per_group_rpc: sending REQUEST_LAUNCH_PROLOG to r13g04 [2019-11-08T13:33:29.300] agent_trigger: pending_wait_time=999->999 mail_too=F->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=6 [2019-11-08T13:33:29.300] agent_trigger: pending_wait_time=999->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=6 [2019-11-08T13:33:29.300] retry_list retry_list_size:6 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,SRUN_PING ... You can see two launch prolog and two batch job messages in the list, probably because other jobs where also starting (e.g. 427943) but I cannot say for sure because I have not earlier logs. In the slurmd part that's the first entry I see related to this job, which means the RPCs were received befor the first entry in the log which happens to be at [2019-11-08T13:33:33.046] (later than 13:33:29) : [2019-11-08T13:34:19.307] error: Waiting for JobId=427942 prolog has failed, giving up after 50 sec Please, can you send me the full logs? Currently the debug level and agent queue traces were being helpful. I encourage you to always send full logs to facilitate our analysis. Thanks and sorry for the delay on my response.
Created attachment 12347 [details] full slurmctl log
Created attachment 12348 [details] slurmd log from failed job 471596
It is interesting that on most of the logs I've seen and even two rpcs are sent from ctld I often see a gap: [2019-11-19T10:53:39.335] debug2: Start processing RPC: REQUEST_BATCH_JOB_LAUNCH [2019-11-19T10:53:39.335] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH [2019-11-19T10:53:39.336] debug2: _group_cache_lookup_internal: no entry found for huiskone [2019-11-19T10:53:50.603] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE In this occasion it happens after the _group_cache_lookup_internal and the next operation.. specifically 20 seconds. Having removed NHC from the way at least discards a problem. Does a dmesg show anything? I don't think I've seen one of your nodes before. One from r04c14 would be nice to have (dmesg -T). I'll keep looking.. sorry for the delay this time, I was busy on SC19.
Created attachment 12434 [details] Dmesg Hi, dmesg does not show any issues for Nov 19th, neither syslog. And we have flat /etc/passwd files on compute and slurmctld nodes. Best Regards, Tommi Tervo
Tommi, Do you happen to have the configure line used during building Slurm?
Created attachment 12437 [details] rpmbuild config.log rpmbuild config log attached
I am working with this issue as my first priority this week. One more question: [2019-11-19T10:53:03.013] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38356 [2019-11-19T10:53:03.013] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:38356 [2019-11-19T10:53:03.015] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38358 [2019-11-19T10:53:03.015] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:38358 [2019-11-19T10:53:03.017] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38360 [2019-11-19T10:53:03.017] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:38360 [2019-11-19T10:53:03.018] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38362 [2019-11-19T10:53:03.019] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:38362 [2019-11-19T10:53:03.020] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38364 ]$ grep -r REQUEST_BUILD|wc -l 30071 Can you please investigate who is issuing this specific RPC so aggressively? These commands: scontrol show config sview sprio do emit this kind of RPC. an 'sdiag' can help, specifically fields: Remote Procedure Call statistics by message type Remote Procedure Call statistics by user Studying the code I have some suspects about this.
> Can you please investigate who is issuing this specific RPC so aggressively? Hi, We have maintenance break ongoing so I'm busy today. But hopefully tomorrow everything is running smoothly after OS/IB/Slurm/adminplane upgrade so I can investigate this issue which is really annoying for us and our users. Thanks, Tommi
I've found something. In slurmctld there's the _slurmctld_background process that every 1 second calls agent_trigger(). This call is logged like this: [2019-11-19T10:53:35.317] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0 [2019-11-19T10:53:36.319] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0 [2019-11-19T10:53:37.321] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0 [2019-11-19T10:53:38.322] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0 The important fields here are agent_thread_cnt and retry_list_size. The first one is a static variable which globally indicates how many threads have been spawned by all the agents in the system, and the retry_list_size has the count of how many new and how many old-which-need-retry RPCs are in the system (other RPCs reside in the defer_list). To extend a bit more, an agent() is a processor for the RPCs in the retry_list and the defer_list. Agents are spawned thanks to a broadcast signal sent from agent_trigger() to the main agent() process in slurmctld, and as said before this can happen every 1 second from the _slurmctld_background process(); it is not the only way though, it can also be triggered from a specific request like from a launch_prolog() call. You can generally differentiate from where the agent_trigger is called looking at the pending_wait_time(), the background thread calls it with RPC_RETRY_INTERVAL=60 and a launch_prolog() will end up with a 999: e.g.: ... pending_wait_time=65534->60 ... ... pending_wait_time=65534->999 ... Think about it as a producer-consumer model. The consumer is the main agent() from slurmctld, and the producer is any one which calls agent_trigger(). Well, when the agent_trigger() broadcasted the main agent() the last one will process the retry_list, it will log things like: [2019-11-19T10:53:39.333] retry_list retry_list_size:2 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,0,0,0 [2019-11-19T10:53:39.333] retry_list retry_list_size:1 msg_type=REQUEST_BATCH_JOB_LAUNCH,0,0,0,0 and it will basically pick up ONE SINGLE RPC from the retry_list. Normally the newest one found which has never been sent, and otherwise the older one. What will do with this RPC? it will just spawn a new agent() thread to process this RPC, which can in fact go to one or more nodes, so you will see messages like: [2019-11-19T10:53:39.333] agent: Agent_cnt=0 agent_thread_cnt=0 with msg_type=REQUEST_LAUNCH_PROLOG retry_list_size=1 [2019-11-19T10:53:39.333] agent: Agent_cnt=0 agent_thread_cnt=0 with msg_type=REQUEST_BATCH_JOB_LAUNCH retry_list_size=0 In this example, which is the one from your logs, you can see that before the retry_list had two msg_types. It processed the first one (REQUEST_LAUNCH_PROLOG) and spawned an agent. Then the next iteration processed the next one and spawned another agent. The two agents then emits their own log, and they even show the global retry_list_size, so a comprehensible human relation would look like: [2019-11-19T10:53:39.333] retry_list retry_list_size:2 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,0,0,0 <-- agent processes first RPC [2019-11-19T10:53:39.333] agent: Agent_cnt=0 agent_thread_cnt=0 with msg_type=REQUEST_LAUNCH_PROLOG retry_list_size=1 <-- this is the first spawned thread [2019-11-19T10:53:39.333] retry_list retry_list_size:1 msg_type=REQUEST_BATCH_JOB_LAUNCH,0,0,0,0 <-- agent processes second RPC [2019-11-19T10:53:39.333] agent: Agent_cnt=0 agent_thread_cnt=0 with msg_type=REQUEST_BATCH_JOB_LAUNCH retry_list_size=0 <-- this is the second spawned thread Great, after that we have two agents, one for each RPC, and moreover the retry_list is already empty. The new agents will next take care of sending the messages to the respective nodes. When one of these secondary agents finish its work, you will see a message like: [2019-11-19T10:53:39.349] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_BATCH_JOB_LAUNCH protocol_version:8704 The thread_count here indicates how many threads are still remaining in that agent, which is 1, which is himself, so other threads launched to serve the RPCs are already gone. Good. You can also see the get_reply:T flag which indicates that we waited for a response and we received it after sending the RPC. Good. How do the spawned agents send the RPC? First of all it just registers globally the number of rpc_thread_cnt that will be used, and it will generally be 2 + the number of nodes where to send the RPC, or 10 if it the number is > AGENT_THREAD_COUNT. So it cannot spawn alone more than 10 threads. In your examples for these two RPC, each spawned agent will use 3 threads, one for sending to the single node r04c14, one for the agent itself, and one for the watchdog. After that, it sets the parameters for new threads (_make_agent_info), creates the watchdog thread (which will sent a SIGUSR1 to threads which have been active for too long), and creates the new designated threads: _thread_per_group_rpc(). [2019-11-19T10:53:39.334] _thread_per_group_rpc: sending REQUEST_LAUNCH_PROLOG to r04c14 [2019-11-19T10:53:39.334] _thread_per_group_rpc: sending REQUEST_BATCH_JOB_LAUNCH to r04c14 This new thread will split and group the list of nodes using the route plugin and the tree width parameter and will call the network functions to send the RPC accross the network. If specified in the agent info, it will wait for a response. When all the job is finished, the thread will die, the spawned agent will terminate, the log message: [2019-11-19T10:53:39.349] agent: end agent ... will be logged, and everybody will be happy. ------------ But well, your situation is not happy. Look at what happens after _thread_per_group_rpc: [2019-11-19T10:53:39.334] _thread_per_group_rpc: sending REQUEST_BATCH_JOB_LAUNCH to r04c14 [2019-11-19T10:53:39.349] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_BATCH_JOB_LAUNCH protocol_version:8704 [2019-11-19T10:53:39.349] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 [2019-11-19T10:53:40.576] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 [2019-11-19T10:53:41.578] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 [2019-11-19T10:53:42.579] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 ... The agent_thread_cnt is 3, and does not decrease, this means the spawned agent for sending the REQUEST_LAUNCH_PROLOG is waiting for _thread_per_group_rpc, which is probably stuck at some network function. seconds after.. still there, one agent, three threads (itself, watchdog and thread_per_group_rpc): [2019-11-19T10:53:59.616] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 [2019-11-19T10:54:00.617] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 later... between the storm of REQUEST_BUILD_INFO which I asked you to investigate in comment 31, still locked: ... [2019-11-19T10:54:02.620] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:39842 [2019-11-19T10:54:02.620] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 [2019-11-19T10:54:02.621] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:39844 ... we even see new RPCs being served, but our three threads and 1 agent are still there: ... [2019-11-19T10:54:33.968] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:SRUN_PING protocol_version:8704 [2019-11-19T10:54:33.968] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=0 ... and we finally see that: [2019-11-19T10:54:37.962] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=0 [2019-11-19T10:54:38.964] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=0 [2019-11-19T10:54:39.394] agent/is_node_resp: node:r04c14 RPC:REQUEST_LAUNCH_PROLOG : Communication connection failure <---------------------------------- [2019-11-19T10:54:39.619] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_LAUNCH_PROLOG protocol_version:8704 <----- [2019-11-19T10:54:39.619] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 [2019-11-19T10:54:39.965] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0 [2019-11-19T10:54:40.967] agent_trigger: pending_wait_time=65534->999 mail_too=F->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=1 [2019-11-19T10:54:40.967] agent_trigger: pending_wait_time=999->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=1 Our agent has terminated due to a communication connection failure. ----- Added to that, I can see the following: we eventually see a communication problem with a node, and it is managed correctly from 15 to 12 threads, destroying the agent and thread_cnt from that agent: .. [2019-11-19T10:55:27.735] agent/is_node_resp: node:r07c42 RPC:REQUEST_BATCH_JOB_LAUNCH : Communication connection failure <------------ AGAIN .. [2019-11-19T10:55:27.959] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=15 retry_list_size=0 [2019-11-19T10:55:27.959] agent: New agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_TERMINATE_JOB protocol_version:8704 [2019-11-19T10:55:27.959] _thread_per_group_rpc: sending REQUEST_TERMINATE_JOB to r07c42 [2019-11-19T10:55:27.965] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_TERMINATE_JOB protocol_version:8704 [2019-11-19T10:55:27.965] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=12 retry_list_size=0 .. After a while, we observe: [2019-11-19T10:55:40.647] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=12 retry_list_size=0 [2019-11-19T10:55:41.648] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=12 retry_list_size=0 [2019-11-19T10:55:42.650] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=12 retry_list_size=0 [2019-11-19T10:55:43.056] _slurmctld_rpc_mgr: accept() connection from 10.140.0.11:56602 [2019-11-19T10:55:43.059] slurmctld_req: received opcode MESSAGE_NODE_REGISTRATION_STATUS from 10.140.0.11:56602 [2019-11-19T10:55:43.257] _wdog: agent maximum delay 61 seconds [2019-11-19T10:55:43.258] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_NODE_REGISTRATION_STATUS protocol_version:8704 [2019-11-19T10:55:43.258] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0 [2019-11-19T10:55:43.652] agent_trigger: pending_wait_time=65534->999 mail_too=F->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=1 The watchdog came in, and finally terminated these agent: [2019-11-19T10:55:43.257] _wdog: agent maximum delay 61 seconds I see many _wdog coming in, which indicates your system is suffering some issue. I see many RPC storms in your logs, which makes me think you're saturating the ib cards or similar. Not only about REQUEST_BUILD_INFO but also REQUEST_PARTITION_INFO, REQUEST_NODE_INFO or REQUEST_PARTITION_INFO. If you want I could provide a patch to log the _thread_per_group() rpc and deeper calls, but given this: ]$ grep -c REQUEST_BUILD_INFO slurmctld_filtered_email_lines.log 30071 ]$ grep -c REQUEST_PARTITION_INFO slurmctld_filtered_email_lines.log 5045 ]$ grep -c REQUEST_NODE_INFO slurmctld_filtered_email_lines.log 5641 and that all of these RPCs happens in seconds or less, I am strongly inclined to not continue looking deeper in code or applying patches to your production system until you find the cause of these massive RPCs. Note: not only the ib card can be saturated, also the max processes in the server or a network switch. I would like your opinion about this. Thanks Felip
> [2019-11-19T10:53:03.020] _slurmctld_rpc_mgr: accept() connection from > 10.140.0.8:38364 > > ]$ grep -r REQUEST_BUILD|wc -l > 30071 > > Can you please investigate who is issuing this specific RPC so aggressively? These requests came from login node but "storm" is gone now. Users do have sometimes "interesting" scripts/applications which will monitor queues and submit jobs and we have no options to prevent users to use e.g. squeue command without rippling functionality which reminds me one old enchantment BR: https://bugs.schedmd.com/show_bug.cgi?id=5080 Another issue is that slurmctld may have lot of connections in TIME_WAIT state, one example that we just saw is that user submits a job with lot of steps but have misconfigured path. Following example generates lot of connections to slurmctld and we do not see any way to prevent this? for i in {1 .. 32000}; do srun /typoed/path/gmx input_$i done I'll attach sdiag and netstat
Created attachment 12468 [details] sdiag
Created attachment 12470 [details] netstat -naptu
*** Ticket 8061 has been marked as a duplicate of this ticket. ***
Hello Tommi, I haven't provided further feedback on this issue until now. The current situation of this issue is that I prepared a patch to improve how the agent queue is managed which will probably mitigate what you experienced before. As you know the big explanation is in comment 33. My proposed patch is pending for review by our QA team. For this reason I am keeping this bug open until it is reviewed and accepted. Thanks for your understanding.
Hi Tommi, We have been reviewing the issue with the QA team, but my proposed patch may not directly address your issue, since it is a performance improvement in the agent queue but your logs didn't show up a high number for the retry list. Now I'd like to know about the status of your issue. - Are you still seeing the "error: validate_node_specs: Prolog or job env setup failure on node xxx, draining the node" too often? - Do you still see "RPC:REQUEST_LAUNCH_PROLOG : Communication connection failure" ? - After comment 36 where you said the "storm RPC" ceased, have you experienced the very same issues without an evident storm? Pd. I still strongly believe there was an issue with communication layer as explained in comment 33.
> - Are you still seeing the "error: validate_node_specs: Prolog or job env > setup failure on node xxx, draining the node" too often? > > - Do you still see "RPC:REQUEST_LAUNCH_PROLOG : Communication connection > failure" ? Yes, problem has not disappeared. We had to make a script which will run in a loop and releases jobs which are held due to this bug > - After comment 36 where you said the "storm RPC" ceased, have you > experienced the very same issues without an evident storm? Yes, problem is constant. One example, we have 200 job submit limit/user so user invented nice workaround: parallel --progress -j 200 sbatch -W --options Problem is that every sbatch command generates 3 RPCs in 10s interval and if one multiplies that with 200x... We see more and more "scripting suites/frameworks" which are monitoring the queue and submit jobs automatically so this kind of frequent polling is going to be increase in future. Overall situation is really bad, we cannot modify more that ~20 account with sacctmgr at the time, otherwise slurmctld may crawl almost 30mins and logs are full of ugly errors. Slurmctld leaks memory but we cannot valgrind it due to performance problems (version is latest from 19.05 branch) 4680 slurm 20 0 13,7g 12,6g 4112 S 12,5 20,1 6212:56 slurmctld
> Slurmctld leaks memory but we cannot valgrind it due to performance problems > (version is latest from 19.05 branch) > 4680 slurm 20 0 13,7g 12,6g 4112 S 12,5 20,1 6212:56 slurmctld Your situation is really bad, and that memory consumption is quite high. Can you gdb attach to slurmctld and send me the output of "thread apply all bt" ? Send me your latest slurmctld and slurmd logs of a node where you've seen problems. Also I want your latest slurm.conf and an sdiag output launched every 10 seconds for 5 minutes. If you are able to make it fail during this period it would be great, for example with the sacctmgr you mentioned. The slurmctld log after all this test will be what I want.
> Yes, problem is constant. One example, we have 200 job submit limit/user so > user invented nice workaround: > > parallel --progress -j 200 sbatch -W --options > > Problem is that every sbatch command generates 3 RPCs in 10s interval and if > one multiplies that with 200x... > > We see more and more "scripting suites/frameworks" which are monitoring the > queue and submit jobs automatically so this kind of frequent polling is > going to be increase in future. You are right... this generates in my case 2 RPCs x 200 every 10s. REQUEST_FED_INFO REQUEST_JOB_INFO_SINGLE And as you know that's the polling mechanism of -W option. How do you have the OS configured? I'm quite sure you already tuned it, but just in case I put here the HTC link: https://slurm.schedmd.com/high_throughput.html Besides that and depending on your current config. I am thinking on modify the backfill parameters to let the RPCs process more quickly, which would have the counterpart of schedule jobs less often. I will wait for your slurm.conf before doing any suggestion. In the meantime I am trying to reproduce your scenarios.
I worked more on the path which has to be triggered to raise your issue, and I strongly believe it is a communication problem caused for an overwhelmed system due to high nr of RPCs and connections. In fact I created a synthetic reproducer which gives the exact problem you're seeing.. the key has been to force a sleep and timeout in the function which does the syscall "connect()" to a socket for the specfic RPCs... exactly the same errors you seen have appeared. The error of "Prolog or job env setup failure" and nodes drained is only the most visible face of this issue, but it has other implications, for example, in your logs I checked how many RPCs have "Communication Problem" message, and I got this: ]$ grep -ri "communication connection" slurmctld_filtered_email_lines.log|cut -d" " -f 4|sort |uniq -c 77 RPC:REQUEST_ACCT_GATHER_UPDATE 8 RPC:REQUEST_BATCH_JOB_LAUNCH 20 RPC:REQUEST_LAUNCH_PROLOG 1 RPC:REQUEST_TERMINATE_JOB So this is not only affecting your submissions, but also your accounting, termination, and other RPCs. You *really* want to tune your system to accept more connections at an OS level, but the final answer here is you want to change the workflow of your users to avoid querying the ctld with so many RPCs. I am afraid there won't be any other solution... In any case send me the info I requested in my previous comments to see if there's any chance for the scheduler to release a bit the work in order to be able to serve RPCs more often, but I cannot guarantee it is a solution. Your users are DDoS'ing your ctld. Tell me what do you think.
Created attachment 13307 [details] current slurm.conf Hi, Sorry for the delay, I was on holidays last week. We have verified that all recommended HPC settings are set on slurmctld. -Tommi
(In reply to Tommi Tervo from comment #56) > Created attachment 13307 [details] > current slurm.conf > > Hi, > > Sorry for the delay, I was on holidays last week. We have verified that all > recommended HPC settings are set on slurmctld. > > -Tommi Note I referred mainly to OS configuration, no to ctld, i.e. system/daemon/net limits... Would you be able to apply a debug patch if I create it?
> Note I referred mainly to OS configuration, no to ctld, i.e. > system/daemon/net limits... Sorry for being unclear, I meant that we have checked those OS configurations/tunings which are on the HTC page. > Would you be able to apply a debug patch if I create it? Of course unless it can be harmful for production. -Tommi
Created attachment 13359 [details] dbg7928_errno.patch That's a pretty simple debug patch restricted to LAUNCH_PROLOG RPCs (so shouldn't cause any performance impact) which will let us see the real errno after the connect() syscall, because now we're overwriting this in certain situations and replace with SLURM_COMMUNICATIONS_CONNECTION_ERROR. Can you apply this patch, run in debuglevel debug, and send me the logs when you see the issue again? Hopefully it will give us a lead and an idea on where to look next. Thanks!!
(In reply to Felip Moll from comment #59) > Created attachment 13359 [details] > dbg7928_errno.patch > > That's a pretty simple debug patch restricted to LAUNCH_PROLOG RPCs (so > shouldn't cause any performance impact) which will let us see the real errno > after the connect() syscall, because now we're overwriting this in certain > situations and replace with SLURM_COMMUNICATIONS_ERROR. > > Can you apply this patch, run in debuglevel debug, and send me the logs when > you see the issue again? Hi, I applied patch our debug build which should catch also memory leak errors. It was running about 5 hours today without hit. I'll continue on Monday, have a nice weekend. -Tommi
Tommi, I've seen your logs in bug 8597 and in this situation I can only see an issue if I grep "update assoc" in the logfile, where you will see there are a lot of updates. On our specific issue, *all* prologs were running fine. You can see that by grepping the logs: ]$ grep -ri bug7928 slurmctld.log we can see also like there's not any retry, this is always 0: ..retries slurm_open_msg_conn=0.. so the issue wasn't triggered that time for us. I think you are facing other issues and these affects bug 8597 and 7928 and your performance. Let's keep with this debug level and if you catch it just send me the logs again.
> I think you are facing other issues and these affects bug 8597 and 7928 and > your performance. > > Let's keep with this debug level and if you catch it just send me the logs > again. Hi, I recompiled slurm without address sanitizer flags so I can keep debug logging on for a longer time. I'll post results when the prolog error happens.
Created attachment 13393 [details] prolog failure debug log
(In reply to Tommi Tervo from comment #64) > Created attachment 13393 [details] > prolog failure debug log We need more log!. Look, at 11:19:09 there's the Allocate for jobid=1663914 on node r03g03. (I guess r03g03 has ip addr 10.140.0.183, can you confirm that?) Immediately we see how the REQUEST_LAUNCH_PROLOG is issued: [2020-03-17T11:19:09.382] debug: BUG7928 slurm_send_addr_recv_msgs: pid 11849 going to connect to 10.140.0.183:6818 to send REQUEST_LAUNCH_PROLOG but... we don't see the continuation of this connect. The loop looks like this: if (msg->msg_type == REQUEST_LAUNCH_PROLOG) { slurm_print_slurm_addr(&msg->address, addr_str, sizeof(addr_str)); debug("BUG7928 %s: pid %d going to connect to %s to send %s", <--------------------------------------------Message seen __func__, getpid(), addr_str, rpc_num2string(msg->msg_type)); } /* This connect retry logic permits Slurm hierarchical communications * to better survive slurmd restarts */ for (i = 0; i <= conn_timeout; i++) { if (i) sleep(1); fd = slurm_open_msg_conn(&msg->address); <---------------------------------- That's a "connect()" followed by a "poll()" if ((fd >= 0) || (errno != ECONNREFUSED)) break; if (i == 0) debug3("connect refused, retrying"); } if (msg->msg_type == REQUEST_LAUNCH_PROLOG) debug("BUG7928 %s:retries slurm_open_msg_conn=%d, conn_timeout=%d, fd=%d, errno: %s", <-----------------------I expect this __func__, i, conn_timeout, fd, slurm_strerror(errno)); So you must agree that after the message we are expecting a "BUG7928 %s:retries slurm_open_msg_conn=%d, conn_timeout=%d, fd=%d, errno: %s". The last timestamp in logs is [2020-03-17T11:20:02.501], so after 53 seconds we still haven't seen it. Then, the prolog fails at 50s because of the watchdog therefore we can conclude here that the "connect()" syscall to 10.140.0.183:6818 is taking a long time. The poll() is not responsible, because I haven't seen any "slurm_connect poll failed" or "slurm_connect poll timeout" in your logs before, so this path is discarded. Please, send me back more log, until you see the missing counterpart: "BUG7928 %s: retries.....". Question: Do you have syncookies enabled on the server? Is the MTU/MSS of the server interfaces the same than on the nodes? The issue you are experiencing seems not directly related to Slurm code, but to your system being overwhelmed, a misconfiguration on the network, a failing card, a switch, a firewall, or something similar. Please, respond to my questions here, I am curious on what's going on.
Also I'd like to know your current kernel versions, are you still at 3.10.0-957.2? What are your your sysctl_max_syn_backlog and net.core.somaxconn? When it is happening it would be nice to catch output of: ]$ ss -n state syn-recv dport = :6818 ]$ ss -plnt dport = :6818
Created attachment 13407 [details] Longer log with connection timeout mgs
> Look, at 11:19:09 there's the Allocate for jobid=1663914 on node r03g03. > > (I guess r03g03 has ip addr 10.140.0.183, can you confirm that?) [root@r03g03 ~]# ifconfig ib0 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet 10.140.0.183 netmask 255.255.240.0 broadcast 10.140.15.255 txqueuelen 16384 (InfiniBand) RX packets 8155441 bytes 10155560555 (9.4 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3320658 bytes 1054565732 (1005.7 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 mslurm1: ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet6 fe80::a00:3800:1:1 prefixlen 64 scopeid 0x20<link> txqueuelen 16384 (InfiniBand) RX packets 1075417805 bytes 141156091229 (131.4 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 4152895038 bytes 6952160941567 (6.3 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ib0:23: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet 10.140.13.24 netmask 255.255.240.0 broadcast 10.140.15.255 txqueuelen 16384 (InfiniBand) > Immediately we see how the REQUEST_LAUNCH_PROLOG is issued: > > [2020-03-17T11:19:09.382] debug: BUG7928 slurm_send_addr_recv_msgs: pid > 11849 going to connect to 10.140.0.183:6818 to send REQUEST_LAUNCH_PROLOG > > but... we don't see the continuation of this connect. > > So you must agree that after the message we are expecting a "BUG7928 > %s:retries slurm_open_msg_conn=%d, conn_timeout=%d, fd=%d, errno: %s". The > last timestamp in logs is [2020-03-17T11:20:02.501], so after 53 seconds we > still haven't seen it. Then, the prolog fails at 50s because of the watchdog > therefore we can conclude here that the "connect()" syscall to > 10.140.0.183:6818 is taking a long time. > > The poll() is not responsible, because I haven't seen any "slurm_connect > poll failed" or "slurm_connect poll timeout" in your logs before, so this > path is discarded. > > Please, send me back more log, until you see the missing counterpart: > "BUG7928 %s: retries.....". Seems that it'll timeout after one minute: [2020-03-17T11:20:09.442] debug: BUG7928 slurm_send_addr_recv_msgs:retries slurm_open_msg_conn=0, conn_timeout=10, fd=-1, errno: Connection timed out [2020-03-17T11:20:10.075] server_thread_count over limit (256), waiting [2020-03-17T11:20:10.128] backfill: continuing to yield locks, 256 RPCs pending > Question: Do you have syncookies enabled on the server? Is the MTU/MSS of > the server interfaces the same than on the nodes? Syncookies were not enabled, I tested that on long time ago but bug occurred with it also. I've enabled it again and trying to catch logs with ss logging. > The issue you are experiencing seems not directly related to Slurm code, but > to your system being overwhelmed, a misconfiguration on the network, a > failing card, a switch, a firewall, or something similar. It's really puzzling because I do not see any dropped/error packets and problem is occurring all over the cluster nodes. --- r03g03.bullx ping statistics --- 15 packets transmitted, 15 received, 0% packet loss, time 14000ms rtt min/avg/max/mdev = 0.061/0.121/0.153/0.030 ms [root@mslurm1 ~ (PUHTI) ]# uname -a Linux mslurm1.bullx 3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux ibdiagnet report : Summary -I- Stage Warnings Errors Comment -I- Discovery 0 0 -I- Lids Check 0 0 -I- Links Check 0 0 -I- Subnet Manager 0 0 -I- Port Counters 0 0 -I- Nodes Information 0 0 -I- Speed / Width checks 0 0 -I- Alias GUIDs 0 0 -I- Virtualization 0 0 -I- Partition Keys 0 0 -I- Temperature Sensing 0 0 -I- Routing 0 0
(In reply to Felip Moll from comment #66) > Also I'd like to know your current kernel versions, are you still at > 3.10.0-957.2? Cluster os stack was updated on December to RHEL 7.7. 3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux > What are your your sysctl_max_syn_backlog and net.core.somaxconn? [root@mslurm1 ~ (PUHTI) ]# cat /proc/sys/net/ipv4/tcp_synack_retries 5 [root@mslurm1 ~ (PUHTI) ]# cat /proc/sys/net/ipv4/tcp_syn tcp_syn_retries tcp_synack_retries tcp_syncookies [root@mslurm1 ~ (PUHTI) ]# cat /proc/sys/net/ipv4/tcp_syncookies 1 [root@mslurm1 ~ (PUHTI) ]# sysctl -a 2>/dev/null|grep net.core.somaxconn net.core.somaxconn = 16384 [root@mslurm1 ~ (PUHTI) ]# sysctl -a 2>/dev/null|grep max_syn_backlog net.ipv4.tcp_max_syn_backlog = 16384 > When it is happening it would be nice to catch output of: > > ]$ ss -n state syn-recv dport = :6818 > ]$ ss -plnt dport = :6818 I setup loop which is logging output of these commends with 10s interwall
Created attachment 13410 [details] Debug logs with +Protocol Compute node journal: -- Unit user-0.slice has finished shutting down. Mar 18 12:40:14 r07c06.bullx root[41382]: start slurm_prolog for job: 1676643 Mar 18 12:40:14 r07c06.bullx root[41392]: end slurm_prolog for job: 1676643 Mar 18 12:40:15 r07c06.bullx root[41897]: start slurm_epilog for job: 1676643 Mar 18 12:40:15 r07c06.bullx root[41934]: ending_2 slurm_epilog for job: 1676643. Mar 18 12:42:39 r07c06.bullx root[41967]: start slurm_prolog for job: 1676714 Mar 18 12:42:39 r07c06.bullx root[41977]: end slurm_prolog for job: 1676714 Mar 18 12:42:42 r07c06.bullx root[42602]: start slurm_prolog for job: 1676715 Mar 18 12:42:42 r07c06.bullx root[42612]: end slurm_prolog for job: 1676715 Mar 18 12:42:43 r07c06.bullx root[43413]: start slurm_epilog for job: 1676715 Mar 18 12:42:44 r07c06.bullx root[43450]: ending_2 slurm_epilog for job: 1676715. Mar 18 12:43:05 r07c06.bullx root[43462]: start slurm_epilog for job: 1676712 Mar 18 12:43:05 r07c06.bullx root[43501]: ending_2 slurm_epilog for job: 1676712. Mar 18 12:43:41 r07c06.bullx root[43594]: start slurm_epilog for job: 1676356 Mar 18 12:43:41 r07c06.bullx root[43631]: ending_2 slurm_epilog for job: 1676356. Mar 18 12:45:01 r07c06.bullx systemd[1]: Created slice User Slice of root. -- Subject: Unit user-0.slice has finished start-up I had that ss loop running but it does not report anything interesting: 12:42:03 : Netid Recv-Q Send-Q Local Address:Port Peer Address:Port State Recv-Q Send-Q Local Address:Port Peer Address:Port 12:42:13 : Netid Recv-Q Send-Q Local Address:Port Peer Address:Port State Recv-Q Send-Q Local Address:Port Peer Address:Port 12:42:23 : Netid Recv-Q Send-Q Local Address:Port Peer Address:Port State Recv-Q Send-Q Local Address:Port Peer Address:Port 12:42:33 : Netid Recv-Q Send-Q Local Address:Port Peer Address:Port State Recv-Q Send-Q Local Address:Port Peer Address:Port 12:42:43 : Netid Recv-Q Send-Q Local Address:Port Peer Address:Port State Recv-Q Send-Q Local Address:Port Peer Address:Port 12:42:53 : Netid Recv-Q Send-Q Local Address:Port Peer Address:Port State Recv-Q Send-Q Local Address:Port Peer Address:Port 12:43:03 : Netid Recv-Q Send-Q Local Address:Port Peer Address:Port
TCP syn cookies won't help if the syn queue is not full, and if ss doesn't show anything I wouldn't set syncookies. Which model of ib switches and MCA do you have? Diagnosing switch issues is out of my scope and knowledge but that would be worth to look at, I have some experiences on weird issues like that one on past jobs where at the end there was a card on a switch just failing. The basic facts here are: 1. We do have timeouts in connect() syscall. 2. It doesn't just happen for REQUEST_LAUNCH_PROLOG but this case is the one that gives noticeable symptoms (comment 52). 3. You use TCP over IB, not a Gb network for management, I am not saying this is an issue, but we just need to keep it in mind. 4. Bug 8597 derived from this bug. I cannot say it is related, but it is something to take into consideration. 5. I think the Slurm code is good, because of 2. What do you really think?
I want to make a clarification. In _slurm_connect we are setting the socket to be O_NONBLOCK. fcntl(__fd, F_SETFL, flags | O_NONBLOCK). This will make connect() to return immediately and then we really wait in poll() with the TCPTimeout. If debug2 were enabled we should see: error("slurm_connect poll failed: %m"); , being errno ETIMEDOUT. That doesn't change anything.. but... Marcin which is working on a very similar bug 8698 suggested that maybe we could try a retry on timedout. Your tcptimeout is 60, that would make a second retry to be too long if it doesn't work and could accumulate many threads. a derived idea we could try is to retry the wait once more after first timeout but with a lower timeout. I am studying now this idea and will let you know.
> Marcin which is working on a very similar bug 8698 suggested that maybe we > could try a retry on timedout. > > Your tcptimeout is 60, that would make a second retry to be too long if it > doesn't work and could accumulate many threads. > > a derived idea we could try is to retry the wait once more after first > timeout but with a lower timeout. I am studying now this idea and will let > you know. Hi, I was going to propose similar idea but I wanted to discuss first with my colleague who has experience of making own RTOS but he's unfortunately sick atm. Btw, bug 8698 does not look related?? -Tommi
> I was going to propose similar idea but I wanted to discuss first with my > colleague who has experience of making own RTOS but he's unfortunately sick > atm. Btw, bug 8698 does not look related?? > > -Tommi Sorry for the typo, it was bug 8398
Tommi, in bug 8398 (which is private) we are trying a patch with a similar concept I explained in comment 73. If you agree we will wait to see how it works there and depending on the results I will propose to apply the same patch or a variant to your situation. Results will come probably this week or the other, depending on the time to implement that and reproduce the issue.
I was going to suggest you to apply the mentioned patch, but after seeing the progress in bug 8597 I prefer to wait in order not to mix both issues. Your system seems to be under quite RPC pressure so I will work with my other colleagues to find out how we can help in 8597, and will put this bug on hold. I suspect it will end being a dup of the other one.
Hi Tommi, I am wondering if after changing the way you do sacctmgr you're still facing the issue described in this bug. Sorry for my long delay in giving a response.
(In reply to Felip Moll from comment #79) > Hi Tommi, > > I am wondering if after changing the way you do sacctmgr you're still facing > the issue described in this bug. Hi, Problem is still there, we have a script which will resume drained node if this problem occurs. -Tommi
Created attachment 15588 [details] bug8398_1905_v4.patch Tommi, The attached patch has proven to be useful for Stanford, where they faced similar issues. It is not yet reviewed and accepted by our QA team but it seems safe. Is it possible for you to apply it and give a try? I will be out of office this next week so I will ping my colleague Marcin to keep track of the bug. Let me know what you see after applying it. Thanks
(In reply to Felip Moll from comment #81) > Created attachment 15588 [details] > bug8398_1905_v4.patch > > Tommi, > > The attached patch has proven to be useful for Stanford, where they faced > similar issues. It is not yet reviewed and accepted by our QA team but it > seems safe. > > Is it possible for you to apply it and give a try? > > I will be out of office this next week so I will ping my colleague Marcin to > keep track of the bug. > > Let me know what you see after applying it. > > Thanks Hi Tommi, There was any chance to try this out? Thanks
> There was any chance to try this out? Hi, I applied the patch to our Slurm build but we had no time to install it. Next we have 19.05 -> 20.02.x upgrade ahead, actual date is not yet decided. We may first try without the patch even it seems to apply to 20.02.x series: slurm-20.02.5-3b1897f]$ patch -p1 < ../puhti-slurm-patches/7928.patch patching file src/common/slurm_protocol_api.c Hunk #1 succeeded at 4928 (offset -18 lines). Hunk #2 succeeded at 4937 (offset -18 lines). patching file src/slurmd/slurmd/req.c Hunk #1 succeeded at 2396 (offset -20 lines). patching file src/common/slurm_protocol_api.c Hunk #1 succeeded at 4938 (offset -18 lines). patching file src/slurmd/slurmd/slurmd.c Hunk #1 succeeded at 461 (offset 7 lines).
(In reply to Tommi Tervo from comment #83) > > There was any chance to try this out? > > Hi, > > I applied the patch to our Slurm build but we had no time to install it. > Next we have 19.05 -> 20.02.x upgrade ahead, actual date is not yet decided. > We may first try without the patch even it seems to apply to 20.02.x series: > > slurm-20.02.5-3b1897f]$ patch -p1 < ../puhti-slurm-patches/7928.patch > patching file src/common/slurm_protocol_api.c > Hunk #1 succeeded at 4928 (offset -18 lines). > Hunk #2 succeeded at 4937 (offset -18 lines). > patching file src/slurmd/slurmd/req.c > Hunk #1 succeeded at 2396 (offset -20 lines). > patching file src/common/slurm_protocol_api.c > Hunk #1 succeeded at 4938 (offset -18 lines). > patching file src/slurmd/slurmd/slurmd.c > Hunk #1 succeeded at 461 (offset 7 lines). It is not yet commited into 20.02, but I think your plan is good. First try 20.02 without the patch and see what happens. If it is still occurring, just apply it. Can we close this issue and reopen it when you've more feedback for me? Thanks
Ok, I'll reopen if this still happens with 20.05 + this patch.
This is to inform that the final patch has been commited: The main part of the patch (solving the issue) is now merged into slurm-20.02 branch and will be released in Slurm 20.02.6. Commits a4e65900d69..04eb33f1d11. Additionall logging went to our master branch and are scheduled for Slurm 20.11, commit: 37b46a351ac83f. There's no need to apply it manually anymore starting at the mentioned release. Thanks