Ticket 7928 - Prolog or job env setup failure - due to Communication connection failure and others
Summary: Prolog or job env setup failure - due to Communication connection failure and...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 19.05.3
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
: 8061 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-10-15 05:51 MDT by CSC sysadmins
Modified: 2020-10-27 06:00 MDT (History)
5 users (show)

See Also:
Site: CSC - IT Center for Science
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld debug logs (18.72 KB, application/gzip)
2019-10-21 06:15 MDT, CSC sysadmins
Details
compute node debug log (5.25 KB, text/x-log)
2019-10-21 06:16 MDT, CSC sysadmins
Details
NHC logs and scripts: lbnl-nhc-1.4.2-1.el7.noarch (343.38 KB, application/x-bzip)
2019-10-21 06:46 MDT, CSC sysadmins
Details
New logs without NHC (23.57 KB, text/x-log)
2019-11-08 05:27 MST, CSC sysadmins
Details
logs from slurmctld (251.97 KB, text/plain)
2019-11-08 05:28 MST, CSC sysadmins
Details
full slurmctl log (532.55 KB, application/x-bzip)
2019-11-19 02:20 MST, CSC sysadmins
Details
slurmd log from failed job 471596 (23.61 KB, text/x-log)
2019-11-19 02:21 MST, CSC sysadmins
Details
Dmesg (386.49 KB, text/x-log)
2019-11-28 06:41 MST, CSC sysadmins
Details
rpmbuild config.log (225.73 KB, text/x-log)
2019-11-29 00:15 MST, CSC sysadmins
Details
sdiag (15.62 KB, text/x-log)
2019-12-04 04:24 MST, CSC sysadmins
Details
netstat -naptu (143.34 KB, text/x-log)
2019-12-04 05:33 MST, CSC sysadmins
Details
current slurm.conf (8.24 KB, text/plain)
2020-03-09 09:28 MDT, CSC sysadmins
Details
dbg7928_errno.patch (1.40 KB, patch)
2020-03-12 11:55 MDT, Felip Moll
Details | Diff
prolog failure debug log (6.61 KB, application/gzip)
2020-03-17 04:18 MDT, CSC sysadmins
Details
Longer log with connection timeout mgs (463.62 KB, text/x-log)
2020-03-18 02:11 MDT, CSC sysadmins
Details
Debug logs with +Protocol (71.65 KB, application/gzip)
2020-03-18 06:48 MDT, CSC sysadmins
Details
bug8398_1905_v4.patch (6.46 KB, patch)
2020-08-25 03:17 MDT, Felip Moll
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2019-10-15 05:51:51 MDT
Felipe asked to open  a separate bug for our case, logs and configs are attached to this bug report: https://bugs.schedmd.com/show_bug.cgi?id=6502
Comment 2 CSC sysadmins 2019-10-17 01:04:22 MDT
Problem can not be in the slurm_prolog script. If prolog fails I do not see that 
"start slurm_prolog" string on the syslog.

#!/bin/sh 
logger "start slurm_prolog"
Comment 3 Felip Moll 2019-10-17 02:18:03 MDT
Tommi,

I need a log of slurmd that includes the full process of a job. I need to see everything from the start or even before it.

As you noticed r11c54's slurmd log starts at
[2019-10-08T14:59:18.209] debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG

and the problematic job, accordingly to slurmctld at:
[2019-10-08T14:58:33.579] sched: Allocate JobId=293312 NodeList=r11c54 #CPUs=10 Partition=fmi

A dmesg in the node and an scontrol show job would be also useful.

Also please, set DebugLevel=debug2, SlurmdDebug=debug2 and ensure the DebugFlags Agent, Protocol and Route are set.

As you know this implies some performance penalty, so set logs to info or debug when you're done with getting them.


I have a slurm.conf from march, it would be good to have an updated one, also the topology.conf.
Comment 4 CSC sysadmins 2019-10-18 01:35:53 MDT
Hi,

Yesterday I enabled debug2 on compute nodes and enabled debugflags on slurmctld but problem disappeared :( 

So problem may be some kind of race condition which is hard to catch?

I'm not able to access "see also"-ticket:
You are not authorized to access bug #6493

Recent configs can be found on this ticket:
https://bugs.schedmd.com/show_bug.cgi?id=7908

I checked from the logs, general failure rate was about 4-5/day when debug2 was not enabled. I've to disable debug2-level log during the next weekend, we do not have disks on the compute nodes and jobs with lot of steps can fill /var tmpfs filesystem quite fast. But I'll continue on next week.

slurmctld.log-20191009.gz:[2019-10-08T13:39:46.912] error: validate_node_specs: Prolog or job env setup failure on node r07c25, draining the node
slurmctld.log-20191009.gz:[2019-10-08T14:59:22.939] error: validate_node_specs: Prolog or job env setup failure on node r11c54, draining the node                                                                                                                                                                            
slurmctld.log-20191009.gz:[2019-10-08T22:25:59.511] error: validate_node_specs: Prolog or job env setup failure on node r05c48, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-09T09:30:29.592] error: validate_node_specs: Prolog or job env setup failure on node r11c65, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-09T12:47:37.447] error: validate_node_specs: Prolog or job env setup failure on node r02g03, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-09T15:41:20.668] error: validate_node_specs: Prolog or job env setup failure on node r07c05, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-09T16:29:43.271] error: validate_node_specs: Prolog or job env setup failure on node r11c53, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-09T16:43:05.134] error: validate_node_specs: Prolog or job env setup failure on node r01c38, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-09T17:08:35.986] error: validate_node_specs: Prolog or job env setup failure on node r01g08, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-09T18:14:04.076] error: validate_node_specs: Prolog or job env setup failure on node r02c01, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-09T19:31:03.604] error: validate_node_specs: Prolog or job env setup failure on node r16c10, draining the node                                                                                                                                                                            
slurmctld.log-20191010.gz:[2019-10-10T00:22:13.406] error: validate_node_specs: Prolog or job env setup failure on node r11c71, draining the node                                                                                                                                                                            
slurmctld.log-20191011.gz:[2019-10-10T09:57:16.019] error: validate_node_specs: Prolog or job env setup failure on node r15c14, draining the node                                                                                                                                                                            
slurmctld.log-20191011.gz:[2019-10-10T10:59:20.881] error: validate_node_specs: Prolog or job env setup failure on node r01g01, draining the node                                                                                                                                                                            
slurmctld.log-20191011.gz:[2019-10-10T13:11:59.753] error: validate_node_specs: Prolog or job env setup failure on node r04c10, draining the node                                                                                                                                                                            
slurmctld.log-20191011.gz:[2019-10-10T14:32:46.488] error: validate_node_specs: Prolog or job env setup failure on node r03c23, draining the node                                                                                                                                                                            
slurmctld.log-20191011.gz:[2019-10-10T17:05:56.299] error: validate_node_specs: Prolog or job env setup failure on node r14c19, draining the node                                                                                                                                                                            
slurmctld.log-20191011.gz:[2019-10-11T02:55:01.292] error: validate_node_specs: Prolog or job env setup failure on node r05c50, draining the node                                                                                                                                                                            
slurmctld.log-20191013.gz:[2019-10-11T07:59:40.250] error: validate_node_specs: Prolog or job env setup failure on node r11c54, draining the node                                                                                                                                                                            
slurmctld.log-20191013.gz:[2019-10-11T09:53:09.073] error: validate_node_specs: Prolog or job env setup failure on node r01g01, draining the node                                                                                                                                                                            
slurmctld.log-20191013.gz:[2019-10-11T13:43:13.716] error: validate_node_specs: Prolog or job env setup failure on node r02g01, draining the node                                                                                                                                                                            
slurmctld.log-20191013.gz:[2019-10-11T18:00:59.428] error: validate_node_specs: Prolog or job env setup failure on node r15c01, draining the node                                                                                                                                                                            
slurmctld.log-20191013.gz:[2019-10-11T21:59:13.154] error: validate_node_specs: Prolog or job env setup failure on node r07c19, draining the node                                                                                                                                                                            
slurmctld.log-20191013.gz:[2019-10-12T00:27:37.752] error: validate_node_specs: Prolog or job env setup failure on node r17c13, draining the node                                                                                                                                                                            
slurmctld.log-20191013.gz:[2019-10-12T01:21:43.237] error: validate_node_specs: Prolog or job env setup failure on node r05c44, draining the node                                                                                                                                                                            
slurmctld.log-20191015.gz:[2019-10-14T04:22:33.650] error: validate_node_specs: Prolog or job env setup failure on node r01c16, draining the node                                                                                                                                                                            
slurmctld.log-20191015.gz:[2019-10-14T06:19:10.071] error: validate_node_specs: Prolog or job env setup failure on node r06c15, draining the node                                                                                                                                                                            
slurmctld.log-20191015.gz:[2019-10-14T15:19:13.857] error: validate_node_specs: Prolog or job env setup failure on node r05c34, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-15T03:24:39.769] error: validate_node_specs: Prolog or job env setup failure on node r06c04, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-15T08:39:50.923] error: validate_node_specs: Prolog or job env setup failure on node r03c08, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-15T09:38:58.090] error: validate_node_specs: Prolog or job env setup failure on node r03c08, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-15T12:02:08.873] error: validate_node_specs: Prolog or job env setup failure on node r04c46, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-15T13:36:50.139] error: validate_node_specs: Prolog or job env setup failure on node r17g05, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-15T15:36:21.258] error: validate_node_specs: Prolog or job env setup failure on node r01g01, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-16T11:15:04.401] error: validate_node_specs: Prolog or job env setup failure on node r02g02, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-16T12:22:36.146] error: validate_node_specs: Prolog or job env setup failure on node r12c11, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-16T16:32:24.111] error: validate_node_specs: Prolog or job env setup failure on node r01g02, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-16T17:23:00.217] error: validate_node_specs: Prolog or job env setup failure on node r01g01, draining the node                                                                                                                                                                            
slurmctld.log-20191017.gz:[2019-10-17T00:27:28.460] error: validate_node_specs: Prolog or job env setup failure on node r16c25, draining the node                                                                                                                                                                            
slurmctld.log-20191018.gz:[2019-10-17T10:48:58.984] error: validate_node_specs: Prolog or job env setup failure on node r15g04, draining the node                                                                                                                                                                            
slurmctld.log-20191018.gz:[2019-10-17T11:56:51.536] error: validate_node_specs: Prolog or job env setup failure on node r12c08, draining the node                                                                                                                                                                            
slurmctld.log-20191018.gz:[2019-10-17T12:11:58.865] error: validate_node_specs: Prolog or job env setup failure on node r12c03, draining the node
Comment 5 Felip Moll 2019-10-18 02:25:31 MDT
(In reply to Tommi Tervo from comment #4)
> Hi,
> 
> Yesterday I enabled debug2 on compute nodes and enabled debugflags on
> slurmctld but problem disappeared :( 

Typical case when you look at the problem, it just fixes itself :P


> So problem may be some kind of race condition which is hard to catch?

I think so, but right now I cannot say if it is in processing an RPC in an incorrect order, or missing one, or just some issue with nodes... I cannot say anything without more info.


> I'm not able to access "see also"-ticket:
> You are not authorized to access bug #6493

There's no relevant information, it is an internal unresolved ticket with a similar issue than yours on a specific environment when running the testsuite.


> Recent configs can be found on this ticket:
> https://bugs.schedmd.com/show_bug.cgi?id=7908

Good.

> I checked from the logs, general failure rate was about 4-5/day when debug2
> was not enabled. I've to disable debug2-level log during the next weekend,
> we do not have disks on the compute nodes and jobs with lot of steps can
> fill /var tmpfs filesystem quite fast. But I'll continue on next week.

It's fine, let's see what happens next week.
Comment 6 CSC sysadmins 2019-10-21 06:15:42 MDT
Created attachment 12021 [details]
slurmctld debug logs
Comment 7 CSC sysadmins 2019-10-21 06:16:58 MDT
Created attachment 12022 [details]
compute node debug log

Now we got prolog error logs with requested debug level.
Comment 8 Felip Moll 2019-10-21 06:34:10 MDT
What does nhc.log say? 

It exited with code 0x0100, can you attach your nhc script too?

[2019-10-21T13:45:27.479] debug2: Start processing RPC: REQUEST_HEALTH_CHECK
[2019-10-21T13:45:27.479] debug2: Processing RPC: REQUEST_HEALTH_CHECK
[2019-10-21T13:45:27.479] debug:  attempting to run health_check [/usr/sbin/nhc]
[2019-10-21T13:45:30.753] error: /usr/sbin/nhc: exited with status 0x0100

[2019-10-21T13:45:30.753] debug2: Finish processing RPC: REQUEST_HEALTH_CHECK
[2019-10-21T13:46:06.209] error: Waiting for JobId=356502 prolog has failed, giving up after 50 sec
Comment 9 CSC sysadmins 2019-10-21 06:46:19 MDT
Created attachment 12023 [details]
NHC logs and scripts: lbnl-nhc-1.4.2-1.el7.noarch
Comment 10 Felip Moll 2019-10-21 08:38:16 MDT
I tracked down an error that is showing up in your nhc.log and I suspect there may be a relation between this line:

/etc/nhc/scripts/common.nhc: line 381: slurm.r12c05.log-20191018.gz: syntax error: invalid arithmetic operator (error token is ".r12c05.log-20191018.gz")
NHC watchdog timer 164451 (60 secs) has expired.  Signaling NHC:  kill -s ALRM -- -164450
/usr/sbin/nhc: line 485: kill: (-164450) - No such process


And these checks:

 @cn@ || check_ps_loadavg 400
 @cn@ || check_ps_service -S -u root auditd
 @cn@ || check_ps_service -r rpcbind
 @cn@ || check_ps_service -S ntpd
 @cn@ || check_ps_service -S -u root sshd
 @cn@ || check_ps_service -S -d rsyslogd -u root rsyslog
 @cn@ || check_ps_service -S -u root crond


This could be the cause of:

[2019-10-21T13:45:27.479] debug2: Start processing RPC: REQUEST_HEALTH_CHECK
[2019-10-21T13:45:27.479] debug2: Processing RPC: REQUEST_HEALTH_CHECK
[2019-10-21T13:45:27.479] debug:  attempting to run health_check [/usr/sbin/nhc]
[2019-10-21T13:45:30.753] error: /usr/sbin/nhc: exited with status 0x0100
[2019-10-21T13:45:30.753] debug2: Finish processing RPC: REQUEST_HEALTH_CHECK

In fact, after that I've found:

https://github.com/mej/nhc/issues/25

It could have relation with your issue so I will ask you to fix it before continue looking.
Could you please fix it?
Comment 11 CSC sysadmins 2019-10-21 09:07:25 MDT
Thanks, I'll patch nhc tomorrow. I just noticed another failure where nhc is not interfering. Unfortunately there was no debug2 level logs but that 2^32 stepd looks a bit suspicious?

[2019-10-21T17:09:24.319] error: Waiting for JobId=356826 prolog has failed, giving up after 50 sec
[2019-10-21T17:09:24.365] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:24.371] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:25.366] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:25.372] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:26.367] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:26.372] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:27.368] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:27.373] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:28.369] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:28.374] error: stepd_connect to 356826.4294967295 failed: No such file or directory
[2019-10-21T17:09:31.507] error: /usr/sbin/nhc: exited with status 0x0100
Comment 12 Felip Moll 2019-10-21 09:26:31 MDT
(In reply to Tommi Tervo from comment #11)
> Thanks, I'll patch nhc tomorrow. I just noticed another failure where nhc is
> not interfering. Unfortunately there was no debug2 level logs but that 2^32
> stepd looks a bit suspicious?
> 
> [2019-10-21T17:09:24.319] error: Waiting for JobId=356826 prolog has failed,
> giving up after 50 sec
> [2019-10-21T17:09:24.365] error: stepd_connect to 356826.4294967295 failed:
> No such file or directory
> ......
> [2019-10-21T17:09:31.507] error: /usr/sbin/nhc: exited with status 0x0100

Well, I see an NHC message too, so it can interfer. Don't worry for the 2³², it is the ID for the extern step. I'd need a debug2 to say more and a 'ps aux' in the node, but it seems slurmd/stepd was trying to connect to the extern step and it wasn't there anymore.

I'd say to comment the process checks in NHC right now if you can, and later apply the patch and fix the bug.


Aside from that, I have to admit I see now similar behavior than in 6502:

CTLD is sending REQUEST_LAUNCH_PROLOG, but it never gets to the node due to "Communication connection failure"

[2019-10-21T13:45:16.200]    retry_list retry_list_size:2 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,0,0,0
[2019-10-21T13:45:16.200]    retry_list retry_list_size:1 msg_type=REQUEST_BATCH_JOB_LAUNCH,0,0,0,0
[2019-10-21T13:45:16.200] agent: Agent_cnt=1 agent_thread_cnt=12 with msg_type=REQUEST_BATCH_JOB_LAUNCH retry_list_size=0
[2019-10-21T13:45:16.200] agent: Agent_cnt=1 agent_thread_cnt=12 with msg_type=REQUEST_LAUNCH_PROLOG retry_list_size=0
[2019-10-21T13:45:16.200] _make_agent_info: sending msg_type REQUEST_BATCH_JOB_LAUNCH to nodes r12c05
[2019-10-21T13:45:16.200] _make_agent_info: sending msg_type REQUEST_LAUNCH_PROLOG to nodes r12c05
[2019-10-21T13:45:16.200] agent: New agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_BATCH_JOB_LAUNCH protocol_version:8704
[2019-10-21T13:45:16.200] agent: New agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_LAUNCH_PROLOG protocol_version:8704
[2019-10-21T13:45:16.200] _thread_per_group_rpc: sending REQUEST_BATCH_JOB_LAUNCH to r12c05
[2019-10-21T13:45:16.200] _thread_per_group_rpc: sending REQUEST_LAUNCH_PROLOG to r12c05
[2019-10-21T13:45:16.216] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_BATCH_JOB_LAUNCH protocol_version:8704
[2019-10-21T13:45:17.681]    retry_list retry_list_size:1 msg_type=REQUEST_LAUNCH_PROLOG,0,0,0,0

..
[2019-10-21T13:46:16.256] agent/is_node_resp: node:r12c05 RPC:REQUEST_LAUNCH_PROLOG : Communication connection failure

So it is understandable than the node doesn't run the prolog and fails with an error. Interestingly it receives the REQUEST_BATCH_JOB_LAUNCH:

[2019-10-21T13:45:16.201] debug2: Start processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2019-10-21T13:45:16.201] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2019-10-21T13:45:16.202] debug2: _group_cache_lookup_internal: no entry found for westerl2
[2019-10-21T13:45:27.479] debug2: Start processing RPC: REQUEST_HEALTH_CHECK
[2019-10-21T13:45:27.479] debug2: Processing RPC: REQUEST_HEALTH_CHECK
[2019-10-21T13:45:27.479] debug:  attempting to run health_check [/usr/sbin/nhc]
[2019-10-21T13:45:30.753] error: /usr/sbin/nhc: exited with status 0x0100

[2019-10-21T13:45:30.753] debug2: Finish processing RPC: REQUEST_HEALTH_CHECK
[2019-10-21T13:46:06.209] error: Waiting for JobId=356502 prolog has failed, giving up after 50 sec


I must investigate what happens when it receives REQUEST_BATCH_JOB_LAUNCH, if there's some condition which makes the node stuck, specially after _group_cache_lookup_internal.


Working on it while I wait for the fix of nhc.
Comment 13 Felip Moll 2019-10-21 10:51:53 MDT
Though I don't have more logs from surendra, I see that you both use NHC:

Apr 24 10:14:41 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_HEALTH_CHECK
Apr 24 10:14:41 r5i0n10 slurmd[3441]: debug:  attempting to run health_check [/usr/sbin/nhc]
Apr 24 10:15:49 r5i0n10 slurmd[3441]: debug2: got this type of message 1017
Apr 24 10:15:49 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
Apr 24 10:15:50 r5i0n10 slurmd[3441]: debug2: got this type of message 1008
Apr 24 10:16:19 r5i0n10 slurmd[3441]: debug2: got this type of message 1017
Apr 24 10:16:19 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
Apr 24 10:16:35 r5i0n10 slurmd[3441]: debug2: got this type of message 4005
Apr 24 10:16:35 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
Apr 24 10:16:35 r5i0n10 slurmd[3441]: debug2: _group_cache_lookup_internal: no entry found for jacl0659
Apr 24 10:16:35 r5i0n10 slurmd[3441]: task_p_slurmd_batch_request: 689894
Apr 24 10:16:35 r5i0n10 slurmd[3441]: task/affinity: job 689894 CPU input mask for node: 0xFFFFFFFFF
Apr 24 10:16:35 r5i0n10 slurmd[3441]: task/affinity: job 689894 CPU final HW mask for node: 0xFFFFFFFFF
Apr 24 10:16:49 r5i0n10 slurmd[3441]: debug2: got this type of message 1017
Apr 24 10:16:49 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
Apr 24 10:17:19 r5i0n10 slurmd[3441]: debug2: got this type of message 1017
Apr 24 10:17:19 r5i0n10 slurmd[3441]: debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
Apr 24 10:17:25 r5i0n10 slurmd[3441]: error: Waiting for JobId=689894 prolog has failed, giving up after 50 sec

In both situations I also see _group_cache_lookup_internal messages.

It would be interesting to see also if there are any NSS/SSSD logs in the node.

Note that you have CacheGroups=0 set in your slurm.conf from bug 7908. This parameter was deprecated since commit 48a4cdf back in 2016. See GroupUpdate* parameters and RELEASE_NOTES. Probably you're receiving a warning when starting slurmctld.
Comment 14 CSC sysadmins 2019-10-23 06:47:16 MDT
> In both situations I also see _group_cache_lookup_internal messages.
> 
> It would be interesting to see also if there are any NSS/SSSD logs in the
> node.

We have no NSS/SSSD on nodes, plain /etc/passwd and /etc/group files

 
> Note that you have CacheGroups=0 set in your slurm.conf from bug 7908. This
> parameter was deprecated since commit 48a4cdf back in 2016. See GroupUpdate*
> parameters and RELEASE_NOTES. Probably you're receiving a warning when
> starting slurmctld.

I'll disable cachegroups on next slurm conf update.

I patched NHC but it still fails miserably if slurmd runs it:

/etc/nhc/scripts/common.nhc: line 381: slurm.r07c03.log-20191018.gz: syntax error: invalid arithmetic operator (error token is ".r07c03.log-20191018.gz")
NHC watchdog timer 32444 (60 secs) has expired.  Signaling NHC:  kill -s ALRM -- -32443
/usr/sbin/nhc: line 491: kill: (-32443) - No such process

But from command line and cron it works as expected:

r07c03
20191023 15:40:12 [slurm] /usr/libexec/nhc/node-mark-online r07c03
/usr/libexec/nhc/node-mark-online:  Node r07c03 is already online.
/usr/libexec/nhc/node-mark-online:  Skipping alloc node r07c03 (none 

I'll configure NHC to run from cron on all nodes on tomorrow and see if it makes any difference.
Comment 15 Felip Moll 2019-10-23 07:09:08 MDT
(In reply to Tommi Tervo from comment #14)
> > In both situations I also see _group_cache_lookup_internal messages.
> > 
> > It would be interesting to see also if there are any NSS/SSSD logs in the
> > node.
> 
> We have no NSS/SSSD on nodes, plain /etc/passwd and /etc/group files
> 
>  
> > Note that you have CacheGroups=0 set in your slurm.conf from bug 7908. This
> > parameter was deprecated since commit 48a4cdf back in 2016. See GroupUpdate*
> > parameters and RELEASE_NOTES. Probably you're receiving a warning when
> > starting slurmctld.
> 
> I'll disable cachegroups on next slurm conf update.
> 
> I patched NHC but it still fails miserably if slurmd runs it:
> 
> /etc/nhc/scripts/common.nhc: line 381: slurm.r07c03.log-20191018.gz: syntax
> error: invalid arithmetic operator (error token is ".r07c03.log-20191018.gz")
> NHC watchdog timer 32444 (60 secs) has expired.  Signaling NHC:  kill -s
> ALRM -- -32443
> /usr/sbin/nhc: line 491: kill: (-32443) - No such process
> 
> But from command line and cron it works as expected:
> 
> r07c03
> 20191023 15:40:12 [slurm] /usr/libexec/nhc/node-mark-online r07c03
> /usr/libexec/nhc/node-mark-online:  Node r07c03 is already online.
> /usr/libexec/nhc/node-mark-online:  Skipping alloc node r07c03 (none 
> 
> I'll configure NHC to run from cron on all nodes on tomorrow and see if it
> makes any difference.

It really *shouldn't* do a difference. The only difference you may have found is which processes are running in the node at the moment of the execution. Maybe with cron there was no processes with dots at that run.

I would suggest instead to remove the processes checks in nhc config.
Comment 16 Felip Moll 2019-11-04 08:34:15 MST
Hi Tommi,

Do you have any news for that issue?

Did the modifications to NHC help?

Thanks
Comment 17 CSC sysadmins 2019-11-05 07:58:41 MST
(In reply to Felip Moll from comment #16)
> Hi Tommi,
> 
> Do you have any news for that issue?
> 
> Did the modifications to NHC help?

Hi,

Nope, separating nhc from slurm did not help. We are seeing another variant of this problem, prolog is somehow stuck. First thing what is should do is to log jobid to syslog but it does not do that:



/var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:25.930] error: Waiting for JobId=415346 prolog has failed, giving up after 50 sec
/var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:25.979] error: stepd_connect to 415346.4294967295 failed: No such file or directory
/var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:26.980] error: stepd_connect to 415346.4294967295 failed: No such file or directory
/var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:27.980] error: stepd_connect to 415346.4294967295 failed: No such file or directory
/var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:28.981] error: stepd_connect to 415346.4294967295 failed: No such file or directory
/var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:29.982] error: stepd_connect to 415346.4294967295 failed: No such file or directory

[root@r05c27 ~]# grep 415346  /var/log/syslog 
/var/log/syslog:1572942865 2019 Nov  5 10:34:25 r05c27 user notice root  start slurm_epilog for job: 415346
/var/log/syslog:1572942866 2019 Nov  5 10:34:26 r05c27 user notice root  ending_2 slurm_epilog for job: 415346.



[root@r05c27 ~]# head /etc/slurm/slurm_prolog 
#!/bin/sh 
logger "start slurm_prolog for job: ${SLURM_JOB_ID}"
Comment 18 Felip Moll 2019-11-05 08:27:34 MST
(In reply to Tommi Tervo from comment #17)
> (In reply to Felip Moll from comment #16)
> > Hi Tommi,
> > 
> > Do you have any news for that issue?
> > 
> > Did the modifications to NHC help?
> 
> Hi,
> 
> Nope, separating nhc from slurm did not help. We are seeing another variant
> of this problem, prolog is somehow stuck. First thing what is should do is
> to log jobid to syslog but it does not do that:

But did you just separate it or removed the 'ps' checks? Even if you separate it, the same 'ps' command can still exhaust the resources in the machine so it could still have some impact.
I really would like you to also disable the process checks in nhc.conf to completely discard NHC issues.

> 
> 
> /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:25.930] error: Waiting for
> JobId=415346 prolog has failed, giving up after 50 sec
> /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:25.979] error:
> stepd_connect to 415346.4294967295 failed: No such file or directory
> /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:26.980] error:
> stepd_connect to 415346.4294967295 failed: No such file or directory
> /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:27.980] error:
> stepd_connect to 415346.4294967295 failed: No such file or directory
> /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:28.981] error:
> stepd_connect to 415346.4294967295 failed: No such file or directory
> /var/log/slurm/slurm.r05c27.log:[2019-11-05T10:34:29.982] error:
> stepd_connect to 415346.4294967295 failed: No such file or directory
> 
> [root@r05c27 ~]# grep 415346  /var/log/syslog 
> /var/log/syslog:1572942865 2019 Nov  5 10:34:25 r05c27 user notice root 
> start slurm_epilog for job: 415346
> /var/log/syslog:1572942866 2019 Nov  5 10:34:26 r05c27 user notice root 
> ending_2 slurm_epilog for job: 415346.
> 
> 
> 
> [root@r05c27 ~]# head /etc/slurm/slurm_prolog 
> #!/bin/sh 
> logger "start slurm_prolog for job: ${SLURM_JOB_ID}"


The thread who is waiting for the prolog to run is not the same thread than executes it. So despite you see "Waiting for JobId=415346 prolog has failed, giving up after 50 sec", doesn't mean the prolog has started at all.

After you see the issue happening again, please attach fresh logs. I expect not to see the NHC error anymore and I am interested in seeing the new workflow.
It would be also good to try to catch the error when it is happening but I understand you don't have a specific pattern to reproduce, is that right?
Comment 19 CSC sysadmins 2019-11-06 07:05:50 MST
> But did you just separate it or removed the 'ps' checks? Even if you
> separate it, the same 'ps' command can still exhaust the resources in the
> machine so it could still have some impact.
> I really would like you to also disable the process checks in nhc.conf to
> completely discard NHC issues.

I disabled all but loadavg check when I did this nhc separation:

#######################################################################
###
### Process checks
###
 @cn@ || check_ps_loadavg 400
# @cn@ || check_ps_service -S -u root auditd
# @cn@ || check_ps_service -r rpcbind
# @cn@ || check_ps_service -S ntpd
# @cn@ || check_ps_service -S -u root sshd
# @cn@ || check_ps_service -S -d rsyslogd -u root rsyslog
# @cn@ || check_ps_service -S -u root crond

> The thread who is waiting for the prolog to run is not the same thread than
> executes it. So despite you see "Waiting for JobId=415346 prolog has failed,
> giving up after 50 sec", doesn't mean the prolog has started at all.

Ah, this makes sense.


> After you see the issue happening again, please attach fresh logs. I expect
> not to see the NHC error anymore and I am interested in seeing the new
> workflow.
> It would be also good to try to catch the error when it is happening but I
> understand you don't have a specific pattern to reproduce, is that right?

I'll enable debug logging tomorrow if day is calm enough. I've not figured out the pattern when this problem occurs. If I put failed prolog nodes back to resume state those will run jobs normally.

Best Regards,
Tommi
Comment 20 CSC sysadmins 2019-11-08 05:27:08 MST
Created attachment 12272 [details]
New logs without NHC
Comment 21 CSC sysadmins 2019-11-08 05:28:34 MST
Created attachment 12273 [details]
logs from slurmctld
Comment 22 Felip Moll 2019-11-14 13:52:50 MST
Tommi, 

slurmd log starts to far in the time.

From slurmctld I get to see how the messages are queued:

[2019-11-08T13:33:29.228] backfill: Started JobId=427942 in gpu on r13g04
..
[2019-11-08T13:33:29.299] _make_agent_info: sending msg_type REQUEST_LAUNCH_PROLOG to nodes r13g04
[2019-11-08T13:33:29.299] _make_agent_info: sending msg_type REQUEST_BATCH_JOB_LAUNCH to nodes r13g04
..
[2019-11-08T13:33:29.299] _thread_per_group_rpc: sending REQUEST_BATCH_JOB_LAUNCH to r13g04
[2019-11-08T13:33:29.299] _thread_per_group_rpc: sending REQUEST_LAUNCH_PROLOG to r13g04
[2019-11-08T13:33:29.300] agent_trigger: pending_wait_time=999->999 mail_too=F->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=6
[2019-11-08T13:33:29.300] agent_trigger: pending_wait_time=999->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=6
[2019-11-08T13:33:29.300]    retry_list retry_list_size:6 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,SRUN_PING
...

You can see two launch prolog and two batch job messages in the list, probably because other jobs where also starting (e.g. 427943) but I cannot say for sure because I have not earlier logs.

In the slurmd part that's the first entry I see related to this job, which means the RPCs were received befor the first entry in the log which happens to be at [2019-11-08T13:33:33.046] (later than 13:33:29) :

[2019-11-08T13:34:19.307] error: Waiting for JobId=427942 prolog has failed, giving up after 50 sec 


Please, can you send me the full logs? Currently the debug level and agent queue traces were being helpful.
I encourage you to always send full logs to facilitate our analysis.

Thanks and sorry for the delay on my response.
Comment 23 CSC sysadmins 2019-11-19 02:20:09 MST
Created attachment 12347 [details]
full slurmctl log
Comment 24 CSC sysadmins 2019-11-19 02:21:15 MST
Created attachment 12348 [details]
slurmd log from failed job 471596
Comment 25 Felip Moll 2019-11-28 06:29:41 MST
It is interesting that on most of the logs I've seen and even two rpcs are sent from ctld I often see a gap:

[2019-11-19T10:53:39.335] debug2: Start processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2019-11-19T10:53:39.335] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2019-11-19T10:53:39.336] debug2: _group_cache_lookup_internal: no entry found for huiskone
[2019-11-19T10:53:50.603] debug2: Start processing RPC: REQUEST_ACCT_GATHER_UPDATE

In this occasion it happens after the _group_cache_lookup_internal and the next operation.. specifically 20 seconds.
Having removed NHC from the way at least discards a problem.

Does a dmesg show anything? I don't think I've seen one of your nodes before. One from r04c14 would be nice to have (dmesg -T).

I'll keep looking.. sorry for the delay this time, I was busy on SC19.
Comment 26 CSC sysadmins 2019-11-28 06:41:03 MST
Created attachment 12434 [details]
Dmesg

Hi,

dmesg does not show any issues for Nov 19th, neither syslog.
And we have flat /etc/passwd files on compute and slurmctld nodes.

Best Regards,
Tommi Tervo
Comment 28 Felip Moll 2019-11-28 09:46:57 MST
Tommi,

Do you happen to have the configure line used during building Slurm?
Comment 29 CSC sysadmins 2019-11-29 00:15:06 MST
Created attachment 12437 [details]
rpmbuild config.log

rpmbuild config log attached
Comment 31 Felip Moll 2019-12-02 14:13:45 MST
I am working with this issue as my first priority this week.

One more question:
[2019-11-19T10:53:03.013] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38356
[2019-11-19T10:53:03.013] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:38356
[2019-11-19T10:53:03.015] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38358
[2019-11-19T10:53:03.015] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:38358
[2019-11-19T10:53:03.017] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38360
[2019-11-19T10:53:03.017] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:38360
[2019-11-19T10:53:03.018] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38362
[2019-11-19T10:53:03.019] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:38362
[2019-11-19T10:53:03.020] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:38364

]$ grep -r REQUEST_BUILD|wc -l
30071

Can you please investigate who is issuing this specific RPC so aggressively?

These commands:
scontrol show config
sview
sprio

do emit this kind of RPC.

an 'sdiag' can help, specifically fields:
Remote Procedure Call statistics by message type
Remote Procedure Call statistics by user

Studying the code I have some suspects about this.
Comment 32 CSC sysadmins 2019-12-02 23:00:23 MST
> Can you please investigate who is issuing this specific RPC so aggressively?

Hi,

We have maintenance break ongoing so I'm busy today. But hopefully tomorrow everything is running smoothly after OS/IB/Slurm/adminplane upgrade so I can investigate this issue which is really annoying for us and our users.

Thanks,
Tommi
Comment 33 Felip Moll 2019-12-03 08:28:39 MST
I've found something.

In slurmctld there's the _slurmctld_background process that every 1 second calls agent_trigger(). This call is logged like this:

[2019-11-19T10:53:35.317] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0
[2019-11-19T10:53:36.319] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0
[2019-11-19T10:53:37.321] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0
[2019-11-19T10:53:38.322] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0

The important fields here are agent_thread_cnt and retry_list_size. The first one is a static variable which globally indicates how many threads have been spawned by all the agents in the system, and the retry_list_size has the count of how many new and how many old-which-need-retry RPCs are in the system (other RPCs reside in the defer_list).

To extend a bit more, an agent() is a processor for the RPCs in the retry_list and the defer_list. Agents are spawned thanks to a broadcast signal sent from agent_trigger() to the main agent() process in slurmctld, and as said before this can happen every 1 second from the _slurmctld_background process(); it is not the only way though, it can also be triggered from a specific request like from a launch_prolog() call. You can generally differentiate from where the agent_trigger is called looking at the pending_wait_time(), the background thread calls it with RPC_RETRY_INTERVAL=60 and a launch_prolog() will end up with a 999:

e.g.:
... pending_wait_time=65534->60  ...
... pending_wait_time=65534->999 ...

Think about it as a producer-consumer model. The consumer is the main agent() from slurmctld, and the producer is any one which calls agent_trigger().

Well, when the agent_trigger() broadcasted the main agent() the last one will process the retry_list, it will log things like:

[2019-11-19T10:53:39.333]    retry_list retry_list_size:2 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,0,0,0
[2019-11-19T10:53:39.333]    retry_list retry_list_size:1 msg_type=REQUEST_BATCH_JOB_LAUNCH,0,0,0,0

and it will basically pick up ONE SINGLE RPC from the retry_list. Normally the newest one found which has never been sent, and otherwise the older one.
What will do with this RPC? it will just spawn a new agent() thread to process this RPC, which can in fact go to one or more nodes, so you will see messages like:

[2019-11-19T10:53:39.333] agent: Agent_cnt=0 agent_thread_cnt=0 with msg_type=REQUEST_LAUNCH_PROLOG retry_list_size=1
[2019-11-19T10:53:39.333] agent: Agent_cnt=0 agent_thread_cnt=0 with msg_type=REQUEST_BATCH_JOB_LAUNCH retry_list_size=0

In this example, which is the one from your logs, you can see that before the retry_list had two msg_types. It processed the first one (REQUEST_LAUNCH_PROLOG) and spawned an agent. Then the next iteration processed the next one and spawned another agent.

The two agents then emits their own log, and they even show the global retry_list_size, so a comprehensible human relation would look like:

[2019-11-19T10:53:39.333]    retry_list retry_list_size:2 msg_type=REQUEST_LAUNCH_PROLOG,REQUEST_BATCH_JOB_LAUNCH,0,0,0  <-- agent processes first RPC
[2019-11-19T10:53:39.333] agent: Agent_cnt=0 agent_thread_cnt=0 with msg_type=REQUEST_LAUNCH_PROLOG retry_list_size=1  <-- this is the first spawned thread

[2019-11-19T10:53:39.333]    retry_list retry_list_size:1 msg_type=REQUEST_BATCH_JOB_LAUNCH,0,0,0,0  <-- agent processes second RPC
[2019-11-19T10:53:39.333] agent: Agent_cnt=0 agent_thread_cnt=0 with msg_type=REQUEST_BATCH_JOB_LAUNCH retry_list_size=0 <-- this is the second spawned thread

Great, after that we have two agents, one for each RPC, and moreover the retry_list is already empty.
The new agents will next take care of sending the messages to the respective nodes.

When one of these secondary agents finish its work, you will see a message like:

[2019-11-19T10:53:39.349] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_BATCH_JOB_LAUNCH protocol_version:8704

The thread_count here indicates how many threads are still remaining in that agent, which is 1, which is himself, so other threads launched to serve the RPCs are already gone. Good. You can also see the get_reply:T flag which indicates that we waited for a response and we received it after sending the RPC. Good.


How do the spawned agents send the RPC?
First of all it just registers globally the number of rpc_thread_cnt that will be used, and it will generally be 2 + the number of nodes where to send the RPC, or 10 if it the number is > AGENT_THREAD_COUNT. So it cannot spawn alone more than 10 threads. In your examples for these two RPC, each spawned agent will use 3 threads, one for sending to the single node r04c14, one for the agent itself, and one for the watchdog.

After that, it sets the parameters for new threads (_make_agent_info), creates the watchdog thread (which will sent a SIGUSR1 to threads which have been active for too long), and creates the new designated threads: _thread_per_group_rpc().

[2019-11-19T10:53:39.334] _thread_per_group_rpc: sending REQUEST_LAUNCH_PROLOG to r04c14
[2019-11-19T10:53:39.334] _thread_per_group_rpc: sending REQUEST_BATCH_JOB_LAUNCH to r04c14

This new thread will split and group the list of nodes using the route plugin and the tree width parameter and will call the network functions to send the RPC accross the network. If specified in the agent info, it will wait for a response. When all the job is finished, the thread will die, the spawned agent will terminate, the log message:

[2019-11-19T10:53:39.349] agent: end agent ...

will be logged, and everybody will be happy. 

------------

But well, your situation is not happy.

Look at what happens after _thread_per_group_rpc:

[2019-11-19T10:53:39.334] _thread_per_group_rpc: sending REQUEST_BATCH_JOB_LAUNCH to r04c14
[2019-11-19T10:53:39.349] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_BATCH_JOB_LAUNCH protocol_version:8704
[2019-11-19T10:53:39.349] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0
[2019-11-19T10:53:40.576] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0
[2019-11-19T10:53:41.578] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0
[2019-11-19T10:53:42.579] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0
...

The agent_thread_cnt is 3, and does not decrease, this means the spawned agent for sending the REQUEST_LAUNCH_PROLOG is waiting for _thread_per_group_rpc, which is probably stuck at some network function.

seconds after.. still there, one agent, three threads (itself, watchdog and thread_per_group_rpc):

[2019-11-19T10:53:59.616] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0
[2019-11-19T10:54:00.617] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0

later... between the storm of REQUEST_BUILD_INFO which I asked you to investigate in comment 31, still locked:

...
[2019-11-19T10:54:02.620] slurmctld_req: received opcode REQUEST_BUILD_INFO from 10.140.0.8:39842
[2019-11-19T10:54:02.620] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0
[2019-11-19T10:54:02.621] _slurmctld_rpc_mgr: accept() connection from 10.140.0.8:39844
...

we even see new RPCs being served, but our three threads and 1 agent are still there:
...
[2019-11-19T10:54:33.968] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:SRUN_PING protocol_version:8704
[2019-11-19T10:54:33.968] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=0
...

and we finally see that:
[2019-11-19T10:54:37.962] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=0
[2019-11-19T10:54:38.964] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=6 retry_list_size=0
[2019-11-19T10:54:39.394] agent/is_node_resp: node:r04c14 RPC:REQUEST_LAUNCH_PROLOG : Communication connection failure <----------------------------------
[2019-11-19T10:54:39.619] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_LAUNCH_PROLOG protocol_version:8704 <-----
[2019-11-19T10:54:39.619] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0
[2019-11-19T10:54:39.965] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=0
[2019-11-19T10:54:40.967] agent_trigger: pending_wait_time=65534->999 mail_too=F->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=1
[2019-11-19T10:54:40.967] agent_trigger: pending_wait_time=999->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=3 retry_list_size=1

Our agent has terminated due to a communication connection failure.

----- Added to that, I can see the following:

we eventually see a communication problem with a node, and it is managed correctly from 15 to 12 threads, destroying the agent and thread_cnt from that agent:
..
[2019-11-19T10:55:27.735] agent/is_node_resp: node:r07c42 RPC:REQUEST_BATCH_JOB_LAUNCH : Communication connection failure <------------ AGAIN
..
[2019-11-19T10:55:27.959] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=2 agent_thread_cnt=15 retry_list_size=0
[2019-11-19T10:55:27.959] agent: New agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_TERMINATE_JOB protocol_version:8704
[2019-11-19T10:55:27.959] _thread_per_group_rpc: sending REQUEST_TERMINATE_JOB to r07c42
[2019-11-19T10:55:27.965] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_TERMINATE_JOB protocol_version:8704
[2019-11-19T10:55:27.965] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=12 retry_list_size=0
..

After a while, we observe:

[2019-11-19T10:55:40.647] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=12 retry_list_size=0
[2019-11-19T10:55:41.648] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=12 retry_list_size=0
[2019-11-19T10:55:42.650] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=1 agent_thread_cnt=12 retry_list_size=0
[2019-11-19T10:55:43.056] _slurmctld_rpc_mgr: accept() connection from 10.140.0.11:56602
[2019-11-19T10:55:43.059] slurmctld_req: received opcode MESSAGE_NODE_REGISTRATION_STATUS from 10.140.0.11:56602
[2019-11-19T10:55:43.257] _wdog: agent maximum delay 61 seconds
[2019-11-19T10:55:43.258] agent: end agent thread_count:1 threads_active:0 retry:F get_reply:T msg_type:REQUEST_NODE_REGISTRATION_STATUS protocol_version:8704
[2019-11-19T10:55:43.258] agent_trigger: pending_wait_time=65534->60 mail_too=T->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=0
[2019-11-19T10:55:43.652] agent_trigger: pending_wait_time=65534->999 mail_too=F->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=1

The watchdog came in, and finally terminated these agent:

[2019-11-19T10:55:43.257] _wdog: agent maximum delay 61 seconds


I see many _wdog coming in, which indicates your system is suffering some issue. I see many RPC storms in your logs, which makes me think you're saturating the ib cards or similar. Not only about REQUEST_BUILD_INFO but also REQUEST_PARTITION_INFO, REQUEST_NODE_INFO or REQUEST_PARTITION_INFO.

If you want I could provide a patch to log the _thread_per_group() rpc and deeper calls, but given this:
]$ grep -c REQUEST_BUILD_INFO slurmctld_filtered_email_lines.log
30071
]$ grep -c REQUEST_PARTITION_INFO slurmctld_filtered_email_lines.log
5045
]$ grep -c REQUEST_NODE_INFO slurmctld_filtered_email_lines.log
5641

and that all of these RPCs happens in seconds or less, I am strongly inclined to not continue looking deeper in code or applying patches to your production system until you find the cause of these massive RPCs.

Note: not only the ib card can be saturated, also the max processes in the server or a network switch. 


I would like your opinion about this.

Thanks
Felip
Comment 36 CSC sysadmins 2019-12-04 03:42:24 MST
> [2019-11-19T10:53:03.020] _slurmctld_rpc_mgr: accept() connection from
> 10.140.0.8:38364
> 
> ]$ grep -r REQUEST_BUILD|wc -l
> 30071
> 
> Can you please investigate who is issuing this specific RPC so aggressively?

These requests came from login node but "storm" is gone now. Users do have sometimes "interesting" scripts/applications which will monitor queues and submit jobs and we have no options to prevent users to use e.g. squeue command without rippling functionality which reminds me one old enchantment BR: 

https://bugs.schedmd.com/show_bug.cgi?id=5080

Another issue is that slurmctld may have lot of connections in TIME_WAIT state, one example that we just saw is that user submits a job with lot of steps but have misconfigured path. Following example generates lot of connections to slurmctld and we do not see any way to prevent this? 

for i in {1 .. 32000}; 
do
   srun /typoed/path/gmx input_$i
done 

I'll attach sdiag and netstat
Comment 37 CSC sysadmins 2019-12-04 04:24:56 MST
Created attachment 12468 [details]
sdiag
Comment 38 CSC sysadmins 2019-12-04 05:33:16 MST
Created attachment 12470 [details]
netstat -naptu
Comment 41 Nate Rini 2019-12-04 09:10:25 MST
*** Ticket 8061 has been marked as a duplicate of this ticket. ***
Comment 42 Felip Moll 2020-01-10 07:11:16 MST
Hello Tommi,

I haven't provided further feedback on this issue until now.
The current situation of this issue is that I prepared a patch to improve how the agent queue is managed which will probably mitigate what you experienced before.

As you know the big explanation is in comment 33. My proposed patch is pending for review by our QA team.

For this reason I am keeping this bug open until it is reviewed and accepted.

Thanks for your understanding.
Comment 47 Felip Moll 2020-02-26 11:08:51 MST
Hi Tommi,

We have been reviewing the issue with the QA team, but my proposed patch may not directly address your issue, since it is a performance improvement in the agent queue but your logs didn't show up a high number for the retry list. Now I'd like to know about the status of your issue.

- Are you still seeing the "error: validate_node_specs: Prolog or job env setup failure on node xxx, draining the node" too often?

- Do you still see "RPC:REQUEST_LAUNCH_PROLOG : Communication connection failure" ?

- After comment 36 where you said the "storm RPC" ceased, have you experienced the very same issues without an evident storm?


Pd. I still strongly believe there was an issue with communication layer as explained in comment 33.
Comment 48 CSC sysadmins 2020-02-27 00:55:08 MST
> - Are you still seeing the "error: validate_node_specs: Prolog or job env
> setup failure on node xxx, draining the node" too often?
> 
> - Do you still see "RPC:REQUEST_LAUNCH_PROLOG : Communication connection
> failure" ?

Yes, problem has not disappeared. We had to make a script which will run in a loop and releases jobs which are held due to this bug

 
> - After comment 36 where you said the "storm RPC" ceased, have you
> experienced the very same issues without an evident storm?

Yes, problem is constant. One example, we have 200 job submit limit/user so user invented nice workaround:

parallel --progress -j 200 sbatch -W --options

Problem is that every sbatch command generates 3 RPCs in 10s interval and if one multiplies that with 200x...

We see more and more "scripting suites/frameworks" which are monitoring the queue and submit jobs automatically so this kind of frequent polling is going to be increase in future.


Overall situation is really bad, we cannot modify more that ~20 account with sacctmgr at the time, otherwise slurmctld may crawl almost 30mins and logs are full of ugly errors.

Slurmctld leaks memory but we cannot valgrind it due to performance problems
(version is latest from 19.05 branch)
4680 slurm     20   0   13,7g  12,6g   4112 S  12,5 20,1   6212:56 slurmctld
Comment 49 Felip Moll 2020-02-28 02:35:09 MST
> Slurmctld leaks memory but we cannot valgrind it due to performance problems
> (version is latest from 19.05 branch)
> 4680 slurm     20   0   13,7g  12,6g   4112 S  12,5 20,1   6212:56 slurmctld

Your situation is really bad, and that memory consumption is quite high.

Can you gdb attach to slurmctld and send me the output of "thread apply all bt" ?

Send me your latest slurmctld and slurmd logs of a node where you've seen problems.

Also I want your latest slurm.conf and an sdiag output launched every 10 seconds for 5 minutes. If you are able to make it fail during this period it would be great, for example with the sacctmgr you mentioned. The slurmctld log after all this test will be what I want.
Comment 50 Felip Moll 2020-02-28 03:44:57 MST
> Yes, problem is constant. One example, we have 200 job submit limit/user so
> user invented nice workaround:
> 
> parallel --progress -j 200 sbatch -W --options
> 
> Problem is that every sbatch command generates 3 RPCs in 10s interval and if
> one multiplies that with 200x...
> 
> We see more and more "scripting suites/frameworks" which are monitoring the
> queue and submit jobs automatically so this kind of frequent polling is
> going to be increase in future.

You are right... this generates in my case 2 RPCs x 200 every 10s.

REQUEST_FED_INFO
REQUEST_JOB_INFO_SINGLE

And as you know that's the polling mechanism of -W option.

How do you have the OS configured? I'm quite sure you already tuned it, but just in case I put here the HTC link: https://slurm.schedmd.com/high_throughput.html

Besides that and depending on your current config. I am thinking on modify the backfill parameters to let the RPCs process more quickly, which would have the counterpart of schedule jobs less often. I will wait for your slurm.conf before doing any suggestion.

In the meantime I am trying to reproduce your scenarios.
Comment 52 Felip Moll 2020-02-28 09:26:52 MST
I worked more on the path which has to be triggered to raise your issue, and I strongly believe it is a communication problem caused for an overwhelmed system due to high nr of RPCs and connections. In fact I created a synthetic reproducer which gives the exact problem you're seeing.. the key has been to force a sleep and timeout in the function which does the syscall "connect()" to a socket for the specfic RPCs... exactly the same errors you seen have appeared.

The error of "Prolog or job env setup failure" and nodes drained is only the most visible face of this issue, but it has other implications, for example, in your logs I checked how many RPCs have "Communication Problem" message, and I got this:

]$ grep -ri "communication connection" slurmctld_filtered_email_lines.log|cut -d" " -f 4|sort |uniq -c
     77 RPC:REQUEST_ACCT_GATHER_UPDATE
      8 RPC:REQUEST_BATCH_JOB_LAUNCH
     20 RPC:REQUEST_LAUNCH_PROLOG
      1 RPC:REQUEST_TERMINATE_JOB


So this is not only affecting your submissions, but also your accounting, termination, and other RPCs.

You *really* want to tune your system to accept more connections at an OS level, but the final answer here is you want to change the workflow of your users to avoid querying the ctld with so many RPCs.

I am afraid there won't be any other solution...

In any case send me the info I requested in my previous comments to see if there's any chance for the scheduler to release a bit the work in order to be able to serve RPCs more often, but I cannot guarantee it is a solution.

Your users are DDoS'ing your ctld.

Tell me what do you think.
Comment 56 CSC sysadmins 2020-03-09 09:28:12 MDT
Created attachment 13307 [details]
current slurm.conf

Hi,

Sorry for the delay, I was on holidays last week. We have verified that all recommended HPC settings are set on slurmctld. 

-Tommi
Comment 57 Felip Moll 2020-03-09 09:53:25 MDT
(In reply to Tommi Tervo from comment #56)
> Created attachment 13307 [details]
> current slurm.conf
> 
> Hi,
> 
> Sorry for the delay, I was on holidays last week. We have verified that all
> recommended HPC settings are set on slurmctld. 
> 
> -Tommi

Note I referred mainly to OS configuration, no to ctld, i.e. system/daemon/net limits...
Would you be able to apply a debug patch if I create it?
Comment 58 CSC sysadmins 2020-03-10 00:04:44 MDT
> Note I referred mainly to OS configuration, no to ctld, i.e.
> system/daemon/net limits...

Sorry for being unclear, I meant that we have checked those OS configurations/tunings which are on the HTC page.

> Would you be able to apply a debug patch if I create it?

Of course unless it can be harmful for production.

-Tommi
Comment 59 Felip Moll 2020-03-12 11:55:16 MDT
Created attachment 13359 [details]
dbg7928_errno.patch

That's a pretty simple debug patch restricted to LAUNCH_PROLOG RPCs (so shouldn't cause any performance impact) which will let us see the real errno after the connect() syscall, because now we're overwriting this in certain situations and replace with SLURM_COMMUNICATIONS_CONNECTION_ERROR.

Can you apply this patch, run in debuglevel debug, and send me the logs when you see the issue again?

Hopefully it will give us a lead and an idea on where to look next.

Thanks!!
Comment 60 CSC sysadmins 2020-03-13 08:38:30 MDT
(In reply to Felip Moll from comment #59)
> Created attachment 13359 [details]
> dbg7928_errno.patch
> 
> That's a pretty simple debug patch restricted to LAUNCH_PROLOG RPCs (so
> shouldn't cause any performance impact) which will let us see the real errno
> after the connect() syscall, because now we're overwriting this in certain
> situations and replace with SLURM_COMMUNICATIONS_ERROR.
> 
> Can you apply this patch, run in debuglevel debug, and send me the logs when
> you see the issue again?

Hi,

I applied patch our debug build which should catch also memory leak errors. It was running about 5 hours today without hit. I'll continue on Monday, have a nice weekend.

-Tommi
Comment 62 Felip Moll 2020-03-16 10:35:53 MDT
Tommi, I've seen your logs in bug 8597 and in this situation I can only see an issue if I grep "update assoc" in the logfile, where you will see there are a lot of updates.

On our specific issue, *all* prologs were running fine. You can see that by grepping the logs:

]$ grep -ri bug7928 slurmctld.log


we can see also like there's not any retry, this is always 0:

..retries slurm_open_msg_conn=0..

so the issue wasn't triggered that time for us.

I think you are facing other issues and these affects bug 8597 and 7928 and your performance.

Let's keep with this debug level and if you catch it just send me the logs again.
Comment 63 CSC sysadmins 2020-03-17 02:53:47 MDT
> I think you are facing other issues and these affects bug 8597 and 7928 and
> your performance.
> 
> Let's keep with this debug level and if you catch it just send me the logs
> again.

Hi,

I recompiled slurm without address sanitizer flags so I can keep debug logging on for a longer time. I'll post results when the prolog error happens.
Comment 64 CSC sysadmins 2020-03-17 04:18:43 MDT
Created attachment 13393 [details]
prolog failure debug log
Comment 65 Felip Moll 2020-03-17 06:53:40 MDT
(In reply to Tommi Tervo from comment #64)
> Created attachment 13393 [details]
> prolog failure debug log

We need more log!.

Look, at 11:19:09 there's the Allocate for jobid=1663914 on node r03g03.

(I guess r03g03 has ip addr 10.140.0.183, can you confirm that?)

Immediately we see how the REQUEST_LAUNCH_PROLOG is issued:

[2020-03-17T11:19:09.382] debug:  BUG7928 slurm_send_addr_recv_msgs: pid 11849 going to connect to 10.140.0.183:6818 to send REQUEST_LAUNCH_PROLOG

but... we don't see the continuation of this connect.

The loop looks like this:

	if (msg->msg_type == REQUEST_LAUNCH_PROLOG) {
		slurm_print_slurm_addr(&msg->address, addr_str,
				       sizeof(addr_str));
		debug("BUG7928 %s: pid %d going to connect to %s to send %s", <--------------------------------------------Message seen
		      __func__, getpid(), addr_str,
		      rpc_num2string(msg->msg_type));
	}
	/* This connect retry logic permits Slurm hierarchical communications
	 * to better survive slurmd restarts */
	for (i = 0; i <= conn_timeout; i++) {
		if (i)
			sleep(1);
		fd = slurm_open_msg_conn(&msg->address);  <---------------------------------- That's a "connect()" followed by a "poll()"
		if ((fd >= 0) || (errno != ECONNREFUSED))
			break;
		if (i == 0)
			debug3("connect refused, retrying");
	}
	if (msg->msg_type == REQUEST_LAUNCH_PROLOG)
		debug("BUG7928 %s:retries slurm_open_msg_conn=%d, conn_timeout=%d, fd=%d, errno: %s", <-----------------------I expect this
		      __func__, i, conn_timeout, fd, slurm_strerror(errno));



So you must agree that after the message we are expecting a "BUG7928 %s:retries slurm_open_msg_conn=%d, conn_timeout=%d, fd=%d, errno: %s". The last timestamp in logs is [2020-03-17T11:20:02.501], so after 53 seconds we still haven't seen it. Then, the prolog fails at 50s because of the watchdog therefore we can conclude here that the "connect()" syscall to 10.140.0.183:6818 is taking a long time.

The poll() is not responsible, because I haven't seen any "slurm_connect poll failed" or "slurm_connect poll timeout" in your logs before, so this path is discarded.

Please, send me back more log, until you see the missing counterpart: "BUG7928 %s: retries.....".

Question: Do you have syncookies enabled on the server? Is the MTU/MSS of the server interfaces the same than on the nodes?


The issue you are experiencing seems not directly related to Slurm code, but to your system being overwhelmed, a misconfiguration on the network, a failing card, a switch, a firewall, or something similar.

Please, respond to my questions here, I am curious on what's going on.
Comment 66 Felip Moll 2020-03-17 11:28:28 MDT
Also I'd like to know your current kernel versions, are you still at 3.10.0-957.2?

What are your your sysctl_max_syn_backlog and net.core.somaxconn?

When it is happening it would be nice to catch output of:

]$ ss -n state syn-recv dport = :6818
]$ ss -plnt dport = :6818
Comment 68 CSC sysadmins 2020-03-18 02:11:19 MDT
Created attachment 13407 [details]
Longer log with connection timeout mgs
Comment 69 CSC sysadmins 2020-03-18 02:27:49 MDT
> Look, at 11:19:09 there's the Allocate for jobid=1663914 on node r03g03.
> 
> (I guess r03g03 has ip addr 10.140.0.183, can you confirm that?)

[root@r03g03 ~]# ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
inet 10.140.0.183  netmask 255.255.240.0  broadcast 10.140.15.255 txqueuelen 16384  (InfiniBand)
RX packets 8155441  bytes 10155560555 (9.4 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 3320658  bytes 1054565732 (1005.7 MiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

mslurm1: 
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
inet6 fe80::a00:3800:1:1  prefixlen 64  scopeid 0x20<link>
 txqueuelen 16384  (InfiniBand)
RX packets 1075417805  bytes 141156091229 (131.4 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 4152895038  bytes 6952160941567 (6.3 TiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0:23: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
inet 10.140.13.24  netmask 255.255.240.0  broadcast 10.140.15.255 txqueuelen 16384  (InfiniBand)


> Immediately we see how the REQUEST_LAUNCH_PROLOG is issued:
> 
> [2020-03-17T11:19:09.382] debug:  BUG7928 slurm_send_addr_recv_msgs: pid
> 11849 going to connect to 10.140.0.183:6818 to send REQUEST_LAUNCH_PROLOG
> 
> but... we don't see the continuation of this connect.

> 
> So you must agree that after the message we are expecting a "BUG7928
> %s:retries slurm_open_msg_conn=%d, conn_timeout=%d, fd=%d, errno: %s". The
> last timestamp in logs is [2020-03-17T11:20:02.501], so after 53 seconds we
> still haven't seen it. Then, the prolog fails at 50s because of the watchdog
> therefore we can conclude here that the "connect()" syscall to
> 10.140.0.183:6818 is taking a long time.
> 
> The poll() is not responsible, because I haven't seen any "slurm_connect
> poll failed" or "slurm_connect poll timeout" in your logs before, so this
> path is discarded.
> 
> Please, send me back more log, until you see the missing counterpart:
> "BUG7928 %s: retries.....".

Seems that it'll timeout after one minute:

[2020-03-17T11:20:09.442] debug:  BUG7928 slurm_send_addr_recv_msgs:retries slurm_open_msg_conn=0, conn_timeout=10, fd=-1, errno: Connection timed out
[2020-03-17T11:20:10.075] server_thread_count over limit (256), waiting
[2020-03-17T11:20:10.128] backfill: continuing to yield locks, 256 RPCs pending
 
> Question: Do you have syncookies enabled on the server? Is the MTU/MSS of
> the server interfaces the same than on the nodes?

Syncookies were not enabled, I tested that on long time ago but bug occurred with it also. I've enabled it again and trying to catch logs with ss logging.

> The issue you are experiencing seems not directly related to Slurm code, but
> to your system being overwhelmed, a misconfiguration on the network, a
> failing card, a switch, a firewall, or something similar.

It's really puzzling because I do not see any dropped/error packets and problem is occurring all over the cluster nodes.

--- r03g03.bullx ping statistics ---
15 packets transmitted, 15 received, 0% packet loss, time 14000ms
rtt min/avg/max/mdev = 0.061/0.121/0.153/0.030 ms
[root@mslurm1 ~ (PUHTI) ]# uname -a
Linux mslurm1.bullx 3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

ibdiagnet report :

Summary
-I- Stage                     Warnings   Errors     Comment
-I- Discovery                 0          0
-I- Lids Check                0          0
-I- Links Check               0          0
-I- Subnet Manager            0          0
-I- Port Counters             0          0
-I- Nodes Information         0          0
-I- Speed / Width checks      0          0
-I- Alias GUIDs               0          0
-I- Virtualization            0          0
-I- Partition Keys            0          0
-I- Temperature Sensing       0          0
-I- Routing                   0          0
Comment 70 CSC sysadmins 2020-03-18 02:32:15 MDT
(In reply to Felip Moll from comment #66)
> Also I'd like to know your current kernel versions, are you still at
> 3.10.0-957.2?

Cluster os stack was updated on December to RHEL 7.7.

3.10.0-1062.7.1.el7.x86_64 #1 SMP Wed Nov 13 08:44:42 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

> What are your your sysctl_max_syn_backlog and net.core.somaxconn?

[root@mslurm1 ~ (PUHTI) ]# cat /proc/sys/net/ipv4/tcp_synack_retries
5
[root@mslurm1 ~ (PUHTI) ]# cat /proc/sys/net/ipv4/tcp_syn
tcp_syn_retries     tcp_synack_retries  tcp_syncookies
[root@mslurm1 ~ (PUHTI) ]# cat /proc/sys/net/ipv4/tcp_syncookies
1
[root@mslurm1 ~ (PUHTI) ]# sysctl -a 2>/dev/null|grep net.core.somaxconn
net.core.somaxconn = 16384
[root@mslurm1 ~ (PUHTI) ]# sysctl -a 2>/dev/null|grep max_syn_backlog
net.ipv4.tcp_max_syn_backlog = 16384

> When it is happening it would be nice to catch output of:
> 
> ]$ ss -n state syn-recv dport = :6818
> ]$ ss -plnt dport = :6818

I setup loop which is logging output of these commends with 10s interwall
Comment 71 CSC sysadmins 2020-03-18 06:48:30 MDT
Created attachment 13410 [details]
Debug logs with +Protocol

Compute node journal:

-- Unit user-0.slice has finished shutting down.
Mar 18 12:40:14 r07c06.bullx root[41382]: start slurm_prolog for job: 1676643
Mar 18 12:40:14 r07c06.bullx root[41392]: end slurm_prolog for job: 1676643
Mar 18 12:40:15 r07c06.bullx root[41897]: start slurm_epilog for job: 1676643
Mar 18 12:40:15 r07c06.bullx root[41934]: ending_2 slurm_epilog for job: 1676643.
Mar 18 12:42:39 r07c06.bullx root[41967]: start slurm_prolog for job: 1676714
Mar 18 12:42:39 r07c06.bullx root[41977]: end slurm_prolog for job: 1676714
Mar 18 12:42:42 r07c06.bullx root[42602]: start slurm_prolog for job: 1676715
Mar 18 12:42:42 r07c06.bullx root[42612]: end slurm_prolog for job: 1676715
Mar 18 12:42:43 r07c06.bullx root[43413]: start slurm_epilog for job: 1676715
Mar 18 12:42:44 r07c06.bullx root[43450]: ending_2 slurm_epilog for job: 1676715.
Mar 18 12:43:05 r07c06.bullx root[43462]: start slurm_epilog for job: 1676712
Mar 18 12:43:05 r07c06.bullx root[43501]: ending_2 slurm_epilog for job: 1676712.
Mar 18 12:43:41 r07c06.bullx root[43594]: start slurm_epilog for job: 1676356
Mar 18 12:43:41 r07c06.bullx root[43631]: ending_2 slurm_epilog for job: 1676356.
Mar 18 12:45:01 r07c06.bullx systemd[1]: Created slice User Slice of root.
-- Subject: Unit user-0.slice has finished start-up


I had that ss loop running but it does not report anything interesting:
12:42:03 :
Netid  Recv-Q Send-Q Local Address:Port               Peer Address:Port
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
12:42:13 :
Netid  Recv-Q Send-Q Local Address:Port               Peer Address:Port
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
12:42:23 :
Netid  Recv-Q Send-Q Local Address:Port               Peer Address:Port
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
12:42:33 :
Netid  Recv-Q Send-Q Local Address:Port               Peer Address:Port
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
12:42:43 :
Netid  Recv-Q Send-Q Local Address:Port               Peer Address:Port
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
12:42:53 :
Netid  Recv-Q Send-Q Local Address:Port               Peer Address:Port
State      Recv-Q Send-Q Local Address:Port               Peer Address:Port
12:43:03 :
Netid  Recv-Q Send-Q Local Address:Port               Peer Address:Port
Comment 72 Felip Moll 2020-03-18 07:06:36 MDT
TCP syn cookies won't help if the syn queue is not full, and if ss doesn't show anything I wouldn't set syncookies.

Which model of ib switches and MCA do you have?
Diagnosing switch issues is out of my scope and knowledge but that would be worth to look at, I have some experiences on weird issues like that one on past jobs where at the end there was a card on a switch just failing.

The basic facts here are:

1. We do have timeouts in connect() syscall.
2. It doesn't just happen for REQUEST_LAUNCH_PROLOG but this case is the one that gives noticeable symptoms (comment 52).
3. You use TCP over IB, not a Gb network for management, I am not saying this is an issue, but we just need to keep it in mind.
4. Bug 8597 derived from this bug. I cannot say it is related, but it is something to take into consideration.
5. I think the Slurm code is good, because of 2.

What do you really think?
Comment 73 Felip Moll 2020-03-19 06:43:45 MDT
I want to make a clarification.

In _slurm_connect we are setting the socket to be O_NONBLOCK.

fcntl(__fd, F_SETFL, flags | O_NONBLOCK).

This will make connect() to return immediately and then we really wait in poll() with the TCPTimeout.

If debug2 were enabled we should see:
 error("slurm_connect poll failed: %m"); , being errno ETIMEDOUT.

That doesn't change anything.. but...

Marcin which is working on a very similar bug 8698 suggested that maybe we could try a retry on timedout.

Your tcptimeout is 60, that would make a second retry to be too long if it doesn't work and could accumulate many threads.

a derived idea we could try is to retry the wait once more after first timeout but with a lower timeout. I am studying now this idea and will let you know.
Comment 74 CSC sysadmins 2020-03-19 07:01:50 MDT
> Marcin which is working on a very similar bug 8698 suggested that maybe we
> could try a retry on timedout.
> 
> Your tcptimeout is 60, that would make a second retry to be too long if it
> doesn't work and could accumulate many threads.
> 
> a derived idea we could try is to retry the wait once more after first
> timeout but with a lower timeout. I am studying now this idea and will let
> you know.

Hi,

I was going to propose similar idea but I wanted to discuss first with my colleague who has experience of making own RTOS but he's unfortunately sick atm. Btw, bug 8698 does not look related??

-Tommi
Comment 75 Felip Moll 2020-03-19 09:03:58 MDT
> I was going to propose similar idea but I wanted to discuss first with my
> colleague who has experience of making own RTOS but he's unfortunately sick
> atm. Btw, bug 8698 does not look related??
> 
> -Tommi

Sorry for the typo, it was bug 8398
Comment 77 Felip Moll 2020-03-23 03:52:19 MDT
Tommi, in bug 8398 (which is private) we are trying a patch with a similar concept I explained in comment 73.

If you agree we will wait to see how it works there and depending on the results I will propose to apply the same patch or a variant to your situation.

Results will come probably this week or the other, depending on the time to implement that and reproduce the issue.
Comment 78 Felip Moll 2020-04-02 10:08:53 MDT
I was going to suggest you to apply the mentioned patch, but after seeing the progress in bug 8597 I prefer to wait in order not to mix both issues.

Your system seems to be under quite RPC pressure so I will work with my other colleagues to find out how we can help in 8597, and will put this bug on hold.

I suspect it will end being a dup of the other one.
Comment 79 Felip Moll 2020-08-21 09:06:23 MDT
Hi Tommi,

I am wondering if after changing the way you do sacctmgr you're still facing the issue described in this bug.

Sorry for my long delay in giving a response.
Comment 80 CSC sysadmins 2020-08-23 23:56:57 MDT
 (In reply to Felip Moll from comment #79)
> Hi Tommi,
> 
> I am wondering if after changing the way you do sacctmgr you're still facing
> the issue described in this bug.

Hi,

Problem is still there, we have a script which will resume drained node if this problem occurs.

-Tommi
Comment 81 Felip Moll 2020-08-25 03:17:48 MDT
Created attachment 15588 [details]
bug8398_1905_v4.patch

Tommi,

The attached patch has proven to be useful for Stanford, where they faced similar issues. It is not yet reviewed and accepted by our QA team but it seems safe.

Is it possible for you to apply it and give a try?

I will be out of office this next week so I will ping my colleague Marcin to keep track of the bug.

Let me know what you see after applying it.

Thanks
Comment 82 Felip Moll 2020-10-12 04:38:38 MDT
(In reply to Felip Moll from comment #81)
> Created attachment 15588 [details]
> bug8398_1905_v4.patch
> 
> Tommi,
> 
> The attached patch has proven to be useful for Stanford, where they faced
> similar issues. It is not yet reviewed and accepted by our QA team but it
> seems safe.
> 
> Is it possible for you to apply it and give a try?
> 
> I will be out of office this next week so I will ping my colleague Marcin to
> keep track of the bug.
> 
> Let me know what you see after applying it.
> 
> Thanks

Hi Tommi,

There was any chance to try this out?

Thanks
Comment 83 CSC sysadmins 2020-10-13 06:27:47 MDT
> There was any chance to try this out?

Hi,

I applied the patch to our Slurm build but we had no time to install it. Next we have 19.05 -> 20.02.x upgrade ahead, actual date is not yet decided. We may first try without the patch even it seems to apply to 20.02.x series:

slurm-20.02.5-3b1897f]$ patch -p1  < ../puhti-slurm-patches/7928.patch 
patching file src/common/slurm_protocol_api.c
Hunk #1 succeeded at 4928 (offset -18 lines).
Hunk #2 succeeded at 4937 (offset -18 lines).
patching file src/slurmd/slurmd/req.c
Hunk #1 succeeded at 2396 (offset -20 lines).
patching file src/common/slurm_protocol_api.c
Hunk #1 succeeded at 4938 (offset -18 lines).
patching file src/slurmd/slurmd/slurmd.c
Hunk #1 succeeded at 461 (offset 7 lines).
Comment 84 Felip Moll 2020-10-13 08:56:41 MDT
(In reply to Tommi Tervo from comment #83)
> > There was any chance to try this out?
> 
> Hi,
> 
> I applied the patch to our Slurm build but we had no time to install it.
> Next we have 19.05 -> 20.02.x upgrade ahead, actual date is not yet decided.
> We may first try without the patch even it seems to apply to 20.02.x series:
> 
> slurm-20.02.5-3b1897f]$ patch -p1  < ../puhti-slurm-patches/7928.patch 
> patching file src/common/slurm_protocol_api.c
> Hunk #1 succeeded at 4928 (offset -18 lines).
> Hunk #2 succeeded at 4937 (offset -18 lines).
> patching file src/slurmd/slurmd/req.c
> Hunk #1 succeeded at 2396 (offset -20 lines).
> patching file src/common/slurm_protocol_api.c
> Hunk #1 succeeded at 4938 (offset -18 lines).
> patching file src/slurmd/slurmd/slurmd.c
> Hunk #1 succeeded at 461 (offset 7 lines).

It is not yet commited into 20.02, but I think your plan is good. First try 20.02 without the patch and see what happens. If it is still occurring, just apply it.

Can we close this issue and reopen it when you've more feedback for me?

Thanks
Comment 85 CSC sysadmins 2020-10-14 07:40:22 MDT
Ok, I'll reopen if this still happens with 20.05 + this patch.
Comment 86 Felip Moll 2020-10-27 06:00:18 MDT
This is to inform that the final patch has been commited:

The main part of the patch (solving the issue) is now merged into slurm-20.02 branch and will be released in Slurm 20.02.6. Commits a4e65900d69..04eb33f1d11.

Additionall logging went to our master branch and are scheduled for Slurm 20.11, commit: 37b46a351ac83f.


There's no need to apply it manually anymore starting at the mentioned release.

Thanks