Ticket 10380

Summary: Extern step out of memory
Product: Slurm Reporter: lhuang
Component: slurmctldAssignee: Albert Gil <albert.gil>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.0   
Hardware: Linux   
OS: Linux   
Site: NY Genome Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmctld logs
slurm conf
slurmd log
cgroup conf

Description lhuang 2020-12-07 09:27:13 MST
Created attachment 16995 [details]
slurmctld logs

Nearly 99% of the jobs ends up with OUT_OF_MEMORY in job extern step.
Comment 1 lhuang 2020-12-07 09:27:34 MST
Created attachment 16996 [details]
slurm conf
Comment 3 Albert Gil 2020-12-08 04:24:33 MST
Hi,

> Nearly 99% of the jobs ends up with OUT_OF_MEMORY in job extern step.

I'm not yet why you are getting OUT_OF_MEMORY, but it seems that you could be facing several issues.
I would need the logs of one slurmd where you get those OUT_OF_MEMORY for further investigation.

Besides the OUT_OF_MEMORY issue, with the already provided slurmctld logs, it looks that you are having some kind of network problem, for example:

[2020-11-28T14:21:19.518] error: slurm_receive_msgs: Socket timed out on send/recv operation
[2020-12-01T10:46:54.202] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2020-12-01T10:46:54.202] error: slurm_set_addr: Unable to resolve "pe2cc2-005"
[2020-12-01T10:46:54.202] error: slurm_get_port: Address family '0' not supported
[2020-12-01T10:46:54.202] error: _set_slurmd_addr: failure on pe2cc2-005


Could you enable DebugFlags=Network to get more insights about this?
Are only getting them on the .extern step?
Do you have some kind of clock synchronization system like NTP?
Could you run "sdiag" and attach the output?
Could you attach your cgroups.conf?
And as mentioned above, could you attach some slurmd log?

Regards,
Albert
Comment 4 lhuang 2020-12-08 11:05:25 MST
Created attachment 17036 [details]
slurmd log

I've attached the slurmd log. 

Please ignore the nodes that are missing. They are just dead nodes that we have not yet removed from the slurm cluster and we use the regex format to add a range of nodes so we haven't really taken the broken ones out.

We have NTP servers and all hpc nodes are in sync.

[root@pe2-slurm01 ~]# sdiag
*******************************************************
sdiag output at Tue Dec 08 13:05:01 2020 (1607450701)
Data since      Tue Dec 08 12:56:48 2020 (1607450208)
*******************************************************
Server thread count:  16
Agent queue size:     0
Agent count:          2
Agent thread count:   6
DBD Agent queue size: 0

Jobs submitted: 3505
Jobs started:   716
Jobs completed: 222
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Tue Dec 08 13:04:31 2020 (1607450671)
Jobs pending:   4748
Jobs running:   1214

Main schedule statistics (microseconds):
	Last cycle:   100851
	Max cycle:    254151
	Total cycles: 225
	Mean cycle:   102266
	Mean depth cycle:  101
	Cycles per minute: 28
	Last queue length: 3114

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
	Total backfilled jobs (since last slurm start): 112
	Total backfilled jobs (since last stats cycle start): 112
	Total backfilled heterogeneous job components: 0
	Total cycles: 14
	Last cycle when: Tue Dec 08 13:04:58 2020 (1607450698)
	Last cycle: 4370455
	Max cycle:  5456671
	Mean cycle: 2630629
	Last depth cycle: 1179
	Last depth cycle (try sched): 255
	Depth Mean: 3073
	Depth Mean (try depth): 236
	Last queue length: 3115
	Queue length mean: 3262
	Last table size: 15
	Mean table size: 16

Latency for 1000 calls to gettimeofday(): 709 microseconds

Remote Procedure Call statistics by message type
	REQUEST_PARTITION_INFO                  ( 2009) count:1349   ave_time:767    total_time:1035750
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:1180   ave_time:1055   total_time:1245748
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:715    ave_time:68760  total_time:49163698
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:220    ave_time:200554 total_time:44122077
	REQUEST_STEP_COMPLETE                   ( 5016) count:215    ave_time:64322  total_time:13829258
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:211    ave_time:228675 total_time:48250460
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:197    ave_time:3958   total_time:779790
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:160    ave_time:3353   total_time:536592
	REQUEST_NODE_INFO                       ( 2007) count:102    ave_time:62817  total_time:6407388
	REQUEST_STATS_INFO                      ( 2035) count:50     ave_time:504    total_time:25202
	REQUEST_JOB_INFO                        ( 2003) count:50     ave_time:96940  total_time:4847029
	REQUEST_JOB_USER_INFO                   ( 2039) count:17     ave_time:676682 total_time:11503595
	REQUEST_FED_INFO                        ( 2049) count:17     ave_time:1069   total_time:18186
	REQUEST_JOB_READY                       ( 4019) count:4      ave_time:230    total_time:920
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:4      ave_time:2653   total_time:10615
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:3      ave_time:6000   total_time:18001
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:2      ave_time:815    total_time:1630
	REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:2      ave_time:430    total_time:861
	REQUEST_RECONFIGURE                     ( 1003) count:1      ave_time:610380 total_time:610380

Remote Procedure Call statistics by user
	root            (       0) count:4273   ave_time:39857  total_time:170312291
	eflynn          (   50321) count:163    ave_time:3377   total_time:550574
	wliao           (    5042) count:48     ave_time:206041 total_time:9889975
	ahawkins        (   50631) count:8      ave_time:1813   total_time:14506
	rraviram        (   50946) count:3      ave_time:389    total_time:1167
	jrahman         (   50867) count:3      ave_time:545431 total_time:1636295
	mbyrska-bishop  (   20158) count:1      ave_time:2372   total_time:2372

Pending RPC statistics
Comment 5 lhuang 2020-12-08 11:05:51 MST
Created attachment 17037 [details]
cgroup conf
Comment 8 Albert Gil 2020-12-09 07:45:07 MST
Hi,

With the logs from slurmd I can confirm that your issue is a duplicate of bug 10255 comment 21. Related details can also be seen on bug 10336.

The fix is already on github, and will be part of the 20.11.1 that will be released very soon:

- https://github.com/SchedMD/slurm/commit/272c636d507e1dc59d987da478d42f6713d88ae1

I'm marking this bug as duplicate of bug 10255.

*** This ticket has been marked as a duplicate of ticket 10255 ***