Summary: | Extern step out of memory | ||
---|---|---|---|
Product: | Slurm | Reporter: | lhuang |
Component: | slurmctld | Assignee: | Albert Gil <albert.gil> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 20.11.0 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | NY Genome | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurmctld logs
slurm conf slurmd log cgroup conf |
Created attachment 16996 [details]
slurm conf
Hi,
> Nearly 99% of the jobs ends up with OUT_OF_MEMORY in job extern step.
I'm not yet why you are getting OUT_OF_MEMORY, but it seems that you could be facing several issues.
I would need the logs of one slurmd where you get those OUT_OF_MEMORY for further investigation.
Besides the OUT_OF_MEMORY issue, with the already provided slurmctld logs, it looks that you are having some kind of network problem, for example:
[2020-11-28T14:21:19.518] error: slurm_receive_msgs: Socket timed out on send/recv operation
[2020-12-01T10:46:54.202] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2020-12-01T10:46:54.202] error: slurm_set_addr: Unable to resolve "pe2cc2-005"
[2020-12-01T10:46:54.202] error: slurm_get_port: Address family '0' not supported
[2020-12-01T10:46:54.202] error: _set_slurmd_addr: failure on pe2cc2-005
Could you enable DebugFlags=Network to get more insights about this?
Are only getting them on the .extern step?
Do you have some kind of clock synchronization system like NTP?
Could you run "sdiag" and attach the output?
Could you attach your cgroups.conf?
And as mentioned above, could you attach some slurmd log?
Regards,
Albert
Created attachment 17036 [details]
slurmd log
I've attached the slurmd log.
Please ignore the nodes that are missing. They are just dead nodes that we have not yet removed from the slurm cluster and we use the regex format to add a range of nodes so we haven't really taken the broken ones out.
We have NTP servers and all hpc nodes are in sync.
[root@pe2-slurm01 ~]# sdiag
*******************************************************
sdiag output at Tue Dec 08 13:05:01 2020 (1607450701)
Data since Tue Dec 08 12:56:48 2020 (1607450208)
*******************************************************
Server thread count: 16
Agent queue size: 0
Agent count: 2
Agent thread count: 6
DBD Agent queue size: 0
Jobs submitted: 3505
Jobs started: 716
Jobs completed: 222
Jobs canceled: 0
Jobs failed: 0
Job states ts: Tue Dec 08 13:04:31 2020 (1607450671)
Jobs pending: 4748
Jobs running: 1214
Main schedule statistics (microseconds):
Last cycle: 100851
Max cycle: 254151
Total cycles: 225
Mean cycle: 102266
Mean depth cycle: 101
Cycles per minute: 28
Last queue length: 3114
Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
Total backfilled jobs (since last slurm start): 112
Total backfilled jobs (since last stats cycle start): 112
Total backfilled heterogeneous job components: 0
Total cycles: 14
Last cycle when: Tue Dec 08 13:04:58 2020 (1607450698)
Last cycle: 4370455
Max cycle: 5456671
Mean cycle: 2630629
Last depth cycle: 1179
Last depth cycle (try sched): 255
Depth Mean: 3073
Depth Mean (try depth): 236
Last queue length: 3115
Queue length mean: 3262
Last table size: 15
Mean table size: 16
Latency for 1000 calls to gettimeofday(): 709 microseconds
Remote Procedure Call statistics by message type
REQUEST_PARTITION_INFO ( 2009) count:1349 ave_time:767 total_time:1035750
REQUEST_NODE_INFO_SINGLE ( 2040) count:1180 ave_time:1055 total_time:1245748
REQUEST_COMPLETE_PROLOG ( 6018) count:715 ave_time:68760 total_time:49163698
REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:220 ave_time:200554 total_time:44122077
REQUEST_STEP_COMPLETE ( 5016) count:215 ave_time:64322 total_time:13829258
MESSAGE_EPILOG_COMPLETE ( 6012) count:211 ave_time:228675 total_time:48250460
MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:197 ave_time:3958 total_time:779790
REQUEST_SUBMIT_BATCH_JOB ( 4003) count:160 ave_time:3353 total_time:536592
REQUEST_NODE_INFO ( 2007) count:102 ave_time:62817 total_time:6407388
REQUEST_STATS_INFO ( 2035) count:50 ave_time:504 total_time:25202
REQUEST_JOB_INFO ( 2003) count:50 ave_time:96940 total_time:4847029
REQUEST_JOB_USER_INFO ( 2039) count:17 ave_time:676682 total_time:11503595
REQUEST_FED_INFO ( 2049) count:17 ave_time:1069 total_time:18186
REQUEST_JOB_READY ( 4019) count:4 ave_time:230 total_time:920
REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:4 ave_time:2653 total_time:10615
REQUEST_RESOURCE_ALLOCATION ( 4001) count:3 ave_time:6000 total_time:18001
REQUEST_JOB_STEP_CREATE ( 5001) count:2 ave_time:815 total_time:1630
REQUEST_JOB_ALLOCATION_INFO ( 4014) count:2 ave_time:430 total_time:861
REQUEST_RECONFIGURE ( 1003) count:1 ave_time:610380 total_time:610380
Remote Procedure Call statistics by user
root ( 0) count:4273 ave_time:39857 total_time:170312291
eflynn ( 50321) count:163 ave_time:3377 total_time:550574
wliao ( 5042) count:48 ave_time:206041 total_time:9889975
ahawkins ( 50631) count:8 ave_time:1813 total_time:14506
rraviram ( 50946) count:3 ave_time:389 total_time:1167
jrahman ( 50867) count:3 ave_time:545431 total_time:1636295
mbyrska-bishop ( 20158) count:1 ave_time:2372 total_time:2372
Pending RPC statistics
Created attachment 17037 [details]
cgroup conf
Hi, With the logs from slurmd I can confirm that your issue is a duplicate of bug 10255 comment 21. Related details can also be seen on bug 10336. The fix is already on github, and will be part of the 20.11.1 that will be released very soon: - https://github.com/SchedMD/slurm/commit/272c636d507e1dc59d987da478d42f6713d88ae1 I'm marking this bug as duplicate of bug 10255. *** This ticket has been marked as a duplicate of ticket 10255 *** |
Created attachment 16995 [details] slurmctld logs Nearly 99% of the jobs ends up with OUT_OF_MEMORY in job extern step.