Created attachment 16995 [details] slurmctld logs Nearly 99% of the jobs ends up with OUT_OF_MEMORY in job extern step.
Created attachment 16996 [details] slurm conf
Hi, > Nearly 99% of the jobs ends up with OUT_OF_MEMORY in job extern step. I'm not yet why you are getting OUT_OF_MEMORY, but it seems that you could be facing several issues. I would need the logs of one slurmd where you get those OUT_OF_MEMORY for further investigation. Besides the OUT_OF_MEMORY issue, with the already provided slurmctld logs, it looks that you are having some kind of network problem, for example: [2020-11-28T14:21:19.518] error: slurm_receive_msgs: Socket timed out on send/recv operation [2020-12-01T10:46:54.202] error: get_addr_info: getaddrinfo() failed: Name or service not known [2020-12-01T10:46:54.202] error: slurm_set_addr: Unable to resolve "pe2cc2-005" [2020-12-01T10:46:54.202] error: slurm_get_port: Address family '0' not supported [2020-12-01T10:46:54.202] error: _set_slurmd_addr: failure on pe2cc2-005 Could you enable DebugFlags=Network to get more insights about this? Are only getting them on the .extern step? Do you have some kind of clock synchronization system like NTP? Could you run "sdiag" and attach the output? Could you attach your cgroups.conf? And as mentioned above, could you attach some slurmd log? Regards, Albert
Created attachment 17036 [details] slurmd log I've attached the slurmd log. Please ignore the nodes that are missing. They are just dead nodes that we have not yet removed from the slurm cluster and we use the regex format to add a range of nodes so we haven't really taken the broken ones out. We have NTP servers and all hpc nodes are in sync. [root@pe2-slurm01 ~]# sdiag ******************************************************* sdiag output at Tue Dec 08 13:05:01 2020 (1607450701) Data since Tue Dec 08 12:56:48 2020 (1607450208) ******************************************************* Server thread count: 16 Agent queue size: 0 Agent count: 2 Agent thread count: 6 DBD Agent queue size: 0 Jobs submitted: 3505 Jobs started: 716 Jobs completed: 222 Jobs canceled: 0 Jobs failed: 0 Job states ts: Tue Dec 08 13:04:31 2020 (1607450671) Jobs pending: 4748 Jobs running: 1214 Main schedule statistics (microseconds): Last cycle: 100851 Max cycle: 254151 Total cycles: 225 Mean cycle: 102266 Mean depth cycle: 101 Cycles per minute: 28 Last queue length: 3114 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 112 Total backfilled jobs (since last stats cycle start): 112 Total backfilled heterogeneous job components: 0 Total cycles: 14 Last cycle when: Tue Dec 08 13:04:58 2020 (1607450698) Last cycle: 4370455 Max cycle: 5456671 Mean cycle: 2630629 Last depth cycle: 1179 Last depth cycle (try sched): 255 Depth Mean: 3073 Depth Mean (try depth): 236 Last queue length: 3115 Queue length mean: 3262 Last table size: 15 Mean table size: 16 Latency for 1000 calls to gettimeofday(): 709 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:1349 ave_time:767 total_time:1035750 REQUEST_NODE_INFO_SINGLE ( 2040) count:1180 ave_time:1055 total_time:1245748 REQUEST_COMPLETE_PROLOG ( 6018) count:715 ave_time:68760 total_time:49163698 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:220 ave_time:200554 total_time:44122077 REQUEST_STEP_COMPLETE ( 5016) count:215 ave_time:64322 total_time:13829258 MESSAGE_EPILOG_COMPLETE ( 6012) count:211 ave_time:228675 total_time:48250460 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:197 ave_time:3958 total_time:779790 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:160 ave_time:3353 total_time:536592 REQUEST_NODE_INFO ( 2007) count:102 ave_time:62817 total_time:6407388 REQUEST_STATS_INFO ( 2035) count:50 ave_time:504 total_time:25202 REQUEST_JOB_INFO ( 2003) count:50 ave_time:96940 total_time:4847029 REQUEST_JOB_USER_INFO ( 2039) count:17 ave_time:676682 total_time:11503595 REQUEST_FED_INFO ( 2049) count:17 ave_time:1069 total_time:18186 REQUEST_JOB_READY ( 4019) count:4 ave_time:230 total_time:920 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:4 ave_time:2653 total_time:10615 REQUEST_RESOURCE_ALLOCATION ( 4001) count:3 ave_time:6000 total_time:18001 REQUEST_JOB_STEP_CREATE ( 5001) count:2 ave_time:815 total_time:1630 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:2 ave_time:430 total_time:861 REQUEST_RECONFIGURE ( 1003) count:1 ave_time:610380 total_time:610380 Remote Procedure Call statistics by user root ( 0) count:4273 ave_time:39857 total_time:170312291 eflynn ( 50321) count:163 ave_time:3377 total_time:550574 wliao ( 5042) count:48 ave_time:206041 total_time:9889975 ahawkins ( 50631) count:8 ave_time:1813 total_time:14506 rraviram ( 50946) count:3 ave_time:389 total_time:1167 jrahman ( 50867) count:3 ave_time:545431 total_time:1636295 mbyrska-bishop ( 20158) count:1 ave_time:2372 total_time:2372 Pending RPC statistics
Created attachment 17037 [details] cgroup conf
Hi, With the logs from slurmd I can confirm that your issue is a duplicate of bug 10255 comment 21. Related details can also be seen on bug 10336. The fix is already on github, and will be part of the 20.11.1 that will be released very soon: - https://github.com/SchedMD/slurm/commit/272c636d507e1dc59d987da478d42f6713d88ae1 I'm marking this bug as duplicate of bug 10255. *** This ticket has been marked as a duplicate of ticket 10255 ***