| Summary: | Frequent reboot or hang issue on GPU nodes. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bom <bom.singiali> |
| Component: | GPU | Assignee: | Oriol Vilarrubi <jvilarru> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | HMGU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
cgroup.conf
slurm.conf syslog slurmd log |
||
Created attachment 26273 [details]
slurm.conf
Created attachment 26274 [details]
syslog
Created attachment 26275 [details]
slurmd log
Hi, Could we have update on this please. We have 03-04 GPU servers impacted by this and is getting urgent to resolve this issue. Thanks Hello Bom, I do not see anything in the slurmd logs that could explain the issues you describe. Could you point me to a reboot so that is easier to identify the issue in the syslog? Thanks. Hi Oriol, Usually these GPU nodes are in hung state and we have to reset/reboot to recover them. Reboot is random occurrence. On slurmctld.log, we get: [2022-08-10T17:14:31.183] error: Nodes supergpu02pxe not responding, setting DOWN [2022-08-11T00:09:39.170] error: Nodes supergpu02pxe not responding [2022-08-11T00:10:14.700] error: Nodes supergpu02pxe not responding, setting DOWN Thanks Could you review cgroup.conf and slurm.conf for these GPU node. If it looks good? > Could you review cgroup.conf and slurm.conf for these GPU node.
> If it looks good?
Yes, is the first thing I did, it looks OK.
I found something interesting in the syslog, it looks like lustre is provoking a kernel dump, so some of your network mount are probably affected, thus making the system unstable:
Aug 10 20:28:25 supergpu02pxe kernel: WARNING: CPU: 84 PID: 5588 at /tmp/rpmbuild-lustre-root-Sr6hC9RG/BUILD/lustre-2.12.6_ddn42/lustre/llite/rw.c:103 ll_ra_count_get.isra.29+0x1bb/0x1d0 [lustre]
Also I saw that before that there was a yum update, maybe you want to review that the lustre kernel module is in proper shape.
Regards.
Thanks Oriol, I am planning to do update kernel, MLNX_OFED and lustre client update on one of the GPU server, do some tests and make observations. Will let you know, how system behaves afterwards. Hi Bom,
> I am planning to do update kernel, MLNX_OFED and lustre client update on one
> of the GPU server, do some tests and make observations.
>
> Will let you know, how system behaves afterwards.
I'll lower the severity to 4 seeing that this is most probably not a slurm issue, and while I wait for your test results.
Regards.
GPU node still hangs but frequency is reduced. We do not have swap enabled on GPU nodes (due to performance reason), however do you still recommend enabling 4GB-100GB swap partitions ? These nodes have 1-3 TB of RAM. Due to swap (not enabled yet), do you think, enabling these options will help? For e.g ========= slurm.conf: SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE,CR_Memory,CR_CPU SchedulerParameters=max_rpc_cnt=250,\ sched_min_interval=2000000,\ batch_sched_delay=20,\ bf_resolution=800,\ bf_min_prio_reserve=2000,\ bf_window=1440,\ bf_continue,\ bf_min_age_reserve=600,\ Ignore_NUMA ========= cgroup.conf # Slurm cgroup support configuration file CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=yes ConstrainCores=yes # default *no* ConstrainDevices=yes # default *no* ConstrainKmemSpace=no # default *no* ConstrainRAMSpace=yes # default ConstrainSwapSpace=yes # default *no* MemorySwappiness=0 ## <== new value TaskAffinity=no # default ========= These were error logs from today, when GPU node was in hung state: == Aug 13 02:46:09 supergpu05 kernel: Node 7 Normal: 1773*4kB (UM) 427*8kB (M) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10524kB Aug 13 02:46:09 supergpu05 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Aug 13 02:46:09 supergpu05 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 13 02:46:09 supergpu05 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Aug 13 02:46:09 supergpu05 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 13 02:46:09 supergpu05 kernel: Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Aug 13 02:46:09 supergpu05 kernel: Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 13 02:46:09 supergpu05 kernel: Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Aug 13 02:46:09 supergpu05 kernel: Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 13 02:46:09 supergpu05 kernel: Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Aug 13 02:46:09 supergpu05 kernel: Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 13 02:46:09 supergpu05 kernel: Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Aug 13 02:46:09 supergpu05 kernel: Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 13 02:46:09 supergpu05 kernel: Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Aug 13 02:46:09 supergpu05 kernel: Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 13 02:46:09 supergpu05 kernel: Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Aug 13 02:46:09 supergpu05 kernel: Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 13 02:46:09 supergpu05 kernel: 596860 total pagecache pages Aug 13 02:46:09 supergpu05 kernel: 0 pages in swap cache Aug 13 02:46:09 supergpu05 kernel: Swap cache stats: add 0, delete 0, find 0/0 Aug 13 02:46:09 supergpu05 kernel: Free swap = 0kB Aug 13 02:46:09 supergpu05 kernel: Total swap = 0kB Aug 13 02:46:09 supergpu05 kernel: kworker/u512:2: page allocation failure: order:0, mode:0x8020 Aug 13 02:46:09 supergpu05 kernel: CPU: 94 PID: 30145 Comm: kworker/u512:2 Kdump: loaded Tainted: P OEL ------------ 3.10.0-1160.71.1.el7.x86_64 #1 Aug 13 02:46:09 supergpu05 kernel: Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 1.13 03/21/2022 Aug 13 02:46:09 supergpu05 kernel: Workqueue: ib_addr process_one_req [ib_core] Aug 13 02:46:09 supergpu05 kernel: Call Trace: Aug 13 02:46:09 supergpu05 kernel: [<ffffffff861865c9>] dump_stack+0x19/0x1b Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85bc4c20>] warn_alloc_failed+0x110/0x180 Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85ad3363>] ? __wake_up+0x13/0x20 Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85bc97bf>] __alloc_pages_nodemask+0x9df/0xbe0 Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85ad7615>] ? ttwu_do_wakeup+0xb5/0xe0 Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85abbd95>] ? insert_work+0x65/0xa0 Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85c193d8>] alloc_pages_current+0x98/0x110 Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85bc399e>] __get_free_pages+0xe/0x40 Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85bc39e6>] get_zeroed_page+0x16/0x20 Aug 13 02:46:09 supergpu05 kernel: [<ffffffff86000fb2>] iommu_map_page+0x182/0x4c0 ug 13 02:46:09 supergpu05 kernel: lowmem_reserve[]: 0 0 0 0 Aug 13 02:46:09 supergpu05 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15876kB ==== As per my understanding, enabling swap will ONLY delay OOM event. and OOM is expected behavior (kills job/apps) and should put node in hang state? *** OOM is expected behavior (kills job/apps) and should NOT put node in hang state? (In reply to Bom from comment #12) > Due to swap (not enabled yet), do you think, enabling these options will > help? > For e.g > >> ========= > cgroup.conf > > # Slurm cgroup support configuration file > > CgroupMountpoint="/sys/fs/cgroup" > CgroupAutomount=yes > > ConstrainCores=yes # default *no* > ConstrainDevices=yes # default *no* > ConstrainKmemSpace=no # default *no* > ConstrainRAMSpace=yes # default > ConstrainSwapSpace=yes # default *no* > > MemorySwappiness=0 ## <== new value > > TaskAffinity=no # default > > > ========= (In reply to Bom from comment #12) > Due to swap (not enabled yet), do you think, enabling these options will > help? > For e.g > > ========= > slurm.conf: > > SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE,CR_Memory,CR_CPU > > SchedulerParameters=max_rpc_cnt=250,\ > sched_min_interval=2000000,\ > batch_sched_delay=20,\ > bf_resolution=800,\ > bf_min_prio_reserve=2000,\ > bf_window=1440,\ > bf_continue,\ > bf_min_age_reserve=600,\ > Ignore_NUMA > > > ========= > cgroup.conf > > # Slurm cgroup support configuration file > > CgroupMountpoint="/sys/fs/cgroup" > CgroupAutomount=yes > > ConstrainCores=yes # default *no* > ConstrainDevices=yes # default *no* > ConstrainKmemSpace=no # default *no* > ConstrainRAMSpace=yes # default > ConstrainSwapSpace=yes # default *no* > > MemorySwappiness=0 ## <== new value > > TaskAffinity=no # default > > > ========= In slurm.conf Rolled back to original/previous config: SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE Hi Bom, I'll reply to your comments: (In reply to Bom from comment #11) > We do not have swap enabled on GPU nodes (due to performance reason), > however do you still recommend enabling 4GB-100GB swap partitions ? > > These nodes have 1-3 TB of RAM. Having swap enabled could help you determine which is the part of your system that is using more memory than expected. And for doing so 4Gb should be sufficient enough. But as you've said per performance reasons is not good to use swap in your compute nodes, but it is a good tool to help you stablish memory needs for you system. What I would also do is to adjust the RealMemory of those nodes to ensure that the system have sufficient memory. The procedure I like to use for measuring that is to have the system in idle and then add a 50% to the used memory to be on the safe side. Taking into account that you also have lustre I would measure the system usage while doing some writes in the lustre filesystem, so that the memory used to serve those is also taken into account. >Due to swap (not enabled yet), do you think, enabling these options will help? >For e.g >========= > cgroup.conf > MemorySwappiness=0 ## <== new value This has no effect until you have swap in your system, but when you have it is good to have it like this, this way your slurm jobs will not use the swap and thus their perfomance will not be affected. I do not say anything about the changes you talked for slurm.conf as you rolled them back in a later comment. > *** OOM is expected behavior (kills job/apps) and should NOT put node in hang state? That statement is right but in the case that the OOM happens inside the slurm job, so in the case that your job asks for 100Mb and uses 200Mb. If the issue is a OOM of the system (as your syslog and the log fragment in Comment 13) then the process that is killed might be a system process that makes the system hang. That is the reason I talked about in my first reply to be really sure that the RealMemory configured in slurm is small enough to not make the system suffer an OOM. Regards Thanks for your suggestion Oriol, Right now, post patching GPU nodes seems to behave normally. Is it ok to keep this bug open until next 2-3 days? If I do not report back, we can close this bug. Many Thanks Bom Singiali Hello Bom, I'll keep it open until friday afternoon, and then I'll close it as per your last comment, either way you will be able to reopen it later if somehting fails. Regards. Hi Bom, I'm closing this bug as infogiven as agreed in last comment. Regards |
Created attachment 26272 [details] cgroup.conf Dear Slurm Support, Since last 2 weeks, we are observing frequent hang(system need reset/reboot to recover) and random reboot issues on GPU nodes. I have attached slurm.conf, syslog andl slurmd logs for your reference. Thanks for your help. Regards Bom Singiali