Ticket 14730

Summary: Frequent reboot or hang issue on GPU nodes.
Product: Slurm Reporter: Bom <bom.singiali>
Component: GPUAssignee: Oriol Vilarrubi <jvilarru>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: HMGU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: cgroup.conf
slurm.conf
syslog
slurmd log

Description Bom 2022-08-11 04:01:05 MDT
Created attachment 26272 [details]
cgroup.conf

Dear Slurm Support,

Since last 2 weeks, we are observing frequent hang(system need reset/reboot to recover) and random reboot issues on GPU nodes.

I have attached slurm.conf, syslog andl slurmd logs for your reference.

Thanks for your help.

Regards
Bom Singiali
Comment 1 Bom 2022-08-11 04:01:28 MDT
Created attachment 26273 [details]
slurm.conf
Comment 2 Bom 2022-08-11 04:02:06 MDT
Created attachment 26274 [details]
syslog
Comment 3 Bom 2022-08-11 04:02:24 MDT
Created attachment 26275 [details]
slurmd log
Comment 4 Bom 2022-08-12 03:52:48 MDT
Hi,

Could we have update on this please.
We have 03-04 GPU servers impacted by this and is getting urgent to resolve this issue.

Thanks
Comment 5 Oriol Vilarrubi 2022-08-12 05:25:00 MDT
Hello Bom,

I do not see anything in the slurmd logs that could explain the issues you describe. Could you point me to a reboot so that is easier to identify the issue in the syslog?

Thanks.
Comment 6 Bom 2022-08-12 06:38:32 MDT
Hi Oriol,

Usually these GPU nodes are in hung state and we have to reset/reboot to recover them. Reboot is random occurrence. 

On slurmctld.log, we get:

[2022-08-10T17:14:31.183] error: Nodes supergpu02pxe not responding, setting DOWN
[2022-08-11T00:09:39.170] error: Nodes supergpu02pxe not responding
[2022-08-11T00:10:14.700] error: Nodes supergpu02pxe not responding, setting DOWN

Thanks
Comment 7 Bom 2022-08-12 06:45:26 MDT
Could you review cgroup.conf and slurm.conf for these GPU node.
If it looks good?
Comment 8 Oriol Vilarrubi 2022-08-12 09:20:51 MDT
> Could you review cgroup.conf and slurm.conf for these GPU node.
> If it looks good?

Yes, is the first thing I did, it looks OK.

I found something interesting in the syslog, it looks like lustre is provoking a kernel dump, so some of your network mount are probably affected, thus making the system unstable:

Aug 10 20:28:25 supergpu02pxe kernel: WARNING: CPU: 84 PID: 5588 at /tmp/rpmbuild-lustre-root-Sr6hC9RG/BUILD/lustre-2.12.6_ddn42/lustre/llite/rw.c:103 ll_ra_count_get.isra.29+0x1bb/0x1d0 [lustre]

Also I saw that before that there was a yum update, maybe you want to review that the lustre kernel module is in proper shape.

Regards.
Comment 9 Bom 2022-08-12 10:30:30 MDT
Thanks Oriol,

I am planning to do update kernel, MLNX_OFED and lustre client update on one of the GPU server, do some tests and make observations. 

Will let you know, how system behaves afterwards.
Comment 10 Oriol Vilarrubi 2022-08-12 10:59:54 MDT
Hi Bom,

> I am planning to do update kernel, MLNX_OFED and lustre client update on one
> of the GPU server, do some tests and make observations. 
> 
> Will let you know, how system behaves afterwards.

I'll lower the severity to 4 seeing that this is most probably not a slurm issue, and while I wait for your test results.

Regards.
Comment 11 Bom 2022-08-13 06:11:21 MDT
GPU node still hangs but frequency is reduced.

We do not have swap enabled on GPU nodes (due to performance reason), however do you still recommend enabling 4GB-100GB swap partitions ? 

These nodes have 1-3 TB of RAM.
Comment 12 Bom 2022-08-13 06:13:59 MDT
Due to swap (not enabled yet), do you think, enabling these options will help?
For e.g

=========
slurm.conf:

SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE,CR_Memory,CR_CPU

SchedulerParameters=max_rpc_cnt=250,\
sched_min_interval=2000000,\
batch_sched_delay=20,\
bf_resolution=800,\
bf_min_prio_reserve=2000,\
bf_window=1440,\
bf_continue,\
bf_min_age_reserve=600,\
Ignore_NUMA


=========
cgroup.conf

# Slurm cgroup support configuration file

CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes

ConstrainCores=yes # default *no*
ConstrainDevices=yes # default *no*
ConstrainKmemSpace=no # default *no*
ConstrainRAMSpace=yes # default
ConstrainSwapSpace=yes # default *no*

MemorySwappiness=0 ## <== new value

TaskAffinity=no # default


=========
Comment 13 Bom 2022-08-13 06:16:06 MDT
These were error logs from today, when GPU node was in hung state:

==
Aug 13 02:46:09 supergpu05 kernel: Node 7 Normal: 1773*4kB (UM) 427*8kB (M) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10524kB
Aug 13 02:46:09 supergpu05 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 13 02:46:09 supergpu05 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 13 02:46:09 supergpu05 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 13 02:46:09 supergpu05 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 13 02:46:09 supergpu05 kernel: Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 13 02:46:09 supergpu05 kernel: Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 13 02:46:09 supergpu05 kernel: Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 13 02:46:09 supergpu05 kernel: Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 13 02:46:09 supergpu05 kernel: Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 13 02:46:09 supergpu05 kernel: Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 13 02:46:09 supergpu05 kernel: Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 13 02:46:09 supergpu05 kernel: Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 13 02:46:09 supergpu05 kernel: Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 13 02:46:09 supergpu05 kernel: Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 13 02:46:09 supergpu05 kernel: Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 13 02:46:09 supergpu05 kernel: Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 13 02:46:09 supergpu05 kernel: 596860 total pagecache pages
Aug 13 02:46:09 supergpu05 kernel: 0 pages in swap cache
Aug 13 02:46:09 supergpu05 kernel: Swap cache stats: add 0, delete 0, find 0/0
Aug 13 02:46:09 supergpu05 kernel: Free swap  = 0kB
Aug 13 02:46:09 supergpu05 kernel: Total swap = 0kB
Aug 13 02:46:09 supergpu05 kernel: kworker/u512:2: page allocation failure: order:0, mode:0x8020
Aug 13 02:46:09 supergpu05 kernel: CPU: 94 PID: 30145 Comm: kworker/u512:2 Kdump: loaded Tainted: P           OEL ------------   3.10.0-1160.71.1.el7.x86_64 #1
Aug 13 02:46:09 supergpu05 kernel: Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 1.13 03/21/2022
Aug 13 02:46:09 supergpu05 kernel: Workqueue: ib_addr process_one_req [ib_core]
Aug 13 02:46:09 supergpu05 kernel: Call Trace:
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff861865c9>] dump_stack+0x19/0x1b
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85bc4c20>] warn_alloc_failed+0x110/0x180
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85ad3363>] ? __wake_up+0x13/0x20
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85bc97bf>] __alloc_pages_nodemask+0x9df/0xbe0
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85ad7615>] ? ttwu_do_wakeup+0xb5/0xe0
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85abbd95>] ? insert_work+0x65/0xa0
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85c193d8>] alloc_pages_current+0x98/0x110
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85bc399e>] __get_free_pages+0xe/0x40
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff85bc39e6>] get_zeroed_page+0x16/0x20
Aug 13 02:46:09 supergpu05 kernel: [<ffffffff86000fb2>] iommu_map_page+0x182/0x4c0
ug 13 02:46:09 supergpu05 kernel: lowmem_reserve[]: 0 0 0 0
Aug 13 02:46:09 supergpu05 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15876kB
====
Comment 14 Bom 2022-08-13 06:18:53 MDT
As per my understanding, enabling swap will ONLY delay OOM event.
and OOM is expected behavior (kills job/apps) and should put node in hang state?
Comment 15 Bom 2022-08-13 06:20:29 MDT
*** OOM is expected behavior (kills job/apps) and should NOT put node in hang state?
Comment 16 Bom 2022-08-13 08:11:37 MDT
(In reply to Bom from comment #12)
> Due to swap (not enabled yet), do you think, enabling these options will
> help?
> For e.g
> 
>> =========
> cgroup.conf
> 
> # Slurm cgroup support configuration file
> 
> CgroupMountpoint="/sys/fs/cgroup"
> CgroupAutomount=yes
> 
> ConstrainCores=yes # default *no*
> ConstrainDevices=yes # default *no*
> ConstrainKmemSpace=no # default *no*
> ConstrainRAMSpace=yes # default
> ConstrainSwapSpace=yes # default *no*
> 
> MemorySwappiness=0 ## <== new value
> 
> TaskAffinity=no # default
> 
> 
> =========
Comment 17 Bom 2022-08-13 08:12:34 MDT
(In reply to Bom from comment #12)
> Due to swap (not enabled yet), do you think, enabling these options will
> help?
> For e.g
> 
> =========
> slurm.conf:
> 
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE,CR_Memory,CR_CPU
> 
> SchedulerParameters=max_rpc_cnt=250,\
> sched_min_interval=2000000,\
> batch_sched_delay=20,\
> bf_resolution=800,\
> bf_min_prio_reserve=2000,\
> bf_window=1440,\
> bf_continue,\
> bf_min_age_reserve=600,\
> Ignore_NUMA
> 
> 
> =========
> cgroup.conf
> 
> # Slurm cgroup support configuration file
> 
> CgroupMountpoint="/sys/fs/cgroup"
> CgroupAutomount=yes
> 
> ConstrainCores=yes # default *no*
> ConstrainDevices=yes # default *no*
> ConstrainKmemSpace=no # default *no*
> ConstrainRAMSpace=yes # default
> ConstrainSwapSpace=yes # default *no*
> 
> MemorySwappiness=0 ## <== new value
> 
> TaskAffinity=no # default
> 
> 
> =========

In slurm.conf Rolled back to original/previous config: 
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
Comment 18 Oriol Vilarrubi 2022-08-15 08:26:28 MDT
Hi Bom,

I'll reply to your comments:

(In reply to Bom from comment #11)
> We do not have swap enabled on GPU nodes (due to performance reason),
> however do you still recommend enabling 4GB-100GB swap partitions ? 
> 
> These nodes have 1-3 TB of RAM.
Having swap enabled could help you determine which is the part of your system that is using more memory than expected. And for doing so 4Gb should be sufficient enough. But as you've said per performance reasons is not good to use swap in your compute nodes, but it is a good tool to help you stablish memory needs for you system.

What I would also do is to adjust the RealMemory of those nodes to ensure that the system have sufficient memory. The procedure I like to use for measuring that is to have the system in idle and then add a 50% to the used memory to be on the safe side. Taking into account that you also have lustre I would measure the system usage while doing some writes in the lustre filesystem, so that the memory used to serve those is also taken into account.

>Due to swap (not enabled yet), do you think, enabling these options will help?
>For e.g
>=========
> cgroup.conf
> MemorySwappiness=0 ## <== new value

This has no effect until you have swap in your system, but when you have it is good to have it like this, this way your slurm jobs will not use the swap and thus their perfomance will not be affected.
I do not say anything about the changes you talked for slurm.conf as you rolled them back in a later comment.

> *** OOM is expected behavior (kills job/apps) and should NOT put node in hang state?

That statement is right but in the case that the OOM happens inside the slurm job, so in the case that your job asks for 100Mb and uses 200Mb. If the issue is a OOM of the system (as your syslog and the log fragment in Comment 13) then the process that is killed might be a system process that makes the system hang. That is the reason I talked about in my first reply to be really sure that the RealMemory configured in slurm is small enough to not make the system suffer an OOM.

Regards
Comment 19 Bom 2022-08-17 05:21:02 MDT
Thanks for your suggestion Oriol,

Right now, post patching GPU nodes seems to behave normally.
Is it ok to keep this bug open until next 2-3 days?

If I do not report back, we can close this bug.

Many Thanks
Bom Singiali
Comment 20 Oriol Vilarrubi 2022-08-17 10:38:41 MDT
Hello Bom,

I'll keep it open until friday afternoon, and then I'll close it as per your last comment, either way you will be able to reopen it later if somehting fails.

Regards.
Comment 21 Oriol Vilarrubi 2022-08-19 15:30:16 MDT
Hi Bom,

I'm closing this bug as infogiven as agreed in last comment.

Regards