We are seeing what appears to be the same thing as bug 3694. I know we are running an unsupported version (we are planning to upgrade very soon to 17.11.x), but I see in this bug that this problem is reported in 17.02 as well. Here's some example output. Let me know what else you might want to know: [Sat Apr 7 14:21:38 2018] ============================================================================= [Sat Apr 7 14:21:38 2018] BUG numa_policy(2262:step_0) (Tainted: P B OE ------------ T): Objects remaining in numa_policy(2262:step_0) on kmem_cach e_close() [Sat Apr 7 14:21:38 2018] ----------------------------------------------------------------------------- [Sat Apr 7 14:21:38 2018] INFO: Slab 0xffffea0001278f00 objects=31 used=1 fp=0xffff880049e3c528 flags=0x1fffff00004080 [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81328989>] ? free_cpumask_var+0x9/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810fd85d>] ? on_each_cpu_cond+0xcd/0x190 [Sat Apr 7 14:21:38 2018] [<ffffffff811e2350>] ? kmem_cache_alloc_bulk+0x140/0x140 [Sat Apr 7 14:21:38 2018] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff880049e3c738 @offset=1848 [Sat Apr 7 14:21:38 2018] ============================================================================= [Sat Apr 7 14:21:38 2018] BUG numa_policy(2262:step_0) (Tainted: P B OE ------------ T): Objects remaining in numa_policy(2262:step_0) on kmem_cache_close() [Sat Apr 7 14:21:38 2018] ----------------------------------------------------------------------------- [Sat Apr 7 14:21:38 2018] INFO: Slab 0xffffea0000162900 objects=31 used=1 fp=0xffff8800058a4738 flags=0x1fffff00004080 [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff8118fef0>] ? __free_memcg_kmem_pages+0x40/0x50 [Sat Apr 7 14:21:38 2018] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff8800058a4b58 @offset=2904 [Sat Apr 7 14:21:38 2018] ============================================================================= [Sat Apr 7 14:21:38 2018] BUG numa_policy(2262:step_0) (Tainted: P B OE ------------ T): Objects remaining in numa_policy(2262:step_0) on kmem_cache_close() [Sat Apr 7 14:21:38 2018] ----------------------------------------------------------------------------- [Sat Apr 7 14:21:38 2018] INFO: Slab 0xffffea00015fc300 objects=31 used=2 fp=0xffff880057f0d8c0 flags=0x1fffff00004080 [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40 [Sat Apr 7 14:21:38 2018] [<ffffffff816a87cb>] ? printk+0x60/0x77 [Sat Apr 7 14:21:38 2018] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff880057f0c630 @offset=1584 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff880057f0c738 @offset=1848 [Sat Apr 7 14:21:38 2018] ============================================================================= [Sat Apr 7 14:21:38 2018] BUG numa_policy(2262:step_0) (Tainted: P B OE ------------ T): Objects remaining in numa_policy(2262:step_0) on kmem_cache_close() [Sat Apr 7 14:21:38 2018] ----------------------------------------------------------------------------- [Sat Apr 7 14:21:38 2018] INFO: Slab 0xffffea000185d500 objects=31 used=3 fp=0xffff8800617555a8 flags=0x1fffff00004080 [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40 [Sat Apr 7 14:21:38 2018] [<ffffffff816a87cb>] ? printk+0x60/0x77 [Sat Apr 7 14:21:38 2018] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff880061754840 @offset=2112 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff880061754948 @offset=2376 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff880061754a50 @offset=2640 [Sat Apr 7 14:21:38 2018] ============================================================================= [Sat Apr 7 14:21:38 2018] BUG numa_policy(2262:step_0) (Tainted: P B OE ------------ T): Objects remaining in numa_policy(2262:step_0) on kmem_cache_close() [Sat Apr 7 14:21:38 2018] ----------------------------------------------------------------------------- [Sat Apr 7 14:21:38 2018] INFO: Slab 0xffffea0011d87780 objects=31 used=1 fp=0xffff8804761df6b0 flags=0x2fffff00004080 [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff8118fef0>] ? __free_memcg_kmem_pages+0x40/0x50 [Sat Apr 7 14:21:38 2018] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff8804761de738 @offset=1848 [Sat Apr 7 14:21:38 2018] ============================================================================= [Sat Apr 7 14:21:38 2018] BUG numa_policy(2262:step_0) (Tainted: P B OE ------------ T): Objects remaining in numa_policy(2262:step_0) on kmem_cache_close() [Sat Apr 7 14:21:38 2018] ----------------------------------------------------------------------------- [Sat Apr 7 14:21:38 2018] INFO: Slab 0xffffea0000160f00 objects=31 used=1 fp=0xffff88000583c420 flags=0x1fffff00004080 [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40 [Sat Apr 7 14:21:38 2018] [<ffffffff816a87cb>] ? printk+0x60/0x77 [Sat Apr 7 14:21:38 2018] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff88000583c738 @offset=1848 [Sat Apr 7 14:21:38 2018] ============================================================================= [Sat Apr 7 14:21:38 2018] BUG numa_policy(2262:step_0) (Tainted: P B OE ------------ T): Objects remaining in numa_policy(2262:step_0) on kmem_cache_close() [Sat Apr 7 14:21:38 2018] ----------------------------------------------------------------------------- [Sat Apr 7 14:21:38 2018] INFO: Slab 0xffffea002d5ca200 objects=31 used=1 fp=0xffff880b57289ef0 flags=0x2fffff00004080 [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40 [Sat Apr 7 14:21:38 2018] [<ffffffff816a87cb>] ? printk+0x60/0x77 [Sat Apr 7 14:21:38 2018] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff880b57289de8 @offset=7656 [Sat Apr 7 14:21:38 2018] ============================================================================= [Sat Apr 7 14:21:38 2018] BUG numa_policy(2262:step_0) (Tainted: P B OE ------------ T): Objects remaining in numa_policy(2262:step_0) on kmem_cache_close() [Sat Apr 7 14:21:38 2018] ----------------------------------------------------------------------------- [Sat Apr 7 14:21:38 2018] INFO: Slab 0xffffea000185d080 objects=31 used=1 fp=0xffff880061742a50 flags=0x1fffff00004080 [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811e0904>] slab_err+0xb4/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff8108dfe9>] ? vprintk_default+0x29/0x40 [Sat Apr 7 14:21:38 2018] [<ffffffff816a87cb>] ? printk+0x60/0x77 [Sat Apr 7 14:21:38 2018] [<ffffffff811e4feb>] ? __kmalloc+0x1eb/0x230 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62d7>] ? kmem_cache_close+0x127/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e62f9>] kmem_cache_close+0x149/0x2e0 [Sat Apr 7 14:21:38 2018] [<ffffffff811e64a4>] __kmem_cache_shutdown+0x14/0x80 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf64>] kmem_cache_destroy+0x44/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [Sat Apr 7 14:21:38 2018] INFO: Object 0xffff880061742840 @offset=2112 [Sat Apr 7 14:21:38 2018] kmem_cache_destroy numa_policy(2262:step_0): Slab cache still has objects [Sat Apr 7 14:21:38 2018] CPU: 3 PID: 32105 Comm: python Tainted: P B OE ------------ T 3.10.0-693.21.1.el7.x86_64 #1 [Sat Apr 7 14:21:38 2018] Hardware name: LENOVO Lenovo NeXtScale nx360 M5: -[5465AC1]-/00YE752, BIOS -[THE132H-2.50]- 10/13/2017 [Sat Apr 7 14:21:38 2018] Call Trace: [Sat Apr 7 14:21:38 2018] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [Sat Apr 7 14:21:38 2018] [<ffffffff811ab000>] kmem_cache_destroy+0xe0/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffff811faa69>] kmem_cache_destroy_memcg_children+0x89/0xb0 [Sat Apr 7 14:21:38 2018] [<ffffffff811aaf39>] kmem_cache_destroy+0x19/0xf0 [Sat Apr 7 14:21:38 2018] [<ffffffffc1827d77>] deinit_chunk_split_cache+0x77/0xa0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc182990e>] uvm_pmm_gpu_deinit+0x3e/0x70 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fddd0>] remove_gpu+0x220/0x2f0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17fe031>] uvm_gpu_release_locked+0x21/0x30 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc18017b8>] uvm_va_space_destroy+0x348/0x3b0 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffffc17f7611>] uvm_release+0x11/0x20 [nvidia_uvm] [Sat Apr 7 14:21:38 2018] [<ffffffff8120791c>] __fput+0xec/0x260 [Sat Apr 7 14:21:38 2018] [<ffffffff81207b7e>] ____fput+0xe/0x10 [Sat Apr 7 14:21:38 2018] [<ffffffff810b087b>] task_work_run+0xbb/0xe0 [Sat Apr 7 14:21:38 2018] [<ffffffff81090ed1>] do_exit+0x2d1/0xa40 [Sat Apr 7 14:21:38 2018] [<ffffffff810c7c60>] ? wake_up_state+0x10/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff8109f20e>] ? signal_wake_up_state+0x1e/0x30 [Sat Apr 7 14:21:38 2018] [<ffffffff810a0602>] ? zap_other_threads+0x92/0xc0 [Sat Apr 7 14:21:38 2018] [<ffffffff810916bf>] do_group_exit+0x3f/0xa0 [Sat Apr 7 14:21:38 2018] [<ffffffff81091734>] SyS_exit_group+0x14/0x20 [Sat Apr 7 14:21:38 2018] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
Hey Ryan - It's a kernel bug at heart. Nothing user-space does should ever be able to cause that type of crash, so there's nothing for us to chase down here. If you have a RHEL support contract I'd suggest getting them in the loop on this. It's possible that, due to some changes of how we managed the various cgroups (which don't appear to be directly implicated here, but have usually been the root cause of some other issues), that Slurm's behavior in 17.11 will avoid triggering this. But you'd have to test to narrow that down. - Tim
Marking as resolved/infogiven. Please reopen if there's anything further I can address. - Tim