Created attachment 6557 [details] slurm.conf We recently upgraded to Slurm 17.11.2. Since the update some of our users have had trouble submitting single-node jobs beyond a certain size (depending on the particular compute node allocated, the trouble begins around 16 threads). Some jobs produce an error, but do run: slurmstepd-oat01: error: task/cgroup: task[0] infinite loop broken while trying to provision compute elements using unknown (bitmap:0x3f0f3f0f) Larger jobs just seem to crash (see below). Looking over some of the suggestions in other issues reported here and elsewhere, I verified the node configuration using slurmd -C and discovered RealMemory was incorrect, which has since been fixed. CPU counts are correct. slurm.conf, cgroup.conf, and allowed_devices.conf attached. slurmd -C output for each node type: NodeName=wheat01 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=96669 UpTime=5-05:14:31 NodeName=wheat06 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=773830 UpTime=5-12:32:13 NodeName=wheat08 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=128883 UpTime=5-12:32:46 NodeName=oat01 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=128821 UpTime=0-01:09:16 Typical slurmstepd error log follows: *** Error in `slurmstepd: [228282.batch]': free(): invalid next size (fast): 0x000055b00cef5320 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x145d5d38c7e5] /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x145d5d39537a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x145d5d39953c] /usr/lib/x86_64-linux-gnu/slurm-wlm/libslurmfull.so(slurm_xfree+0x1d)[0x145d5e29612d] /usr/lib/x86_64-linux-gnu/slurm-wlm/task_cgroup.so(task_cgroup_cpuset_set_task_affinity+0x14ce)[0x145d5a834f1e] /usr/lib/x86_64-linux-gnu/slurm-wlm/task_cgroup.so(task_p_pre_launch+0x21)[0x145d5a832001] slurmstepd: [228282.batch](task_g_pre_launch+0x6c)[0x55b00bd6aaac] slurmstepd: [228282.batch](exec_task+0x480)[0x55b00bd559f0] slurmstepd: [228282.batch](+0xbab3)[0x55b00bd4fab3] slurmstepd: [228282.batch](job_manager+0x374)[0x55b00bd54364] slurmstepd: [228282.batch](main+0xc04)[0x55b00bd51234] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x145d5d335830] slurmstepd: [228282.batch](_start+0x29)[0x55b00bd51e19] ======= Memory map: ======== 145d54000000-145d54021000 rw-p 00000000 00:00 0 145d54021000-145d58000000 ---p 00000000 00:00 0 145d59e0a000-145d59e20000 r-xp 00000000 fd:00 262295 /lib/x86_64-linux-gnu/libgcc_s.so.1 145d59e20000-145d5a01f000 ---p 00016000 fd:00 262295 /lib/x86_64-linux-gnu/libgcc_s.so.1 145d5a01f000-145d5a020000 rw-p 00015000 fd:00 262295 /lib/x86_64-linux-gnu/libgcc_s.so.1 145d5a020000-145d5a021000 r-xp 00000000 fd:00 814213 /usr/lib/x86_64-linux-gnu/slurm-wlm/job_container_none.so 145d5a021000-145d5a220000 ---p 00001000 fd:00 814213 /usr/lib/x86_64-linux-gnu/slurm-wlm/job_container_none.so 145d5a220000-145d5a221000 r--p 00000000 fd:00 814213 /usr/lib/x86_64-linux-gnu/slurm-wlm/job_container_none.so 145d5a221000-145d5a222000 rw-p 00001000 fd:00 814213 /usr/lib/x86_64-linux-gnu/slurm-wlm/job_container_none.so 145d5a222000-145d5a224000 r-xp 00000000 fd:00 814262 /usr/lib/x86_64-linux-gnu/slurm-wlm/crypto_munge.so 145d5a224000-145d5a423000 ---p 00002000 fd:00 814262 /usr/lib/x86_64-linux-gnu/slurm-wlm/crypto_munge.so 145d5a423000-145d5a424000 r--p 00001000 fd:00 814262 /usr/lib/x86_64-linux-gnu/slurm-wlm/crypto_munge.so 145d5a424000-145d5a425000 rw-p 00002000 fd:00 814262 /usr/lib/x86_64-linux-gnu/slurm-wlm/crypto_munge.so 145d5a425000-145d5a426000 r-xp 00000000 fd:00 814207 /usr/lib/x86_64-linux-gnu/slurm-wlm/checkpoint_none.so 145d5a426000-145d5a626000 ---p 00001000 fd:00 814207 /usr/lib/x86_64-linux-gnu/slurm-wlm/checkpoint_none.so 145d5a626000-145d5a627000 r--p 00001000 fd:00 814207 /usr/lib/x86_64-linux-gnu/slurm-wlm/checkpoint_none.so 145d5a627000-145d5a628000 rw-p 00002000 fd:00 814207 /usr/lib/x86_64-linux-gnu/slurm-wlm/checkpoint_none.so 145d5a628000-145d5a62b000 r-xp 00000000 fd:00 814208 /usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so 145d5a62b000-145d5a82a000 ---p 00003000 fd:00 814208 /usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so 145d5a82a000-145d5a82b000 r--p 00002000 fd:00 814208 /usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so 145d5a82b000-145d5a82c000 rw-p 00003000 fd:00 814208 /usr/lib/x86_64-linux-gnu/slurm-wlm/proctrack_cgroup.so 145d5a82c000-145d5a82f000 rw-p 00000000 00:00 0 145d5a82f000-145d5a83a000 r-xp 00000000 fd:00 814269 /usr/lib/x86_64-linux-gnu/slurm-wlm/task_cgroup.so 145d5a83a000-145d5aa39000 ---p 0000b000 fd:00 814269 /usr/lib/x86_64-linux-gnu/slurm-wlm/task_cgroup.so 145d5aa39000-145d5aa3a000 r--p 0000a000 fd:00 814269 /usr/lib/x86_64-linux-gnu/slurm-wlm/task_cgroup.so 145d5aa3a000-145d5aa3b000 rw-p 0000b000 fd:00 814269 /usr/lib/x86_64-linux-gnu/slurm-wlm/task_cgroup.so 145d5aa3b000-145d5aa45000 rw-p 00000000 00:00 0 145d5aa45000-145d5aa46000 r-xp 00000000 fd:00 814248 /usr/lib/x86_64-linux-gnu/slurm-wlm/core_spec_none.so 145d5aa46000-145d5ac45000 ---p 00001000 fd:00 814248 /usr/lib/x86_64-linux-gnu/slurm-wlm/core_spec_none.so 145d5ac45000-145d5ac46000 r--p 00000000 fd:00 814248 /usr/lib/x86_64-linux-gnu/slurm-wlm/core_spec_none.so 145d5ac46000-145d5ac47000 rw-p 00001000 fd:00 814248 /usr/lib/x86_64-linux-gnu/slurm-wlm/core_spec_none.so 145d5ac46000-145d5ac47000 rw-p 00001000 fd:00 814248 /usr/lib/x86_64-linux-gnu/slurm-wlm/core_spec_none.so 145d5ac47000-145d5ac48000 ---p 00000000 00:00 0 145d5ac48000-145d5ad48000 rw-p 00000000 00:00 0 145d5ad48000-145d5ad49000 ---p 00000000 00:00 0 145d5ad49000-145d5ae49000 rw-p 00000000 00:00 0 145d5ae49000-145d5ae51000 r-xp 00000000 fd:00 814231 /usr/lib/x86_64-linux-gnu/slurm-wlm/jobacct_gather_cgroup.s 145d5ae51000-145d5b050000 ---p 00008000 fd:00 814231 /usr/lib/x86_64-linux-gnu/slurm-wlm/jobacct_gather_cgroup.s 145d5b050000-145d5b051000 r--p 00007000 fd:00 814231 /usr/lib/x86_64-linux-gnu/slurm-wlm/jobacct_gather_cgroup.s 145d5b051000-145d5b052000 rw-p 00008000 fd:00 814231 /usr/lib/x86_64-linux-gnu/slurm-wlm/jobacct_gather_cgroup.s 145d5b052000-145d5b05a000 rw-p 00000000 00:00 0 145d5b05a000-145d5b05b000 r-xp 00000000 fd:00 814237 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_filesystem_ 145d5b05b000-145d5b25a000 ---p 00001000 fd:00 814237 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_filesystem_ 145d5b25a000-145d5b25b000 r--p 00000000 fd:00 814237 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_filesystem_ 145d5b25b000-145d5b25c000 rw-p 00001000 fd:00 814237 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_filesystem_ 145d5b25c000-145d5b25d000 r-xp 00000000 fd:00 814254 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_interconnec 145d5b25d000-145d5b45c000 ---p 00001000 fd:00 814254 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_interconnec 145d5b45c000-145d5b45d000 r--p 00000000 fd:00 814254 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_interconnec 145d5b45d000-145d5b45e000 rw-p 00001000 fd:00 814254 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_interconnec 145d5b45e000-145d5b45f000 r-xp 00000000 fd:00 814250 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_energy_none 145d5b45f000-145d5b65e000 ---p 00001000 fd:00 814250 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_energy_none 145d5b65e000-145d5b65f000 r--p 00000000 fd:00 814250 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_energy_none 145d5b65f000-145d5b660000 rw-p 00001000 fd:00 814250 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_energy_none 145d5b660000-145d5b661000 r-xp 00000000 fd:00 814281 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_profile_non 145d5b661000-145d5b861000 ---p 00001000 fd:00 814281 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_profile_non 145d5b861000-145d5b862000 r--p 00001000 fd:00 814281 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_profile_non 145d5b862000-145d5b863000 rw-p 00002000 fd:00 814281 /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_profile_non 145d5b863000-145d5b868000 r-xp 00000000 fd:00 814267 /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so 145d5b868000-145d5ba67000 ---p 00005000 fd:00 814267 /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so 145d5ba67000-145d5ba68000 r--p 00004000 fd:00 814267 /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so 145d5ba68000-145d5ba69000 rw-p 00005000 fd:00 814267 /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so 145d5ba69000-145d5ba6b000 r-xp 00000000 fd:00 814282 /usr/lib/x86_64-linux-gnu/slurm-wlm/switch_none.so 145d5ba6b000-145d5bc6a000 ---p 00002000 fd:00 814282 /usr/lib/x86_64-linux-gnu/slurm-wlm/switch_none.so 145d5bc6a000-145d5bc6b000 r--p 00001000 fd:00 814282 /usr/lib/x86_64-linux-gnu/slurm-wlm/switch_none.so 145d5bc6b000-145d5bc6c000 rw-p 00002000 fd:00 814282 /usr/lib/x86_64-linux-gnu/slurm-wlm/switch_none.so 145d5bc6c000-145d5bc75000 r-xp 00000000 fd:00 803370 /usr/lib/x86_64-linux-gnu/libmunge.so.2.0.0 145d5bc75000-145d5be74000 ---p 00009000 fd:00 803370 /usr/lib/x86_64-linux-gnu/libmunge.so.2.0.0 145d5be74000-145d5be75000 r--p 00008000 fd:00 803370 /usr/lib/x86_64-linux-gnu/libmunge.so.2.0.0 145d5be75000-145d5be76000 rw-p 00009000 fd:00 803370 /usr/lib/x86_64-linux-gnu/libmunge.so.2.0.0 145d5be76000-145d5be79000 r-xp 00000000 fd:00 814198 /usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so 145d5be79000-145d5c078000 ---p 00003000 fd:00 814198 /usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so 145d5c078000-145d5c079000 r--p 00002000 fd:00 814198 /usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so 145d5c079000-145d5c07a000 rw-p 00003000 fd:00 814198 /usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so 145d5c07a000-145d5c094000 r-xp 00000000 fd:00 814286 /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_res.so 145d5c094000-145d5c293000 ---p 0001a000 fd:00 814286 /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_res.so 145d5c293000-145d5c294000 r--p 00019000 fd:00 814286 /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_res.so 145d5c294000-145d5c295000 rw-p 0001a000 fd:00 814286 /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_res.so 145d5c295000-145d5c2a0000 r-xp 00000000 fd:00 262533 /lib/x86_64-linux-gnu/libnss_files-2.23.so 145d5c2a0000-145d5c49f000 ---p 0000b000 fd:00 262533 /lib/x86_64-linux-gnu/libnss_files-2.23.so 145d5c49f000-145d5c4a0000 r--p 0000a000 fd:00 262533 /lib/x86_64-linux-gnu/libnss_files-2.23.so 145d5c4a0000-145d5c4a1000 rw-p 0000b000 fd:00 262533 /lib/x86_64-linux-gnu/libnss_files-2.23.so 145d5c4a1000-145d5c4a7000 rw-p 00000000 00:00 0 145d5c4a7000-145d5c4b2000 r-xp 00000000 fd:00 262555 /lib/x86_64-linux-gnu/libnss_nis-2.23.so 145d5c4b2000-145d5c6b1000 ---p 0000b000 fd:00 262555 /lib/x86_64-linux-gnu/libnss_nis-2.23.so 145d5c6b1000-145d5c6b2000 r--p 0000a000 fd:00 262555 /lib/x86_64-linux-gnu/libnss_nis-2.23.so 145d5c6b2000-145d5c6b3000 rw-p 0000b000 fd:00 262555 /lib/x86_64-linux-gnu/libnss_nis-2.23.so 145d5c6b3000-145d5c6c9000 r-xp 00000000 fd:00 262258 /lib/x86_64-linux-gnu/libnsl-2.23.so 145d5c6c9000-145d5c8c8000 ---p 00016000 fd:00 262258 /lib/x86_64-linux-gnu/libnsl-2.23.so 145d5c8c8000-145d5c8c9000 r--p 00015000 fd:00 262258 /lib/x86_64-linux-gnu/libnsl-2.23.so 145d5c8c9000-145d5c8ca000 rw-p 00016000 fd:00 262258 /lib/x86_64-linux-gnu/libnsl-2.23.so 145d5c8ca000-145d5c8cc000 rw-p 00000000 00:00 0 145d5c8cc000-145d5c8d4000 r-xp 00000000 fd:00 262486 /lib/x86_64-linux-gnu/libnss_compat-2.23.so 145d5c8d4000-145d5cad3000 ---p 00008000 fd:00 262486 /lib/x86_64-linux-gnu/libnss_compat-2.23.so 145d5cad3000-145d5cad4000 r--p 00007000 fd:00 262486 /lib/x86_64-linux-gnu/libnss_compat-2.23.so 145d5cad4000-145d5cad5000 rw-p 00008000 fd:00 262486 /lib/x86_64-linux-gnu/libnss_compat-2.23.so 145d5cad5000-145d5caf1000 r-xp 00000000 fd:00 262410 /lib/x86_64-linux-gnu/libaudit.so.1.0.0 145d5caf1000-145d5ccf0000 ---p 0001c000 fd:00 262410 /lib/x86_64-linux-gnu/libaudit.so.1.0.0 145d5ccf0000-145d5ccf1000 r--p 0001b000 fd:00 262410 /lib/x86_64-linux-gnu/libaudit.so.1.0.0 145d5ccf1000-145d5ccf2000 rw-p 0001c000 fd:00 262410 /lib/x86_64-linux-gnu/libaudit.so.1.0.0 145d5ccf2000-145d5ccfc000 rw-p 00000000 00:00 0 145d5ccfc000-145d5cd05000 r-xp 00000000 fd:00 795131 /usr/lib/x86_64-linux-gnu/libltdl.so.7.3.1 145d5cd05000-145d5cf04000 ---p 00009000 fd:00 795131 /usr/lib/x86_64-linux-gnu/libltdl.so.7.3.1 145d5cf04000-145d5cf05000 r--p 00008000 fd:00 795131 /usr/lib/x86_64-linux-gnu/libltdl.so.7.3.1 145d5cf05000-145d5cf06000 rw-p 00009000 fd:00 795131 /usr/lib/x86_64-linux-gnu/libltdl.so.7.3.1 145d5cf06000-145d5cf10000 r-xp 00000000 fd:00 789402 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0 145d5cf10000-145d5d10f000 ---p 0000a000 fd:00 789402 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0 145d5d10f000-145d5d110000 r--p 00009000 fd:00 789402 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0 145d5d110000-145d5d111000 rw-p 0000a000 fd:00 789402 /usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0 145d5d111000-145d5d114000 r-xp 00000000 fd:00 262273 /lib/x86_64-linux-gnu/libdl-2.23.so 145d5d114000-145d5d313000 ---p 00003000 fd:00 262273 /lib/x86_64-linux-gnu/libdl-2.23.so 145d5d313000-145d5d314000 r--p 00002000 fd:00 262273 /lib/x86_64-linux-gnu/libdl-2.23.so 145d5d314000-145d5d315000 rw-p 00003000 fd:00 262273 /lib/x86_64-linux-gnu/libdl-2.23.so 145d5d315000-145d5d4d5000 r-xp 00000000 fd:00 262270 /lib/x86_64-linux-gnu/libc-2.23.so 145d5d4d5000-145d5d6d5000 ---p 001c0000 fd:00 262270 /lib/x86_64-linux-gnu/libc-2.23.so 145d5d6d5000-145d5d6d9000 r--p 001c0000 fd:00 262270 /lib/x86_64-linux-gnu/libc-2.23.so 145d5d6d9000-145d5d6db000 rw-p 001c4000 fd:00 262270 /lib/x86_64-linux-gnu/libc-2.23.so 145d5d6db000-145d5d6df000 rw-p 00000000 00:00 0 145d5d6df000-145d5d6f7000 r-xp 00000000 fd:00 262269 /lib/x86_64-linux-gnu/libpthread-2.23.so 145d5d6f7000-145d5d8f6000 ---p 00018000 fd:00 262269 /lib/x86_64-linux-gnu/libpthread-2.23.so 145d5d8f6000-145d5d8f7000 r--p 00017000 fd:00 262269 /lib/x86_64-linux-gnu/libpthread-2.23.so 145d5d8f7000-145d5d8f8000 rw-p 00018000 fd:00 262269 /lib/x86_64-linux-gnu/libpthread-2.23.so 145d5d8f8000-145d5d8fc000 rw-p 00000000 00:00 0 145d5d8fc000-145d5d8fe000 r-xp 00000000 fd:00 262483 /lib/x86_64-linux-gnu/libutil-2.23.so 145d5d8fe000-145d5dafd000 ---p 00002000 fd:00 262483 /lib/x86_64-linux-gnu/libutil-2.23.so 145d5dafd000-145d5dafe000 r--p 00001000 fd:00 262483 /lib/x86_64-linux-gnu/libutil-2.23.so 145d5dafe000-145d5daff000 rw-p 00002000 fd:00 262483 /lib/x86_64-linux-gnu/libutil-2.23.so 145d5daff000-145d5db02000 r-xp 00000000 fd:00 262156 /lib/x86_64-linux-gnu/libpam_misc.so.0.82.0 145d5db02000-145d5dd01000 ---p 00003000 fd:00 262156 /lib/x86_64-linux-gnu/libpam_misc.so.0.82.0 145d5dd01000-145d5dd02000 r--p 00002000 fd:00 262156 /lib/x86_64-linux-gnu/libpam_misc.so.0.82.0 145d5dd02000-145d5dd03000 rw-p 00003000 fd:00 262156 /lib/x86_64-linux-gnu/libpam_misc.so.0.82.0 145d5dd03000-145d5dd10000 r-xp 00000000 fd:00 262277 /lib/x86_64-linux-gnu/libpam.so.0.83.1 145d5dd10000-145d5df0f000 ---p 0000d000 fd:00 262277 /lib/x86_64-linux-gnu/libpam.so.0.83.1 145d5df0f000-145d5df10000 r--p 0000c000 fd:00 262277 /lib/x86_64-linux-gnu/libpam.so.0.83.1 145d5df10000-145d5df11000 rw-p 0000d000 fd:00 262277 /lib/x86_64-linux-gnu/libpam.so.0.83.1 145d5df11000-145d5df4a000 r-xp 00000000 fd:00 795136 /usr/lib/x86_64-linux-gnu/libhwloc.so.5.6.8 145d5df4a000-145d5e149000 ---p 00039000 fd:00 795136 /usr/lib/x86_64-linux-gnu/libhwloc.so.5.6.8 145d5e149000-145d5e14a000 r--p 00038000 fd:00 795136 /usr/lib/x86_64-linux-gnu/libhwloc.so.5.6.8 145d5e14a000-145d5e14b000 rw-p 00039000 fd:00 795136 /usr/lib/x86_64-linux-gnu/libhwloc.so.5.6.8 145d5e14b000-145d5e2f7000 r-xp 00000000 fd:00 814274 /usr/lib/x86_64-linux-gnu/slurm-wlm/libslurmfull.so 145d5e2f7000-145d5e4f6000 ---p 001ac000 fd:00 814274 /usr/lib/x86_64-linux-gnu/slurm-wlm/libslurmfull.so 145d5e4f6000-145d5e4f8000 r--p 001ab000 fd:00 814274 /usr/lib/x86_64-linux-gnu/slurm-wlm/libslurmfull.so 145d5e4f8000-145d5e4fd000 rw-p 001ad000 fd:00 814274 /usr/lib/x86_64-linux-gnu/slurm-wlm/libslurmfull.so 145d5e4fd000-145d5e502000 rw-p 00000000 00:00 0 145d5e502000-145d5e528000 r-xp 00000000 fd:00 262263 /lib/x86_64-linux-gnu/ld-2.23.so 145d5e5f5000-145d5e5f6000 ---p 00000000 00:00 0 145d5e5f6000-145d5e6fd000 rw-p 00000000 00:00 0 145d5e725000-145d5e726000 rw-p 00000000 00:00 0 145d5e726000-145d5e727000 rw-p 00000000 00:00 0 145d5e727000-145d5e728000 r--p 00025000 fd:00 262263 /lib/x86_64-linux-gnu/ld-2.23.so 145d5e728000-145d5e729000 rw-p 00026000 fd:00 262263 /lib/x86_64-linux-gnu/ld-2.23.so 145d5e729000-145d5e72a000 rw-p 00000000 00:00 0 55b00bd44000-55b00bd7b000 r-xp 00000000 fd:00 922621 /usr/sbin/slurmstepd-wlm 55b00bf7a000-55b00bf7b000 r--p 00036000 fd:00 922621 /usr/sbin/slurmstepd-wlm 55b00bf7b000-55b00bf7c000 rw-p 00037000 fd:00 922621 /usr/sbin/slurmstepd-wlm 55b00bf7c000-55b00bf7f000 rw-p 00000000 00:00 0 55b00cec0000-55b00cf4b000 rw-p 00000000 00:00 0 [heap] 7ffe58f8c000-7ffe58fad000 rw-p 00000000 00:00 0 [stack] 7ffe58ff7000-7ffe58ffa000 r--p 00000000 00:00 0 [vvar] 7ffe58ffa000-7ffe58ffc000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] slurmstepd-oat01: error: get_exit_code task 0 died by signal
Created attachment 6558 [details] cgroup.conf
Created attachment 6559 [details] cgroup allowed_devices.conf
Hi Do you have core dump file? If yes, could you attach output from line gdb -batch -ex "thread apply all bt full" <core file> How exactly this job was submitted? Dominik
Ubuntu's apport captured the crash, but I extracted the core dump and ran `gdb -batch -ex "thread apply all bt full" /usr/sbin/slurmstepd-wlm CoreDump`. (gdb warns that the dump might not be from slurmstepd-wlm, but that's the executable specified in the crash file.) warning: core file may not match specified executable file. [New LWP 37907] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `slurmstepd: [231224.b'. Program terminated with signal SIGABRT, Aborted. #0 0x0000152422aac428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. Thread 1 (Thread 0x152423e5c700 (LWP 37907)): #0 0x0000152422aac428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 resultvar = 0 pid = 37907 selftid = 37907 #1 0x0000152422aae02a in __GI_abort () at abort.c:89 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x3939656666372d30, sa_sigaction = 0x3939656666372d30}, sa_mask = {__val = {8223625903107040868, 3472328295963438381, 4192904167887482928, 2314885531086893104, 2314885530818453536, 2314885530818453536, 8528445641706184736, 7378645557150114166, 3472386803145193829, 4123438424310361392, 8223625903103553637, 3472328295963457581, 4192904167887482928, 2314885531086893104, 2314885530818453536, 2314885530818453536}}, sa_flags = 538976288, sa_restorer = 0x64} sigs = {__val = {32, 0 <repeats 15 times>}} #2 0x0000152422aee7ea in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x152422c07ed8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 ap = <error reading variable ap (Attempt to dereference a generic pointer.)> fd = 2 on_2 = <optimized out> list = <optimized out> nlist = <optimized out> cp = <optimized out> written = <optimized out> #3 0x0000152422af737a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x152422c07f50 "free(): invalid next size (fast)", action=3) at malloc.c:5006 buf = "0000563f6f739150" cp = <optimized out> ar_ptr = <optimized out> str = 0x152422c07f50 "free(): invalid next size (fast)" action = 3 #4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3867 size = <optimized out> fb = <optimized out> nextchunk = <optimized out> nextsize = <optimized out> nextinuse = <optimized out> prevsize = <optimized out> bck = <optimized out> fwd = <optimized out> errstr = <optimized out> locked = <optimized out> #5 0x0000152422afb53c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 ar_ptr = <optimized out> p = <optimized out> hook = <optimized out> #6 0x00001524239f812d in slurm_xfree () from /usr/lib/x86_64-linux-gnu/slurm-wlm/libslurmfull.so No symbol table info available. #7 0x000015241bdeff1e in task_cgroup_cpuset_set_task_affinity () from /usr/lib/x86_64-linux-gnu/slurm-wlm/task_cgroup.so No symbol table info available. #8 0x000015241bded001 in task_p_pre_launch () from /usr/lib/x86_64-linux-gnu/slurm-wlm/task_cgroup.so No symbol table info available. #9 0x0000563f6eb3eaac in task_g_pre_launch () No symbol table info available. #10 0x0000563f6eb299f0 in exec_task () No symbol table info available. #11 0x0000563f6eb23ab3 in ?? () No symbol table info available. #12 0x0000563f6eb28364 in job_manager () No symbol table info available. #13 0x0000563f6eb25234 in main () No symbol table info available.
Hi In the backtrack I noticed that the core file hasn’t got debug symbols. We are compiling all binary with debug symbols by default. We also haven’t got xfree() inside of task_cgroup_cpuset_set_task_affinity(). How exactly did you compile slurm? Did you use any local patches? Dominik
We're built from the Debian package maintainer's version tagged 17.11.2-1, on Ubuntu 16.04.4, no local patches. https://anonscm.debian.org/cgit/collab-maint/slurm-llnl.git
Any update or advice on this issue? Is there any further information we can provide?
Hi I am still working on that. Without debug symbols it is much more difficult to find reason of this problem. I have checked code from debian repository and it looked fine. I am not sure how this works when is compiled without -O0 and with all debian hardening options. We recommend using code and build configuration from our official schedmd repository. Dominik
Thanks for the update, I appreciate it! I understand the recommendation to use the official repository and build configuration; it's something we're looking at, but isn't something we can do in the near term. If there's anything we can do to help, or any more information we can provide, please let me know.
Hi If you can install debug symbols for slurmd,libslurm,plugins and generate backtrace one more time. I think that this packages will provide necessary symbols: libslurm32-dbgsym slurm-dbgsym slurmd-dbgsym, slurm-wlm-basic-plugins-dbgsym. Dominik
Sorry for the delay... It was a little more difficult than just installing the packages, as it seems the -dbgsym packages weren't built and are missing. I rebuilt and disabled stripping, and I think the new trace will be more useful. Some information is still optimized out, but there's info now for every step in the trace. warning: core file may not match specified executable file. [New LWP 27023] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `slurmstepd: [246331.b'. Program terminated with signal SIGABRT, Aborted. #0 0x000015535c5da428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. Thread 1 (Thread 0x15535d987700 (LWP 27023)): #0 0x000015535c5da428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 resultvar = 0 pid = 27023 selftid = 27023 #1 0x000015535c5dc02a in __GI_abort () at abort.c:89 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x6539646666372d30, sa_sigaction = 0x6539646666372d30}, sa_mask = {__val = {8223625903107040614, 3472328295963438381, 4192904167887482928, 2314885531086893104, 2314885530818453536, 2314885530818453536, 8528445641706184736, 7378645557150114166, 3472386798886664548, 7293971462467562800, 8223625903103567462, 3472328295963457581, 4192904167887482928, 2314885531086893104, 2314885530818453536, 2314885530818453536}}, sa_flags = 538976288, sa_restorer = 0x64} sigs = {__val = {32, 0 <repeats 15 times>}} #2 0x000015535c61c7ea in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x15535c735ed8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 ap = <error reading variable ap (Attempt to dereference a generic pointer.)> fd = 2 on_2 = <optimized out> list = <optimized out> nlist = <optimized out> cp = <optimized out> written = <optimized out> #3 0x000015535c62537a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x15535c735f50 "free(): invalid next size (fast)", action=3) at malloc.c:5006 buf = "0000559606606860" cp = <optimized out> ar_ptr = <optimized out> str = 0x15535c735f50 "free(): invalid next size (fast)" action = 3 #4 _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3867 size = <optimized out> fb = <optimized out> nextchunk = <optimized out> nextsize = <optimized out> nextinuse = <optimized out> prevsize = <optimized out> bck = <optimized out> fwd = <optimized out> errstr = <optimized out> locked = <optimized out> #5 0x000015535c62953c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 ar_ptr = <optimized out> p = <optimized out> hook = <optimized out> #6 0x000015535d5261bd in slurm_xfree (item=item@entry=0x7ffd9efba478, file=file@entry=0x155359ac73c8 "../../../../../src/plugins/task/cgroup/task_cgroup_cpuset.c", line=line@entry=673, func=func@entry=0x155359ac8080 <__func__.17105> "_task_cgroup_cpuset_dist_cyclic") at ../../../src/common/xmalloc.c:241 p = <optimized out> #7 0x0000155359ac4e5e in _task_cgroup_cpuset_dist_cyclic (cpuset=0x5596065f5570, bind_verbose=0, job=0x5596065e7060, req_hwtype=HWLOC_OBJ_CORE, hwtype=HWLOC_OBJ_PU, topology=0x5596065f0360) at ../../../../../src/plugins/task/cgroup/task_cgroup_cpuset.c:673 npus = 32 obj_idxs = {1, 15, 0} tpc = 1 sock_loop = 33 s_ix = 0 nthreads = 2 spec_thread_cnt = <optimized out> t_ix = 0x559606606870 i = 0 npdist = 32 sock_fcyclic = <optimized out> cps = 4 j = 22 ntskip = 0 c_ixn = 0x5596065f2560 nboards = 1 ncores = 8 nsockets = 2 taskid = 0 core_cyclic = false hwloc_success = true c_ixc = 0x5596065f2620 spec_threads = 0x0 core_fcyclic = false #8 task_cgroup_cpuset_set_task_affinity (job=0x5596065e7060) at ../../../../../src/plugins/task/cgroup/task_cgroup_cpuset.c:1523 str = 0x559606606870 "\002" fstatus = -1 mstr = "P\000\000\000\000\000\000\000J\000\000\000\000\000\000\000\220\246\373\236\375\177\000\000%\000\000\000\000\000\000\000p\246\373\236\375\177\000\000p\246\373\236\375\177\000\000\060\360^\006\226U\000\000P\356R]S\025\000\000\360\247\373\236\375\177\000\000\001\000\000\000\000\000\000\000c\000\000\000\000\000\000\000@\275y]S\025\000\000\001\200\255\373n\000\000\000\275\002\000\000\000\000\000\000\000`\233]S\025\000\000(\004>]S\025\000\000@\025?]S\025\000\000K\306y]S\025\000\000\275\002\000\000\000\000\000\000@\025?]S\025\000\000\000`\233]S\025\000\000X\247\373\236\375\177\000\000T\247\373\236\375\177\000\000\000\000\000\000\000\000\000\000p\002\000\000\000\000\000\000"... bind_type = <optimized out> ts = {__bits = {64424509441, 140724603453440, 94102733455360, 140724603453472, 140724603453440, 23446226468864, 140727270745408, 0, 23447779842904, 140727270745360, 23447779842928, 18446603346438806257, 140727270745359, 4294967296, 140727270745488, 0}} socket_or_node = <optimized out> topology = 0x5596065f0360 cpuset = 0x5596065f5570 hwtype = HWLOC_OBJ_PU req_hwtype = HWLOC_OBJ_CORE bind_verbose = 0 rc = 0 pid = 27023 tssize = <optimized out> nobj = <optimized out> taskid = 0 jntasks = <optimized out> jnpus = <optimized out> spec_threads = <optimized out> #9 0x0000155359ac1fa1 in task_p_pre_launch (job=<optimized out>) at ../../../../../src/plugins/task/cgroup/task_cgroup.c:277 No locals. #10 0x0000559605fc898c in task_g_pre_launch (job=job@entry=0x5596065e7060) at ../../../../src/slurmd/common/task_plugin.c:450 i = 0 rc = 0 __func__ = "task_g_pre_launch" #11 0x0000559605fb3990 in exec_task (job=job@entry=0x5596065e7060, i=i@entry=0) at ../../../../src/slurmd/slurmstepd/task.c:429 gtids = 0x0 j = <optimized out> task = 0x5596065eedf0 tmp_env = <optimized out> saved_errno = <optimized out> node_offset = 0 task_offset = <optimized out> __func__ = "exec_task" #12 0x0000559605fada44 in _fork_all_tasks (job=job@entry=0x5596065e7060, io_initialized=io_initialized@entry=0x7ffd9efbcf0b) at ../../../../src/slurmd/slurmstepd/mgr.c:1823 time_stamp = "slurmstepd\000\000\000\000\000\000\000\377\377\377\377\377\377\377\000\377\377\377\377\377\377\377", '\000' <repeats 48 times>, "\377\377\377\377\377\377\377\377\000\377\377\377\377\377\377\377", '\000' <repeats 29 times>, "\377\377\377", '\000' <repeats 16 times>, "\377\377\377\377\377\377\377\377\377\377\000\377\377\377\377\377", '\000' <repeats 32 times>, "\003\000\000\000\000\000\000\000"... ei = 0x5596065fc6a0 rc = 0 i = 0 sprivs = {saved_uid = 0, saved_gid = 0, gid_list = 0x0, ngids = 4, saved_cwd = "/home/epoch\000lurmd\000\000\000\002", '\000' <repeats 15 times>, "\060\236R]S\025\000\000!R\004a\002\200\377\377߭\373\236\375\177\000\000\340\305_\006\226U\000\000\003\000\000\000\060", '\000' <repeats 27 times>, "P\230^\006\226U\000\000\000\000\000\000\000\000\000\000w\000\000\000|", '\000' <repeats 43 times>, " \233\226\\S\025\000\000\"\000\000\000\000\000\000\000\320n\000\000\000\000\000\000\060\341^\006\226U\000\000\004", '\000' <repeats 15 times>...} jobacct_id = {taskid = 0, nodeid = 0, job = 0x0} oom_value = <optimized out> exec_wait_list = 0x0 esc = <optimized out> tv1 = {tv_sec = 1524088074, tv_usec = 80593} tv2 = {tv_sec = 0, tv_usec = 0} tv_str = '\000' <repeats 19 times> delta_t = 0 __func__ = "_fork_all_tasks" #13 0x0000559605fb2304 in job_manager (job=0x5596065e7060) at ../../../../src/slurmd/slurmstepd/mgr.c:1322 rc = 0 io_initialized = true ckpt_type = 0x5596065f1140 "checkpoint/none" err_msg = 0x0 __func__ = "job_manager" #14 0x0000559605faf1d4 in main (argc=1, argv=0x7ffd9efbd198) at ../../../../src/slurmd/slurmstepd/slurmstepd.c:167 cli = 0x5596065dd310 self = 0x0 msg = 0x5596065dd010 rc = 0 launch_params = 0x0 __func__ = "main"
Hi Thanks for this bt. Could you send me lstopo output from this node? Dominik
Machine (126GB total) NUMANode L#0 (P#0 63GB) Package L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#16) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#17) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#18) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#19) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#20) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#21) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#22) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#23) HostBridge L#0 PCIBridge PCI 8086:1521 Net L#0 "ens1f0" PCI 8086:1521 Net L#1 "ens1f1" PCIBridge PCI 10de:1023 GPU L#2 "card1" GPU L#3 "renderD128" PCI 8086:8d62 Block(Disk) L#4 "sda" PCIBridge PCIBridge PCI 1a03:2000 GPU L#5 "card0" GPU L#6 "controlD64" PCI 8086:8d02 NUMANode L#1 (P#1 63GB) + Package L#1 + L3 L#1 (20MB) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#24) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#25) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#26) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#27) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#28) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#29) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#30) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#31)
Created attachment 6920 [details] patch proposal Hi I still can't recreate this issue. This patch fixes one suspicious place that is close to this problem. Could you apply this patch and check if this has helped? Dominik
This seems to have solved the problem in my tests!
Hi This patch is included in 17.11.7 https://github.com/SchedMD/slurm/commit/83c92a4d1003a3 I will mark this ticket as resolved/fixed. Dominik