Ticket 4125

Summary: slurmstepd core dump on job start
Product: Slurm Reporter: Dan Barker <danbarke>
Component: slurmstepdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 17.11.x   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Core dump

Description Dan Barker 2017-08-31 12:38:12 MDT
Created attachment 5185 [details]
Core dump

Overview:
Submitting jobs via sbatch or srun causes slurmstepd to core dump on slurm-17.11.0-0pre2.el7.centos.x86_64. This is not an issue on slurm-17.02.6

Steps to reproduce:
srun --pty -p standard -A hpcstaff -N 1 --mem=2500m /bin/bash -l

Result:
An allocation is made for the job on the compute node, but the user does not get a shell. Upon further investigation I noticed a core file generated for every job I tried to run.

Running a backtrace on the dump gives:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `slurmstepd: [156465]'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b79f58755ef in task_cgroup_devices_create (job=0x1fb3c20) at task_cgroup_devices.c:318
318     task_cgroup_devices.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-17.11.0-0pre2.el7.centos.x86_64
(gdb) bt
#0  0x00002b79f58755ef in task_cgroup_devices_create (job=0x1fb3c20) at task_cgroup_devices.c:318
#1  0x00002b79f586f2df in task_p_pre_setuid (job=0x1fb3c20) at task_cgroup.c:234
#2  0x000000000044e7c2 in task_g_pre_setuid (job=0x1fb3c20) at task_plugin.c:372
#3  0x00000000004301db in _fork_all_tasks (job=0x1fb3c20, io_initialized=0x7ffc69ba0fbb) at mgr.c:1606
#4  0x000000000042f885 in job_manager (job=0x1fb3c20) at mgr.c:1275
#5  0x000000000042a854 in main (argc=1, argv=0x7ffc69ba1178) at slurmstepd.c:183


The core file is attached.

Dan Barker
University of Michigan