Ticket 4125 - slurmstepd core dump on job start
Summary: slurmstepd core dump on job start
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 17.11.x
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-08-31 12:38 MDT by Dan Barker
Modified: 2017-08-31 12:38 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Core dump (6.45 MB, application/x-core)
2017-08-31 12:38 MDT, Dan Barker
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Dan Barker 2017-08-31 12:38:12 MDT
Created attachment 5185 [details]
Core dump

Overview:
Submitting jobs via sbatch or srun causes slurmstepd to core dump on slurm-17.11.0-0pre2.el7.centos.x86_64. This is not an issue on slurm-17.02.6

Steps to reproduce:
srun --pty -p standard -A hpcstaff -N 1 --mem=2500m /bin/bash -l

Result:
An allocation is made for the job on the compute node, but the user does not get a shell. Upon further investigation I noticed a core file generated for every job I tried to run.

Running a backtrace on the dump gives:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by `slurmstepd: [156465]'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b79f58755ef in task_cgroup_devices_create (job=0x1fb3c20) at task_cgroup_devices.c:318
318     task_cgroup_devices.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-17.11.0-0pre2.el7.centos.x86_64
(gdb) bt
#0  0x00002b79f58755ef in task_cgroup_devices_create (job=0x1fb3c20) at task_cgroup_devices.c:318
#1  0x00002b79f586f2df in task_p_pre_setuid (job=0x1fb3c20) at task_cgroup.c:234
#2  0x000000000044e7c2 in task_g_pre_setuid (job=0x1fb3c20) at task_plugin.c:372
#3  0x00000000004301db in _fork_all_tasks (job=0x1fb3c20, io_initialized=0x7ffc69ba0fbb) at mgr.c:1606
#4  0x000000000042f885 in job_manager (job=0x1fb3c20) at mgr.c:1275
#5  0x000000000042a854 in main (argc=1, argv=0x7ffc69ba1178) at slurmstepd.c:183


The core file is attached.

Dan Barker
University of Michigan