| Summary: | slurmstepd core dump on job start | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Dan Barker <danbarke> |
| Component: | slurmstepd | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 17.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Core dump | ||
Created attachment 5185 [details] Core dump Overview: Submitting jobs via sbatch or srun causes slurmstepd to core dump on slurm-17.11.0-0pre2.el7.centos.x86_64. This is not an issue on slurm-17.02.6 Steps to reproduce: srun --pty -p standard -A hpcstaff -N 1 --mem=2500m /bin/bash -l Result: An allocation is made for the job on the compute node, but the user does not get a shell. Upon further investigation I noticed a core file generated for every job I tried to run. Running a backtrace on the dump gives: [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by `slurmstepd: [156465]'. Program terminated with signal 11, Segmentation fault. #0 0x00002b79f58755ef in task_cgroup_devices_create (job=0x1fb3c20) at task_cgroup_devices.c:318 318 task_cgroup_devices.c: No such file or directory. Missing separate debuginfos, use: debuginfo-install slurm-17.11.0-0pre2.el7.centos.x86_64 (gdb) bt #0 0x00002b79f58755ef in task_cgroup_devices_create (job=0x1fb3c20) at task_cgroup_devices.c:318 #1 0x00002b79f586f2df in task_p_pre_setuid (job=0x1fb3c20) at task_cgroup.c:234 #2 0x000000000044e7c2 in task_g_pre_setuid (job=0x1fb3c20) at task_plugin.c:372 #3 0x00000000004301db in _fork_all_tasks (job=0x1fb3c20, io_initialized=0x7ffc69ba0fbb) at mgr.c:1606 #4 0x000000000042f885 in job_manager (job=0x1fb3c20) at mgr.c:1275 #5 0x000000000042a854 in main (argc=1, argv=0x7ffc69ba1178) at slurmstepd.c:183 The core file is attached. Dan Barker University of Michigan