| Summary: | slurmctl crashes when scontrol is ran on members of job array. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | John Hanks <john.hanks> |
| Component: | slurmctld | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.11.3 15.08.0pre2 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
This is fixed in the following commit: https://github.com/SchedMD/slurm/commit/db98d6242db48b44cfe54d05e54ae01dcf596796 Please re-open if it doesn't solve it for you. Thanks, Brian |
We are seeing reproducible crashes when modifying job array elements with scontrol update. To reproduce: 1. submit a job array. 2. modify it with something like: for i in {1..50}; do scontrol update jobid=3095958_$i reservation=kapferc_9 done Backtrace from core dump is: Program terminated with signal 6, Aborted. #0 0x0000003558832635 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6_5.4.x86_64 libgcc-4.4.7-4.el6.x86_64 munge-libs-0.5.10-1.el6.x86_64 sssd-client-1.11.6-30.el6.x86_64 (gdb) bt #0 0x0000003558832635 in raise () from /lib64/libc.so.6 #1 0x0000003558833e15 in abort () from /lib64/libc.so.6 #2 0x0000003558870547 in __libc_message () from /lib64/libc.so.6 #3 0x0000003558875e76 in malloc_printerr () from /lib64/libc.so.6 #4 0x00000035588789b3 in _int_free () from /lib64/libc.so.6 #5 0x00000000004a622d in slurm_xfree (item=0x2b867e99d470, file=<value optimized out>, line=<value optimized out>, func=<value optimized out>) at xmalloc.c:238 #6 0x00000000004692fa in select_nodes (job_ptr=<value optimized out>, test_only=false, select_node_bitmap=<value optimized out>, err_msg=0x0) at node_scheduler.c:1814 #7 0x000000000045ba4b in schedule (job_limit=100) at job_scheduler.c:1274 #8 0x0000000000432558 in _slurmctld_background (no_data=<value optimized out>) at controller.c:1635 #9 0x000000000043507f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:561