| Summary: | backup slurmctld abort() in select/cray | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | Broderick Gardner <broderick> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
sorry, it would seem that i failed to finish my timeline: at 8:53 the primary crashed at 8:55 the backup tookover at 9:06 the backup crashed at 9:16 the primary was restored note that because of the frequent crashed of the primary, this instance of the backup has taken over several times. I am investigating this, thanks for the information. Did this happen only once, or is it happening whenever the backup controller takes over? just the one time; we usually don't fail over while in operations Do you still have this core file? There is a global blade_cnt that might be 0, which means either that select_p_node_init was never called in the backup controller or it got the wrong blade count. So could you get the value of blade_cnt from the core? I've looked into this some more but have not been able to make progress. We need a reproduction to get more information and a core file. Timing out. Thanks |
Hello, Owing to continuing issues with slurmctld abort()ing owing to the pthread issue being discussed in another bug, our backup is getting some exercise. The backup is presently configured to be non-scheduling, though I'm thinking of changing that given the relative instability we are experiencing with the primary controller. The sequence of events were that at 8:53am the primary ctld abort()ed. #0 0x00007f460069cf67 in raise () from /lib64/libc.so.6 #1 0x00007f460069e33a in abort () from /lib64/libc.so.6 #2 0x00007f4600695d66 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007f4600695e12 in __assert_fail () from /lib64/libc.so.6 #4 0x00007f4600e98e31 in bit_test (b=<optimized out>, bit=<optimized out>) at bitstring.c:229 #5 0x00007f45ff81327d in _set_job_running (job_ptr=0x7f452c489680) at select_cray.c:938 #6 select_p_job_begin (job_ptr=0x7f452c489680) at select_cray.c:1920 #7 0x00007f4600ed6d9f in select_g_job_begin (job_ptr=job_ptr@entry=0x7f452c489680) at node_select.c:542 #8 0x000000000047af79 in select_nodes (job_ptr=<optimized out>, test_only=test_only@entry=false, select_node_bitmap=select_node_bitmap@entry=0x0, err_msg=err_msg@entry=0x7f44a56d59e0, submission=submission@entry=true) at node_scheduler.c:2870 #9 0x00000000004554ea in _select_nodes_parts (err_msg=0x7f44a56d59e0, select_node_bitmap=0x0, test_only=false, job_ptr=0x7f452c489680) at job_mgr.c:4683 #10 job_allocate (job_specs=job_specs@entry=0x7f452c02a9d0, immediate=immediate@entry=0, will_run=will_run@entry=0, resp=resp@entry=0x0, allocate=allocate@entry=1, submit_uid=submit_uid@entry=77419, job_pptr=job_pptr@entry=0x7f44a56d59d8, err_msg=err_msg@entry=0x7f44a56d59e0, protocol_version=8448) at job_mgr.c:4905 #11 0x000000000049214a in _slurm_rpc_allocate_resources (msg=msg@entry=0x7f44a56d5ec0) at proc_req.c:1648 #12 0x0000000000493703 in slurmctld_req (msg=msg@entry=0x7f44a56d5ec0, arg=arg@entry=0x7f45cc014cf0) at proc_req.c:328 #13 0x000000000042a541 in _service_connection (arg=0x7f45cc014cf0) at controller.c:1274 #14 0x00007f4600c19724 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f4600754e8d in clone () from /lib64/libc.so.6 (gdb) frame 5 #5 0x00007f45ff81327d in _set_job_running (job_ptr=0x7f452c489680) at select_cray.c:938 938 in select_cray.c (gdb) print (jobinfo->blade_map[0]) $12 = 1111704645 (gdb) print (jobinfo->blade_map[1]) $13 = 0 (gdb) print nodeinfo->blade_id $14 = 0 (gdb) It would seem that the _assert_bit_valid is failing because the length of the map is 0. I have not taken the time to trace the code, but I imagine this is because the blade_map was not initilized in the case of a backup controller, or a non-scheduling backup controller (haven't looked). Thanks, Doug