On upgrade on RHEL6 from 14.11.6 to 15.08.4, our slurmctld daemon fails to start with: [2015-12-09T18:17:51.538+00:00] error: Could not open job state file /project/ukmo/slurm/spool/slurmctld-test/sicp_state: No such file or directory That directory exists and is under an NFS mount for our primary/backup controllers, and it contains the normal 14.11 state files, except for this new one. The sicp_state file (and SICP mechanics) didn't exist at 14.11, but seems to be required for the 15.08.4 slurmctld daemon to start. We've followed the Quick Start Admin Upgrade notes (ordering of shutdown, upgrade, and start for the various daemons), and don't know what's wrong. We've raised this as Medium Impact because the upgrade is highly desirable for our production cluster - at the moment, it's just our test cluster that is out of action, due to this effect.
Hi Ben, the slurmctld should just log the error message and keep running. I can remove the file get the error but the controller keeps operating, indeed it just creates the file by itself. I suspect something else is wrong, could you append your slurmctld log file and slurm.conf? David
Sorry, we had a mid air collision on ticket editing here - I was about to post that we think it was actually a slurmctld segfault, and the 'error' was just on the last line of the logs each time. We had around 1800 jobs queued and 200 running, and we'll repeat the upgrade and see if this was a cause.
(we *now* think)
Running `slurmctld -c` worked fine.
Ok I understand. Did you check if there is a core file in the log directory? Also please make sure you don't limit the core file size. It is also possible to run the controller using the -D option in which case it will not daemonize but it will run in foreground so you can get the core dump in the working directory. David
Another mid air collision :-). If you run -c than all states are cleared and all jobs lost... is that what happed? David
Yes, `-c` nuked all the running and queued jobs - not a problem on the test cluster, but not something we want to repeat! gdb /usr/sbin/slurmctld /var/log/slurm/core.NNNNN gives: .... Reading symbols from /usr/lib64/slurm//route_default.so...done. Loaded symbols for /usr/lib64/slurm//route_default.so Reading symbols from /usr/lib64/slurm//priority_multifactor.so...done. Loaded symbols for /usr/lib64/slurm//priority_multifactor.so Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libgcc_s.so.1 Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x00007fe4749cd514 in _apply_new_usage (job_ptr=0x20d1fb0, start_period=1449683430, end_period=1449743924, adjust_for_end=false) at priority_multifactor.c:1076 1076 priority_multifactor.c: No such file or directory. in priority_multifactor.c Missing separate debuginfos, use: debuginfo-install slurm-15.08.4-1.el6.x86_64
Well the running just disappeared from slurmctld memory but they are still running if you search for slurmstepd you should see them around. Can you please print the stack using the command where? It looks like it core dump in the multifactor plugin. David
They were very rapid jobs (~20 seconds long), so they wouldn't still be around after the upgrade. (gdb) where #0 0x00007f8e3c280514 in _apply_new_usage (job_ptr=0x1424fb0, start_period=1449683430, end_period=1449743875, adjust_for_end=false) at priority_multifactor.c:1076 #1 0x00007f8e3c2822f3 in decay_apply_new_usage (job_ptr=0x1424fb0, start_time_ptr=0x7f8e3c279db0) at priority_multifactor.c:1880 #2 0x00007f8e3c28342b in _ft_decay_apply_new_usage (job=0x1424fb0, start=0x7f8e3c279db0) at fair_tree.c:97 #3 0x00000000004f3ff3 in list_for_each (l=0x1396070, f=0x7f8e3c283408 <_ft_decay_apply_new_usage>, arg=0x7f8e3c279db0) at list.c:527 #4 0x00007f8e3c283302 in fair_tree_decay (jobs=0x1396070, start=1449743875) at fair_tree.c:61 #5 0x00007f8e3c280dd7 in _decay_thread (no_data=0x0) at priority_multifactor.c:1338 #6 0x000000359be07a51 in start_thread () from /lib64/libpthread.so.0 #7 0x000000359b6e893d in clone () from /lib64/libc.so.6
It core dumps while trying to access the job's tres. From the core file and the frame 0 can you please print the dereferenced job pointer? (gdb) frame 0 (gdb) print * job_ptr David
(gdb) frame 0 #0 0x00007f8e3c280514 in _apply_new_usage (job_ptr=0x1424fb0, start_period=1449683430, end_period=1449743875, adjust_for_end=false) at priority_multifactor.c:1076 1076 in priority_multifactor.c (gdb) print * job_ptr $1 = {account = 0x14254f0 "normalexpress", alias_list = 0x0, alloc_node = 0x14254d0 "eld001", alloc_resp_port = 0, alloc_sid = 20824, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 8, assoc_ptr = 0x13b0c70, batch_flag = 1, batch_host = 0x0, bit_flags = 0, burst_buffer = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 1, billable_tres = 4294967294, cr_enabled = 0, db_index = 12953, derived_ec = 0, details = 0x14252e0, direct_set_prio = 0, end_time = 0, end_time_exp = 0, epilog_running = false, exit_code = 0, front_end_ptr = 0x0, gres = 0x1425520 "tmp:30", gres_list = 0x141fb50, gres_alloc = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 1008, job_id = 12931, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 0, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x1424f90}, mail_type = 0, mail_user = 0x0, magic = 4038539564, name = 0x1425480 "slurm-submit.sh", network = 0x0, next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 1, nodes_completing = 0x0, other_port = 0, partition = 0x1425460 "normal", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x13b7a40, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 21666, priority_array = 0x0, prio_factors = 0x1424da0, profile = 0, qos_id = 1, qos_ptr = 0x13ae260, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x1425540, sicp_mode = 0 '\000', spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 7168, start_time = 1449683430, state_desc = 0x0, state_reason = 3, state_reason_prev = 0, step_list = 0x141fb00, suspend_time = 0, time_last_active = 1449743875, time_limit = 10, time_min = 0, tot_sus_time = 0, total_cpus = 1, total_nodes = 0, tres_req_cnt = 0x1424e10, tres_req_str = 0x1425840 "1=1,2=30,4=1", tres_fmt_req_str = 0x14258a0 "cpu=1,mem=30,node=1", tres_alloc_cnt = 0x0, tres_alloc_str = 0x0, tres_fmt_alloc_str = 0x0, user_id = 811, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x14254b0 "*", req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
This is the problem: tres_alloc_cnt = 0x0 the code tries to access a 0 address and it gets segmentation fault. There were 15.08 jobs? Could you also print slurmctld_tres_cnt? (gdb) print slurmctld_tres_cnt David
(gdb) print slurmctld_tres_cnt $2 = 4
I don't think there were any 15.08 jobs! We hadn't made any changes to the configuration for TRES yet...
Ben, we have a fix to prevent the controller from getting SIGSEGV. It is commit 6c045965bfb and it will be officially available in the 15.08.5 release which should happen this week. However you can get the diffs and apply the patch locally. Do you want me to send you the diffs directly? David
That's great! We'll apply the fix directly and retest on the test cluster - if that goes well we'll use 15.08.5 when it comes out. We'll go through the upgrade process again with the patched 15.05.4 and let you know how it goes.
Created attachment 2497 [details] slurmctld patch Here is the patch to prevent slurmctld from core dumping. David
Wait please that's a wrong fix!
Ben, sorry for the confusion. The problem was actually already fixed last Friday. The latest commit is actually a further check so it is harmless however the diffs I have appended by themselves will not fix the problem you need the commit 9f98610d3fc3981. My suggestion would be wait for 15.08.5 to be released which as I said should happened any day now. David
Thanks David, no worries - we'll wait for 15.08.05 and then retest.
Change status to close as already fixed. David