Ticket 2237 - 14.11.6 to 15.08.4 upgrade: sicp_state file not found
Summary: 14.11.6 to 15.08.4 upgrade: sicp_state file not found
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 15.08.4
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-12-09 20:22 MST by Ben Fitzpatrick
Modified: 2015-12-09 23:02 MST (History)
0 users

See Also:
Site: Met Office
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld patch (796 bytes, patch)
2015-12-09 22:05 MST, David Bigagli
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Ben Fitzpatrick 2015-12-09 20:22:21 MST
On upgrade on RHEL6 from 14.11.6 to 15.08.4, our slurmctld daemon fails to start with:

[2015-12-09T18:17:51.538+00:00] error: Could not open job state file /project/ukmo/slurm/spool/slurmctld-test/sicp_state: No such file or directory

That directory exists and is under an NFS mount for our primary/backup controllers, and it contains the normal 14.11 state files, except for this new one.

The sicp_state file (and SICP mechanics) didn't exist at 14.11, but seems to be required for the 15.08.4 slurmctld daemon to start. We've followed the Quick Start Admin Upgrade notes (ordering of shutdown, upgrade, and start for the various daemons), and don't know what's wrong.

We've raised this as Medium Impact because the upgrade is highly desirable for our production cluster - at the moment, it's just our test cluster that is out of action, due to this effect.
Comment 1 David Bigagli 2015-12-09 20:41:18 MST
Hi Ben,
       the slurmctld should just log the error message and keep running. I can 
remove the file get the error but the controller keeps operating, indeed it
just creates the file by itself. I suspect something else is wrong, could you
append your slurmctld log file and slurm.conf?

David
Comment 2 Ben Fitzpatrick 2015-12-09 20:50:31 MST
Sorry, we had a mid air collision on ticket editing here - I was about to post that we think it was actually a slurmctld segfault, and the 'error' was just on the last line of the logs each time. We had around 1800 jobs queued and 200 running, and we'll repeat the upgrade and see if this was a cause.
Comment 3 Ben Fitzpatrick 2015-12-09 20:51:41 MST
(we *now* think)
Comment 4 Ben Fitzpatrick 2015-12-09 20:55:35 MST
Running `slurmctld -c` worked fine.
Comment 5 David Bigagli 2015-12-09 20:56:45 MST
Ok I understand. Did you check if there is a core file in the log directory?
Also please make sure you don't limit the core file size.
It is also possible to run the controller using the -D option in which case
it will not daemonize but it will run in foreground so you can get the core
dump in the working directory.

David
Comment 6 David Bigagli 2015-12-09 20:57:31 MST
Another mid air collision :-). If you run -c than all states are cleared and
all jobs lost... is that what happed?

David
Comment 7 Ben Fitzpatrick 2015-12-09 21:10:14 MST
Yes, `-c` nuked all the running and queued jobs - not a problem on the test cluster, but not something we want to repeat!

gdb /usr/sbin/slurmctld /var/log/slurm/core.NNNNN gives:

....
Reading symbols from /usr/lib64/slurm//route_default.so...done.
Loaded symbols for /usr/lib64/slurm//route_default.so
Reading symbols from /usr/lib64/slurm//priority_multifactor.so...done.
Loaded symbols for /usr/lib64/slurm//priority_multifactor.so
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fe4749cd514 in _apply_new_usage (job_ptr=0x20d1fb0, 
    start_period=1449683430, end_period=1449743924, adjust_for_end=false)
    at priority_multifactor.c:1076
1076	priority_multifactor.c: No such file or directory.
	in priority_multifactor.c
Missing separate debuginfos, use: debuginfo-install slurm-15.08.4-1.el6.x86_64
Comment 8 David Bigagli 2015-12-09 21:13:36 MST
Well the running just disappeared from slurmctld memory but they are still
running if you search for slurmstepd you should see them around.

Can you please print the stack using the command where? It looks like it
core dump in the multifactor plugin.

David
Comment 9 Ben Fitzpatrick 2015-12-09 21:20:45 MST
They were very rapid jobs (~20 seconds long), so they wouldn't still be around after the upgrade.

(gdb) where
#0  0x00007f8e3c280514 in _apply_new_usage (job_ptr=0x1424fb0, start_period=1449683430, end_period=1449743875, adjust_for_end=false)
    at priority_multifactor.c:1076
#1  0x00007f8e3c2822f3 in decay_apply_new_usage (job_ptr=0x1424fb0, start_time_ptr=0x7f8e3c279db0) at priority_multifactor.c:1880
#2  0x00007f8e3c28342b in _ft_decay_apply_new_usage (job=0x1424fb0, start=0x7f8e3c279db0) at fair_tree.c:97
#3  0x00000000004f3ff3 in list_for_each (l=0x1396070, f=0x7f8e3c283408 <_ft_decay_apply_new_usage>, arg=0x7f8e3c279db0) at list.c:527
#4  0x00007f8e3c283302 in fair_tree_decay (jobs=0x1396070, start=1449743875) at fair_tree.c:61
#5  0x00007f8e3c280dd7 in _decay_thread (no_data=0x0) at priority_multifactor.c:1338
#6  0x000000359be07a51 in start_thread () from /lib64/libpthread.so.0
#7  0x000000359b6e893d in clone () from /lib64/libc.so.6
Comment 10 David Bigagli 2015-12-09 21:21:08 MST
It core dumps while trying to access the job's tres. From the core file and
the frame 0 can you please print the dereferenced job pointer?

(gdb) frame 0
(gdb) print * job_ptr


David
Comment 11 Ben Fitzpatrick 2015-12-09 21:23:17 MST
(gdb) frame 0
#0  0x00007f8e3c280514 in _apply_new_usage (job_ptr=0x1424fb0, start_period=1449683430, end_period=1449743875, adjust_for_end=false)
    at priority_multifactor.c:1076
1076	in priority_multifactor.c
(gdb) print * job_ptr
$1 = {account = 0x14254f0 "normalexpress", alias_list = 0x0, alloc_node = 0x14254d0 "eld001", alloc_resp_port = 0, alloc_sid = 20824, 
  array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 8, assoc_ptr = 0x13b0c70, batch_flag = 1, batch_host = 0x0, 
  bit_flags = 0, burst_buffer = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 1, 
  billable_tres = 4294967294, cr_enabled = 0, db_index = 12953, derived_ec = 0, details = 0x14252e0, direct_set_prio = 0, end_time = 0, 
  end_time_exp = 0, epilog_running = false, exit_code = 0, front_end_ptr = 0x0, gres = 0x1425520 "tmp:30", gres_list = 0x141fb50, 
  gres_alloc = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 1008, job_id = 12931, job_next = 0x0, job_array_next_j = 0x0, 
  job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 0, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, 
    time = 0, tres = 0x1424f90}, mail_type = 0, mail_user = 0x0, magic = 4038539564, name = 0x1425480 "slurm-submit.sh", network = 0x0, 
  next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 1, 
  nodes_completing = 0x0, other_port = 0, partition = 0x1425460 "normal", part_ptr_list = 0x0, part_nodes_missing = false, 
  part_ptr = 0x13b7a40, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 21666, 
  priority_array = 0x0, prio_factors = 0x1424da0, profile = 0, qos_id = 1, qos_ptr = 0x13ae260, reboot = 0 '\000', restart_cnt = 0, 
  resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, 
  select_jobinfo = 0x1425540, sicp_mode = 0 '\000', spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 7168, 
  start_time = 1449683430, state_desc = 0x0, state_reason = 3, state_reason_prev = 0, step_list = 0x141fb00, suspend_time = 0, 
  time_last_active = 1449743875, time_limit = 10, time_min = 0, tot_sus_time = 0, total_cpus = 1, total_nodes = 0, tres_req_cnt = 0x1424e10, 
  tres_req_str = 0x1425840 "1=1,2=30,4=1", tres_fmt_req_str = 0x14258a0 "cpu=1,mem=30,node=1", tres_alloc_cnt = 0x0, tres_alloc_str = 0x0, 
  tres_fmt_alloc_str = 0x0, user_id = 811, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x14254b0 "*", 
  req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
Comment 12 David Bigagli 2015-12-09 21:28:19 MST
This is the problem:

tres_alloc_cnt = 0x0

the code tries to access a 0 address and it gets segmentation fault.
There were 15.08 jobs?

Could you also print slurmctld_tres_cnt?

(gdb) print slurmctld_tres_cnt

David
Comment 13 Ben Fitzpatrick 2015-12-09 21:50:36 MST
(gdb) print slurmctld_tres_cnt
$2 = 4
Comment 14 Ben Fitzpatrick 2015-12-09 21:51:21 MST
I don't think there were any 15.08 jobs! We hadn't made any changes to the configuration for TRES yet...
Comment 15 David Bigagli 2015-12-09 21:53:55 MST
Ben,
    we have a fix to prevent the controller from getting SIGSEGV.
It is commit 6c045965bfb and it will be officially available in the
15.08.5 release which should happen this week. However you can get the diffs
and apply the patch locally. 

Do you want me to send you the diffs directly?

David
Comment 16 Ben Fitzpatrick 2015-12-09 22:03:23 MST
That's great! We'll apply the fix directly and retest on the test cluster - if that goes well we'll use 15.08.5 when it comes out. We'll go through the upgrade process again with the patched 15.05.4 and let you know how it goes.
Comment 17 David Bigagli 2015-12-09 22:05:41 MST
Created attachment 2497 [details]
slurmctld patch


Here is the patch to prevent slurmctld from core dumping.

David
Comment 18 David Bigagli 2015-12-09 22:07:02 MST
Wait please that's a wrong fix!
Comment 19 David Bigagli 2015-12-09 22:16:28 MST
Ben,
    sorry for the confusion. The problem was actually already fixed last Friday.
The latest commit is actually a further check so it is harmless however the
diffs I have appended by themselves will not fix the problem you need the commit
9f98610d3fc3981. My suggestion would be wait for 15.08.5 to be released 
which as I said should happened any day now. 

David
Comment 20 Ben Fitzpatrick 2015-12-09 22:49:16 MST
Thanks David, no worries - we'll wait for 15.08.05 and then retest.
Comment 21 David Bigagli 2015-12-09 23:02:16 MST
Change status to close as already fixed.

David