300 – crash of slurmctld

Ticket 300 - crash of slurmctld

Summary: crash of slurmctld

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	2.6.x
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2013-05-24 03:01 MDT by Yiannis Georgiou
Modified:	2013-06-21 08:31 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Universitat Dresden (Germany)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm config file (5.14 KB, application/octet-stream) 2013-06-12 10:25 MDT, Yiannis Georgiou	Details
valgrind results for array jobs execution (718.56 KB, application/octet-stream) 2013-06-14 01:43 MDT, Yiannis Georgiou	Details
patch for slurmctld core dump (692 bytes, patch) 2013-06-14 06:17 MDT, David Bigagli	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Yiannis Georgiou 2013-05-24 03:01:40 MDT

Hello,

in Dresden site slumctld crashed with the following core file and not any particular message in the log


#0  0x0000003234c328a5 in raise () from /lib64/libc.so.6
#1  0x0000003234c34085 in abort () from /lib64/libc.so.6
#2  0x0000003234c6fa37 in __libc_message () from /lib64/libc.so.6
#3  0x0000003234c75366 in malloc_printerr () from /lib64/libc.so.6
#4  0x000000000049535c in slurm_xfree (item=0x7fa2893278e8, file=<value optimized out>, line=<value optimized out>, func=<value optimized out>) at xmalloc.c:270
#5  0x000000000043c045 in _list_delete_job (job_entry=<value optimized out>) at job_mgr.c:5568
#6  0x0000000000499bde in list_delete_all (l=0x1ef67d8, f=0x438700 <_list_find_job_old>, key=0x55cdfa) at list.c:478
#7  0x000000000044025e in purge_old_job () at job_mgr.c:6306
#8  0x000000000042f4bf in _slurmctld_background (no_data=<value optimized out>) at controller.c:1561
#9  0x0000000000431e5f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:580


From what I have figured out there were quite a few array jobs in the queue but I don't have more details than that. It seems that the problem is related to the cleaning of array jobs. The core shows that it was in this line of slurmctld/job_mgr.c

xfree(job_ptr->priority_array);

Do you have any idea why this happened with the above information or do you need something more?
The site is working with a patched version of 2.6.0-pre3. Do you think that this problem is corrected in the newer versions?

Thanks
Yiannis

Comment 1 Moe Jette 2013-05-24 03:15:31 MDT

They are rather bold running v2.6-pre3.

David Bigagli has spend much of the past month testing v2.6 and discovered several memory management errors which might be responsible for this error. The commits are identified below:

https://github.com/SchedMD/slurm/commit/486e0233b71998f9d291fba6c4099ca5a5c11d6f
https://github.com/SchedMD/slurm/commit/ff2ee1b126d4a62fe8fcd77a8d0932af0f3c7546

Either one of these bugs could have been responsible for the assert.

We plan to tag v2.6-pre4 or 2.6-rc1 very soon with quite a few bug fixes plus the sensor code.

Comment 2 Yiannis Georgiou 2013-06-12 03:09:50 MDT

Hello 

Dresden cluster is now running with 2.6.0-RC1

the following array job crashed the controller

---------------------------------------------
[bull@tauruslogin1]$ cat job.slurm
#!/bin/bash
#SBATCH --time=0:02:00
#SBATCH -J Slurm_20000
#SBATCH -N 1
#SBATCH --exclusive
#SBATCH --acctg-freq=0
#SBATCH -p mpi,mpi2
#SBATCH --output=/dev/null
srun --acctg-freq=0 --reservation=bull_49 /bin/sleep 30
-------------------------------------------------

The submission was made with the following command:

[bull@tauruslogin1]$ sbatch --reservation=bull_49 --array=1-2 ./job.slurm

The analysis of the core gave the following result:

-------------------------------------
#0  0x00000034b8c75485 in malloc_consolidate () from /lib64/libc.so.6
#1  0x00000034b8c77e28 in _int_free () from /lib64/libc.so.6
#2  0x0000000000495efc in slurm_xfree (item=0x7f22ec009e40, file=<value optimized out>, line=<value optimized out>, func=<value optimized out>)
    at xmalloc.c:267
#3  0x000000000051cdbb in free_job_resources (job_resrcs_pptr=0x7f22ec001c40) at job_resources.c:417
#4  0x000000000043c3d7 in _list_delete_job (job_entry=<value optimized out>) at job_mgr.c:5631
#5  0x000000000049a6be in list_delete_all (l=0x1a62b98, f=0x438a70 <_list_find_job_old>, key=0x560a67) at list.c:475
#6  0x000000000044062e in purge_old_job () at job_mgr.c:6365
#7  0x000000000042f81f in _slurmctld_background (no_data=<value optimized out>) at controller.c:1562
#8  0x00000000004321cf in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:576
-----------------------------------

It is a very strange error, It's sure that it is related with array jobs but I'm wondering if there is a relation with the reservation or with a particular parameter in slurm.conf

Do you have any ideas?

Thanks,
Yiannis

Comment 3 Danny Auble 2013-06-12 03:30:32 MDT

Yiannis, 

In the core could you print out

job_resrcs_ptr from the free_job_resources function?
As well as from the calling function one up?

If it crashed where on

xfree(job_resrcs_ptr->nodes);

It would make me feel this value had some memory corruption on it.  I'll look at it here and see what I can find out.

Comment 4 Danny Auble 2013-06-12 04:06:36 MDT

Yiannis, could you also send the reservation definition?

Comment 5 Yiannis Georgiou 2013-06-12 05:32:50 MDT

Here are the details of the reservation... It was exactly the same with this one:

ReservationName=bull_50 StartTime=2013-06-12T18:00:00 EndTime=2013-06-13T08:00:00 Duration=14:00:00
   Nodes=taurusi[1001-1270,3001-3180],taurussmp[1-2] NodeCnt=452 CoreCnt=6544 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES
   Users=bull Accounts=(null) Licenses=(null) State=ACTIVE

how can I print the values you ask? Do I need to add a breakpoint somewhere ?

Comment 6 Danny Auble 2013-06-12 07:23:15 MDT

In gdb on the core file you can go to the function by typing 

up

until you get the the function you want to be in then type

print *job_resrcs_ptr

Send the output of that.  Are you saying you are able to reproduce this issue easily?

Comment 7 Yiannis Georgiou 2013-06-12 07:42:39 MDT

here you are:

 print *job_resrcs_ptr
$1 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, 
  cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 64000, 
  nodes = 0x7f22ec012198 "taurusi3002", ncpus = 1, sock_core_rep_count = 0x7f22ec006238, sockets_per_node = 0x7f22ec006a18}


I can reproduce it every time actually. There is one more important detail here. You need to do an scancel of the jobid of one of the array jobs and then slurmctld crashes:

scancel 685032_1

Comment 8 Danny Auble 2013-06-12 07:46:31 MDT

That is excellent about the ability to reproduce.

Could you go up one more in the stack and give me the output of *job_ptr and
*job_ptr->job_resrcs.

We will see if we can do the same.

Comment 9 Yiannis Georgiou 2013-06-12 07:59:02 MDT

Here you are:

(gdb) print *job_ptr
$1 = {account = 0x0, alias_list = 0x0, alloc_node = 0x0, alloc_resp_port = 0, alloc_sid = 16198, array_job_id = 643426, array_task_id = 1, 
  assoc_id = 879, assoc_ptr = 0x1aa9178, batch_flag = 1, batch_host = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, 
  cpu_cnt = 0, cr_enabled = 1, db_index = 894740, derived_ec = 15, details = 0x0, direct_set_prio = 0, end_time = 1371002221, exit_code = 0, 
  front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 200026, job_id = 643427, 
  job_next = 0x0, job_resrcs = 0x7f22ec009dd8, job_state = 4, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, 
  limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, 
  mail_type = 0, mail_user = 0x0, magic = 0, name = 0x0, network = 0x0, next_step_id = 1, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, 
  node_bitmap_cg = 0x0, node_cnt = 0, nodes_completing = 0x0, other_port = 0, partition = 0x0, part_ptr_list = 0x0, part_nodes_missing = false, 
  part_ptr = 0x1aeadf8, pre_sus_time = 0, preempt_time = 0, priority = 1, priority_array = 0x0, prio_factors = 0x7f22ec001a78, profile = 0, 
  qos_id = 1, qos_ptr = 0x1a66f58, restart_cnt = 0, resize_time = 0, resv_id = 49, resv_name = 0x0, resv_ptr = 0x1b92ba8, resv_flags = 32832, 
  requid = 2054944, resp_host = 0x0, select_jobinfo = 0x7f22ec002028, spank_job_env = 0x0, spank_job_env_size = 0, start_time = 1371002195, 
  state_desc = 0x0, state_reason = 0, step_list = 0x1ab0e78, suspend_time = 0, time_last_active = 1371002221, time_limit = 2, time_min = 0, 
  tot_sus_time = 0, total_cpus = 12, total_nodes = 1, user_id = 2054944, wait_all_nodes = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, 
  req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}


(gdb) print *job_ptr->job_resrcs
$2 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, 
  cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 64000, 
  nodes = 0x7f22ec012198 "taurusi3002", ncpus = 1, sock_core_rep_count = 0x7f22ec006238, sockets_per_node = 0x7f22ec006a18}

Comment 10 David Bigagli 2013-06-12 08:19:03 MDT

 Hi Yannis, I am trying to reproduce this problem now. Could you please send me 
 the slurm configuration files? Thanks.
 
 David

Comment 11 Danny Auble 2013-06-12 09:16:34 MDT

Yiannis, could you also try to reproduce without a reservation.  At this moment I am guessing it is something in the slurm.conf file that is causing this, having it will most likely shed some light on the subject.  I am not able to reproduce this either.

Comment 12 Yiannis Georgiou 2013-06-12 10:24:15 MDT

Hi David,

I'll try tomorrow without the reservation ... here is the slurm.conf file

Yiannis

Comment 13 Danny Auble 2013-06-12 10:25:07 MDT

Did you forget to attach the file?

Comment 14 Yiannis Georgiou 2013-06-12 10:25:45 MDT

Created attachment 284 [details]
slurm config file

Comment 15 Yiannis Georgiou 2013-06-12 10:28:47 MDT

how did you notice that fast! :)

Comment 16 David Bigagli 2013-06-13 05:53:39 MDT

Hello we are still trying to reproduce this also running valgrind to check for any memory errors, but so far no luck. I suspect we are not running the 
exact sequence of commands as you. Could you please provide me with the 
sequence and syntax of commands as you ran them. 

For example:

1) scontrol create reservation=bull_50 nodes=dario,perseo,prometeo,sofia,spartaco users=david flags=IGNORE_JOBS starttime=now endtime=now+3600
2) scontrol show reservation
3) squeue; scontrol show job
4) sbatch --reservation=bull_50 --array=1-2 ./job.slurm
5) squeue; scontrol show job
6) scancel arrayid_elementid
7) squeue; scontrol show job

The squeue and scontrol will help us to see the states in which are the jobs.
Perhaps you can increase the runtime of the job.slurm so you can execute these 
commands.

In addition another idea, although I am not sure if possible, is to start 
slurmctld under valgrind control before running the above test:
o) valgrind ./slurmctld -Dvvv
This should tell us if there are any memory errors. However this will slow 
down the system very much so it is not a good idea if the system is in production.

Thanks.

David

Comment 17 Yiannis Georgiou 2013-06-14 01:43:04 MDT

Created attachment 286 [details]
valgrind results for array jobs execution

Comment 18 Yiannis Georgiou 2013-06-14 01:48:15 MDT

Hi David,

the valgrind log as you asked me. Let me know if this is not enough and I need to run with different parameters.
I launched 2 array jobs and canceled them and that was the result.
I'm just starting to study it so let me know if you find something and I'll do the same

Yiannis

Comment 19 David Bigagli 2013-06-14 04:45:20 MDT

Thanks I am looking at the log now.

Comment 20 David Bigagli 2013-06-14 04:52:54 MDT

Yannis please send me the exact sequence of commands you run.

Comment 21 Yiannis Georgiou 2013-06-14 05:08:59 MDT

this is the sequence I have followed with valgrind:

sinfo
squeue
sbatch --reservation=bull_56 --array=1-2 ./job.slurm
scontrol show job jobid_1
scancel jobid_1
scancel jobid_2
squeue

And actually yesterday I've seen slurmctld hang even without doing the scancel... just when an array job finishes it makes slurmctld hang

Comment 22 David Bigagli 2013-06-14 05:23:01 MDT

Yiannis did you also try without the reservation?

Comment 23 Yiannis Georgiou 2013-06-14 05:27:32 MDT

David, no I haven't.

Comment 24 David Bigagli 2013-06-14 06:07:43 MDT

Hello we found a suspicious code in 

src/plugins/priority/multifactor/priority_multifactor.c

where the priority_array gets allocated. 
Could you please apply this patch to the file rebuild. This patch 
add space for the NULL termination of the priority_array array.

Let us now how is it going.

Thanks.

david@prometeo /opt/slurm/26/slurm/src/plugins/priority/multifactor>git diff
diff --git a/src/plugins/priority/multifactor/priority_multifactor.c b/src/plugins/priority/multifactor/priority_multifactor
index 3e5fbe3..d36840a 100644
--- a/src/plugins/priority/multifactor/priority_multifactor.c
+++ b/src/plugins/priority/multifactor/priority_multifactor.c
@@ -734,7 +734,7 @@ static uint32_t _get_priority_internal(time_t start_time,
 
                if (!job_ptr->priority_array) {
                        job_ptr->priority_array = xmalloc(sizeof(uint32_t) *
-                                       list_count(job_ptr->part_ptr_list));
+                                                         list_count(job_ptr->part_ptr_list) + 1);
                }
                part_iterator = list_iterator_create(job_ptr->part_ptr_list);
                while ((part_ptr = (struct part_record *)

Comment 25 David Bigagli 2013-06-14 06:17:36 MDT

Created attachment 287 [details]
patch for slurmctld core dump

Sorry I should have attached the patch instead of cut and pasting it.

Comment 26 David Bigagli 2013-06-19 11:08:50 MDT

Hi did you get a chance to try the patch?

 David

Comment 27 Yiannis Georgiou 2013-06-19 12:44:43 MDT

Hi David,

I've just tested and everything seems to work fine now!!Great job!
I'll let you know if any issues at all but for now everything seems fine ...

Thanks a lot 
Yiannis

Comment 28 Danny Auble 2013-06-21 08:31:41 MDT

It appears this problem is fixed, please reopen if it shows up again.