Hello, in Dresden site slumctld crashed with the following core file and not any particular message in the log #0 0x0000003234c328a5 in raise () from /lib64/libc.so.6 #1 0x0000003234c34085 in abort () from /lib64/libc.so.6 #2 0x0000003234c6fa37 in __libc_message () from /lib64/libc.so.6 #3 0x0000003234c75366 in malloc_printerr () from /lib64/libc.so.6 #4 0x000000000049535c in slurm_xfree (item=0x7fa2893278e8, file=<value optimized out>, line=<value optimized out>, func=<value optimized out>) at xmalloc.c:270 #5 0x000000000043c045 in _list_delete_job (job_entry=<value optimized out>) at job_mgr.c:5568 #6 0x0000000000499bde in list_delete_all (l=0x1ef67d8, f=0x438700 <_list_find_job_old>, key=0x55cdfa) at list.c:478 #7 0x000000000044025e in purge_old_job () at job_mgr.c:6306 #8 0x000000000042f4bf in _slurmctld_background (no_data=<value optimized out>) at controller.c:1561 #9 0x0000000000431e5f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:580 From what I have figured out there were quite a few array jobs in the queue but I don't have more details than that. It seems that the problem is related to the cleaning of array jobs. The core shows that it was in this line of slurmctld/job_mgr.c xfree(job_ptr->priority_array); Do you have any idea why this happened with the above information or do you need something more? The site is working with a patched version of 2.6.0-pre3. Do you think that this problem is corrected in the newer versions? Thanks Yiannis
They are rather bold running v2.6-pre3. David Bigagli has spend much of the past month testing v2.6 and discovered several memory management errors which might be responsible for this error. The commits are identified below: https://github.com/SchedMD/slurm/commit/486e0233b71998f9d291fba6c4099ca5a5c11d6f https://github.com/SchedMD/slurm/commit/ff2ee1b126d4a62fe8fcd77a8d0932af0f3c7546 Either one of these bugs could have been responsible for the assert. We plan to tag v2.6-pre4 or 2.6-rc1 very soon with quite a few bug fixes plus the sensor code.
Hello Dresden cluster is now running with 2.6.0-RC1 the following array job crashed the controller --------------------------------------------- [bull@tauruslogin1]$ cat job.slurm #!/bin/bash #SBATCH --time=0:02:00 #SBATCH -J Slurm_20000 #SBATCH -N 1 #SBATCH --exclusive #SBATCH --acctg-freq=0 #SBATCH -p mpi,mpi2 #SBATCH --output=/dev/null srun --acctg-freq=0 --reservation=bull_49 /bin/sleep 30 ------------------------------------------------- The submission was made with the following command: [bull@tauruslogin1]$ sbatch --reservation=bull_49 --array=1-2 ./job.slurm The analysis of the core gave the following result: ------------------------------------- #0 0x00000034b8c75485 in malloc_consolidate () from /lib64/libc.so.6 #1 0x00000034b8c77e28 in _int_free () from /lib64/libc.so.6 #2 0x0000000000495efc in slurm_xfree (item=0x7f22ec009e40, file=<value optimized out>, line=<value optimized out>, func=<value optimized out>) at xmalloc.c:267 #3 0x000000000051cdbb in free_job_resources (job_resrcs_pptr=0x7f22ec001c40) at job_resources.c:417 #4 0x000000000043c3d7 in _list_delete_job (job_entry=<value optimized out>) at job_mgr.c:5631 #5 0x000000000049a6be in list_delete_all (l=0x1a62b98, f=0x438a70 <_list_find_job_old>, key=0x560a67) at list.c:475 #6 0x000000000044062e in purge_old_job () at job_mgr.c:6365 #7 0x000000000042f81f in _slurmctld_background (no_data=<value optimized out>) at controller.c:1562 #8 0x00000000004321cf in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:576 ----------------------------------- It is a very strange error, It's sure that it is related with array jobs but I'm wondering if there is a relation with the reservation or with a particular parameter in slurm.conf Do you have any ideas? Thanks, Yiannis
Yiannis, In the core could you print out job_resrcs_ptr from the free_job_resources function? As well as from the calling function one up? If it crashed where on xfree(job_resrcs_ptr->nodes); It would make me feel this value had some memory corruption on it. I'll look at it here and see what I can find out.
Yiannis, could you also send the reservation definition?
Here are the details of the reservation... It was exactly the same with this one: ReservationName=bull_50 StartTime=2013-06-12T18:00:00 EndTime=2013-06-13T08:00:00 Duration=14:00:00 Nodes=taurusi[1001-1270,3001-3180],taurussmp[1-2] NodeCnt=452 CoreCnt=6544 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES Users=bull Accounts=(null) Licenses=(null) State=ACTIVE how can I print the values you ask? Do I need to add a breakpoint somewhere ?
In gdb on the core file you can go to the function by typing up until you get the the function you want to be in then type print *job_resrcs_ptr Send the output of that. Are you saying you are able to reproduce this issue easily?
here you are: print *job_resrcs_ptr $1 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 64000, nodes = 0x7f22ec012198 "taurusi3002", ncpus = 1, sock_core_rep_count = 0x7f22ec006238, sockets_per_node = 0x7f22ec006a18} I can reproduce it every time actually. There is one more important detail here. You need to do an scancel of the jobid of one of the array jobs and then slurmctld crashes: scancel 685032_1
That is excellent about the ability to reproduce. Could you go up one more in the stack and give me the output of *job_ptr and *job_ptr->job_resrcs. We will see if we can do the same.
Here you are: (gdb) print *job_ptr $1 = {account = 0x0, alias_list = 0x0, alloc_node = 0x0, alloc_resp_port = 0, alloc_sid = 16198, array_job_id = 643426, array_task_id = 1, assoc_id = 879, assoc_ptr = 0x1aa9178, batch_flag = 1, batch_host = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 0, cr_enabled = 1, db_index = 894740, derived_ec = 15, details = 0x0, direct_set_prio = 0, end_time = 1371002221, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 200026, job_id = 643427, job_next = 0x0, job_resrcs = 0x7f22ec009dd8, job_state = 4, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x0, magic = 0, name = 0x0, network = 0x0, next_step_id = 1, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, nodes_completing = 0x0, other_port = 0, partition = 0x0, part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x1aeadf8, pre_sus_time = 0, preempt_time = 0, priority = 1, priority_array = 0x0, prio_factors = 0x7f22ec001a78, profile = 0, qos_id = 1, qos_ptr = 0x1a66f58, restart_cnt = 0, resize_time = 0, resv_id = 49, resv_name = 0x0, resv_ptr = 0x1b92ba8, resv_flags = 32832, requid = 2054944, resp_host = 0x0, select_jobinfo = 0x7f22ec002028, spank_job_env = 0x0, spank_job_env_size = 0, start_time = 1371002195, state_desc = 0x0, state_reason = 0, step_list = 0x1ab0e78, suspend_time = 0, time_last_active = 1371002221, time_limit = 2, time_min = 0, tot_sus_time = 0, total_cpus = 12, total_nodes = 1, user_id = 2054944, wait_all_nodes = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) print *job_ptr->job_resrcs $2 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 64000, nodes = 0x7f22ec012198 "taurusi3002", ncpus = 1, sock_core_rep_count = 0x7f22ec006238, sockets_per_node = 0x7f22ec006a18}
Hi Yannis, I am trying to reproduce this problem now. Could you please send me the slurm configuration files? Thanks. David
Yiannis, could you also try to reproduce without a reservation. At this moment I am guessing it is something in the slurm.conf file that is causing this, having it will most likely shed some light on the subject. I am not able to reproduce this either.
Hi David, I'll try tomorrow without the reservation ... here is the slurm.conf file Yiannis
Did you forget to attach the file?
Created attachment 284 [details] slurm config file
how did you notice that fast! :)
Hello we are still trying to reproduce this also running valgrind to check for any memory errors, but so far no luck. I suspect we are not running the exact sequence of commands as you. Could you please provide me with the sequence and syntax of commands as you ran them. For example: 1) scontrol create reservation=bull_50 nodes=dario,perseo,prometeo,sofia,spartaco users=david flags=IGNORE_JOBS starttime=now endtime=now+3600 2) scontrol show reservation 3) squeue; scontrol show job 4) sbatch --reservation=bull_50 --array=1-2 ./job.slurm 5) squeue; scontrol show job 6) scancel arrayid_elementid 7) squeue; scontrol show job The squeue and scontrol will help us to see the states in which are the jobs. Perhaps you can increase the runtime of the job.slurm so you can execute these commands. In addition another idea, although I am not sure if possible, is to start slurmctld under valgrind control before running the above test: o) valgrind ./slurmctld -Dvvv This should tell us if there are any memory errors. However this will slow down the system very much so it is not a good idea if the system is in production. Thanks. David
Created attachment 286 [details] valgrind results for array jobs execution
Hi David, the valgrind log as you asked me. Let me know if this is not enough and I need to run with different parameters. I launched 2 array jobs and canceled them and that was the result. I'm just starting to study it so let me know if you find something and I'll do the same Yiannis
Thanks I am looking at the log now.
Yannis please send me the exact sequence of commands you run.
this is the sequence I have followed with valgrind: sinfo squeue sbatch --reservation=bull_56 --array=1-2 ./job.slurm scontrol show job jobid_1 scancel jobid_1 scancel jobid_2 squeue And actually yesterday I've seen slurmctld hang even without doing the scancel... just when an array job finishes it makes slurmctld hang
Yiannis did you also try without the reservation?
David, no I haven't.
Hello we found a suspicious code in src/plugins/priority/multifactor/priority_multifactor.c where the priority_array gets allocated. Could you please apply this patch to the file rebuild. This patch add space for the NULL termination of the priority_array array. Let us now how is it going. Thanks. david@prometeo /opt/slurm/26/slurm/src/plugins/priority/multifactor>git diff diff --git a/src/plugins/priority/multifactor/priority_multifactor.c b/src/plugins/priority/multifactor/priority_multifactor index 3e5fbe3..d36840a 100644 --- a/src/plugins/priority/multifactor/priority_multifactor.c +++ b/src/plugins/priority/multifactor/priority_multifactor.c @@ -734,7 +734,7 @@ static uint32_t _get_priority_internal(time_t start_time, if (!job_ptr->priority_array) { job_ptr->priority_array = xmalloc(sizeof(uint32_t) * - list_count(job_ptr->part_ptr_list)); + list_count(job_ptr->part_ptr_list) + 1); } part_iterator = list_iterator_create(job_ptr->part_ptr_list); while ((part_ptr = (struct part_record *)
Created attachment 287 [details] patch for slurmctld core dump Sorry I should have attached the patch instead of cut and pasting it.
Hi did you get a chance to try the patch? David
Hi David, I've just tested and everything seems to work fine now!!Great job! I'll let you know if any issues at all but for now everything seems fine ... Thanks a lot Yiannis
It appears this problem is fixed, please reopen if it shows up again.