After updating slurmdbd and slurmctld, slurmctld segfaults on startup. Here is the output of thread apply all bt: Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x00007f0d68cdef38 in _job_alloc (core_bitmap=0x2c93b40, node_name=0x28f4790 "nvl-002", job_id=290382, gres_name=0x229ccb0 "gpu", node_offset=0, node_cnt=1, node_gres_data=0x2a2ef20, job_gres_data=0x2b0fa70) at gres.c:5856 5856 gres.c: No such file or directory. Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-18.08.1-1.el7.centos.x86_64 (gdb) thread apply all bt Thread 5 (Thread 0x7f0d65bc7700 (LWP 21578)): #0 0x00007f0d68560f0d in poll () from /lib64/libc.so.6 #1 0x00007f0d68d2bb76 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f0d65bc6da0) at /usr/include/bits/poll2.h:46 #2 _conn_readable (persist_conn=persist_conn@entry=0x22c04c0) at slurm_persist_conn.c:138 #3 0x00007f0d68d2d039 in slurm_persist_recv_msg (persist_conn=0x22c04c0) at slurm_persist_conn.c:882 #4 0x00007f0d65ed730f in _get_return_code () at slurmdbd_agent.c:148 #5 _agent (x=<optimized out>) at slurmdbd_agent.c:668 #6 0x00007f0d68841e25 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f0d6856bbad in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f0d69233700 (LWP 21572)): #0 0x00007f0d68845d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000041fd26 in _agent_init (arg=<optimized out>) at agent.c:1377 #2 0x00007f0d68841e25 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f0d6856bbad in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f0d654b4700 (LWP 21588)): #0 0x00007f0d68845d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f0d654b8e06 in _my_sleep (usec=30000000) at backfill.c:590 #2 0x00007f0d654bf4f7 in backfill_agent (args=<optimized out>) at backfill.c:926 #3 0x00007f0d68841e25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f0d6856bbad in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f0d65ecc700 (LWP 21575)): #0 0x00007f0d68845d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f0d65ed1e90 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445 #2 0x00007f0d68841e25 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f0d6856bbad in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f0d69234740 (LWP 21567)): #0 0x00007f0d68cdef38 in _job_alloc (core_bitmap=0x2c93b40, node_name=0x28f4790 "nvl-002", job_id=290382, gres_name=0x229ccb0 "gpu", node_offset=0, node_cnt=1, node_gres_data=0x2a2ef20, job_gres_data=0x2b0fa70) at gres.c:5856 #1 gres_plugin_job_alloc (job_gres_list=<optimized out>, node_gres_list=node_gres_list@entry=0x282a7d0, node_cnt=1, node_offset=node_offset@entry=0, job_id=290382, node_name=0x28f4790 "nvl-002", core_bitmap=core_bitmap@entry=0x2c93b40) at gres.c:6128 #2 0x00007f0d67c4419c in _add_job_to_res (job_ptr=job_ptr@entry=0x2b0f070, action=action@entry=0) at select_cons_res.c:855 #3 0x00007f0d67c48108 in select_p_select_nodeinfo_set (job_ptr=0x2b0f070) at select_cons_res.c:2697 #4 0x00007f0d68d01061 in select_g_select_nodeinfo_set (job_ptr=job_ptr@entry=0x2b0f070) at node_select.c:857 #5 0x000000000044cef8 in reset_job_bitmaps () at job_mgr.c:10919 #6 0x000000000049a57f in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1326 #7 0x000000000042b2d0 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:663
Hi Could you send me the output from this gdb commands? t 1 f 5 p *job_ptr Dominik
(gdb) t 1 [Switching to thread 1 (Thread 0x7f0d69234740 (LWP 21567))] #0 0x00007f0d68cdef38 in _job_alloc (core_bitmap=0x2c93b40, node_name=0x28f4790 "nvl-002", job_id=290382, gres_name=0x229ccb0 "gpu", node_offset=0, node_cnt=1, node_gres_data=0x2a2ef20, job_gres_data=0x2b0fa70) at gres.c:5856 5856 in gres.c (gdb) f 5 #5 0x000000000044cef8 in reset_job_bitmaps () at job_mgr.c:10919 10919 job_mgr.c: No such file or directory. (gdb) p *job_ptr $1 = {magic = 4038539564, account = 0x2b0f6d0 "general", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x2b0f6a0 "dev-intel18", alloc_resp_port = 52491, alloc_sid = 29888, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 2762, assoc_ptr = 0x2403620, batch_features = 0x0, batch_flag = 0, batch_host = 0x2ac7f80 "nvl-002", billable_tres = 8, bit_flags = 32768, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x2b0f6f0 "stdout=/mnt/home/dunan/slurm-290382.out", cpu_cnt = 8, cpus_per_tres = 0x0, cr_enabled = 0, db_index = 2958410, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x2b0f470, direct_set_prio = 0, end_time = 1539559241, end_time_exp = 1539559241, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres_list = 0x2804ec0, gres_alloc = 0x2b0f750 "7696487:1", gres_detail_cnt = 1, gres_detail_str = 0x2b114f0, gres_req = 0x2b0f780 "gpu:1", gres_used = 0x0, group_id = 2003, job_id = 290382, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x2b0f7f0, job_state = 1, kill_on_node_fail = 1, last_sched_eval = 1538954441, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x2b0efe0}, mail_type = 0, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, name = 0x2b0efc0 "sh", network = 0x0, next_step_id = 1, ngids = 0, nodes = 0x2b0ef50 "nvl-002", node_addr = 0x2c85ea0, node_bitmap = 0x2c85bd0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 0, nodes_completing = 0x0, origin_cluster = 0x2b11290 "msuhpcc", other_port = 52490, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x2b0ef70 "general-long-18,general-long-14,general-long-16", part_ptr_list = 0x26cf950, part_nodes_missing = false, part_ptr = 0x28f5e90, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 3389, priority_array = 0x0, prio_factors = 0x2b0ee90, profile = 0, qos_id = 1, qos_ptr = 0x22c17a0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x2b0ef20 "192.168.9.200", sched_nodes = 0x0, select_jobinfo = 0x2b0f7c0, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1538954441, state_desc = 0x0, state_reason = 0, state_reason_prev = 0, step_list = 0x26cf9a0, suspend_time = 0, system_comment = 0x0, time_last_active = 1539005708, time_limit = 10080, time_min = 0, tot_sus_time = 0, total_cpus = 8, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x2b0f730 "gpu:1", tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x2b11340, tres_req_str = 0x2b112b0 "1=8,2=6000,4=1,1001=1", tres_fmt_req_str = 0x2b11250 "cpu=8,mem=6000M,node=1,gres/gpu=1", tres_alloc_cnt = 0x2b112e0, tres_alloc_str = 0x2b113a0 "1=8,2=6000,3=18446744073709551614,4=1,5=8,1001=1", tres_fmt_alloc_str = 0x2b113f0 "cpu=8,mem=6000M,node=1,billing=8,gres/gpu=1", user_id = 721898, user_name = 0x2b0f680 "dunan", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
Dominik, I reverted our production server back to 17.11.7 for now. I'm going to change this to a severity 2.
Hi Good to know that you are "up" now. Could you send also job_gres_ptr? t 1 f 0 p *job_gres_ptr Dominik
Hi I can recreate this issue. I will let you know when I fix this. Dominik
Hi I have the preliminary patch now, but it still needs some review, let me know if you need it. I suggest staying on 17.11 slurmctld until it will be officially in git repo. Dominik
Dominik, That's good news! We are aiming to be on 18.08.1 by next Monday. Do you think this internal review will be done by then? Thanks, Steve
Hi If it wasn't US holiday, it would be done already. Because of Columbus Day, it will take a little bit longer, it should be in repo tomorrow. Dominik
Hey Dominik, I'm hoping to do this upgrade later this evening. Will this patch be available today? Thanks, Steve
Comment on attachment 7994 [details] Patch to create structures outside of if Steve, please use this patch it should get you what you need. It will be in 18.08.2 which will probably be tagged in a few weeks commit 33f8f22d.
Danny, We updated using this patch and slurmctld runs, however, we're seeing a lot of the following errors in the log: [2018-10-10T21:28:54.125] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4) [2018-10-10T21:28:54.125] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.127] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1) [2018-10-10T21:28:54.127] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1) [2018-10-10T21:28:54.127] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4) [2018-10-10T21:28:54.127] error: gres/gpu: job 326800 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.127] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4) [2018-10-10T21:28:54.128] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.155] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1) [2018-10-10T21:28:54.155] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1) [2018-10-10T21:28:54.157] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1) [2018-10-10T21:28:54.157] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1) [2018-10-10T21:28:54.160] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1) [2018-10-10T21:28:54.160] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1) [2018-10-10T21:28:54.174] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4) [2018-10-10T21:28:54.174] error: gres/gpu: job 326800 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.174] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4) [2018-10-10T21:28:54.174] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.177] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4)
Hi We expect this is only a transitional thing and it should go away after all 17.11 jobs are gone. I will inform you when I will find what exactly caused this problem. Dominik
Hi Could you send me full slurmctld.log and output from "scontrol show job <one of affected job_id eg. 326800 >", if such jobs still exist in the system? Dominik
# scontrol show job 326800 JobId=326800 JobName=4zh9_long_200ns_MD UserId=foouser(*) GroupId=hpccclass(*) MCS_label=N/A Priority=20007 Nice=0 Account=general QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=1-14:46:46 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2018-10-10T18:39:43 EligibleTime=2018-10-10T18:39:43 AccrueTime=Unknown StartTime=2018-10-10T18:39:43 EndTime=2018-10-17T18:39:43 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-10-10T18:39:43 Partition=general-long-18 AllocNode:Sid=lac-249:2131 ReqNodeList=(null) ExcNodeList=(null) NodeList=nvl-006 BatchHost=nvl-006 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=12G,node=1,billing=1,gres/gpu=4 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=12G MinTmpDiskNode=0 Features=intel16|intel18 DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb WorkDir=/mnt/home/foouser/MD/CRBP/longRun Comment=stdout=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.o%A StdErr=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.e326800 StdIn=/dev/null StdOut=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.o326800 Power= TresPerNode=gpu:4
Hi Does this error still occur? I am still working on this but I can't reproduce it. Dominik
Hi Can we drop severity to 3 now, as the patch is already in git repo? Dominik
Dominik, This is fine to drop to sev 3. I am still seeing this error in the logs. I'm now seeing it for jobs submitted after the update. For example, job 672084: [2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 gres count underflow [2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 topo gres count underflow (3 4) [2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 type k80 gres count underflow (3 4) JobId=672084 JobName=bash UserId=foouser(*) GroupId=group(*) MCS_label=N/A Priority=24787 Nice=0 Account=general QOS=normal JobState=FAILED Reason=NonZeroExitCode Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=130:0 RunTime=00:14:28 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2018-10-25T12:09:16 EligibleTime=2018-10-25T12:09:16 StartTime=2018-10-25T12:09:16 EndTime=2018-10-25T12:23:44 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-10-25T12:09:16 Partition=general-long-16 AllocNode:Sid=dev-intel16-k80:18489 ReqNodeList=(null) ExcNodeList=(null) NodeList=lac-196 BatchHost=lac-196 NumNodes=1 NumCPUs=8 NumTasks=4 CPUs/Task=2 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=2G,node=1,billing=8,gres/gpu=4 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=2 MinMemoryNode=2G MinTmpDiskNode=0 Features=gpgpu DelayBoot=00:00:00 Gres=gpu:4 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/bash WorkDir=/mnt/home/foouser/Documents/ShearWaveImaging/CUDA_GFS_ForHPCC_Release Comment=stdout=/mnt/home/foouser/Documents/ShearWaveImaging/CUDA_GFS_ForHPCC_Release/slurm-672084.out Power= Thanks, Steve
Created attachment 8105 [details] Slurmctld log
Hi It seems that you can easily reproduce this. Because I still can't reproduce this. Could you generate slurmctld.log with enabled DebugFlags SelectType? eg.: scontrol setdebugflags +SelectType submit job wait for log scontrol setdebugflags -SelectType Could you also send me slurm.conf and exact command you used to submit this job? Dominik
Dominik, I have not been able to submit a job that produces these errors, I am only noticing these jobs after the fact. I'll attach the slurm log with the selecttype flag set. Perhaps the information from that flag for the jobs that are already throwing this error will be helpful. Best, Steve
Created attachment 8202 [details] Slurmctld log +SelectType, compressed with xz
Hi It looks like logic in select plugin is working well, a problem must be in a different place. Could you generate slurmctld.log with enabled DebugFlags TraceJobs? Dominik
Created attachment 9233 [details] Slurmctld log +TraceJobs Dominik, Here is the log output with TraceJobs enabled. Best, Steve
Hi This two commit should fix (gres count underflow) issue: https://github.com/SchedMD/slurm/commit/6f8cd92e1091e https://github.com/SchedMD/slurm/commit/69d78159c3305 Both patches will be included in 18.08.6. In separate bug, we handle wrongly set debug level which returns a false positive message like this "select_nodes: calling _get_req_features() for JobId=10129368 with not NULL job resources" Let me know if we can close this bug now. Dominik
Hi I will close this ticket as resolved/fixed and set "Version Fixed" to 18.08.2 (this is version were an original bug was fixed). Please open a new ticket if you notice more "count underflow" message above version 18.08.6. Dominik