Summary: | Slurmctld segfault after update from 17.11.7 to 18.08.1 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Steve Ford <fordste5> |
Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | bart |
Version: | 18.08.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=7826 | ||
Site: | MSU | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 18.08.2 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
Patch to create structures outside of if
Slurmctld log Slurmctld log +SelectType, compressed with xz Slurmctld log +TraceJobs |
Description
Steve Ford
2018-10-08 07:42:21 MDT
Hi Could you send me the output from this gdb commands? t 1 f 5 p *job_ptr Dominik (gdb) t 1 [Switching to thread 1 (Thread 0x7f0d69234740 (LWP 21567))] #0 0x00007f0d68cdef38 in _job_alloc (core_bitmap=0x2c93b40, node_name=0x28f4790 "nvl-002", job_id=290382, gres_name=0x229ccb0 "gpu", node_offset=0, node_cnt=1, node_gres_data=0x2a2ef20, job_gres_data=0x2b0fa70) at gres.c:5856 5856 in gres.c (gdb) f 5 #5 0x000000000044cef8 in reset_job_bitmaps () at job_mgr.c:10919 10919 job_mgr.c: No such file or directory. (gdb) p *job_ptr $1 = {magic = 4038539564, account = 0x2b0f6d0 "general", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x2b0f6a0 "dev-intel18", alloc_resp_port = 52491, alloc_sid = 29888, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 2762, assoc_ptr = 0x2403620, batch_features = 0x0, batch_flag = 0, batch_host = 0x2ac7f80 "nvl-002", billable_tres = 8, bit_flags = 32768, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x2b0f6f0 "stdout=/mnt/home/dunan/slurm-290382.out", cpu_cnt = 8, cpus_per_tres = 0x0, cr_enabled = 0, db_index = 2958410, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x2b0f470, direct_set_prio = 0, end_time = 1539559241, end_time_exp = 1539559241, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres_list = 0x2804ec0, gres_alloc = 0x2b0f750 "7696487:1", gres_detail_cnt = 1, gres_detail_str = 0x2b114f0, gres_req = 0x2b0f780 "gpu:1", gres_used = 0x0, group_id = 2003, job_id = 290382, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x2b0f7f0, job_state = 1, kill_on_node_fail = 1, last_sched_eval = 1538954441, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x2b0efe0}, mail_type = 0, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, name = 0x2b0efc0 "sh", network = 0x0, next_step_id = 1, ngids = 0, nodes = 0x2b0ef50 "nvl-002", node_addr = 0x2c85ea0, node_bitmap = 0x2c85bd0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 0, nodes_completing = 0x0, origin_cluster = 0x2b11290 "msuhpcc", other_port = 52490, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x2b0ef70 "general-long-18,general-long-14,general-long-16", part_ptr_list = 0x26cf950, part_nodes_missing = false, part_ptr = 0x28f5e90, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 3389, priority_array = 0x0, prio_factors = 0x2b0ee90, profile = 0, qos_id = 1, qos_ptr = 0x22c17a0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x2b0ef20 "192.168.9.200", sched_nodes = 0x0, select_jobinfo = 0x2b0f7c0, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1538954441, state_desc = 0x0, state_reason = 0, state_reason_prev = 0, step_list = 0x26cf9a0, suspend_time = 0, system_comment = 0x0, time_last_active = 1539005708, time_limit = 10080, time_min = 0, tot_sus_time = 0, total_cpus = 8, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x2b0f730 "gpu:1", tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x2b11340, tres_req_str = 0x2b112b0 "1=8,2=6000,4=1,1001=1", tres_fmt_req_str = 0x2b11250 "cpu=8,mem=6000M,node=1,gres/gpu=1", tres_alloc_cnt = 0x2b112e0, tres_alloc_str = 0x2b113a0 "1=8,2=6000,3=18446744073709551614,4=1,5=8,1001=1", tres_fmt_alloc_str = 0x2b113f0 "cpu=8,mem=6000M,node=1,billing=8,gres/gpu=1", user_id = 721898, user_name = 0x2b0f680 "dunan", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} Dominik, I reverted our production server back to 17.11.7 for now. I'm going to change this to a severity 2. Hi Good to know that you are "up" now. Could you send also job_gres_ptr? t 1 f 0 p *job_gres_ptr Dominik Hi I can recreate this issue. I will let you know when I fix this. Dominik Hi I have the preliminary patch now, but it still needs some review, let me know if you need it. I suggest staying on 17.11 slurmctld until it will be officially in git repo. Dominik Dominik, That's good news! We are aiming to be on 18.08.1 by next Monday. Do you think this internal review will be done by then? Thanks, Steve Hi If it wasn't US holiday, it would be done already. Because of Columbus Day, it will take a little bit longer, it should be in repo tomorrow. Dominik Hey Dominik, I'm hoping to do this upgrade later this evening. Will this patch be available today? Thanks, Steve Comment on attachment 7994 [details]
Patch to create structures outside of if
Steve, please use this patch it should get you what you need. It will be in 18.08.2 which will probably be tagged in a few weeks commit 33f8f22d.
Danny, We updated using this patch and slurmctld runs, however, we're seeing a lot of the following errors in the log: [2018-10-10T21:28:54.125] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4) [2018-10-10T21:28:54.125] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.127] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1) [2018-10-10T21:28:54.127] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1) [2018-10-10T21:28:54.127] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4) [2018-10-10T21:28:54.127] error: gres/gpu: job 326800 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.127] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4) [2018-10-10T21:28:54.128] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.155] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1) [2018-10-10T21:28:54.155] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1) [2018-10-10T21:28:54.157] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1) [2018-10-10T21:28:54.157] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1) [2018-10-10T21:28:54.160] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1) [2018-10-10T21:28:54.160] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1) [2018-10-10T21:28:54.174] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4) [2018-10-10T21:28:54.174] error: gres/gpu: job 326800 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.174] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4) [2018-10-10T21:28:54.174] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4) [2018-10-10T21:28:54.177] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4) Hi We expect this is only a transitional thing and it should go away after all 17.11 jobs are gone. I will inform you when I will find what exactly caused this problem. Dominik Hi Could you send me full slurmctld.log and output from "scontrol show job <one of affected job_id eg. 326800 >", if such jobs still exist in the system? Dominik # scontrol show job 326800 JobId=326800 JobName=4zh9_long_200ns_MD UserId=foouser(*) GroupId=hpccclass(*) MCS_label=N/A Priority=20007 Nice=0 Account=general QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=1-14:46:46 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2018-10-10T18:39:43 EligibleTime=2018-10-10T18:39:43 AccrueTime=Unknown StartTime=2018-10-10T18:39:43 EndTime=2018-10-17T18:39:43 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-10-10T18:39:43 Partition=general-long-18 AllocNode:Sid=lac-249:2131 ReqNodeList=(null) ExcNodeList=(null) NodeList=nvl-006 BatchHost=nvl-006 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=12G,node=1,billing=1,gres/gpu=4 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=12G MinTmpDiskNode=0 Features=intel16|intel18 DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb WorkDir=/mnt/home/foouser/MD/CRBP/longRun Comment=stdout=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.o%A StdErr=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.e326800 StdIn=/dev/null StdOut=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.o326800 Power= TresPerNode=gpu:4 Hi Does this error still occur? I am still working on this but I can't reproduce it. Dominik Hi Can we drop severity to 3 now, as the patch is already in git repo? Dominik Dominik, This is fine to drop to sev 3. I am still seeing this error in the logs. I'm now seeing it for jobs submitted after the update. For example, job 672084: [2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 gres count underflow [2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 topo gres count underflow (3 4) [2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 type k80 gres count underflow (3 4) JobId=672084 JobName=bash UserId=foouser(*) GroupId=group(*) MCS_label=N/A Priority=24787 Nice=0 Account=general QOS=normal JobState=FAILED Reason=NonZeroExitCode Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=130:0 RunTime=00:14:28 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2018-10-25T12:09:16 EligibleTime=2018-10-25T12:09:16 StartTime=2018-10-25T12:09:16 EndTime=2018-10-25T12:23:44 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-10-25T12:09:16 Partition=general-long-16 AllocNode:Sid=dev-intel16-k80:18489 ReqNodeList=(null) ExcNodeList=(null) NodeList=lac-196 BatchHost=lac-196 NumNodes=1 NumCPUs=8 NumTasks=4 CPUs/Task=2 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=2G,node=1,billing=8,gres/gpu=4 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=2 MinMemoryNode=2G MinTmpDiskNode=0 Features=gpgpu DelayBoot=00:00:00 Gres=gpu:4 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/bash WorkDir=/mnt/home/foouser/Documents/ShearWaveImaging/CUDA_GFS_ForHPCC_Release Comment=stdout=/mnt/home/foouser/Documents/ShearWaveImaging/CUDA_GFS_ForHPCC_Release/slurm-672084.out Power= Thanks, Steve Created attachment 8105 [details]
Slurmctld log
Hi It seems that you can easily reproduce this. Because I still can't reproduce this. Could you generate slurmctld.log with enabled DebugFlags SelectType? eg.: scontrol setdebugflags +SelectType submit job wait for log scontrol setdebugflags -SelectType Could you also send me slurm.conf and exact command you used to submit this job? Dominik Dominik, I have not been able to submit a job that produces these errors, I am only noticing these jobs after the fact. I'll attach the slurm log with the selecttype flag set. Perhaps the information from that flag for the jobs that are already throwing this error will be helpful. Best, Steve Created attachment 8202 [details]
Slurmctld log +SelectType, compressed with xz
Hi It looks like logic in select plugin is working well, a problem must be in a different place. Could you generate slurmctld.log with enabled DebugFlags TraceJobs? Dominik Created attachment 9233 [details]
Slurmctld log +TraceJobs
Dominik,
Here is the log output with TraceJobs enabled.
Best,
Steve
Hi This two commit should fix (gres count underflow) issue: https://github.com/SchedMD/slurm/commit/6f8cd92e1091e https://github.com/SchedMD/slurm/commit/69d78159c3305 Both patches will be included in 18.08.6. In separate bug, we handle wrongly set debug level which returns a false positive message like this "select_nodes: calling _get_req_features() for JobId=10129368 with not NULL job resources" Let me know if we can close this bug now. Dominik Hi I will close this ticket as resolved/fixed and set "Version Fixed" to 18.08.2 (this is version were an original bug was fixed). Please open a new ticket if you notice more "count underflow" message above version 18.08.6. Dominik |