|
Description
Michael Ver Haag
2017-05-03 06:22:05 MDT
Hello Michael, I haven't seen this problem before 17.02.2. Probably the best thing to start with is for you to pick a recent, problematic job and send it to me with the slurmctld log. I'll trace through the log and follow the job's life-cycle and see what I can find. Just to verify, only some jobs are having this problem, correct? What percentage of your jobs are having this problem? Are jobs at least running to completion correctly, just not being purged? Also, if you could attach your slurm.conf file, that would be helpful. Thanks Tim Created attachment 4488 [details]
slurmconf
Created attachment 4489 [details]
slurmctldlog1
Created attachment 4490 [details]
slurmctldlog2
Created attachment 4491 [details]
slurmctldlog3
Created attachment 4492 [details]
List of unremoved jobs
This is a current snapshot of the jobs sitting in the running queue
squeue -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.20e %.30R" --sort=t,p,i --states=SUSPENDED,CANCELLED,TIMEOUT,NODE_FAIL,PREEMPTED,BOOT_FAIL,RESIZING,SPECIAL_EXIT > unremovedjobs
approximate time stamp
Wed May 3 12:47
Jobs seem to be making it to completion but it looks like things with some part getting cancelled don't get purged. Its not all of the jobs that are going through the queue. Kind of hard to say what percentage of them are getting stuck. I'm looking at job 1526417 in the log to try to find out why this is happening. So far, I don't see a lot of clues other than it was canceled by the user. Could you also attach the slurmd log from node compute0253 which is where I believe job 1526417 was running? I want to see if I can find out more about it there. Thanks. Created attachment 4508 [details]
slurmd compute0253
It may be worth noting that that particular job has moved out of the queue now but other still persist from longer than MinJobAge Still chasing leads here. Could you also send me the output of "scontrol show job" for one of the jobs have have passed their MinJobAge? Thanks. Here is a couple of them from this morning. scontrol show job 1544634 JobId=1544634 JobName=callsCGG.sh UserId=yhe23@jhu.edu(1959) GroupId=abattle4(1206) MCS_label=N/A Priority=107446 Nice=0 Account=abattle4 QOS=normal JobState=CANCELLED Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15 RunTime=00:07:39 TimeLimit=4-04:00:00 TimeMin=N/A SubmitTime=2017-05-02T22:55:57 EligibleTime=2017-05-02T22:55:58 StartTime=2017-05-02T22:55:58 EndTime=2017-05-02T23:03:37 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=shared AllocNode:Sid=login-node03:173822 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute0019 BatchHost=compute0019 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=5G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/groups/abattle4/yuan/code/sCGG/genome_wide_stochastic/callsCGG.sh 7 stochastic_7 WorkDir=/scratch/groups/abattle4/yuan/code/sCGG/genome_wide_stochastic StdErr=/scratch/groups/abattle4/yuan/code/sCGG/genome_wide_stochastic/slurm-1544634.out StdIn=/dev/null StdOut=/scratch/groups/abattle4/yuan/code/sCGG/genome_wide_stochastic/slurm-1544634.out Power= scontrol show job 1545475 JobId=1545475 JobName=interact UserId=aleong3@jhu.edu(2322) GroupId=thufnag1(1108) MCS_label=N/A Priority=152158 Nice=0 Account=thufnag1 QOS=normal JobState=TIMEOUT Reason=TimeLimit Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:31:00 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2017-05-02T23:21:23 EligibleTime=2017-05-02T23:21:23 StartTime=2017-05-02T23:21:23 EndTime=2017-05-02T23:52:23 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=shared AllocNode:Sid=login-node02:81506 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute0096 BatchHost=compute0096 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=5G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/data/thufnag1/Matlab_codes Power= Michael, After some investigation work, I'm still can't be certain as to why jobs are getting stuck with CANCELLED or TIMEOUT status. I'm thinking there may be some important info on the job that isn't surfaced up with a simple "scontrol show job". That means I'll need to see the job structure in memory. That can be done with gdb. Using gdb, can you print out the job structure of a job that is currently passed it's MinJobAge. In case you're not familiar with gdb, here's how: 1) As root, attach gdb to the slurmctld: gdb attach `pgrep slurmctld` or just look up the pid for slurmctld: gdb attach 1242 2) Run the following gdb commands (replace "XXXX" with the job id): set print pretty on p *job_hash[XXXX % slurmctld_conf.max_job_cnt] That should print the entire job structure of the job in memory. To get out, just type "quit". If you could paste the output into this bug, that would be helpful. I appreciate your effort and patience while I continue to investigate this problem. Thanks. Tim Here are two examples of the issue. With the requested information. # scontrol show job 1600311 JobId=1600311 JobName=WHe UserId=qli39@jhu.edu(1224) GroupId=ema4(1030) MCS_label=N/A Priority=51182 Nice=0 Account=ema4 QOS=normal JobState=CANCELLED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2017-05-03T14:15:38 EligibleTime=2017-05-03T14:15:38 StartTime=2017-05-03T14:16:09 EndTime=2017-05-03T14:16:09 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=parallel AllocNode:Sid=login-node02:95851 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=48,mem=245760,node=2 Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* MinCPUsNode=24 MinMemoryCPU=5G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/sub.sh WorkDir=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110 StdErr=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1600311.out StdIn=/dev/null StdOut=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1600311.out Power= # gdb attach `pgrep slurmctld` GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... attach: No such file or directory. Attaching to process 147156 Reading symbols from /cm/shared/apps/slurm/17.02.2/sbin/slurmctld...done. Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [New LWP 77498] [New LWP 77421] [New LWP 77413] [New LWP 77348] [New LWP 568] [New LWP 567] [New LWP 566] [New LWP 565] [New LWP 564] [New LWP 147197] [New LWP 147165] [New LWP 147164] [New LWP 147163] [New LWP 147162] [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_ldap.so.2 Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/crypto_munge.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/crypto_munge.so Reading symbols from /usr/lib64/libmunge.so.2...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/gres_gpu.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/gres_gpu.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/select_cons_res.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/select_cons_res.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/preempt_partition_prio.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/preempt_partition_prio.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/checkpoint_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/checkpoint_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_energy_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_energy_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_profile_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_profile_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_infiniband_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_infiniband_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_filesystem_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_filesystem_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/jobacct_gather_linux.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/jobacct_gather_linux.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/ext_sensors_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/ext_sensors_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/switch_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/switch_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/accounting_storage_slurmdbd.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/accounting_storage_slurmdbd.so Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_dns.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/auth_munge.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/auth_munge.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/topology_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/topology_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/jobcomp_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/jobcomp_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/sched_backfill.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/sched_backfill.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/route_default.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/route_default.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/priority_multifactor.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/priority_multifactor.so Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/mcs_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/mcs_none.so 0x00002aaaab19ba7d in nanosleep () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.3.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 (gdb) set print pretty on (gdb) p *job_hash[1600311 % slurmctld_conf.max_job_cnt] $1 = { account = 0xad6b220 "ema4", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0xad6b1f0 "login-node02", alloc_resp_port = 0, alloc_sid = 95851, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 326, assoc_ptr = 0xa18d80, batch_flag = 1, batch_host = 0x0, billable_tres = 4294967294, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 0, cr_enabled = 0, db_index = 290578455, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0xad6b020, direct_set_prio = 0, end_time = 1493835369, end_time_exp = 0, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x0, gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 1030, job_id = 1600311, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 4, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set = { qos = 0, time = 0, tres = 0xad6ac80 }, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0xad6b1d0 "WHe", network = 0x0, next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 0, nodes_completing = 0x0, other_port = 0, pack_leader = 0, ---Type <return> to continue, or q <return> to quit--- partition = 0xad6b1a0 "parallel", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x2aafa433c600, pelog_env = 0x0, pelog_env_size = 0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 51182, priority_array = 0x0, prio_factors = 0xad6aae0, profile = 0, qos_id = 1, qos_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0xad6b240, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 7936, start_time = 1493835369, state_desc = 0x0, state_reason = 0, state_reason_prev = 0, step_list = 0xad317f0, suspend_time = 0, time_last_active = 1494065651, time_limit = 30, time_min = 0, tot_sus_time = 0, total_cpus = 0, total_nodes = 2, tres_req_cnt = 0xad6b450, tres_req_str = 0xad6b330 "1=48,2=245760,4=2", tres_fmt_req_str = 0xad6b360 "cpu=48,mem=245760,node=2", tres_alloc_cnt = 0xad6b410, tres_alloc_str = 0x0, tres_fmt_alloc_str = 0x0, user_id = 1224, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0 } # scontrol show job 1600311 JobId=1600311 JobName=WHe UserId=qli39@jhu.edu(1224) GroupId=ema4(1030) MCS_label=N/A Priority=51182 Nice=0 Account=ema4 QOS=normal JobState=CANCELLED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2017-05-03T14:15:38 EligibleTime=2017-05-03T14:15:38 StartTime=2017-05-03T14:16:09 EndTime=2017-05-03T14:16:09 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=parallel AllocNode:Sid=login-node02:95851 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=48,mem=245760,node=2 Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* MinCPUsNode=24 MinMemoryCPU=5G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/sub.sh WorkDir=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110 StdErr=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1600311.out StdIn=/dev/null StdOut=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1600311.out Power= # gdb attach `pgrep slurmctld` GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... attach: No such file or directory. Attaching to process 147156 Reading symbols from /cm/shared/apps/slurm/17.02.2/sbin/slurmctld...done. Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [New LWP 568] [New LWP 567] [New LWP 566] [New LWP 565] [New LWP 564] [New LWP 147197] [New LWP 147165] [New LWP 147164] [New LWP 147163] [New LWP 147162] [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_ldap.so.2 Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/crypto_munge.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/crypto_munge.so Reading symbols from /usr/lib64/libmunge.so.2...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/gres_gpu.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/gres_gpu.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/select_cons_res.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/select_cons_res.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/preempt_partition_prio.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/preempt_partition_prio.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/checkpoint_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/checkpoint_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_energy_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_energy_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_profile_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_profile_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_infiniband_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_infiniband_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_filesystem_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_filesystem_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/jobacct_gather_linux.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/jobacct_gather_linux.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/ext_sensors_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/ext_sensors_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/switch_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/switch_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/accounting_storage_slurmdbd.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/accounting_storage_slurmdbd.so Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_dns.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/auth_munge.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/auth_munge.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/topology_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/topology_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/jobcomp_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/jobcomp_none.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/sched_backfill.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/sched_backfill.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/route_default.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/route_default.so Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/priority_multifactor.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/priority_multifactor.so Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/mcs_none.so...done. Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/mcs_none.so 0x00002aaaab19ba7d in nanosleep () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.3.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 (gdb) set print pretty on (gdb) p *job_hash[1601720 % slurmctld_conf.max_job_cnt] $1 = { account = 0xb29ed10 "abattle4", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0xb29ece0 "login-node03", alloc_resp_port = 53651, alloc_sid = 188027, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1302, assoc_ptr = 0x9d7450, batch_flag = 0, batch_host = 0xb29ed80 "compute0210", billable_tres = 1, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 0, cr_enabled = 0, db_index = 290585115, deadline = 0, delay_boot = 0, derived_ec = 9, details = 0xb29eab0, direct_set_prio = 0, end_time = 1493837480, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0xb29ed40 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0xb29ed60 "", gres_used = 0x0, group_id = 1206, job_id = 1601720, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0xb29ede0, job_state = 6, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set = { qos = 0, time = 0, tres = 0xb29e710 }, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0xb29ecb0 "interact", network = 0x0, next_step_id = 1, nodes = 0xb29ec60 "compute0210", node_addr = 0x0, node_bitmap = 0x2ab0081224a0, node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 0, nodes_completing = 0x0, other_port = 51738, pack_leader = 0, ---Type <return> to continue, or q <return> to quit--- partition = 0xb29ec90 "shared", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x2aafa433c400, pelog_env = 0x0, pelog_env_size = 0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 108250, priority_array = 0x0, prio_factors = 0xb29e530, profile = 0, qos_id = 1, qos_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0xb29ec30 "172.16.5.5", sched_nodes = 0x0, select_jobinfo = 0xb29edb0, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 7936, start_time = 1493835596, state_desc = 0x0, state_reason = 24, state_reason_prev = 24, step_list = 0xb2659a0, suspend_time = 0, time_last_active = 1494065652, time_limit = 30, time_min = 0, tot_sus_time = 54, total_cpus = 1, total_nodes = 1, tres_req_cnt = 0xb29f240, tres_req_str = 0xb29f130 "1=1,2=5120,4=1", tres_fmt_req_str = 0xb29f160 "cpu=1,mem=5120,node=1", tres_alloc_cnt = 0xb29f200, tres_alloc_str = 0xb29f0d0 "1=1,2=5120,4=1", tres_fmt_alloc_str = 0xb29f100 "cpu=1,mem=5G,node=1", user_id = 1959, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0 } Created attachment 4536 [details]
Extra logging patch
Thank you for the output you provided. It has been helpful and has surfaced up some leads to pursue. I have some theories now as to why this is happening but I need a little more information to nail down this problem.
First, I'm attaching a logging patch that will give a little more information around the area of the code that removes jobs from the queue. Please apply the patch and turn your logging up to the debug level.
SlurmctldDebug=debug
Second, please run "scontrol show step <job_id>" on one of the jobs the stuck jobs and paste the output in this bug. I'm wondering if these stuck jobs still have job steps that haven't finished. Then, using the same job id, connect to slurmctld using gdb:
gdb attach `pgrep slurmctld`
Then run:
set print pretty on
p job_hash[XXXX % slurmctld_conf.max_job_cnt]
Notice this print command doesn't contain the prepending "*" like the command I sent you before but the XXXX is still where the job id goes. That should return something like:
$2 = (struct job_record *) 0x7fcbd40056f0
Then, using the hex address given by gdb (For example, "0x7fcbd40056f0" is what mine was), run the following command with your hex address:
p *((struct job_record *) 0x7fcbd40056f0)->details
That should print out the job details object. Please paste that into the bug.
Finally, you said that some jobs are eventually leaving queue. I need to go through the debug logs and find the entries that show a job actually leaving the queue. The extra logging patch I provided should help with that. The debug log message "purging job" should show that take place. If you can help me find a job that was stuck but now has been purged with that message, please attach that log to this ticket.
I know this is quite of bit of work for you so feel free to send me the first few items if you need time to do the last one.
Let me know if you have any further questions. Thanks again for your effort.
Regards.
Tim
I will work in implementing the patch in the morning, and see if I can get any debug purge statements.
I tried to get some scontrol show step <jobid> step but I could not get any output from any of the jobs. They all said something like
# scontrol show step 1800325
Job step 1800325.4294967294 not found
# scontrol show step 1661238
Job step 1661238.4294967294 not found
I ran a "scontrol show step" and it only showed about 70ish lines for State=RUNNING jobs.
with this as the current load summary.
Running Pending Other Total Users Groups
Total 3397 1910 5766 11073 168 98
scavenger 6 11 343 360 9 5
lrgmem 93 32 821 946 23 20
gpuscav 0 0 2 2 1 1
shared,lr 0 12 0 12 1 1
bw-gpu 12 0 59 71 11 10
memscav 0 0 11 11 1 1
unlimited 12 0 5 17 3 3
bw-parall 8 2 166 176 13 12
gpu 60 46 249 355 28 21
debug 1 0 103 104 32 20
shared 3096 1793 3268 8157 78 59
parallel 109 14 739 862 56 38
I also noticed while doing this that
# squeue -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.30R" --sort=t,p,i --states=FAILED |wc -l
53318
So there seems to be about 50k more jobs in the queue in addition to the 5900 that are currently outside the MinJobAge that I had previously seen.
Created attachment 4538 [details]
Extra logging patch
I decided to add a few more logging statements to the patch to better see exactly why slurmctld won't purge the jobs. In case you've already downloaded it, please download again.
Thanks
Tim
Created attachment 4539 [details]
Extra logging patch
One minor adjustment made to logging patch.
Sorry for the delay.
I updated us to 17.02.3 as long as I had to recompile.
Issue persists
I have seen a few some purge messages in the log but nothing that seems relevant yet
# squeue -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.20e %.30R" --sort=e --states=SUSPENDED,CANCELLED,TIMEOUT,NODE_FAIL,PREEMPTED,BOOT_FAIL,COMPLETING,CONFIGURING,RESIZING,SPECIAL_EXIT |head -n 10
JOBID PARTITION USER STATE TIME_LIMI NOD ACCOUNT PRIORITY START_TIME END_TIME NODELIST(REASON)
1680496 parallel qli39@jhu.edu CANCELLED 1-06:30:00 30 ema4 52897 2017-05-04T23:23:03 2017-05-04T23:23:03
# grep 1680496 slurmctld
[2017-05-15T09:25:14.613] Recovered JobID=1680496 State=0x4 NodeCnt=0 Assoc=326
# scontrol show job 1680496
JobId=1680496 JobName=WHe
UserId=qli39@jhu.edu(1224) GroupId=ema4(1030) MCS_label=N/A
Priority=52897 Nice=0 Account=ema4 QOS=normal
JobState=CANCELLED Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-06:30:00 TimeMin=N/A
SubmitTime=2017-05-04T23:04:36 EligibleTime=2017-05-04T23:04:37
StartTime=2017-05-04T23:23:03 EndTime=2017-05-04T23:23:03 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=parallel AllocNode:Sid=login-node03:166934
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=30 NumCPUs=720 NumTasks=720 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=720,mem=3686400,node=30
Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
MinCPUsNode=24 MinMemoryCPU=5G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/sub.sh
WorkDir=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110
StdErr=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1680496.out
StdIn=/dev/null
StdOut=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1680496.out
Power=
# scontrol show step 1680496
Job step 1680496.4294967294 not found
(gdb) p job_hash[1680496 % slurmctld_conf.max_job_cnt]
$1 = (struct job_record *) 0x2048340
(gdb) p *((struct job_record *) 0x2048340)->details
$2 = {acctg_freq = 0x0, argc = 1, argv = 0x2048990, begin_time = 1493953477, ckpt_dir = 0x2048930 "/var/slurm/checkpoint", contiguous = 0, core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534, cpu_freq_min = 4294967294, cpu_freq_max = 4294967294,
cpu_freq_gov = 4294967294, cpus_per_task = 1, depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, feature_list = 0x0, features = 0x0, magic = 0, max_cpus = 4294967294,
max_nodes = 30, mc_ptr = 0x2048960, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 720, min_nodes = 30, nice = 2147483648, ntasks_per_node = 24, num_tasks = 720, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 24,
pn_min_memory = 9223372036854780928, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, preempt_start_time = 0, req_nodes = 0x0, requeue = 1, restart_dir = 0x0, share_res = 0 '\000', std_err = 0x0,
std_in = 0x2048900 "/dev/null", std_out = 0x0, submit_time = 1493953476, task_dist = 8192, usable_nodes = 0, whole_node = 1 '\001', work_dir = 0x20482c0 "/scratch/groups/ema4/qjli/screw-vac-H/WHe-110"}
Created attachment 4553 [details]
Recent slurmctld log
Created attachment 4557 [details]
Extra logging patch 2
I have a working theory as to why this is happening but I still don't have enough information to be 100% positive. The logging you provided with my last patch supported my theory and I've attached a new patch (Extra logging patch 2) with a bit more logging that should verify my theory. If you could apply it, let slurmctld run for a while and then give me the log with the job ID of another stuck job so I can track it down in the log, that would be great.
I believe we are close to discovering the core problem here. Thanks for your effort.
Regards
Tim
# squeue -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.20e %.30R" --sort=e,t,p,i --states=SUSPENDED,CANCELLED,TIMEOUT,NODE_FAIL,PREEMPTED,BOOT_FAIL,COMPLETING,CONFIGURING,RESIZING,SPECIAL_EXIT |head -n 10
JOBID PARTITION USER STATE TIME_LIMI NOD ACCOUNT PRIORITY START_TIME END_TIME NODELIST(REASON)
1740411 shared gweber5@jhu.edu CANCELLED 30:00 1 sghosh20 51512 2017-05-06T04:24:54 2017-05-06T04:24:54
1740457 shared gweber5@jhu.edu CANCELLED 30:00 1 sghosh20 51516 2017-05-06T04:26:13 2017-05-06T04:26:13
1740507 parallel gweber5@jhu.edu CANCELLED 30:00 1 sghosh20 51755 2017-05-06T04:51:22 2017-05-06T04:51:22
1744516 parallel asiera CANCELLED 12:00:00 1 jcombar1 1020315 2017-05-06T06:18:18 2017-05-06T06:23:21 compute0273
1749580 shared qli39@jhu.edu CANCELLED 1-06:30:00 25 ema4 52476 2017-05-06T08:47:13 2017-05-06T08:47:13
1747436 shared ppivete1@jhu.edu CANCELLED 10:00:00 1 ppivete1 302446 2017-05-06T07:46:39 2017-05-06T09:00:35 compute0018
1727855 bw-gpu cchen184@jhu.edu CANCELLED 1-00:00:00 2 mulmsch1 51113 2017-05-05T22:49:39 2017-05-06T09:12:05 gpu[062-063]
1750724 parallel cchen184@jhu.edu CANCELLED 4-04:00:00 4 mulmsch1 51246 2017-05-06T09:15:10 2017-05-06T09:15:10
1750935 parallel cchen184@jhu.edu CANCELLED 4-04:00:00 4 mulmsch1 51226 2017-05-06T09:17:11 2017-05-06T09:17:11
Created attachment 4563 [details]
slurmctldlog after second logging patch
Michael, I just wanted to give you an quick update for this bug. Using your last debugging log I was able to confirm my theory as to what is happening. Right now, the controller is hitting the job purge timeout of 1 second after not purging very many jobs. Something is causing the controller to take longer than normal to purge the jobs and it doesn't get through very many before timing out and moving on. There could be environmental factors that are causing this. A few possibilities include running the controller in a VM, the file system is slow and under-performing, and/or the script/environment removal is taking too long. Exactly why those things could be is uncertain. You might be able to lend some insights there. In any case, I'm working on putting a patch together to optimize and speed up the purging of jobs and I'm considering several different approaches for that. It may take a little while to determine the best course of action and it will need some extensive testing done. In the meantime, you might look into those possibilities I provided to see if any of them have any warrant. I'll keep you up to date on how it's going. Regards. Tim Thanks for the insight. For reference the performance of the purge process appeared to have changed when we implemented 17.02, that was the only parameter that changed before we experienced this issue. Based on your latest note I made a change in slurm yesterday Our slurm environment runs of a relatively slow nfs share with the save state on that nfs share. I migrated the statesave directory to a local disk (yes I know this causes an issue with the backup controller) After implementing the change the purge process seemed better at being able to catch up. After about 4 hours the system managed to catch up. I was monitoring the log while it was doing this process. As the size of the queue went down it sped up so it appears that there is some time involved in loading that should not be a part of the timeout. The we are under a relatively light compute load at this time with exams and the end of the semester. The machine that slurmctl is running on is bare metal 24core system with 128G of ram. It seems like the purging should have been able to catch relatively quickly if disk wait was the primary problem. The fact that it took so long suggests that the slow disk was a smaller component that compounded the problem. Than you for your work on this. I look forward to your changes to see if we can roll back to our share file space Created attachment 4650 [details]
consolidated patch to address performance issues with purge_old_job
Hey Michael -
I'm attaching a preliminary version of a patch that addresses this. I believe the main issue that the purge_old_job code has is the speed in which the batch script + environment files can be removed from StateSaveLocation; the attached patch defers that to a separate thread that will work through them at its own pace, while letting the rest of the slurmctld processes continue on after it.
It should apply to a stock 17.02.3 source tree - you can remove the other test patches Tim had given you before.
If you do test it out, let me know how it goes. I'll update this bug when we have a final version of these committed, which should happen before 17.02.4 is released.
- Tim
Hey Michael - A version of the patch I'd sent has been committed as 843e5d38a5 + ce2cd1b237 + b9719be2a3, plus a few extra cleanup patches afterwards. Those are functionally equivalent to what I'd provided before, so no need to switch over if that's been working for you thus far. That will all be in 17.02.4 when released, likely next week. Please reopen if you have any further concerns. - Tim |