Ticket 3763

Summary:	Cancelled jobs staying in queue beyond MinJobAge
Product:	Slurm	Reporter:	Michael Ver Haag <verhaag>
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	17.02.2
Hardware:	Linux
OS:	Linux
Site:	Johns Hopkins University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02.4, 17.11-pre0
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmconf slurmctldlog1 slurmctldlog2 slurmctldlog3 List of unremoved jobs slurmd compute0253 Extra logging patch Extra logging patch Extra logging patch Recent slurmctld log Extra logging patch 2 slurmctldlog after second logging patch consolidated patch to address performance issues with purge_old_job

Description Michael Ver Haag 2017-05-03 06:22:05 MDT

Many of our jobs seem to be stuck in the running queue and not obeying MinJobAge.
Even after a reconfigure and a slurmctld restart
It is occurring across all of our partitions.
Every job seems to be from a state Cancelled or timeout

[root@mgmt1 ~]# date
Wed May  3 08:14:01 EDT 2017
[root@mgmt1 ~]# scontrol show config |grep Min
MinJobAge               = 301 sec
[root@mgmt1 ~]# squeue  -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.20e %.30R" --sort=t,p,i --states=SUSPENDED,CANCELLED,TIMEOUT,NODE_FAIL,PREEMPTED,BOOT_FAIL,COMPLETING,CONFIGURING,RESIZING,SPECIAL_EXIT |grep -v "2017-05-03" |wc -l
261

Comment 2 Tim Shaw 2017-05-03 10:02:44 MDT

Hello Michael,

I haven't seen this problem before 17.02.2.  Probably the best thing to start with is for you to pick a recent, problematic job and send it to me with the slurmctld log.  I'll trace through the log and follow the job's life-cycle and see what I can find.

Just to verify, only some jobs are having this problem, correct?  What percentage of your jobs are having this problem?  Are jobs at least running to completion correctly, just not being purged?

Also, if you could attach your slurm.conf file, that would be helpful.

Thanks

Tim

Comment 3 Michael Ver Haag 2017-05-03 10:42:16 MDT

Created attachment 4488 [details]
slurmconf

Comment 4 Michael Ver Haag 2017-05-03 10:43:02 MDT

Created attachment 4489 [details]
slurmctldlog1

Comment 5 Michael Ver Haag 2017-05-03 10:43:24 MDT

Created attachment 4490 [details]
slurmctldlog2

Comment 6 Michael Ver Haag 2017-05-03 10:43:47 MDT

Created attachment 4491 [details]
slurmctldlog3

Comment 7 Michael Ver Haag 2017-05-03 10:49:12 MDT

Created attachment 4492 [details]
List of unremoved jobs

This is a current snapshot of the jobs sitting in the running queue
squeue  -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.20e %.30R" --sort=t,p,i --states=SUSPENDED,CANCELLED,TIMEOUT,NODE_FAIL,PREEMPTED,BOOT_FAIL,RESIZING,SPECIAL_EXIT > unremovedjobs

approximate time stamp
Wed May  3 12:47

Comment 8 Michael Ver Haag 2017-05-03 10:53:12 MDT

Jobs seem to be making it to completion but it looks like things with some part getting cancelled don't get purged. Its not all of the jobs that are going through the queue. Kind of hard to say what percentage of them are getting stuck.

Comment 9 Tim Shaw 2017-05-04 10:49:45 MDT

I'm looking at job 1526417 in the log to try to find out why this is happening.  So far, I don't see a lot of clues other than it was canceled by the user.  Could you also attach the slurmd log from node compute0253 which is where I believe job 1526417 was running?  I want to see if I can find out more about it there.

Thanks.

Comment 10 Michael Ver Haag 2017-05-04 10:56:42 MDT

Created attachment 4508 [details]
slurmd compute0253

Comment 11 Michael Ver Haag 2017-05-04 11:00:38 MDT

It may be worth noting that that particular job has moved out of the queue now but other still persist from longer than MinJobAge

Comment 12 Tim Shaw 2017-05-04 11:48:45 MDT

Still chasing leads here.  Could you also send me the output of "scontrol show job" for one of the jobs have have passed their MinJobAge?

Thanks.

Comment 13 Michael Ver Haag 2017-05-05 04:56:19 MDT

Here is a couple of them from this morning. 

scontrol show job 1544634
JobId=1544634 JobName=callsCGG.sh                                                                                                                                                                                                                                              
   UserId=yhe23@jhu.edu(1959) GroupId=abattle4(1206) MCS_label=N/A                                                                                                                                                                                                             
   Priority=107446 Nice=0 Account=abattle4 QOS=normal                                                                                                                                                                                                                          
   JobState=CANCELLED Reason=None Dependency=(null)                                                                                                                                                                                                                            
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15                                                                                                                                                                                                                     
   RunTime=00:07:39 TimeLimit=4-04:00:00 TimeMin=N/A                                                                                                                                                                                                                           
   SubmitTime=2017-05-02T22:55:57 EligibleTime=2017-05-02T22:55:58                                                                                                                                                                                                             
   StartTime=2017-05-02T22:55:58 EndTime=2017-05-02T23:03:37 Deadline=N/A                                                                                                                                                                                                      
   PreemptTime=None SuspendTime=None SecsPreSuspend=0                                                                                                                                                                                                                          
   Partition=shared AllocNode:Sid=login-node03:173822                                                                                                                                                                                                                          
   ReqNodeList=(null) ExcNodeList=(null)                                                                                                                                                                                                                                       
   NodeList=compute0019                                                                                                                                                                                                                                                        
   BatchHost=compute0019                                                                                                                                                                                                                                                       
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*                                                                                                                                                                                                              
   TRES=cpu=1,mem=5G,node=1                                                                                                                                                                                                                                                    
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*                                                                                                                                                                                                                            
   MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0                                                                                                                                                                                                                              
   Features=(null) DelayBoot=00:00:00                                                                                                                                                                                                                                          
   Gres=(null) Reservation=(null)                                                                                                                                                                                                                                              
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)                                                                                                                                                                                                                
   Command=/scratch/groups/abattle4/yuan/code/sCGG/genome_wide_stochastic/callsCGG.sh 7 stochastic_7                                                                                                                                                                           
   WorkDir=/scratch/groups/abattle4/yuan/code/sCGG/genome_wide_stochastic                                                                                                                                                                                                      
   StdErr=/scratch/groups/abattle4/yuan/code/sCGG/genome_wide_stochastic/slurm-1544634.out                                                                                                                                                                                     
   StdIn=/dev/null                                                                                                                                                                                                                                                             
   StdOut=/scratch/groups/abattle4/yuan/code/sCGG/genome_wide_stochastic/slurm-1544634.out                                                                                                                                                                                     
   Power=      


scontrol show job 1545475
JobId=1545475 JobName=interact                                                                                                                                                                                                                                                 
   UserId=aleong3@jhu.edu(2322) GroupId=thufnag1(1108) MCS_label=N/A                                                                                                                                                                                                           
   Priority=152158 Nice=0 Account=thufnag1 QOS=normal                                                                                                                                                                                                                          
   JobState=TIMEOUT Reason=TimeLimit Dependency=(null)                                                                                                                                                                                                                         
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0                                                                                                                                                                                                                      
   RunTime=00:31:00 TimeLimit=00:30:00 TimeMin=N/A                                                                                                                                                                                                                             
   SubmitTime=2017-05-02T23:21:23 EligibleTime=2017-05-02T23:21:23                                                                                                                                                                                                             
   StartTime=2017-05-02T23:21:23 EndTime=2017-05-02T23:52:23 Deadline=N/A                                                                                                                                                                                                      
   PreemptTime=None SuspendTime=None SecsPreSuspend=0                                                                                                                                                                                                                          
   Partition=shared AllocNode:Sid=login-node02:81506                                                                                                                                                                                                                           
   ReqNodeList=(null) ExcNodeList=(null)                                                                                                                                                                                                                                       
   NodeList=compute0096                                                                                                                                                                                                                                                        
   BatchHost=compute0096
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=5G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/data/thufnag1/Matlab_codes
   Power=

Comment 14 Tim Shaw 2017-05-08 12:03:37 MDT

Michael,

After some investigation work, I'm still can't be certain as to why jobs are getting stuck with CANCELLED or TIMEOUT status.  I'm thinking there may be some important info on the job that isn't surfaced up with a simple "scontrol show job".  That means I'll need to see the job structure in memory.  That can be done with gdb.  Using gdb, can you print out the job structure of a job that is currently passed it's MinJobAge.  In case you're not familiar with gdb, here's how:

1) As root, attach gdb to the slurmctld:

gdb attach `pgrep slurmctld`

or just look up the pid for slurmctld:

gdb attach 1242

2) Run the following gdb commands (replace "XXXX" with the job id):

set print pretty on
p *job_hash[XXXX % slurmctld_conf.max_job_cnt]

That should print the entire job structure of the job in memory.

To get out, just type "quit".

If you could paste the output into this bug, that would be helpful.

I appreciate your effort and patience while I continue to investigate this problem.

Thanks.

Tim

Comment 15 Michael Ver Haag 2017-05-09 06:06:28 MDT

Here are two examples of the issue. With the requested information.

# scontrol show job  1600311
JobId=1600311 JobName=WHe
   UserId=qli39@jhu.edu(1224) GroupId=ema4(1030) MCS_label=N/A
   Priority=51182 Nice=0 Account=ema4 QOS=normal
   JobState=CANCELLED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2017-05-03T14:15:38 EligibleTime=2017-05-03T14:15:38
   StartTime=2017-05-03T14:16:09 EndTime=2017-05-03T14:16:09 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=parallel AllocNode:Sid=login-node02:95851
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=245760,node=2
   Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
   MinCPUsNode=24 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/sub.sh
   WorkDir=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110
   StdErr=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1600311.out
   StdIn=/dev/null
   StdOut=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1600311.out
   Power=
# gdb attach `pgrep slurmctld`
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 147156
Reading symbols from /cm/shared/apps/slurm/17.02.2/sbin/slurmctld...done.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 77498]
[New LWP 77421]
[New LWP 77413]
[New LWP 77348]
[New LWP 568]
[New LWP 567]
[New LWP 566]
[New LWP 565]
[New LWP 564]
[New LWP 147197]
[New LWP 147165]
[New LWP 147164]
[New LWP 147163]
[New LWP 147162]
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_ldap.so.2
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/crypto_munge.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/crypto_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/gres_gpu.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/gres_gpu.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/select_cons_res.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/select_cons_res.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/preempt_partition_prio.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/preempt_partition_prio.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/checkpoint_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/checkpoint_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_energy_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_profile_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_infiniband_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_infiniband_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_filesystem_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/jobacct_gather_linux.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/ext_sensors_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/ext_sensors_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/switch_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/switch_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/accounting_storage_slurmdbd.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/accounting_storage_slurmdbd.so
Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_dns.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/auth_munge.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/auth_munge.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/topology_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/topology_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/jobcomp_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/jobcomp_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/sched_backfill.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/sched_backfill.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/route_default.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/route_default.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/priority_multifactor.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/priority_multifactor.so
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/mcs_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/mcs_none.so
0x00002aaaab19ba7d in nanosleep () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.3.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64
(gdb) set print pretty on
(gdb) p *job_hash[1600311 % slurmctld_conf.max_job_cnt]
$1 = {
  account = 0xad6b220 "ema4", 
  admin_comment = 0x0, 
  alias_list = 0x0, 
  alloc_node = 0xad6b1f0 "login-node02", 
  alloc_resp_port = 0, 
  alloc_sid = 95851, 
  array_job_id = 0, 
  array_task_id = 4294967294, 
  array_recs = 0x0, 
  assoc_id = 326, 
  assoc_ptr = 0xa18d80, 
  batch_flag = 1, 
  batch_host = 0x0, 
  billable_tres = 4294967294, 
  bit_flags = 0, 
  burst_buffer = 0x0, 
  burst_buffer_state = 0x0, 
  check_job = 0x0, 
  ckpt_interval = 0, 
  ckpt_time = 0, 
  clusters = 0x0, 
  comment = 0x0, 
  cpu_cnt = 0, 
  cr_enabled = 0, 
  db_index = 290578455, 
  deadline = 0, 
  delay_boot = 0, 
  derived_ec = 0, 
  details = 0xad6b020, 
  direct_set_prio = 0, 
  end_time = 1493835369, 
  end_time_exp = 0, 
  epilog_running = false, 
  exit_code = 0, 
  fed_details = 0x0, 
  front_end_ptr = 0x0, 
  gres = 0x0, 
  gres_list = 0x0, 
  gres_alloc = 0x0, 
  gres_detail_cnt = 0, 
  gres_detail_str = 0x0, 
  gres_req = 0x0, 
  gres_used = 0x0, 
  group_id = 1030, 
  job_id = 1600311, 
  job_next = 0x0, 
  job_array_next_j = 0x0, 
  job_array_next_t = 0x0, 
  job_resrcs = 0x0, 
  job_state = 4, 
  kill_on_node_fail = 1, 
  licenses = 0x0, 
  license_list = 0x0, 
  limit_set = {
    qos = 0, 
    time = 0, 
    tres = 0xad6ac80
  }, 
  mail_type = 0, 
  mail_user = 0x0, 
  magic = 4038539564, 
  mcs_label = 0x0, 
  name = 0xad6b1d0 "WHe", 
  network = 0x0, 
  next_step_id = 0, 
  nodes = 0x0, 
  node_addr = 0x0, 
  node_bitmap = 0x0, 
  node_bitmap_cg = 0x0, 
  node_cnt = 0, 
  node_cnt_wag = 0, 
  nodes_completing = 0x0, 
  other_port = 0, 
  pack_leader = 0, 
---Type <return> to continue, or q <return> to quit---
  partition = 0xad6b1a0 "parallel", 
  part_ptr_list = 0x0, 
  part_nodes_missing = false, 
  part_ptr = 0x2aafa433c600, 
  pelog_env = 0x0, 
  pelog_env_size = 0, 
  power_flags = 0 '\000', 
  pre_sus_time = 0, 
  preempt_time = 0, 
  preempt_in_progress = false, 
  priority = 51182, 
  priority_array = 0x0, 
  prio_factors = 0xad6aae0, 
  profile = 0, 
  qos_id = 1, 
  qos_ptr = 0x0, 
  reboot = 0 '\000', 
  restart_cnt = 0, 
  resize_time = 0, 
  resv_id = 0, 
  resv_name = 0x0, 
  resv_ptr = 0x0, 
  requid = 4294967295, 
  resp_host = 0x0, 
  sched_nodes = 0x0, 
  select_jobinfo = 0xad6b240, 
  spank_job_env = 0x0, 
  spank_job_env_size = 0, 
  start_protocol_ver = 7936, 
  start_time = 1493835369, 
  state_desc = 0x0, 
  state_reason = 0, 
  state_reason_prev = 0, 
  step_list = 0xad317f0, 
  suspend_time = 0, 
  time_last_active = 1494065651, 
  time_limit = 30, 
  time_min = 0, 
  tot_sus_time = 0, 
  total_cpus = 0, 
  total_nodes = 2, 
  tres_req_cnt = 0xad6b450, 
  tres_req_str = 0xad6b330 "1=48,2=245760,4=2", 
  tres_fmt_req_str = 0xad6b360 "cpu=48,mem=245760,node=2", 
  tres_alloc_cnt = 0xad6b410, 
  tres_alloc_str = 0x0, 
  tres_fmt_alloc_str = 0x0, 
  user_id = 1224, 
  wait_all_nodes = 0, 
  warn_flags = 0, 
  warn_signal = 0, 
  warn_time = 0, 
  wckey = 0x0, 
  req_switch = 0, 
  wait4switch = 0, 
  best_switch = true, 
  wait4switch_start = 0
}
# scontrol show job  1600311
JobId=1600311 JobName=WHe
   UserId=qli39@jhu.edu(1224) GroupId=ema4(1030) MCS_label=N/A
   Priority=51182 Nice=0 Account=ema4 QOS=normal
   JobState=CANCELLED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2017-05-03T14:15:38 EligibleTime=2017-05-03T14:15:38
   StartTime=2017-05-03T14:16:09 EndTime=2017-05-03T14:16:09 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=parallel AllocNode:Sid=login-node02:95851
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=245760,node=2
   Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
   MinCPUsNode=24 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/sub.sh
   WorkDir=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110
   StdErr=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1600311.out
   StdIn=/dev/null
   StdOut=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1600311.out
   Power=

# gdb attach `pgrep slurmctld`
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
attach: No such file or directory.
Attaching to process 147156
Reading symbols from /cm/shared/apps/slurm/17.02.2/sbin/slurmctld...done.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 568]
[New LWP 567]
[New LWP 566]
[New LWP 565]
[New LWP 564]
[New LWP 147197]
[New LWP 147165]
[New LWP 147164]
[New LWP 147163]
[New LWP 147162]
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_ldap.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_ldap.so.2
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/crypto_munge.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/crypto_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/gres_gpu.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/gres_gpu.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/select_cons_res.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/select_cons_res.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/preempt_partition_prio.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/preempt_partition_prio.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/checkpoint_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/checkpoint_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_energy_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_profile_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_infiniband_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_infiniband_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/acct_gather_filesystem_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/jobacct_gather_linux.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/ext_sensors_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/ext_sensors_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/switch_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/switch_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/accounting_storage_slurmdbd.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/accounting_storage_slurmdbd.so
Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_dns.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/auth_munge.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/auth_munge.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/topology_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/topology_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/jobcomp_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/jobcomp_none.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/sched_backfill.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/sched_backfill.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/route_default.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/route_default.so
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/priority_multifactor.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/priority_multifactor.so
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /cm/shared/apps/slurm/17.02.2/lib/slurm/mcs_none.so...done.
Loaded symbols for /cm/shared/apps/slurm/17.02.2/lib/slurm/mcs_none.so
0x00002aaaab19ba7d in nanosleep () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.166.el6_7.3.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64
(gdb) set print pretty on
(gdb) p *job_hash[1601720 % slurmctld_conf.max_job_cnt]
$1 = {
  account = 0xb29ed10 "abattle4", 
  admin_comment = 0x0, 
  alias_list = 0x0, 
  alloc_node = 0xb29ece0 "login-node03", 
  alloc_resp_port = 53651, 
  alloc_sid = 188027, 
  array_job_id = 0, 
  array_task_id = 4294967294, 
  array_recs = 0x0, 
  assoc_id = 1302, 
  assoc_ptr = 0x9d7450, 
  batch_flag = 0, 
  batch_host = 0xb29ed80 "compute0210", 
  billable_tres = 1, 
  bit_flags = 0, 
  burst_buffer = 0x0, 
  burst_buffer_state = 0x0, 
  check_job = 0x0, 
  ckpt_interval = 0, 
  ckpt_time = 0, 
  clusters = 0x0, 
  comment = 0x0, 
  cpu_cnt = 0, 
  cr_enabled = 0, 
  db_index = 290585115, 
  deadline = 0, 
  delay_boot = 0, 
  derived_ec = 9, 
  details = 0xb29eab0, 
  direct_set_prio = 0, 
  end_time = 1493837480, 
  end_time_exp = 4294967294, 
  epilog_running = false, 
  exit_code = 0, 
  fed_details = 0x0, 
  front_end_ptr = 0x0, 
  gres = 0x0, 
  gres_list = 0x0, 
  gres_alloc = 0xb29ed40 "", 
  gres_detail_cnt = 0, 
  gres_detail_str = 0x0, 
  gres_req = 0xb29ed60 "", 
  gres_used = 0x0, 
  group_id = 1206, 
  job_id = 1601720, 
  job_next = 0x0, 
  job_array_next_j = 0x0, 
  job_array_next_t = 0x0, 
  job_resrcs = 0xb29ede0, 
  job_state = 6, 
  kill_on_node_fail = 1, 
  licenses = 0x0, 
  license_list = 0x0, 
  limit_set = {
    qos = 0, 
    time = 0, 
    tres = 0xb29e710
  }, 
  mail_type = 0, 
  mail_user = 0x0, 
  magic = 4038539564, 
  mcs_label = 0x0, 
  name = 0xb29ecb0 "interact", 
  network = 0x0, 
  next_step_id = 1, 
  nodes = 0xb29ec60 "compute0210", 
  node_addr = 0x0, 
  node_bitmap = 0x2ab0081224a0, 
  node_bitmap_cg = 0x0, 
  node_cnt = 0, 
  node_cnt_wag = 0, 
  nodes_completing = 0x0, 
  other_port = 51738, 
  pack_leader = 0, 
---Type <return> to continue, or q <return> to quit---
  partition = 0xb29ec90 "shared", 
  part_ptr_list = 0x0, 
  part_nodes_missing = false, 
  part_ptr = 0x2aafa433c400, 
  pelog_env = 0x0, 
  pelog_env_size = 0, 
  power_flags = 0 '\000', 
  pre_sus_time = 0, 
  preempt_time = 0, 
  preempt_in_progress = false, 
  priority = 108250, 
  priority_array = 0x0, 
  prio_factors = 0xb29e530, 
  profile = 0, 
  qos_id = 1, 
  qos_ptr = 0x0, 
  reboot = 0 '\000', 
  restart_cnt = 0, 
  resize_time = 0, 
  resv_id = 0, 
  resv_name = 0x0, 
  resv_ptr = 0x0, 
  requid = 4294967295, 
  resp_host = 0xb29ec30 "172.16.5.5", 
  sched_nodes = 0x0, 
  select_jobinfo = 0xb29edb0, 
  spank_job_env = 0x0, 
  spank_job_env_size = 0, 
  start_protocol_ver = 7936, 
  start_time = 1493835596, 
  state_desc = 0x0, 
  state_reason = 24, 
  state_reason_prev = 24, 
  step_list = 0xb2659a0, 
  suspend_time = 0, 
  time_last_active = 1494065652, 
  time_limit = 30, 
  time_min = 0, 
  tot_sus_time = 54, 
  total_cpus = 1, 
  total_nodes = 1, 
  tres_req_cnt = 0xb29f240, 
  tres_req_str = 0xb29f130 "1=1,2=5120,4=1", 
  tres_fmt_req_str = 0xb29f160 "cpu=1,mem=5120,node=1", 
  tres_alloc_cnt = 0xb29f200, 
  tres_alloc_str = 0xb29f0d0 "1=1,2=5120,4=1", 
  tres_fmt_alloc_str = 0xb29f100 "cpu=1,mem=5G,node=1", 
  user_id = 1959, 
  wait_all_nodes = 0, 
  warn_flags = 0, 
  warn_signal = 0, 
  warn_time = 0, 
  wckey = 0x0, 
  req_switch = 0, 
  wait4switch = 0, 
  best_switch = true, 
  wait4switch_start = 0
}

Comment 16 Tim Shaw 2017-05-09 11:09:36 MDT

Created attachment 4536 [details]
Extra logging patch

Thank you for the output you provided.  It has been helpful and has surfaced up some leads to pursue.  I have some theories now as to why this is happening but I need a little more information to nail down this problem.

First, I'm attaching a logging patch that will give a little more information around the area of the code that removes jobs from the queue.  Please apply the patch and turn your logging up to the debug level.

SlurmctldDebug=debug

Second, please run "scontrol show step <job_id>" on one of the jobs the stuck jobs and paste the output in this bug.  I'm wondering if these stuck jobs still have job steps that haven't finished.  Then, using the same job id, connect to slurmctld using gdb:

gdb attach `pgrep slurmctld`

Then run:

set print pretty on
p job_hash[XXXX % slurmctld_conf.max_job_cnt]

Notice this print command doesn't contain the prepending "*" like the command I sent you before but the XXXX is still where the job id goes.  That should return something like:

$2 = (struct job_record *) 0x7fcbd40056f0

Then, using the hex address given by gdb (For example, "0x7fcbd40056f0" is what mine was), run the following command with your hex address:

p *((struct job_record *) 0x7fcbd40056f0)->details

That should print out the job details object.  Please paste that into the bug.

Finally, you said that some jobs are eventually leaving queue.  I need to go through the debug logs and find the entries that show a job actually leaving the queue.  The extra logging patch I provided should help with that.  The debug log message "purging job" should show that take place.  If you can help me find a job that was stuck but now has been purged with that message, please attach that log to this ticket.

I know this is quite of bit of work for you so feel free to send me the first few items if you need time to do the last one.

Let me know if you have any further questions.  Thanks again for your effort.

Regards.

Tim

Comment 17 Michael Ver Haag 2017-05-09 11:59:55 MDT

I will work in implementing the patch in the morning, and see if I can get any debug purge statements. 

I tried to get some scontrol show step <jobid> step but I could not get any output from any of the jobs. They all said something like

# scontrol show step 1800325
Job step 1800325.4294967294 not found
# scontrol show step 1661238
Job step 1661238.4294967294 not found

I ran a "scontrol show step" and it only showed about 70ish lines for State=RUNNING jobs. 
with this as the current load summary.
           Running Pending   Other   Total   Users  Groups 
Total         3397    1910    5766   11073     168      98 
scavenger        6      11     343     360       9       5 
lrgmem          93      32     821     946      23      20 
gpuscav          0       0       2       2       1       1 
shared,lr        0      12       0      12       1       1 
bw-gpu          12       0      59      71      11      10 
memscav          0       0      11      11       1       1 
unlimited       12       0       5      17       3       3 
bw-parall        8       2     166     176      13      12 
gpu             60      46     249     355      28      21 
debug            1       0     103     104      32      20 
shared        3096    1793    3268    8157      78      59 
parallel       109      14     739     862      56      38 

I also noticed while doing this that 
# squeue  -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.30R" --sort=t,p,i --states=FAILED |wc -l
53318
So there seems to be about 50k more jobs in the queue in addition to the 5900 that are currently outside the MinJobAge that I had previously seen.

Comment 18 Tim Shaw 2017-05-09 13:46:29 MDT

Created attachment 4538 [details]
Extra logging patch

I decided to add a few more logging statements to the patch to better see exactly why slurmctld won't purge the jobs.  In case you've already downloaded it, please download again.

Thanks

Tim

Comment 19 Tim Shaw 2017-05-09 13:49:57 MDT

Created attachment 4539 [details]
Extra logging patch

One minor adjustment made to logging patch.

Comment 20 Michael Ver Haag 2017-05-15 08:19:14 MDT

Sorry for the delay.
I updated us to 17.02.3 as long as I had to recompile. 
Issue persists
I have seen a few some purge messages in the log but nothing that seems relevant yet

# squeue  -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.20e %.30R" --sort=e --states=SUSPENDED,CANCELLED,TIMEOUT,NODE_FAIL,PREEMPTED,BOOT_FAIL,COMPLETING,CONFIGURING,RESIZING,SPECIAL_EXIT |head -n 10
          JOBID       PARTITION              USER STATE TIME_LIMI NOD    ACCOUNT   PRIORITY           START_TIME             END_TIME               NODELIST(REASON)
        1680496        parallel     qli39@jhu.edu CANCELLED 1-06:30:00  30       ema4      52897  2017-05-04T23:23:03  2017-05-04T23:23:03                               

# grep 1680496 slurmctld
[2017-05-15T09:25:14.613] Recovered JobID=1680496 State=0x4 NodeCnt=0 Assoc=326

# scontrol show job 1680496
JobId=1680496 JobName=WHe
   UserId=qli39@jhu.edu(1224) GroupId=ema4(1030) MCS_label=N/A
   Priority=52897 Nice=0 Account=ema4 QOS=normal
   JobState=CANCELLED Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-06:30:00 TimeMin=N/A
   SubmitTime=2017-05-04T23:04:36 EligibleTime=2017-05-04T23:04:37
   StartTime=2017-05-04T23:23:03 EndTime=2017-05-04T23:23:03 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=parallel AllocNode:Sid=login-node03:166934
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=30 NumCPUs=720 NumTasks=720 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=720,mem=3686400,node=30
   Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
   MinCPUsNode=24 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/sub.sh
   WorkDir=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110
   StdErr=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1680496.out
   StdIn=/dev/null
   StdOut=/scratch/groups/ema4/qjli/screw-vac-H/WHe-110/slurm-1680496.out
   Power=

# scontrol show step 1680496
Job step 1680496.4294967294 not found

(gdb) p job_hash[1680496 % slurmctld_conf.max_job_cnt]
$1 = (struct job_record *) 0x2048340
(gdb) p *((struct job_record *) 0x2048340)->details
$2 = {acctg_freq = 0x0, argc = 1, argv = 0x2048990, begin_time = 1493953477, ckpt_dir = 0x2048930 "/var/slurm/checkpoint", contiguous = 0, core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534, cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, 
  cpu_freq_gov = 4294967294, cpus_per_task = 1, depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, feature_list = 0x0, features = 0x0, magic = 0, max_cpus = 4294967294, 
  max_nodes = 30, mc_ptr = 0x2048960, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 720, min_nodes = 30, nice = 2147483648, ntasks_per_node = 24, num_tasks = 720, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 24, 
  pn_min_memory = 9223372036854780928, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, preempt_start_time = 0, req_nodes = 0x0, requeue = 1, restart_dir = 0x0, share_res = 0 '\000', std_err = 0x0, 
  std_in = 0x2048900 "/dev/null", std_out = 0x0, submit_time = 1493953476, task_dist = 8192, usable_nodes = 0, whole_node = 1 '\001', work_dir = 0x20482c0 "/scratch/groups/ema4/qjli/screw-vac-H/WHe-110"}

Comment 21 Michael Ver Haag 2017-05-15 08:24:34 MDT

Created attachment 4553 [details]
Recent slurmctld log

Comment 22 Tim Shaw 2017-05-15 11:24:58 MDT

Created attachment 4557 [details]
Extra logging patch 2

I have a working theory as to why this is happening but I still don't have enough information to be 100% positive.  The logging you provided with my last patch supported my theory and I've attached a new patch (Extra logging patch 2) with a bit more logging that should verify my theory.  If you could apply it, let slurmctld run for a while and then give me the log with the job ID of another stuck job so I can track it down in the log, that would be great.

I believe we are close to discovering the core problem here.  Thanks for your effort.

Regards

Tim

Comment 24 Michael Ver Haag 2017-05-16 05:36:03 MDT

# squeue  -o "%.15i %.15P %.17u %T %.9l %.3D %.10a %.10Q %.20S %.20e %.30R" --sort=e,t,p,i --states=SUSPENDED,CANCELLED,TIMEOUT,NODE_FAIL,PREEMPTED,BOOT_FAIL,COMPLETING,CONFIGURING,RESIZING,SPECIAL_EXIT  |head -n 10
          JOBID       PARTITION              USER STATE TIME_LIMI NOD    ACCOUNT   PRIORITY           START_TIME             END_TIME               NODELIST(REASON)                                                                                                           
        1740411          shared   gweber5@jhu.edu CANCELLED     30:00   1   sghosh20      51512  2017-05-06T04:24:54  2017-05-06T04:24:54                                                                                                                                      
        1740457          shared   gweber5@jhu.edu CANCELLED     30:00   1   sghosh20      51516  2017-05-06T04:26:13  2017-05-06T04:26:13                                                                                                                                      
        1740507        parallel   gweber5@jhu.edu CANCELLED     30:00   1   sghosh20      51755  2017-05-06T04:51:22  2017-05-06T04:51:22                                                                                                                                      
        1744516        parallel            asiera CANCELLED  12:00:00   1   jcombar1    1020315  2017-05-06T06:18:18  2017-05-06T06:23:21                    compute0273                                                                                                       
        1749580          shared     qli39@jhu.edu CANCELLED 1-06:30:00  25       ema4      52476  2017-05-06T08:47:13  2017-05-06T08:47:13                                                                                                                                     
        1747436          shared  ppivete1@jhu.edu CANCELLED  10:00:00   1   ppivete1     302446  2017-05-06T07:46:39  2017-05-06T09:00:35                    compute0018                                                                                                       
        1727855          bw-gpu  cchen184@jhu.edu CANCELLED 1-00:00:00   2   mulmsch1      51113  2017-05-05T22:49:39  2017-05-06T09:12:05                   gpu[062-063]                                                                                                      
        1750724        parallel  cchen184@jhu.edu CANCELLED 4-04:00:00   4   mulmsch1      51246  2017-05-06T09:15:10  2017-05-06T09:15:10                                                                                                                                     
        1750935        parallel  cchen184@jhu.edu CANCELLED 4-04:00:00   4   mulmsch1      51226  2017-05-06T09:17:11  2017-05-06T09:17:11

Comment 25 Michael Ver Haag 2017-05-16 05:37:23 MDT

Created attachment 4563 [details]
slurmctldlog after second logging patch

Comment 30 Tim Shaw 2017-05-17 14:23:25 MDT

Michael,

I just wanted to give you an quick update for this bug.  Using your last debugging log I was able to confirm my theory as to what is happening.  Right now, the controller is hitting the job purge timeout of 1 second after not purging very many jobs.  Something is causing the controller to take longer than normal to purge the jobs and it doesn't get through very many before timing out and moving on.  There could be environmental factors that are causing this.  A few possibilities include running the controller in a VM, the file system is slow and under-performing, and/or the script/environment removal is taking too long.  Exactly why those things could be is uncertain.  You might be able to lend some insights there.  In any case, I'm working on putting a patch together to optimize and speed up the purging of jobs and I'm considering several different approaches for that.  It may take a little while to determine the best course of action and it will need some extensive testing done.  In the meantime, you might look into those possibilities I provided to see if any of them have any warrant.

I'll keep you up to date on how it's going.

Regards.

Tim

Comment 35 Michael Ver Haag 2017-05-19 04:59:26 MDT

Thanks for the insight. 
For reference the performance of the purge process appeared to have changed when we implemented 17.02, that was the only parameter that changed before we experienced this issue. 
Based on your latest note I made a change in slurm yesterday
Our slurm environment runs of a relatively slow nfs share with the save state on that nfs share. I migrated the statesave directory to a local disk (yes I know this causes an issue with the backup controller)
After implementing the change the purge process seemed better at being able to catch up. After about 4 hours the system managed to catch up. I was monitoring the log while it was doing this process. As the size of the queue went down it sped up so it appears that there is some time involved in loading that should not be a part of the timeout. 
The we are under a relatively light compute load at this time with exams and the end of the semester. The machine that slurmctl is running on is bare metal 24core system with 128G of ram. It seems like the purging should have been able to catch relatively quickly if disk wait was the primary problem. The fact that it took so long suggests that the slow disk was a smaller component that compounded the problem. 
Than you for your work on this. 
I look forward to your changes to see if we can roll back to our share file space

Comment 42 Tim Wickberg 2017-05-25 14:44:08 MDT

Created attachment 4650 [details]
consolidated patch to address performance issues with purge_old_job

Hey Michael -

I'm attaching a preliminary version of a patch that addresses this. I believe the main issue that the purge_old_job code has is the speed in which the batch script + environment files can be removed from StateSaveLocation; the attached patch defers that to a separate thread that will work through them at its own pace, while letting the rest of the slurmctld processes continue on after it.

It should apply to a stock 17.02.3 source tree - you can remove the other test patches Tim had given you before.

If you do test it out, let me know how it goes. I'll update this bug when we have a final version of these committed, which should happen before 17.02.4 is released.

- Tim

Comment 45 Tim Wickberg 2017-06-01 14:05:22 MDT

Hey Michael -

A version of the patch I'd sent has been committed as 843e5d38a5 + ce2cd1b237 + b9719be2a3, plus a few extra cleanup patches afterwards. Those are functionally equivalent to what I'd provided before, so no need to switch over if that's been working for you thus far.

That will all be in 17.02.4 when released, likely next week.

Please reopen if you have any further concerns.

- Tim