Ticket 3978 - Jobs Scheduled But Not Starting
Summary: Jobs Scheduled But Not Starting
Status: RESOLVED DUPLICATE of ticket 3977
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.02.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-07-11 08:41 MDT by Paul Edmon
Modified: 2017-07-13 14:56 MDT (History)
0 users

See Also:
Site: Harvard University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2017-07-11 08:41:35 MDT
So we are seeing a variation of bug 3676 (https://bugs.schedmd.com/show_bug.cgi?id=3676) after our upgrade to 17.02.6 on Monday.  Jobs get scheduled and show up in the slurmd.log as having run their prolog and started their cgroup but then the job never actually starts.  For instance.  On this node there are 50 jobs from this user:

[root@xie01 log]# /usr/bin/squeue -w xie01 | grep ltan | wc -l
50

All of which run python, but then when you do a count of how many are actually running you get:

[root@xie01 log]# ps aux | grep python | grep ltan | wc -l
23

Some jobs:
[root@xie01 log]# scontrol -dd show job 24116603
JobId=24116603 JobName=assign.sh
   UserId=ltan(57234) GroupId=xie_lab(402013) MCS_label=N/A
   Priority=7399128 Nice=0 Account=xie_lab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=01:02:18 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2017-07-11T09:36:16 EligibleTime=2017-07-11T09:36:16
   StartTime=2017-07-11T09:36:19 EndTime=2017-07-18T09:36:21 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=xie AllocNode:Sid=holynx01:58732
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=xie01
   BatchHost=xie01
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=8000M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=xie01 CPU_IDs=37 Mem=8000 GRES_IDX=
   MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=./assign.sh processed/1CDX1-285/impute 10 processed/1CDX1-285/clean.txt processed/1CDX1-285/assign
   WorkDir=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC
   StdErr=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC/logs/assign-1CDX1-285-10-24116603.err
   StdIn=/dev/null
   StdOut=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC/logs/assign-1CDX1-285-10-24116603.out
   Power=
   BatchScript=
#!/bin/bash
#SBATCH -n 1
#SBATCH -p general
#SBATCH --mem-per-cpu=8000
#SBATCH -t 7-00:00

prefix="$1"
replicate_name="$2"
raw_contact_file="$3"
output_prefix="$4"

module load python
source activate hic

stdbuf -o0 -e0 ~/research3/nuc_dynamics/nuc_dynamics ${prefix}.ncc -o ${prefix}.${replicate_name}.3dg -temps 50 -iso 10000 -ran ${replicate_name}
python ~/research/dianti-code/dianti-meta/clean_3dg.py ${prefix}.${replicate_name}.3dg ${prefix}.txt ${prefix}.${replicate_name}c.3dg
python ~/research/dianti-code/dianti-meta/assign.py ${raw_contact_file} ${prefix}.${replicate_name}.3dg ${output_prefix}.${replicate_name}.txt




 show:
[root@xie01 log]# ps aux | grep 24116603
root     23054  0.0  0.0 310956  3880 ?        Sl   09:36   0:00 slurmstepd: [24116603]
root     23059  0.0  0.0 293868  3364 ?        Sl   09:36   0:00 slurmstepd: [24116603.4294967295]
ltan     23099  0.0  0.0 106224  1384 ?        S    09:36   0:00 /bin/bash /var/slurmd/spool/slurmd/job24116603/slurm_script processed/1CDX1-285/impute 10 processed/1CDX1-285/clean.txt processed/1CDX1-285/assign
root     28919  0.0  0.0 103248   856 pts/6    S+   10:37   0:00 grep 24116603

but others if I grep for their jobid show:

[root@xie01 log]# ps aux | grep 24116633
root     24325  0.0  0.0 293868  3364 ?        Sl   09:36   0:00 slurmstepd: [24116633.4294967295]
root     28795  0.0  0.0 103244   856 pts/6    S+   10:36   0:00 grep 24116633

even though they should have started:

root@xie01 log]# scontrol -dd show job 24116633
JobId=24116633 JobName=assign.sh
   UserId=ltan(57234) GroupId=xie_lab(402013) MCS_label=N/A
   Priority=7399128 Nice=0 Account=xie_lab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=01:01:49 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2017-07-11T09:36:18 EligibleTime=2017-07-11T09:36:18
   StartTime=2017-07-11T09:36:29 EndTime=2017-07-18T09:36:29 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=xie AllocNode:Sid=holynx01:58732
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=xie01
   BatchHost=xie01
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=8000M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=xie01 CPU_IDs=19 Mem=8000 GRES_IDX=
   MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=./assign.sh processed/1CDX1-413/impute 10 processed/1CDX1-413/clean.txt processed/1CDX1-413/assign
   WorkDir=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC
   StdErr=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC/logs/assign-1CDX1-413-10-24116633.err
   StdIn=/dev/null
   StdOut=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC/logs/assign-1CDX1-413-10-24116633.out
   Power=
   BatchScript=
#!/bin/bash
#SBATCH -n 1
#SBATCH -p general
#SBATCH --mem-per-cpu=8000
#SBATCH -t 7-00:00

prefix="$1"
replicate_name="$2"
raw_contact_file="$3"
output_prefix="$4"

module load python
source activate hic

stdbuf -o0 -e0 ~/research3/nuc_dynamics/nuc_dynamics ${prefix}.ncc -o ${prefix}.${replicate_name}.3dg -temps 50 -iso 10000 -ran ${replicate_name}
python ~/research/dianti-code/dianti-meta/clean_3dg.py ${prefix}.${replicate_name}.3dg ${prefix}.txt ${prefix}.${replicate_name}c.3dg
python ~/research/dianti-code/dianti-meta/assign.py ${raw_contact_file} ${prefix}.${replicate_name}.3dg ${output_prefix}.${replicate_name}.txt


You can see that the jobs are basically identical.  The storage its writing to is fine and has plenty of storage and inodes and is not overloaded.

Any insight to this?  I know you couldn't r
Comment 1 Paul Edmon 2017-07-11 08:47:36 MDT
Sorry, pulled the trigger too early.  Any insight on this?  I know you couldn't reproduce it on your end last time.

All told we are getting reports across our cluster of people's jobs starting but not writing any data.  We originally thought it was filesystem but its too widespread for that and it only started occurring after the Slurm upgrade that we did on Monday morning.
Comment 2 Paul Edmon 2017-07-11 09:11:42 MDT
Thinking about this more this could be some sort of file limit or something that the slurm user or root is hitting.  Any thoughts on that?
Comment 3 Tim Wickberg 2017-07-11 09:15:12 MDT
Do you have any of these currently stuck jobs still?

I'd love to see what that stepd process is currently doing - attaching to it with 'gdb -p <pid>', and running 'thread apply all bt full' would help considerably.
Comment 4 Paul Edmon 2017-07-11 09:18:07 MDT
[root@xie01 ~]# ps aux | grep 24116633
root     24325  0.0  0.0 293868  3364 ?        Sl   09:36   0:00 
slurmstepd: [24116633.4294967295]
root     30404  0.0  0.0 103248   856 pts/6    S+   11:16   0:00 grep 
24116633
[root@xie01 ~]# gdb -p 24325
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 24325
Reading symbols from /usr/sbin/slurmstepd...done.
Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libhwloc.so.5
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpam.so.0...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libpam.so.0
Reading symbols from /lib64/libpam_misc.so.0...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libpam_misc.so.0
Reading symbols from /lib64/libutil.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[New LWP 24332]
[New LWP 24330]
[New LWP 24329]
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /lib64/libpci.so.3...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libpci.so.3
Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libxml2.so.2
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libaudit.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libaudit.so.1
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libfreebl3.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnss_sss.so.2
Reading symbols from /usr/lib64/slurm/select_cons_res.so...done.
Loaded symbols for /usr/lib64/slurm/select_cons_res.so
Reading symbols from /usr/lib64/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib64/slurm/auth_munge.so
Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/libmunge.so.2
Reading symbols from /usr/lib64/slurm/switch_none.so...done.
Loaded symbols for /usr/lib64/slurm/switch_none.so
Reading symbols from /usr/lib64/slurm/gres_gpu.so...done.
Loaded symbols for /usr/lib64/slurm/gres_gpu.so
Reading symbols from /usr/lib64/slurm/acct_gather_profile_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_profile_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_energy_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_energy_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_infiniband_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_infiniband_none.so
Reading symbols from /usr/lib64/slurm/acct_gather_filesystem_none.so...done.
Loaded symbols for /usr/lib64/slurm/acct_gather_filesystem_none.so
Reading symbols from /usr/lib64/slurm/jobacct_gather_linux.so...done.
Loaded symbols for /usr/lib64/slurm/jobacct_gather_linux.so
Reading symbols from /usr/lib64/slurm/core_spec_none.so...done.
Loaded symbols for /usr/lib64/slurm/core_spec_none.so
Reading symbols from /usr/lib64/slurm/task_affinity.so...done.
Loaded symbols for /usr/lib64/slurm/task_affinity.so
Reading symbols from /usr/lib64/slurm/task_cgroup.so...done.
Loaded symbols for /usr/lib64/slurm/task_cgroup.so
Reading symbols from /usr/lib64/slurm/proctrack_cgroup.so...done.
Loaded symbols for /usr/lib64/slurm/proctrack_cgroup.so
Reading symbols from /usr/lib64/slurm/checkpoint_none.so...done.
Loaded symbols for /usr/lib64/slurm/checkpoint_none.so
Reading symbols from /usr/lib64/slurm/crypto_munge.so...done.
Loaded symbols for /usr/lib64/slurm/crypto_munge.so
Reading symbols from /usr/lib64/slurm/job_container_none.so...done.
Loaded symbols for /usr/lib64/slurm/job_container_none.so
Reading symbols from /usr/lib64/slurm/mpi_none.so...done.
Loaded symbols for /usr/lib64/slurm/mpi_none.so
Reading symbols from /usr/lib64/slurm/x11.so...(no debugging symbols 
found)...done.
Loaded symbols for /usr/lib64/slurm/x11.so
Reading symbols from /usr/lib64/slurm/libspunnel.so...(no debugging 
symbols found)...done.
Loaded symbols for /usr/lib64/slurm/libspunnel.so
0x00000030ab2ac6ca in wait4 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
slurm-17.02.6-1fasrc01.el6.x86_64
(gdb) thread apply all bt full

Thread 4 (Thread 0x2ba03d5e3700 (LWP 24329)):
#0  0x00000030aba0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
No symbol table info available.
#1  0x00000000005436af in _watch_tasks (arg=0x0) at 
slurm_jobacct_gather.c:244
         err = 0
         type = 1
         __func__ = "_watch_tasks"
#2  0x00000030aba079d1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00000030ab2e88fd in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x2ba03d6e4700 (LWP 24330)):
#0  0x00000030ab2aca3d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00000030ab2e1be4 in usleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000000000547894 in _timer_thread (args=0x0) at 
slurm_acct_gather_profile.c:189
         i = 4
         now = 1499786227
         diff = 1499786227
         __func__ = "_timer_thread"
         tv1 = {tv_sec = 1499786227, tv_usec = 968531}
         tv2 = {tv_sec = 1499786227, tv_usec = 968532}
         tv_str = "usec=1", '\000' <repeats 13 times>
         delta_t = 1
#3  0x00000030aba079d1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00000030ab2e88fd in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x2ba03d7e5700 (LWP 24332)):
#0  0x00000030ab2df0d3 in poll () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000000000453816 in _poll_internal (pfds=0x2ba0400008d0, nfds=2, 
shutdown_time=0) at eio.c:362
         n = 16777217
         timeout = -1
#2  0x00000000004535ec in eio_handle_mainloop (eio=0xdce5d0) at eio.c:326
         retval = 0
         pollfds = 0x2ba0400008d0
         map = 0x2ba040000900
         maxnfds = 1
         nfds = 2
---Type <return> to continue, or q <return> to quit---
         n = 1
         shutdown_time = 0
         __func__ = "eio_handle_mainloop"
#3  0x000000000043dd79 in _msg_thr_internal (job_arg=0xdc7440) at req.c:242
         job = 0xdc7440
#4  0x00000030aba079d1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x00000030ab2e88fd in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x2ba03b788860 (LWP 24325)):
#0  0x00000030ab2ac6ca in wait4 () from /lib64/libc.so.6
No symbol table info available.
#1  0x000000000042fefd in _spawn_job_container (job=0xdc7440) at mgr.c:1059
         jobacct = 0x0
         rusage = {ru_utime = {tv_sec = 140728798717296, tv_usec = 
14482240}, ru_stime = {tv_sec = 140728798716720, tv_usec = 4526872}, 
ru_maxrss = 1,
           ru_ixrss = 8490230596, ru_idrss = 140728798716740, ru_isrss = 
47967235126592, ru_minflt = 0, ru_majflt = 14487952, ru_nswap = 
140728798716752,
           ru_inblock = 47967233011193, ru_oublock = 104475079499525, 
ru_msgsnd = 24325, ru_msgrcv = 140728798716816, ru_nsignals = 
47967233012171,
           ru_nvcsw = 0, ru_nivcsw = 14447680}
         jobacct_id = {taskid = 0, nodeid = 0, job = 0xdc7440}
         status = 0
         pid = 24348
         rc = 0
         __func__ = "_spawn_job_container"
#2  0x0000000000430246 in job_manager (job=0xdc7440) at mgr.c:1167
         rc = 0
         io_initialized = false
         ckpt_type = 0xdcd1d0 "checkpoint/none"
         err_msg = 0x0
         __func__ = "job_manager"
#3  0x000000000042b89c in main (argc=1, argv=0x7ffdfa0ea578) at 
slurmstepd.c:183
         cli = 0xdd9870
         self = 0xde9b60
         msg = 0xdc77a0
         job = 0xdc7440
         ngids = 16
         gids = 0xdec910
         rc = 0
         launch_params = 0x0
         __func__ = "main"


On 07/11/2017 11:15 AM, bugs@schedmd.com wrote:
> Tim Wickberg <mailto:tim@schedmd.com> changed bug 3978 
> <https://bugs.schedmd.com/show_bug.cgi?id=3978>
> What 	Removed 	Added
> Assignee 	support@schedmd.com 	tim@schedmd.com
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=3978#c3> on bug 
> 3978 <https://bugs.schedmd.com/show_bug.cgi?id=3978> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> Do you have any of these currently stuck jobs still?
>
> I'd love to see what that stepd process is currently doing - attaching to it
> with 'gdb -p <pid>', and running 'thread apply all bt full' would help
> considerably.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 5 Paul Edmon 2017-07-12 08:01:24 MDT
Another data point for this.  I just launched a series of data transfers this morning but only a few are actually running:

[root@holyitc01 liu]# /usr/bin/squeue -u root
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          24160047   general fscull-r     root  R       3:13      7 holy2a[03202-03208]
          24160015   general runscrip     root  R       4:16      1 holyseas02
          24160016   general runscrip     root  R       4:16      1 holyseas02
          24160008   general runscrip     root  R       4:47      1 holyseas02
          24160009   general runscrip     root  R       4:47      1 holyseas02
          24160010   general runscrip     root  R       4:47      1 holyseas01
          24160011   general runscrip     root  R       4:47      1 holyseas01
          24160012   general runscrip     root  R       4:47      1 holyseas01
          24160013   general runscrip     root  R       4:47      1 holy2a13108
          24160014   general runscrip     root  R       4:47      1 holy2a13108

[root@holyitc01 liu]# ls -ltr
total 68
-rwxr-xr-x 1 root root   211 Jul 10 11:03 migrate.sh
-rw-r--r-- 1 root root   195 Jul 12 09:48 runscript-migrate
-rw-r--r-- 1 root root   122 Jul 12 09:49 slurm-24160007.out
-rw-r--r-- 1 root root   121 Jul 12 09:49 slurm-24160006.out
-rw-r--r-- 1 root root 10343 Jul 12 09:53 slurm-24160015.out
-rw-r--r-- 1 root root 10303 Jul 12 09:53 slurm-24160016.out
-rw-r--r-- 1 root root 13339 Jul 12 09:53 slurm-24160011.out
-rw-r--r-- 1 root root 12107 Jul 12 09:53 slurm-24160014.out

It looks like if you have multiple jobs per node some of the jobs stall out.  The job script I'm running is here:

[root@holyitc01 liu]# cat runscript-migrate 
#!/bin/bash
#SBATCH -n 1
#SBATCH -p general
#SBATCH -t 7-00:00:00
#SBATCH --mem=4000

echo $FOLDER

srun rsync -av --progress /n/regal/liu/$FOLDER/ /n/holylfs/ATTIC/2017-07-10-regal-liu/$FOLDER/

Normally I wouldn't use srun to launch rsync but I thought maybe that might solve the problem.  Clearly though that did not.

This is an ongoing problem for us so any insight or ideas are appreciated.
Comment 6 Paul Edmon 2017-07-12 08:26:15 MDT
One of our users, Longzhi Tan, found the following:

"I notice that by doing "scontrol requeue <job id>" on each "stuck" job will lead to successful running (at least for the 17 jobs that I tested). So this is a temporary solution; but still I hope the bug will be fixed soon."

I tested this myself and it does work.  Hopefully that will give some hint as to a proper solution for this.  In the meantime we are going to notify our users as to this temporary fix.
Comment 7 Paul Edmon 2017-07-13 13:01:47 MDT
Just curious as to an update on this.  Do you need any more data from my end?
Comment 8 Tim Wickberg 2017-07-13 13:26:40 MDT
I have a few loose theories, but nothing concrete unfortunately.

Do you happen to have any SPANK plugins enabled?
Comment 9 Paul Edmon 2017-07-13 13:29:44 MDT
Yes, we have 2.

https://github.com/hautreux/slurm-spank-x11
https://github.com/harvardinformatics/spunnel

-Paul Edmon-


On 07/13/2017 03:26 PM, bugs@schedmd.com wrote:
>
> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=3978#c8> on bug 
> 3978 <https://bugs.schedmd.com/show_bug.cgi?id=3978> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> I have a few loose theories, but nothing concrete unfortunately.
>
> Do you happen to have any SPANK plugins enabled?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 10 Tim Wickberg 2017-07-13 13:47:15 MDT
Actually, I think SPANK isn't implicated here. The backtrace from the stepd looks fine; that's all that the extern step should be doing here.

I'm realizing that this is likely a duplicate of bug 3977 that Stanford reported immediate before this. Do you mind if we tag this as a duplicate, and keep track of progress on there?
Comment 11 Paul Edmon 2017-07-13 13:50:36 MDT
Sure, that's perfectly fine.  It's good to see that we aren't the only 
ones who saw this.  We were afraid that something super unique in our 
infrastructure was causing this as we figured this would have been seen 
in testing.

-Paul Edmon-


On 07/13/2017 03:47 PM, bugs@schedmd.com wrote:
>
> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=3978#c10> on 
> bug 3978 <https://bugs.schedmd.com/show_bug.cgi?id=3978> from Tim 
> Wickberg <mailto:tim@schedmd.com> *
> Actually, I think SPANK isn't implicated here. The backtrace from the stepd
> looks fine; that's all that the extern step should be doing here.
>
> I'm realizing that this is likely a duplicate ofbug 3977 <show_bug.cgi?id=3977>  that Stanford
> reported immediate before this. Do you mind if we tag this as a duplicate, and
> keep track of progress on there?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 12 Tim Wickberg 2017-07-13 13:54:54 MDT
It's unusual; we're still tracking down the root cause unfortunately, but it does not appear that you're alone on this one.

I'm combing through the changes from 17.02.5 to see if I can isolate a potential cause for this; it does look like both you and Stanford are seeing this on 17.02.6 which seems to limit when this may have started, and I would have expected to see this long ago if it was an issue in older maintenance releases.

*** This ticket has been marked as a duplicate of ticket 3977 ***