So we are seeing a variation of bug 3676 (https://bugs.schedmd.com/show_bug.cgi?id=3676) after our upgrade to 17.02.6 on Monday. Jobs get scheduled and show up in the slurmd.log as having run their prolog and started their cgroup but then the job never actually starts. For instance. On this node there are 50 jobs from this user: [root@xie01 log]# /usr/bin/squeue -w xie01 | grep ltan | wc -l 50 All of which run python, but then when you do a count of how many are actually running you get: [root@xie01 log]# ps aux | grep python | grep ltan | wc -l 23 Some jobs: [root@xie01 log]# scontrol -dd show job 24116603 JobId=24116603 JobName=assign.sh UserId=ltan(57234) GroupId=xie_lab(402013) MCS_label=N/A Priority=7399128 Nice=0 Account=xie_lab QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=01:02:18 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2017-07-11T09:36:16 EligibleTime=2017-07-11T09:36:16 StartTime=2017-07-11T09:36:19 EndTime=2017-07-18T09:36:21 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=xie AllocNode:Sid=holynx01:58732 ReqNodeList=(null) ExcNodeList=(null) NodeList=xie01 BatchHost=xie01 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=8000M,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=xie01 CPU_IDs=37 Mem=8000 GRES_IDX= MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=./assign.sh processed/1CDX1-285/impute 10 processed/1CDX1-285/clean.txt processed/1CDX1-285/assign WorkDir=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC StdErr=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC/logs/assign-1CDX1-285-10-24116603.err StdIn=/dev/null StdOut=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC/logs/assign-1CDX1-285-10-24116603.out Power= BatchScript= #!/bin/bash #SBATCH -n 1 #SBATCH -p general #SBATCH --mem-per-cpu=8000 #SBATCH -t 7-00:00 prefix="$1" replicate_name="$2" raw_contact_file="$3" output_prefix="$4" module load python source activate hic stdbuf -o0 -e0 ~/research3/nuc_dynamics/nuc_dynamics ${prefix}.ncc -o ${prefix}.${replicate_name}.3dg -temps 50 -iso 10000 -ran ${replicate_name} python ~/research/dianti-code/dianti-meta/clean_3dg.py ${prefix}.${replicate_name}.3dg ${prefix}.txt ${prefix}.${replicate_name}c.3dg python ~/research/dianti-code/dianti-meta/assign.py ${raw_contact_file} ${prefix}.${replicate_name}.3dg ${output_prefix}.${replicate_name}.txt show: [root@xie01 log]# ps aux | grep 24116603 root 23054 0.0 0.0 310956 3880 ? Sl 09:36 0:00 slurmstepd: [24116603] root 23059 0.0 0.0 293868 3364 ? Sl 09:36 0:00 slurmstepd: [24116603.4294967295] ltan 23099 0.0 0.0 106224 1384 ? S 09:36 0:00 /bin/bash /var/slurmd/spool/slurmd/job24116603/slurm_script processed/1CDX1-285/impute 10 processed/1CDX1-285/clean.txt processed/1CDX1-285/assign root 28919 0.0 0.0 103248 856 pts/6 S+ 10:37 0:00 grep 24116603 but others if I grep for their jobid show: [root@xie01 log]# ps aux | grep 24116633 root 24325 0.0 0.0 293868 3364 ? Sl 09:36 0:00 slurmstepd: [24116633.4294967295] root 28795 0.0 0.0 103244 856 pts/6 S+ 10:36 0:00 grep 24116633 even though they should have started: root@xie01 log]# scontrol -dd show job 24116633 JobId=24116633 JobName=assign.sh UserId=ltan(57234) GroupId=xie_lab(402013) MCS_label=N/A Priority=7399128 Nice=0 Account=xie_lab QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=01:01:49 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2017-07-11T09:36:18 EligibleTime=2017-07-11T09:36:18 StartTime=2017-07-11T09:36:29 EndTime=2017-07-18T09:36:29 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=xie AllocNode:Sid=holynx01:58732 ReqNodeList=(null) ExcNodeList=(null) NodeList=xie01 BatchHost=xie01 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=8000M,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=xie01 CPU_IDs=19 Mem=8000 GRES_IDX= MinCPUsNode=1 MinMemoryCPU=8000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=./assign.sh processed/1CDX1-413/impute 10 processed/1CDX1-413/clean.txt processed/1CDX1-413/assign WorkDir=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC StdErr=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC/logs/assign-1CDX1-413-10-24116633.err StdIn=/dev/null StdOut=/net/xiefs3/srv/export/xiefs3/share_root/Users/ltan/research/Nagano2017HiC/logs/assign-1CDX1-413-10-24116633.out Power= BatchScript= #!/bin/bash #SBATCH -n 1 #SBATCH -p general #SBATCH --mem-per-cpu=8000 #SBATCH -t 7-00:00 prefix="$1" replicate_name="$2" raw_contact_file="$3" output_prefix="$4" module load python source activate hic stdbuf -o0 -e0 ~/research3/nuc_dynamics/nuc_dynamics ${prefix}.ncc -o ${prefix}.${replicate_name}.3dg -temps 50 -iso 10000 -ran ${replicate_name} python ~/research/dianti-code/dianti-meta/clean_3dg.py ${prefix}.${replicate_name}.3dg ${prefix}.txt ${prefix}.${replicate_name}c.3dg python ~/research/dianti-code/dianti-meta/assign.py ${raw_contact_file} ${prefix}.${replicate_name}.3dg ${output_prefix}.${replicate_name}.txt You can see that the jobs are basically identical. The storage its writing to is fine and has plenty of storage and inodes and is not overloaded. Any insight to this? I know you couldn't r
Sorry, pulled the trigger too early. Any insight on this? I know you couldn't reproduce it on your end last time. All told we are getting reports across our cluster of people's jobs starting but not writing any data. We originally thought it was filesystem but its too widespread for that and it only started occurring after the Slurm upgrade that we did on Monday morning.
Thinking about this more this could be some sort of file limit or something that the slurm user or root is hitting. Any thoughts on that?
Do you have any of these currently stuck jobs still? I'd love to see what that stepd process is currently doing - attaching to it with 'gdb -p <pid>', and running 'thread apply all bt full' would help considerably.
[root@xie01 ~]# ps aux | grep 24116633 root 24325 0.0 0.0 293868 3364 ? Sl 09:36 0:00 slurmstepd: [24116633.4294967295] root 30404 0.0 0.0 103248 856 pts/6 S+ 11:16 0:00 grep 24116633 [root@xie01 ~]# gdb -p 24325 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6_4.1) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Attaching to process 24325 Reading symbols from /usr/sbin/slurmstepd...done. Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libhwloc.so.5 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpam.so.0...(no debugging symbols found)...done. Loaded symbols for /lib64/libpam.so.0 Reading symbols from /lib64/libpam_misc.so.0...(no debugging symbols found)...done. Loaded symbols for /lib64/libpam_misc.so.0 Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libutil.so.1 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [New LWP 24332] [New LWP 24330] [New LWP 24329] [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnuma.so.1 Reading symbols from /lib64/libpci.so.3...(no debugging symbols found)...done. Loaded symbols for /lib64/libpci.so.3 Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libxml2.so.2 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libaudit.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libaudit.so.1 Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libcrypt.so.1 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libz.so.1 Reading symbols from /usr/lib64/libfreebl3.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libfreebl3.so Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libnss_sss.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_sss.so.2 Reading symbols from /usr/lib64/slurm/select_cons_res.so...done. Loaded symbols for /usr/lib64/slurm/select_cons_res.so Reading symbols from /usr/lib64/slurm/auth_munge.so...done. Loaded symbols for /usr/lib64/slurm/auth_munge.so Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /usr/lib64/slurm/switch_none.so...done. Loaded symbols for /usr/lib64/slurm/switch_none.so Reading symbols from /usr/lib64/slurm/gres_gpu.so...done. Loaded symbols for /usr/lib64/slurm/gres_gpu.so Reading symbols from /usr/lib64/slurm/acct_gather_profile_none.so...done. Loaded symbols for /usr/lib64/slurm/acct_gather_profile_none.so Reading symbols from /usr/lib64/slurm/acct_gather_energy_none.so...done. Loaded symbols for /usr/lib64/slurm/acct_gather_energy_none.so Reading symbols from /usr/lib64/slurm/acct_gather_infiniband_none.so...done. Loaded symbols for /usr/lib64/slurm/acct_gather_infiniband_none.so Reading symbols from /usr/lib64/slurm/acct_gather_filesystem_none.so...done. Loaded symbols for /usr/lib64/slurm/acct_gather_filesystem_none.so Reading symbols from /usr/lib64/slurm/jobacct_gather_linux.so...done. Loaded symbols for /usr/lib64/slurm/jobacct_gather_linux.so Reading symbols from /usr/lib64/slurm/core_spec_none.so...done. Loaded symbols for /usr/lib64/slurm/core_spec_none.so Reading symbols from /usr/lib64/slurm/task_affinity.so...done. Loaded symbols for /usr/lib64/slurm/task_affinity.so Reading symbols from /usr/lib64/slurm/task_cgroup.so...done. Loaded symbols for /usr/lib64/slurm/task_cgroup.so Reading symbols from /usr/lib64/slurm/proctrack_cgroup.so...done. Loaded symbols for /usr/lib64/slurm/proctrack_cgroup.so Reading symbols from /usr/lib64/slurm/checkpoint_none.so...done. Loaded symbols for /usr/lib64/slurm/checkpoint_none.so Reading symbols from /usr/lib64/slurm/crypto_munge.so...done. Loaded symbols for /usr/lib64/slurm/crypto_munge.so Reading symbols from /usr/lib64/slurm/job_container_none.so...done. Loaded symbols for /usr/lib64/slurm/job_container_none.so Reading symbols from /usr/lib64/slurm/mpi_none.so...done. Loaded symbols for /usr/lib64/slurm/mpi_none.so Reading symbols from /usr/lib64/slurm/x11.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/slurm/x11.so Reading symbols from /usr/lib64/slurm/libspunnel.so...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/slurm/libspunnel.so 0x00000030ab2ac6ca in wait4 () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install slurm-17.02.6-1fasrc01.el6.x86_64 (gdb) thread apply all bt full Thread 4 (Thread 0x2ba03d5e3700 (LWP 24329)): #0 0x00000030aba0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005436af in _watch_tasks (arg=0x0) at slurm_jobacct_gather.c:244 err = 0 type = 1 __func__ = "_watch_tasks" #2 0x00000030aba079d1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00000030ab2e88fd in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2ba03d6e4700 (LWP 24330)): #0 0x00000030ab2aca3d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00000030ab2e1be4 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000547894 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:189 i = 4 now = 1499786227 diff = 1499786227 __func__ = "_timer_thread" tv1 = {tv_sec = 1499786227, tv_usec = 968531} tv2 = {tv_sec = 1499786227, tv_usec = 968532} tv_str = "usec=1", '\000' <repeats 13 times> delta_t = 1 #3 0x00000030aba079d1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00000030ab2e88fd in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2ba03d7e5700 (LWP 24332)): #0 0x00000030ab2df0d3 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x0000000000453816 in _poll_internal (pfds=0x2ba0400008d0, nfds=2, shutdown_time=0) at eio.c:362 n = 16777217 timeout = -1 #2 0x00000000004535ec in eio_handle_mainloop (eio=0xdce5d0) at eio.c:326 retval = 0 pollfds = 0x2ba0400008d0 map = 0x2ba040000900 maxnfds = 1 nfds = 2 ---Type <return> to continue, or q <return> to quit--- n = 1 shutdown_time = 0 __func__ = "eio_handle_mainloop" #3 0x000000000043dd79 in _msg_thr_internal (job_arg=0xdc7440) at req.c:242 job = 0xdc7440 #4 0x00000030aba079d1 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00000030ab2e88fd in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2ba03b788860 (LWP 24325)): #0 0x00000030ab2ac6ca in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042fefd in _spawn_job_container (job=0xdc7440) at mgr.c:1059 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 140728798717296, tv_usec = 14482240}, ru_stime = {tv_sec = 140728798716720, tv_usec = 4526872}, ru_maxrss = 1, ru_ixrss = 8490230596, ru_idrss = 140728798716740, ru_isrss = 47967235126592, ru_minflt = 0, ru_majflt = 14487952, ru_nswap = 140728798716752, ru_inblock = 47967233011193, ru_oublock = 104475079499525, ru_msgsnd = 24325, ru_msgrcv = 140728798716816, ru_nsignals = 47967233012171, ru_nvcsw = 0, ru_nivcsw = 14447680} jobacct_id = {taskid = 0, nodeid = 0, job = 0xdc7440} status = 0 pid = 24348 rc = 0 __func__ = "_spawn_job_container" #2 0x0000000000430246 in job_manager (job=0xdc7440) at mgr.c:1167 rc = 0 io_initialized = false ckpt_type = 0xdcd1d0 "checkpoint/none" err_msg = 0x0 __func__ = "job_manager" #3 0x000000000042b89c in main (argc=1, argv=0x7ffdfa0ea578) at slurmstepd.c:183 cli = 0xdd9870 self = 0xde9b60 msg = 0xdc77a0 job = 0xdc7440 ngids = 16 gids = 0xdec910 rc = 0 launch_params = 0x0 __func__ = "main" On 07/11/2017 11:15 AM, bugs@schedmd.com wrote: > Tim Wickberg <mailto:tim@schedmd.com> changed bug 3978 > <https://bugs.schedmd.com/show_bug.cgi?id=3978> > What Removed Added > Assignee support@schedmd.com tim@schedmd.com > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=3978#c3> on bug > 3978 <https://bugs.schedmd.com/show_bug.cgi?id=3978> from Tim Wickberg > <mailto:tim@schedmd.com> * > Do you have any of these currently stuck jobs still? > > I'd love to see what that stepd process is currently doing - attaching to it > with 'gdb -p <pid>', and running 'thread apply all bt full' would help > considerably. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Another data point for this. I just launched a series of data transfers this morning but only a few are actually running: [root@holyitc01 liu]# /usr/bin/squeue -u root JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 24160047 general fscull-r root R 3:13 7 holy2a[03202-03208] 24160015 general runscrip root R 4:16 1 holyseas02 24160016 general runscrip root R 4:16 1 holyseas02 24160008 general runscrip root R 4:47 1 holyseas02 24160009 general runscrip root R 4:47 1 holyseas02 24160010 general runscrip root R 4:47 1 holyseas01 24160011 general runscrip root R 4:47 1 holyseas01 24160012 general runscrip root R 4:47 1 holyseas01 24160013 general runscrip root R 4:47 1 holy2a13108 24160014 general runscrip root R 4:47 1 holy2a13108 [root@holyitc01 liu]# ls -ltr total 68 -rwxr-xr-x 1 root root 211 Jul 10 11:03 migrate.sh -rw-r--r-- 1 root root 195 Jul 12 09:48 runscript-migrate -rw-r--r-- 1 root root 122 Jul 12 09:49 slurm-24160007.out -rw-r--r-- 1 root root 121 Jul 12 09:49 slurm-24160006.out -rw-r--r-- 1 root root 10343 Jul 12 09:53 slurm-24160015.out -rw-r--r-- 1 root root 10303 Jul 12 09:53 slurm-24160016.out -rw-r--r-- 1 root root 13339 Jul 12 09:53 slurm-24160011.out -rw-r--r-- 1 root root 12107 Jul 12 09:53 slurm-24160014.out It looks like if you have multiple jobs per node some of the jobs stall out. The job script I'm running is here: [root@holyitc01 liu]# cat runscript-migrate #!/bin/bash #SBATCH -n 1 #SBATCH -p general #SBATCH -t 7-00:00:00 #SBATCH --mem=4000 echo $FOLDER srun rsync -av --progress /n/regal/liu/$FOLDER/ /n/holylfs/ATTIC/2017-07-10-regal-liu/$FOLDER/ Normally I wouldn't use srun to launch rsync but I thought maybe that might solve the problem. Clearly though that did not. This is an ongoing problem for us so any insight or ideas are appreciated.
One of our users, Longzhi Tan, found the following: "I notice that by doing "scontrol requeue <job id>" on each "stuck" job will lead to successful running (at least for the 17 jobs that I tested). So this is a temporary solution; but still I hope the bug will be fixed soon." I tested this myself and it does work. Hopefully that will give some hint as to a proper solution for this. In the meantime we are going to notify our users as to this temporary fix.
Just curious as to an update on this. Do you need any more data from my end?
I have a few loose theories, but nothing concrete unfortunately. Do you happen to have any SPANK plugins enabled?
Yes, we have 2. https://github.com/hautreux/slurm-spank-x11 https://github.com/harvardinformatics/spunnel -Paul Edmon- On 07/13/2017 03:26 PM, bugs@schedmd.com wrote: > > *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=3978#c8> on bug > 3978 <https://bugs.schedmd.com/show_bug.cgi?id=3978> from Tim Wickberg > <mailto:tim@schedmd.com> * > I have a few loose theories, but nothing concrete unfortunately. > > Do you happen to have any SPANK plugins enabled? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Actually, I think SPANK isn't implicated here. The backtrace from the stepd looks fine; that's all that the extern step should be doing here. I'm realizing that this is likely a duplicate of bug 3977 that Stanford reported immediate before this. Do you mind if we tag this as a duplicate, and keep track of progress on there?
Sure, that's perfectly fine. It's good to see that we aren't the only ones who saw this. We were afraid that something super unique in our infrastructure was causing this as we figured this would have been seen in testing. -Paul Edmon- On 07/13/2017 03:47 PM, bugs@schedmd.com wrote: > > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=3978#c10> on > bug 3978 <https://bugs.schedmd.com/show_bug.cgi?id=3978> from Tim > Wickberg <mailto:tim@schedmd.com> * > Actually, I think SPANK isn't implicated here. The backtrace from the stepd > looks fine; that's all that the extern step should be doing here. > > I'm realizing that this is likely a duplicate ofbug 3977 <show_bug.cgi?id=3977> that Stanford > reported immediate before this. Do you mind if we tag this as a duplicate, and > keep track of progress on there? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
It's unusual; we're still tracking down the root cause unfortunately, but it does not appear that you're alone on this one. I'm combing through the changes from 17.02.5 to see if I can isolate a potential cause for this; it does look like both you and Stanford are seeing this on 17.02.6 which seems to limit when this may have started, and I would have expected to see this long ago if it was an issue in older maintenance releases. *** This ticket has been marked as a duplicate of ticket 3977 ***