Ticket 942

Summary: slurm_epilog failing to run leaves job in R state
Product: Slurm Reporter: Stuart Midgley <stuartm>
Component: slurmdAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da, dylanj
Version: 14.03.4   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.03.6 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Stuart Midgley 2014-07-06 13:35:23 MDT
We are at 487bfd4785fdb7841f22edc57bb0f6eb3e004665 + DUG patches


We have a job stuck in R state for 14+ hours...


20140707083004 bud30:Downloads> squeue -j 4007996
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teambm      900        dp_RMO                 bjornm  R   14:26:35      1 clus553         4007996_1502
teambm      900        dp_RMO                 bjornm  R   14:26:35      1 clus553         4007996_1503
teambm      900        dp_RMO                 bjornm  R   14:26:35      1 clus553         4007996_1504
teambm      900        dp_RMO                 bjornm  R   14:26:35      1 clus553         4007996_1505
teambm      900        dp_RMO                 bjornm  R   14:26:35      1 clus553         4007996_1506

20140707083014 bud30:Downloads> scontrol show job=4007996_1502
JobId=4008001 ArrayJobId=4007996 ArrayTaskId=1502 Name=dp_RMO
   UserId=bjornm(1007) GroupId=teambm(2102)
   Priority=900 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
   RunTime=14:26:44 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2014-07-06T17:50:33 EligibleTime=2014-07-06T17:50:43
   StartTime=2014-07-06T18:03:34 EndTime=2014-07-06T19:03:34
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=teambm AllocNode:Sid=pud33:39960
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=clus553
   BatchHost=clus553
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=0
   MinCPUsNode=1 MinMemoryNode=16048M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=1 Contiguous=0 Licenses=(null) Network=(null)
   Command=/p5/apache/harPr_117/seiTimeProc/prod/000scratch/04_joy_radon_RMO_TRIMS2_gathers.dugio/workflow_2014.07.06_12.15.07_WST/rj.dp_RMO.JHLIc4
   WorkDir=/p5/apache/harPr_117/seiTimeProc/prod/000scratch/04_joy_radon_RMO_TRIMS2_gathers.dugio/workflow_2014.07.06_12.15.07_WST
   Comment=/p5/apache/harPr_117/seiTimeProc/prod/000scratch/04_joy_radon_RMO_TRIMS2_gathers.dugio/workflow_2014.07.06_12.15.07_WST/production.job
   StdErr=/p5/apache/harPr_117/seiTimeProc/prod/000scratch/04_joy_radon_RMO_TRIMS2_gathers.dugio/workflow_2014.07.06_12.15.07_WST/logs_cluster/dp_RMO/dp_RMO.o4007996.1502
   StdIn=/dev/null
   StdOut=/p5/apache/harPr_117/seiTimeProc/prod/000scratch/04_joy_radon_RMO_TRIMS2_gathers.dugio/workflow_2014.07.06_12.15.07_WST/logs_cluster/dp_RMO/dp_RMO.o4007996.1502


From the slurmctld

[2014-07-06T13:51:24.291] _slurm_rpc_submit_batch_job JobId=4007996 usec=4983
[2014-07-06T13:53:52.240] sched: update_job: setting priority to 900 for job_id 4007996
[2014-07-06T13:53:52.242] _slurm_rpc_update_job complete JobId=4007996 uid=1007 usec=2000
[2014-07-06T13:53:52.401] sched: update_job: setting priority to 900 for job_id 4008001
[2014-07-06T13:53:52.403] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=1929
[2014-07-06T15:17:03.477] sched: Allocate JobId=4007996 NodeList=clus555 #CPUs=1
[2014-07-06T15:18:24.041] sched: Allocate JobId=4008001 NodeList=clus553 #CPUs=1
[2014-07-06T17:49:03.043] completing job 4007996 status 0
[2014-07-06T17:49:03.064] sched: job_complete for JobId=4007996 successful, exit code=0
[2014-07-06T17:50:33.186] requeue job 4008001 due to failure of node clus553
[2014-07-06T17:50:33.205] requeue batch job 4008001
[2014-07-06T18:03:32.268] error: Registered PENDING job 4008001.4294967294 on node clus553
[2014-07-06T18:03:34.548] completing job 4008001 status 15
[2014-07-06T18:03:34.548] Job 4008001 cancelled from interactive user
[2014-07-06T18:03:34.791] sched: Allocate JobId=4008001 NodeList=clus553 #CPUs=1
[2014-07-06T18:15:33.319] error: Registered job 4008001.4294967294 on wrong node clus553



From the cluster node

2014-07-06T18:03:59+08:00 clus553 slurmstepd[59735]: task_p_pre_launch: 4008001.4294967294, task 0
2014-07-06T18:03:59+08:00 clus553 slurmstepd[59735]: [job 4008001] attempting to run slurm task_prolog [/d/sw/slurm/etc/slurm_task_prolog.sh]
2014-07-06T18:03:59+08:00 clus553 slurmstepd[59735]: export name:USP_PRINT_PREFIX:val:DUG_4008001_:
2014-07-06T18:04:00+08:00 clus553 slurmstepd[59735]: export name:TMPDIR:val:/localData/queue-teambm/job4008001.rjNM:
2014-07-06T18:04:00+08:00 clus553 slurmstepd[59735]: export name:CWP_TMPDIR:val:/localData/queue-teambm/job4008001.rjNM:
2014-07-06T18:04:00+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:1728 limit:16433152 KB
2014-07-06T18:04:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:554148 limit:16433152 KB
2014-07-06T18:04:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:1431984 limit:16433152 KB
2014-07-06T18:05:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:2431792 limit:16433152 KB
2014-07-06T18:05:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:3604300 limit:16433152 KB
2014-07-06T18:06:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:4054328 limit:16433152 KB
2014-07-06T18:06:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:4531008 limit:16433152 KB
2014-07-06T18:07:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:4565540 limit:16433152 KB
2014-07-06T18:07:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:4939828 limit:16433152 KB
2014-07-06T18:08:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:5171604 limit:16433152 KB
2014-07-06T18:08:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:5481316 limit:16433152 KB
2014-07-06T18:09:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:6218312 limit:16433152 KB
2014-07-06T18:09:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:6429708 limit:16433152 KB
2014-07-06T18:10:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:6725540 limit:16433152 KB
2014-07-06T18:10:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7015368 limit:16433152 KB
2014-07-06T18:11:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7248472 limit:16433152 KB
2014-07-06T18:11:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7249768 limit:16433152 KB
2014-07-06T18:12:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7397772 limit:16433152 KB
2014-07-06T18:12:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7465652 limit:16433152 KB
2014-07-06T18:13:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7614284 limit:16433152 KB
2014-07-06T18:13:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7723384 limit:16433152 KB
2014-07-06T18:14:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7764312 limit:16433152 KB
2014-07-06T18:14:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7809788 limit:16433152 KB
2014-07-06T18:15:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7845108 limit:16433152 KB
2014-07-06T18:15:33+08:00 clus553 slurmd[27527]: debug:  found apparently running job 4008001
2014-07-06T18:15:33+08:00 clus553 slurmd[27527]: debug:  task_p_slurmd_release_resources: 4008001
2014-07-06T18:15:33+08:00 clus553 slurmd[27527]: debug:  credential for job 4008001 revoked
2014-07-06T18:15:33+08:00 clus553 slurmstepd[56232]: _handle_signal_container for step=4008001.4294967294 uid=0 signal=997
2014-07-06T18:15:33+08:00 clus553 slurmstepd[56232]: Sent signal 9 to 4008001.4294967294
2014-07-06T18:15:33+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:0 limit:16433152 KB
2014-07-06T18:15:33+08:00 clus553 slurmstepd[56232]: task_p_post_term: 4008001.4294967294, task 0
2014-07-06T18:15:34+08:00 clus553 slurmstepd[56232]: step 4008001.4294967294 abort completed
2014-07-06T18:15:34+08:00 clus553 slurmd[27527]: debug:  Waiting for job 4008001's prolog to complete
2014-07-06T18:15:34+08:00 clus553 slurmd[27527]: debug:  Finished wait for job 4008001's prolog to complete
2014-07-06T18:15:36+08:00 clus553 spank-epilog[16459]: Running spank/epilog for jobid [4008001] uid [0]
2014-07-06T18:15:36+08:00 clus553 slurmd[27527]: debug:  [job 4008001] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]

which was the last mention of job 4008001 on clus553.

From our slurm.conf

Epilog=/d/sw/slurm/etc/slurm_epilog.sh


Now, a few questions...

Why isn't the job in CG state?  The node had nothing running on it (no slurmstepd's, no user processes, only slurmd).  So the jobs were completely gone.

Why wasn't the slurm_epilog tried again?  What is the  result of the epilog failing?  The job was just left hanging in R state and their is no evidence that it was ever going to recover.  

We rebooted the node and the jobs were still in R state... we ended up scontrol requeue the job's to get them back on.
Comment 1 David Bigagli 2014-07-07 11:23:02 MDT
Let me try to answer your question and tell you what I see from the logs.

1) Why the job was not in CG state.

Since the controller thinks the job is still running while there is neither
slurmstepd nor the job itself on the execution hosts, this may indicate
that at some point during the job ran we lost the slurmstepd.
This is easy to arrange, submit a job like sleep 30 and then kill -SEGV the
slurmstepd, then the job will run until you force the controller to operate
on it as you did by requeueing it.

2) If the epilog that is ran by the slurmd at the end of the job
fails then the node is set to drain by the controller.

So it seems the slurmstepd failed somehow. Is there any core file in the log directory?
Also can you send us the output of sacct for the job 4008001:

sacct -D  --format="JobID,JobName,Partition,Account,AllocCPUS,
State,Submit,Start,Elapsed,ExitCode,NodeList" -j 4008001

also from the syslog it would be helpful to see all records related to jobid 4008001.

David
Comment 2 Stuart Midgley 2014-07-07 13:51:02 MDT
sacct -D  --format="JobID,JobName,Partition,Account,AllocCPUS,State,Submit,Start,Elapsed,ExitCode,NodeList" -j 4008001
       JobID    JobName  Partition    Account  AllocCPUS      State              Submit               Start    Elapsed ExitCode        NodeList 
------------ ---------- ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- -------- --------------- 
4008001          dp_RMO     teambm                     1  COMPLETED 2014-07-06T17:50:33 2014-07-07T09:11:25   00:46:27      0:0         clus553 
4008001.bat+      batch                                1  COMPLETED 2014-07-06T18:03:04 2014-07-06T18:03:04   15:54:48      0:0         clus553
Comment 3 Stuart Midgley 2014-07-07 13:57:38 MDT
(In reply to David Bigagli from comment #1)
> So it seems the slurmstepd failed somehow. Is there any core file in the log
> directory?


No core files.

No OOM's in syslog.

From clus553

2014-07-06T15:18:24+08:00 clus553 slurmd[27527]: debug:  task_p_slurmd_batch_request: 4008001
2014-07-06T15:18:24+08:00 clus553 spank-prolog[36849]: Running spank/prolog for jobid [4008001] uid [1007]
2014-07-06T15:18:24+08:00 clus553 slurmd[27527]: debug:  [job 4008001] attempting to run prolog [/d/sw/slurm/etc/slurm_prolog.sh]
2014-07-06T15:18:27+08:00 clus553 slurmd[27527]: Launching batch job 4008001 for UID 1007
2014-07-06T15:18:27+08:00 clus553 slurmstepd[38318]: task_p_pre_launch_priv: 4008001.4294967294
2014-07-06T15:18:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:1556 limit:16433152 KB
2014-07-06T15:18:27+08:00 clus553 slurmstepd[38318]: task_p_pre_launch: 4008001.4294967294, task 0
2014-07-06T15:18:27+08:00 clus553 slurmstepd[38318]: [job 4008001] attempting to run slurm task_prolog [/d/sw/slurm/etc/slurm_task_prolog.sh]
2014-07-06T15:18:27+08:00 clus553 slurmstepd[38318]: export name:USP_PRINT_PREFIX:val:DUG_4008001_:
2014-07-06T15:18:27+08:00 clus553 slurmstepd[38318]: export name:TMPDIR:val:/localData/queue-teambm/job4008001.6Q9N:
2014-07-06T15:18:27+08:00 clus553 slurmstepd[38318]: export name:CWP_TMPDIR:val:/localData/queue-teambm/job4008001.6Q9N:
2014-07-06T15:18:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:3240 limit:16433152 KB
2014-07-06T15:18:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:1298068 limit:16433152 KB
2014-07-06T15:19:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:3873712 limit:16433152 KB
2014-07-06T15:19:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:3858848 limit:16433152 KB
2014-07-06T15:20:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:3858680 limit:16433152 KB
2014-07-06T15:20:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:3870932 limit:16433152 KB
2014-07-06T15:21:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:3871816 limit:16433152 KB
2014-07-06T15:21:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4005596 limit:16433152 KB
2014-07-06T15:22:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4040408 limit:16433152 KB
2014-07-06T15:22:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4058884 limit:16433152 KB
2014-07-06T15:23:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4089660 limit:16433152 KB
2014-07-06T15:23:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4089656 limit:16433152 KB
2014-07-06T15:24:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4099896 limit:16433152 KB
2014-07-06T15:24:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4168792 limit:16433152 KB
2014-07-06T15:25:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4168792 limit:16433152 KB
2014-07-06T15:25:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4168616 limit:16433152 KB
2014-07-06T15:26:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4168636 limit:16433152 KB
2014-07-06T15:26:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4168616 limit:16433152 KB
2014-07-06T15:27:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4168620 limit:16433152 KB
2014-07-06T15:27:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4168592 limit:16433152 KB
2014-07-06T15:28:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4174368 limit:16433152 KB
2014-07-06T15:28:54+08:00 clus553 slurmd[27527]: debug:  found apparently running job 4008001
2014-07-06T15:28:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4174228 limit:16433152 KB
2014-07-06T15:29:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4174160 limit:16433152 KB
2014-07-06T15:29:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4174160 limit:16433152 KB
2014-07-06T15:30:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4174160 limit:16433152 KB
2014-07-06T15:30:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4174096 limit:16433152 KB
2014-07-06T15:31:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4174096 limit:16433152 KB
2014-07-06T15:31:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4174096 limit:16433152 KB
2014-07-06T15:32:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4188672 limit:16433152 KB
2014-07-06T15:32:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4196572 limit:16433152 KB
2014-07-06T15:33:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4223072 limit:16433152 KB
2014-07-06T15:33:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4269524 limit:16433152 KB
2014-07-06T15:34:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4287948 limit:16433152 KB
2014-07-06T15:34:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4296160 limit:16433152 KB
2014-07-06T15:35:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4296160 limit:16433152 KB
2014-07-06T15:35:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4296172 limit:16433152 KB
2014-07-06T15:36:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4296172 limit:16433152 KB
2014-07-06T15:36:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4296112 limit:16433152 KB
2014-07-06T15:37:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4296056 limit:16433152 KB
2014-07-06T15:37:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4296108 limit:16433152 KB
2014-07-06T15:38:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4303796 limit:16433152 KB
2014-07-06T15:38:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306556 limit:16433152 KB
2014-07-06T15:39:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:39:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:40:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:40:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:41:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:41:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:42:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:42:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:43:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:43:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:44:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:44:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:45:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306540 limit:16433152 KB
2014-07-06T15:45:57+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306408 limit:16433152 KB
2014-07-06T15:46:27+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4306468 limit:16433152 KB
2014-07-06T15:46:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308440 limit:16433152 KB
2014-07-06T15:47:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:47:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:48:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:48:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:49:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:49:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:50:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:50:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:51:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:51:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:52:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:52:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:53:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:53:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:54:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:54:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:55:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:55:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:56:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:56:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:57:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:57:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:58:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:58:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:59:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T15:59:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T16:00:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T16:00:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308468 limit:16433152 KB
2014-07-06T16:01:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4308384 limit:16433152 KB
2014-07-06T16:01:59+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4790748 limit:16433152 KB
2014-07-06T16:02:13+08:00 clus553 slurmd[27527]: debug:  found apparently running job 4008001
2014-07-06T16:02:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:4985956 limit:16433152 KB
2014-07-06T16:02:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:5072032 limit:16433152 KB
2014-07-06T16:03:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:5425220 limit:16433152 KB
2014-07-06T16:03:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:6377516 limit:16433152 KB
2014-07-06T16:04:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:7515748 limit:16433152 KB
2014-07-06T16:04:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:8413096 limit:16433152 KB
2014-07-06T16:05:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:9651252 limit:16433152 KB
2014-07-06T16:05:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:9808220 limit:16433152 KB
2014-07-06T16:06:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:9860112 limit:16433152 KB
2014-07-06T16:06:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10257376 limit:16433152 KB
2014-07-06T16:07:29+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10433076 limit:16433152 KB
2014-07-06T16:07:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10471844 limit:16433152 KB
2014-07-06T16:08:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10473388 limit:16433152 KB
2014-07-06T16:08:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10473328 limit:16433152 KB
2014-07-06T16:09:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10473316 limit:16433152 KB
2014-07-06T16:09:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10473324 limit:16433152 KB
2014-07-06T16:10:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10473208 limit:16433152 KB
2014-07-06T16:10:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10473164 limit:16433152 KB
2014-07-06T16:11:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10473016 limit:16433152 KB
2014-07-06T16:11:59+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10472560 limit:16433152 KB
2014-07-06T16:12:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10472528 limit:16433152 KB
2014-07-06T16:12:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10472540 limit:16433152 KB
2014-07-06T16:13:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10472476 limit:16433152 KB
2014-07-06T16:13:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10472476 limit:16433152 KB
2014-07-06T16:14:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10472480 limit:16433152 KB
2014-07-06T16:14:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10472480 limit:16433152 KB
2014-07-06T16:15:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474612 limit:16433152 KB
2014-07-06T16:15:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474420 limit:16433152 KB
2014-07-06T16:16:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474216 limit:16433152 KB
2014-07-06T16:16:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474220 limit:16433152 KB
2014-07-06T16:17:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474232 limit:16433152 KB
2014-07-06T16:17:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474232 limit:16433152 KB
2014-07-06T16:18:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474220 limit:16433152 KB
2014-07-06T16:18:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474204 limit:16433152 KB
2014-07-06T16:19:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474204 limit:16433152 KB
2014-07-06T16:19:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10474116 limit:16433152 KB
2014-07-06T16:20:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477072 limit:16433152 KB
2014-07-06T16:20:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477028 limit:16433152 KB
2014-07-06T16:21:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477016 limit:16433152 KB
2014-07-06T16:21:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477184 limit:16433152 KB
2014-07-06T16:22:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10476912 limit:16433152 KB
2014-07-06T16:22:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10476904 limit:16433152 KB
2014-07-06T16:23:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477412 limit:16433152 KB
2014-07-06T16:23:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477408 limit:16433152 KB
2014-07-06T16:24:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477408 limit:16433152 KB
2014-07-06T16:24:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477416 limit:16433152 KB
2014-07-06T16:25:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477416 limit:16433152 KB
2014-07-06T16:25:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477416 limit:16433152 KB
2014-07-06T16:26:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10477412 limit:16433152 KB
2014-07-06T16:26:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514152 limit:16433152 KB
2014-07-06T16:27:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514152 limit:16433152 KB
2014-07-06T16:27:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514160 limit:16433152 KB
2014-07-06T16:28:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514160 limit:16433152 KB
2014-07-06T16:28:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514152 limit:16433152 KB
2014-07-06T16:29:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514160 limit:16433152 KB
2014-07-06T16:29:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514152 limit:16433152 KB
2014-07-06T16:30:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514152 limit:16433152 KB
2014-07-06T16:30:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10514160 limit:16433152 KB
2014-07-06T16:31:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513968 limit:16433152 KB
2014-07-06T16:31:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513968 limit:16433152 KB
2014-07-06T16:32:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513972 limit:16433152 KB
2014-07-06T16:32:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513972 limit:16433152 KB
2014-07-06T16:33:29+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513924 limit:16433152 KB
2014-07-06T16:33:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513932 limit:16433152 KB
2014-07-06T16:34:29+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513924 limit:16433152 KB
2014-07-06T16:34:59+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513924 limit:16433152 KB
2014-07-06T16:35:29+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513892 limit:16433152 KB
2014-07-06T16:35:34+08:00 clus553 slurmd[27527]: debug:  found apparently running job 4008001
2014-07-06T16:35:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513920 limit:16433152 KB
2014-07-06T16:36:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513888 limit:16433152 KB
2014-07-06T16:36:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513860 limit:16433152 KB
2014-07-06T16:37:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513848 limit:16433152 KB
2014-07-06T16:37:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513560 limit:16433152 KB
2014-07-06T16:38:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513560 limit:16433152 KB
2014-07-06T16:38:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513480 limit:16433152 KB
2014-07-06T16:39:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10513480 limit:16433152 KB
2014-07-06T16:39:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517824 limit:16433152 KB
2014-07-06T16:40:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517620 limit:16433152 KB
2014-07-06T16:40:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517588 limit:16433152 KB
2014-07-06T16:41:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517596 limit:16433152 KB
2014-07-06T16:41:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517588 limit:16433152 KB
2014-07-06T16:42:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517584 limit:16433152 KB
2014-07-06T16:42:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517580 limit:16433152 KB
2014-07-06T16:43:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517584 limit:16433152 KB
2014-07-06T16:43:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517584 limit:16433152 KB
2014-07-06T16:44:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10517584 limit:16433152 KB
2014-07-06T16:44:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10528020 limit:16433152 KB
2014-07-06T16:45:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10528012 limit:16433152 KB
2014-07-06T16:45:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10527864 limit:16433152 KB
2014-07-06T16:46:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10527552 limit:16433152 KB
2014-07-06T16:46:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10527532 limit:16433152 KB
2014-07-06T16:47:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10527532 limit:16433152 KB
2014-07-06T16:47:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10527532 limit:16433152 KB
2014-07-06T16:48:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10561280 limit:16433152 KB
2014-07-06T16:48:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10561216 limit:16433152 KB
2014-07-06T16:49:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10561216 limit:16433152 KB
2014-07-06T16:49:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10561216 limit:16433152 KB
2014-07-06T16:50:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10560996 limit:16433152 KB
2014-07-06T16:50:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10560948 limit:16433152 KB
2014-07-06T16:51:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10560804 limit:16433152 KB
2014-07-06T16:51:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10560692 limit:16433152 KB
2014-07-06T16:52:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10559288 limit:16433152 KB
2014-07-06T16:52:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10559288 limit:16433152 KB
2014-07-06T16:53:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10559296 limit:16433152 KB
2014-07-06T16:53:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10559296 limit:16433152 KB
2014-07-06T16:54:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10559296 limit:16433152 KB
2014-07-06T16:54:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10559296 limit:16433152 KB
2014-07-06T16:55:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10559296 limit:16433152 KB
2014-07-06T16:55:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558540 limit:16433152 KB
2014-07-06T16:56:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558224 limit:16433152 KB
2014-07-06T16:56:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558224 limit:16433152 KB
2014-07-06T16:57:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558224 limit:16433152 KB
2014-07-06T16:57:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558224 limit:16433152 KB
2014-07-06T16:58:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558224 limit:16433152 KB
2014-07-06T16:58:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558216 limit:16433152 KB
2014-07-06T16:59:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558184 limit:16433152 KB
2014-07-06T16:59:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:00:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:00:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:01:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:01:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:02:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:02:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:03:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:03:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:04:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:04:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:05:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:05:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558188 limit:16433152 KB
2014-07-06T17:06:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558152 limit:16433152 KB
2014-07-06T17:06:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558152 limit:16433152 KB
2014-07-06T17:07:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558152 limit:16433152 KB
2014-07-06T17:07:59+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558096 limit:16433152 KB
2014-07-06T17:08:29+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:08:54+08:00 clus553 slurmd[27527]: debug:  found apparently running job 4008001
2014-07-06T17:08:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:09:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:09:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:10:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:10:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:11:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:11:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:12:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:12:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:13:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558108 limit:16433152 KB
2014-07-06T17:13:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558064 limit:16433152 KB
2014-07-06T17:14:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558064 limit:16433152 KB
2014-07-06T17:14:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558064 limit:16433152 KB
2014-07-06T17:15:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558064 limit:16433152 KB
2014-07-06T17:15:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10558064 limit:16433152 KB
2014-07-06T17:16:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10566864 limit:16433152 KB
2014-07-06T17:16:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10566864 limit:16433152 KB
2014-07-06T17:17:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10764912 limit:16433152 KB
2014-07-06T17:17:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10812468 limit:16433152 KB
2014-07-06T17:18:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10855848 limit:16433152 KB
2014-07-06T17:18:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10855988 limit:16433152 KB
2014-07-06T17:19:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10856792 limit:16433152 KB
2014-07-06T17:19:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10860332 limit:16433152 KB
2014-07-06T17:20:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10861864 limit:16433152 KB
2014-07-06T17:20:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10864696 limit:16433152 KB
2014-07-06T17:21:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10867500 limit:16433152 KB
2014-07-06T17:21:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10868580 limit:16433152 KB
2014-07-06T17:22:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10868896 limit:16433152 KB
2014-07-06T17:22:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10869200 limit:16433152 KB
2014-07-06T17:23:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10869548 limit:16433152 KB
2014-07-06T17:23:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10869908 limit:16433152 KB
2014-07-06T17:24:29+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10870244 limit:16433152 KB
2014-07-06T17:24:59+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10870256 limit:16433152 KB
2014-07-06T17:25:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10870260 limit:16433152 KB
2014-07-06T17:25:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10870260 limit:16433152 KB
2014-07-06T17:26:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10891904 limit:16433152 KB
2014-07-06T17:26:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10891880 limit:16433152 KB
2014-07-06T17:27:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10891844 limit:16433152 KB
2014-07-06T17:27:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10891804 limit:16433152 KB
2014-07-06T17:28:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10893176 limit:16433152 KB
2014-07-06T17:28:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10912068 limit:16433152 KB
2014-07-06T17:29:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10908084 limit:16433152 KB
2014-07-06T17:29:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10908152 limit:16433152 KB
2014-07-06T17:30:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10908132 limit:16433152 KB
2014-07-06T17:30:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10908132 limit:16433152 KB
2014-07-06T17:31:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10908140 limit:16433152 KB
2014-07-06T17:31:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10908120 limit:16433152 KB
2014-07-06T17:32:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10927416 limit:16433152 KB
2014-07-06T17:32:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10927456 limit:16433152 KB
2014-07-06T17:33:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10927456 limit:16433152 KB
2014-07-06T17:33:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10927456 limit:16433152 KB
2014-07-06T17:34:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10927444 limit:16433152 KB
2014-07-06T17:34:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934948 limit:16433152 KB
2014-07-06T17:35:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935008 limit:16433152 KB
2014-07-06T17:35:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934864 limit:16433152 KB
2014-07-06T17:36:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934920 limit:16433152 KB
2014-07-06T17:36:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934920 limit:16433152 KB
2014-07-06T17:37:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934920 limit:16433152 KB
2014-07-06T17:37:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934920 limit:16433152 KB
2014-07-06T17:38:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934920 limit:16433152 KB
2014-07-06T17:38:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934920 limit:16433152 KB
2014-07-06T17:39:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:39:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:40:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:40:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:41:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:41:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:42:13+08:00 clus553 slurmd[27527]: debug:  found apparently running job 4008001
2014-07-06T17:42:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:42:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:43:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10934800 limit:16433152 KB
2014-07-06T17:43:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935312 limit:16433152 KB
2014-07-06T17:44:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935308 limit:16433152 KB
2014-07-06T17:44:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935308 limit:16433152 KB
2014-07-06T17:45:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935304 limit:16433152 KB
2014-07-06T17:45:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935304 limit:16433152 KB
2014-07-06T17:46:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935304 limit:16433152 KB
2014-07-06T17:46:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935280 limit:16433152 KB
2014-07-06T17:47:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935280 limit:16433152 KB
2014-07-06T17:47:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10935800 limit:16433152 KB
2014-07-06T17:48:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937848 limit:16433152 KB
2014-07-06T17:48:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937848 limit:16433152 KB
2014-07-06T17:49:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937848 limit:16433152 KB
2014-07-06T17:49:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937848 limit:16433152 KB
2014-07-06T17:50:29+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937796 limit:16433152 KB
2014-07-06T17:50:59+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937796 limit:16433152 KB
2014-07-06T17:51:29+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937796 limit:16433152 KB
2014-07-06T17:51:59+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937796 limit:16433152 KB
2014-07-06T17:52:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937796 limit:16433152 KB
2014-07-06T17:52:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10937796 limit:16433152 KB
2014-07-06T17:53:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10938276 limit:16433152 KB
2014-07-06T17:53:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10939292 limit:16433152 KB
2014-07-06T17:54:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10939800 limit:16433152 KB
2014-07-06T17:54:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943204 limit:16433152 KB
2014-07-06T17:55:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943232 limit:16433152 KB
2014-07-06T17:55:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943060 limit:16433152 KB
2014-07-06T17:56:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T17:56:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T17:57:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T17:57:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T17:58:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T17:58:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T17:59:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T17:59:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T18:00:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T18:00:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T18:01:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T18:01:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T18:02:28+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T18:02:58+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:10943568 limit:16433152 KB
2014-07-06T18:03:30+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:11061704 limit:16433152 KB
2014-07-06T18:03:32+08:00 clus553 slurmd[27527]: debug:  task_p_slurmd_release_resources: 4008001
2014-07-06T18:03:32+08:00 clus553 slurmd[27527]: debug:  task_p_slurmd_release_resources: 4008001
2014-07-06T18:03:32+08:00 clus553 slurmd[27527]: debug:  job 4008001 requeued, but started no tasks
2014-07-06T18:03:33+08:00 clus553 slurmd[27527]: debug:  credential for job 4008001 revoked
2014-07-06T18:03:33+08:00 clus553 slurmstepd[37386]: _handle_signal_container for step=4008001.4294967294 uid=0 signal=998
2014-07-06T18:03:33+08:00 clus553 slurmstepd[37386]: error: *** JOB 4008001 CANCELLED AT 2014-07-06T18:03:33 DUE TO NODE FAILURE ***
2014-07-06T18:03:33+08:00 clus553 slurmstepd[37386]: _handle_signal_container for step=4008001.4294967294 uid=0 signal=18
2014-07-06T18:03:33+08:00 clus553 slurmstepd[37386]: Sent signal 18 to 4008001.4294967294
2014-07-06T18:03:33+08:00 clus553 slurmstepd[37386]: _handle_signal_container for step=4008001.4294967294 uid=0 signal=15
2014-07-06T18:03:33+08:00 clus553 slurmstepd[37386]: Sent signal 15 to 4008001.4294967294
2014-07-06T18:03:34+08:00 clus553 slurmstepd[37386]: Job 4008001 memory used:0 limit:16433152 KB
2014-07-06T18:03:34+08:00 clus553 slurmstepd[37386]: task_p_post_term: 4008001.4294967294, task 0
2014-07-06T18:03:34+08:00 clus553 slurmstepd[37386]: error: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_1007/job_4008001/step_batch: Device or resource busy
2014-07-06T18:03:34+08:00 clus553 slurmstepd[37386]: job 4008001 completed with slurm_rc = 0, job_rc = -2
2014-07-06T18:03:34+08:00 clus553 slurmd[27527]: reissued job credential for job 4008001
2014-07-06T18:03:34+08:00 clus553 slurmd[27527]: debug:  task_p_slurmd_batch_request: 4008001
2014-07-06T18:03:35+08:00 clus553 slurmd[27527]: debug:  Waiting for job 4008001's prolog to complete
2014-07-06T18:03:35+08:00 clus553 slurmd[27527]: debug:  Finished wait for job 4008001's prolog to complete
2014-07-06T18:03:35+08:00 clus553 slurmd[27527]: debug:  Not running epilog for jobid 4008001: File exists
2014-07-06T18:03:58+08:00 clus553 slurmd[27527]: Launching batch job 4008001 for UID 1007
2014-07-06T18:03:59+08:00 clus553 slurmstepd[59735]: task_p_pre_launch_priv: 4008001.4294967294
2014-07-06T18:03:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:1548 limit:16433152 KB
2014-07-06T18:03:59+08:00 clus553 slurmstepd[59735]: task_p_pre_launch: 4008001.4294967294, task 0
2014-07-06T18:03:59+08:00 clus553 slurmstepd[59735]: [job 4008001] attempting to run slurm task_prolog [/d/sw/slurm/etc/slurm_task_prolog.sh]
2014-07-06T18:03:59+08:00 clus553 slurmstepd[59735]: export name:USP_PRINT_PREFIX:val:DUG_4008001_:
2014-07-06T18:04:00+08:00 clus553 slurmstepd[59735]: export name:TMPDIR:val:/localData/queue-teambm/job4008001.rjNM:
2014-07-06T18:04:00+08:00 clus553 slurmstepd[59735]: export name:CWP_TMPDIR:val:/localData/queue-teambm/job4008001.rjNM:
2014-07-06T18:04:00+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:1728 limit:16433152 KB
2014-07-06T18:04:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:554148 limit:16433152 KB
2014-07-06T18:04:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:1431984 limit:16433152 KB
2014-07-06T18:05:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:2431792 limit:16433152 KB
2014-07-06T18:05:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:3604300 limit:16433152 KB
2014-07-06T18:06:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:4054328 limit:16433152 KB
2014-07-06T18:06:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:4531008 limit:16433152 KB
2014-07-06T18:07:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:4565540 limit:16433152 KB
2014-07-06T18:07:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:4939828 limit:16433152 KB
2014-07-06T18:08:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:5171604 limit:16433152 KB
2014-07-06T18:08:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:5481316 limit:16433152 KB
2014-07-06T18:09:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:6218312 limit:16433152 KB
2014-07-06T18:09:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:6429708 limit:16433152 KB
2014-07-06T18:10:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:6725540 limit:16433152 KB
2014-07-06T18:10:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7015368 limit:16433152 KB
2014-07-06T18:11:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7248472 limit:16433152 KB
2014-07-06T18:11:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7249768 limit:16433152 KB
2014-07-06T18:12:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7397772 limit:16433152 KB
2014-07-06T18:12:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7465652 limit:16433152 KB
2014-07-06T18:13:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7614284 limit:16433152 KB
2014-07-06T18:13:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7723384 limit:16433152 KB
2014-07-06T18:14:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7764312 limit:16433152 KB
2014-07-06T18:14:59+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7809788 limit:16433152 KB
2014-07-06T18:15:29+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:7845108 limit:16433152 KB
2014-07-06T18:15:33+08:00 clus553 slurmd[27527]: debug:  found apparently running job 4008001
2014-07-06T18:15:33+08:00 clus553 slurmd[27527]: debug:  task_p_slurmd_release_resources: 4008001
2014-07-06T18:15:33+08:00 clus553 slurmd[27527]: debug:  credential for job 4008001 revoked
2014-07-06T18:15:33+08:00 clus553 slurmstepd[56232]: _handle_signal_container for step=4008001.4294967294 uid=0 signal=997
2014-07-06T18:15:33+08:00 clus553 slurmstepd[56232]: Sent signal 9 to 4008001.4294967294
2014-07-06T18:15:33+08:00 clus553 slurmstepd[56232]: Job 4008001 memory used:0 limit:16433152 KB
2014-07-06T18:15:33+08:00 clus553 slurmstepd[56232]: task_p_post_term: 4008001.4294967294, task 0
2014-07-06T18:15:34+08:00 clus553 slurmstepd[56232]: step 4008001.4294967294 abort completed
2014-07-06T18:15:34+08:00 clus553 slurmd[27527]: debug:  Waiting for job 4008001's prolog to complete
2014-07-06T18:15:34+08:00 clus553 slurmd[27527]: debug:  Finished wait for job 4008001's prolog to complete
2014-07-06T18:15:36+08:00 clus553 spank-epilog[16459]: Running spank/epilog for jobid [4008001] uid [0]
2014-07-06T18:15:36+08:00 clus553 slurmd[27527]: debug:  [job 4008001] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
2014-07-07T09:11:25+08:00 clus553 slurmd[3010]: debug:  task_p_slurmd_batch_request: 4008001
2014-07-07T09:11:25+08:00 clus553 spank-prolog[4331]: Running spank/prolog for jobid [4008001] uid [1007]
2014-07-07T09:11:25+08:00 clus553 slurmd[3010]: debug:  [job 4008001] attempting to run prolog [/d/sw/slurm/etc/slurm_prolog.sh]
2014-07-07T09:11:26+08:00 clus553 slurmd[3010]: Launching batch job 4008001 for UID 1007
2014-07-07T09:11:26+08:00 clus553 slurmstepd[4382]: task_p_pre_launch_priv: 4008001.4294967294
2014-07-07T09:11:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:1548 limit:16433152 KB
2014-07-07T09:11:26+08:00 clus553 slurmstepd[4382]: task_p_pre_launch: 4008001.4294967294, task 0
2014-07-07T09:11:26+08:00 clus553 slurmstepd[4382]: [job 4008001] attempting to run slurm task_prolog [/d/sw/slurm/etc/slurm_task_prolog.sh]
2014-07-07T09:11:26+08:00 clus553 slurmstepd[4382]: export name:USP_PRINT_PREFIX:val:DUG_4008001_:
2014-07-07T09:11:26+08:00 clus553 slurmstepd[4382]: export name:TMPDIR:val:/localData/queue-teambm/job4008001.DqEz:
2014-07-07T09:11:26+08:00 clus553 slurmstepd[4382]: export name:CWP_TMPDIR:val:/localData/queue-teambm/job4008001.DqEz:
2014-07-07T09:11:27+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8476 limit:16433152 KB
2014-07-07T09:11:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:1445952 limit:16433152 KB
2014-07-07T09:12:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:2817136 limit:16433152 KB
2014-07-07T09:12:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:4038588 limit:16433152 KB
2014-07-07T09:13:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:4234728 limit:16433152 KB
2014-07-07T09:13:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:5331308 limit:16433152 KB
2014-07-07T09:14:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:6199720 limit:16433152 KB
2014-07-07T09:14:53+08:00 clus553 slurmd[3010]: debug:  found apparently running job 4008001
2014-07-07T09:14:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:6804556 limit:16433152 KB
2014-07-07T09:15:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:7338016 limit:16433152 KB
2014-07-07T09:15:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:7577224 limit:16433152 KB
2014-07-07T09:16:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:7863016 limit:16433152 KB
2014-07-07T09:16:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8092392 limit:16433152 KB
2014-07-07T09:17:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8294124 limit:16433152 KB
2014-07-07T09:17:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8364272 limit:16433152 KB
2014-07-07T09:18:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8368428 limit:16433152 KB
2014-07-07T09:18:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8379940 limit:16433152 KB
2014-07-07T09:19:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8393528 limit:16433152 KB
2014-07-07T09:19:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8405948 limit:16433152 KB
2014-07-07T09:20:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8430140 limit:16433152 KB
2014-07-07T09:20:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8438224 limit:16433152 KB
2014-07-07T09:21:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8466312 limit:16433152 KB
2014-07-07T09:21:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8481048 limit:16433152 KB
2014-07-07T09:22:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8491392 limit:16433152 KB
2014-07-07T09:22:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8496840 limit:16433152 KB
2014-07-07T09:23:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8499952 limit:16433152 KB
2014-07-07T09:23:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8507708 limit:16433152 KB
2014-07-07T09:24:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8512108 limit:16433152 KB
2014-07-07T09:24:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8522660 limit:16433152 KB
2014-07-07T09:25:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8524452 limit:16433152 KB
2014-07-07T09:25:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8524452 limit:16433152 KB
2014-07-07T09:26:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8524452 limit:16433152 KB
2014-07-07T09:26:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8526752 limit:16433152 KB
2014-07-07T09:27:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8527392 limit:16433152 KB
2014-07-07T09:27:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8534464 limit:16433152 KB
2014-07-07T09:28:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8534464 limit:16433152 KB
2014-07-07T09:28:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8550744 limit:16433152 KB
2014-07-07T09:29:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8555520 limit:16433152 KB
2014-07-07T09:29:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8568572 limit:16433152 KB
2014-07-07T09:30:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8572560 limit:16433152 KB
2014-07-07T09:30:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8577876 limit:16433152 KB
2014-07-07T09:31:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8584660 limit:16433152 KB
2014-07-07T09:31:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8588084 limit:16433152 KB
2014-07-07T09:32:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8600160 limit:16433152 KB
2014-07-07T09:32:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8604008 limit:16433152 KB
2014-07-07T09:33:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8608388 limit:16433152 KB
2014-07-07T09:33:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8609872 limit:16433152 KB
2014-07-07T09:34:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8629596 limit:16433152 KB
2014-07-07T09:34:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8631408 limit:16433152 KB
2014-07-07T09:35:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8633236 limit:16433152 KB
2014-07-07T09:35:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8637052 limit:16433152 KB
2014-07-07T09:36:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8647532 limit:16433152 KB
2014-07-07T09:36:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8664560 limit:16433152 KB
2014-07-07T09:37:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8683284 limit:16433152 KB
2014-07-07T09:37:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8686636 limit:16433152 KB
2014-07-07T09:38:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8689376 limit:16433152 KB
2014-07-07T09:38:57+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8689980 limit:16433152 KB
2014-07-07T09:39:27+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8691668 limit:16433152 KB
2014-07-07T09:39:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8693620 limit:16433152 KB
2014-07-07T09:40:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8694476 limit:16433152 KB
2014-07-07T09:40:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8697008 limit:16433152 KB
2014-07-07T09:41:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8697716 limit:16433152 KB
2014-07-07T09:41:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8702000 limit:16433152 KB
2014-07-07T09:42:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8722960 limit:16433152 KB
2014-07-07T09:42:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8724120 limit:16433152 KB
2014-07-07T09:43:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8725236 limit:16433152 KB
2014-07-07T09:43:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8725700 limit:16433152 KB
2014-07-07T09:44:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8725904 limit:16433152 KB
2014-07-07T09:44:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8726148 limit:16433152 KB
2014-07-07T09:45:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8743872 limit:16433152 KB
2014-07-07T09:45:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8746552 limit:16433152 KB
2014-07-07T09:46:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8748224 limit:16433152 KB
2014-07-07T09:46:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8749484 limit:16433152 KB
2014-07-07T09:47:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8754704 limit:16433152 KB
2014-07-07T09:47:44+08:00 clus553 slurmd[3010]: debug:  found apparently running job 4008001
2014-07-07T09:47:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8755628 limit:16433152 KB
2014-07-07T09:48:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8756432 limit:16433152 KB
2014-07-07T09:48:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8757576 limit:16433152 KB
2014-07-07T09:49:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8758568 limit:16433152 KB
2014-07-07T09:49:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8759220 limit:16433152 KB
2014-07-07T09:50:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8759632 limit:16433152 KB
2014-07-07T09:50:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8764104 limit:16433152 KB
2014-07-07T09:51:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8765320 limit:16433152 KB
2014-07-07T09:51:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8766336 limit:16433152 KB
2014-07-07T09:52:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8766348 limit:16433152 KB
2014-07-07T09:52:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8767004 limit:16433152 KB
2014-07-07T09:53:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8767060 limit:16433152 KB
2014-07-07T09:53:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:8767732 limit:16433152 KB
2014-07-07T09:54:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:3222596 limit:16433152 KB
2014-07-07T09:54:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:3576256 limit:16433152 KB
2014-07-07T09:55:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:3795432 limit:16433152 KB
2014-07-07T09:55:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:4014760 limit:16433152 KB
2014-07-07T09:56:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:4768900 limit:16433152 KB
2014-07-07T09:56:56+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:1064484 limit:16433152 KB
2014-07-07T09:57:26+08:00 clus553 slurmstepd[4378]: Job 4008001 memory used:721268 limit:16433152 KB
2014-07-07T09:57:52+08:00 clus553 slurmstepd[4378]: task_p_post_term: 4008001.4294967294, task 0
2014-07-07T09:57:52+08:00 clus553 slurmstepd[4378]: job 4008001 completed with slurm_rc = 0, job_rc = 0
2014-07-07T09:57:52+08:00 clus553 slurmd[3010]: debug:  task_p_slurmd_release_resources: 4008001
2014-07-07T09:57:52+08:00 clus553 slurmd[3010]: debug:  credential for job 4008001 revoked
2014-07-07T09:57:52+08:00 clus553 slurmd[3010]: debug:  Waiting for job 4008001's prolog to complete
2014-07-07T09:57:52+08:00 clus553 slurmd[3010]: debug:  Finished wait for job 4008001's prolog to complete
2014-07-07T09:57:52+08:00 clus553 spank-epilog[17729]: Running spank/epilog for jobid [4008001] uid [1007]
2014-07-07T09:57:52+08:00 clus553 slurmd[3010]: debug:  [job 4008001] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
2014-07-07T09:57:52+08:00 clus553 slurmd[3010]: debug:  completed epilog for jobid 4008001
2014-07-07T09:57:52+08:00 clus553 slurmd[3010]: debug:  Job 4008001: sent epilog complete msg: rc = 0



and from the slurmctld

[2014-07-06T13:53:52.401] sched: update_job: setting priority to 900 for job_id 4008001
[2014-07-06T13:53:52.403] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=1929
[2014-07-06T15:18:24.041] sched: Allocate JobId=4008001 NodeList=clus553 #CPUs=1
[2014-07-06T17:50:33.186] requeue job 4008001 due to failure of node clus553
[2014-07-06T17:50:33.205] requeue batch job 4008001
[2014-07-06T18:03:32.268] error: Registered PENDING job 4008001.4294967294 on node clus553 
[2014-07-06T18:03:34.548] completing job 4008001 status 15
[2014-07-06T18:03:34.548] Job 4008001 cancelled from interactive user
[2014-07-06T18:03:34.791] sched: Allocate JobId=4008001 NodeList=clus553 #CPUs=1
[2014-07-06T18:15:33.319] error: Registered job 4008001.4294967294 on wrong node clus553 
[2014-07-07T08:49:10.451] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=36
[2014-07-07T08:58:14.634] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=56
[2014-07-07T08:59:47.295] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=177
[2014-07-07T09:10:30.312] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=39
[2014-07-07T09:11:18.066] error: Job 4008001 allocated no nodes to be killed on
[2014-07-07T09:11:18.080] _slurm_rpc_requeue: 4008001: usec=21693
[2014-07-07T09:11:25.635] sched: Allocate JobId=4008001 NodeList=clus553 #CPUs=1
[2014-07-07T09:13:24.574] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=131
[2014-07-07T09:32:54.219] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=4540
[2014-07-07T09:42:10.475] _slurm_rpc_update_job complete JobId=4008001 uid=1007 usec=121
[2014-07-07T09:57:52.162] completing job 4008001 status 0
[2014-07-07T09:57:52.169] sched: job_complete for JobId=4008001 successful, exit code=0
Comment 4 David Bigagli 2014-07-08 11:46:48 MDT
Hi Stuart,
         now we have a better idea of the sequence of events although it is not
clear why things went the way they did. I am still trying to reproduce it.
To have a better picture I downloaded the slurmctld.log-20140708. Do you think
I could access to the machine that concentrate the logs from all hosts?

The sequence of events shows that:

1) job 4008001, part of a larger array, started at 15:18:24.041 on clus553.
2) At 17:50:33 the job is requeued since apparently node clus553 failed,
   although there is no evidence of that as the job was requeued ok so the
   daemons were responding. Is it possible that something ran scontrol
   command setting the node down?
3) At 18:03:34 the job starts again on clus553.
4) At 18:15:33 the controller gives this error:

 error: Registered job 4008001.4294967294 on wrong node clus553

We cannot explain this error at the moment. Since Slurm thinks the job is
running on some other host than clus553 it aborts the job on clus553.
Interestingly enough this happens to all jobs running on clus553.

There are the following things to understand:

1) Why Slurm thought clus553 was down and why it thought the host clus553
was a wrong host.
2) What happens to slurmstepd on clus553 as there are no syslog entries
between:

2014-07-06T18:15:36+08:00
2014-07-07T09:11:25+08:00

which is ~15hours the job ran.

3) From the syslog output we are missing an important message from
slurmstepd that says: done with job. This is the last print instruction
before the stepd returns from  main(). It is consistently missing
from syslog output you sent us. Where is it?

By the way we noticed that you don't use the slurmdbd but log directly
in the database. We recommend to use the  slurmdbd, you can have a look at
this document: http://slurm.schedmd.com/accounting.html which outlines
the differences.
We also noticed that you log both the job records and the job completion
records, this seems to be duplicated since the accounting includes the
job completion.

David
Comment 5 Dylan 2014-07-08 16:41:05 MDT
You should now have access to pmon0001 where all the nodes log to. They're kept in /localData/logs/. You can ssh into pmon0001 once on pque0001.
Comment 6 Stuart Midgley 2014-07-08 19:13:34 MDT
We never saw much need to use slurmdbd.  It only appears to provide an absctraction layer between slurm and mysql... which can provide extra security... and we are not that worried security for our queue.


> By the way we noticed that you don't use the slurmdbd but log directly
> in the database. We recommend to use the  slurmdbd, you can have a look at
> this document: http://slurm.schedmd.com/accounting.html which outlines
> the differences.
> We also noticed that you log both the job records and the job completion
> records, this seems to be duplicated since the accounting includes the
> job completion.
> 
> David
Comment 7 David Bigagli 2014-07-09 04:36:31 MDT
Thanks

On 07/08/2014 09:41 PM, bugs@schedmd.com wrote:
> Dylan <mailto:dylanj@dugeo.com> changed bug 942
> <http://bugs.schedmd.com/show_bug.cgi?id=942>
> What 	Removed 	Added
> CC 		dylanj@dugeo.com
>
> *Comment # 5 <http://bugs.schedmd.com/show_bug.cgi?id=942#c5> on bug 942
> <http://bugs.schedmd.com/show_bug.cgi?id=942> from Dylan
> <mailto:dylanj@dugeo.com> *
>
> You should now have access to pmon0001 where all the nodes log to. They're kept
> in /localData/logs/. You can ssh into pmon0001 once on pque0001.
>
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You are the assignee for the bug.
>   * You are watching someone on the CC list of the bug.
>   * You are watching the assignee of the bug.
>
Comment 8 David Bigagli 2014-07-09 09:16:31 MDT
Which one of the 7687 files I should look at? :-)

I would like to have the syslog from midnight July 6 till midnight of
July 7. I saw this file: syslogd.log-20140706.gz but it seems to be too 
small, 20 bytes, and is read protected.

David
Comment 9 David Bigagli 2014-07-09 11:48:40 MDT
The security of the database is just one feature of slurmdbd there are others
like rollup, archiving and purging. If you use sreport or sacct having 
the rollup will make the processing faster.

More information can be found here:

http://slurm.schedmd.com/slurmdbd.conf.html

David
Comment 10 Dylan 2014-07-09 17:27:20 MDT
Unfortunately that time period is split over two logs.

pmon0001:/localData/logs/clus553.log-20140706.gz
pmon0001:/localData/logs/clus553.log

Our log rotate occurred at 2014-07-06T03:30:04+08:00 splitting it between those two files.
Comment 11 David Bigagli 2014-07-10 12:00:09 MDT
Thanks I got the files, I see the messages are indeed split between
them but that's ok I just merged them and the good thing is that
we see the missing message:

 slurmstepd[37386]: done with job

they are there!

Lets' look at the job 4008001 which started at 2014-07-06T15:18:24.041
on node clus553. Until 2014-07-06T17:50:33.186 it ran fine however
at this time the controller believe clus553 is not responding while
actually clus553 is running just fine. The controller requeues the
job but the request is processed by clus553 only at: 2014-07-06T18:03:33
which is 13 minutes later. This indicates that the periodic ping from
the controller to the slurmds times out as well and that's the reason
the controller is fooled into believing the node is down while it is not.
Either the network was congested, in error, or the controller was
congested not being to send out messages. Do you have any way to tell
what was happening?

This is only half of the story of course and we are still investigating
the second part in which the controller wrongly thought the job
was not running on clus553 after it dispatched it again on it and
what happened to the slurmstepd. For this I will prepare an instrumented
controller with more debug info.

Should not the controller timeout with the slurmd this issue would
probably never happen. For know I would suggest to increase your
SlurmdTimeout from 300 to 600.

Another thing I noticed is that it seems you run your master in virtualbox:

root@pque0001 /home/davidb>virt-what
virtualbox

Can the VM guarantee the performance you need? I remember we talked
about this while back.

David
Comment 12 David Bigagli 2014-07-14 05:11:21 MDT
I have made small code change to print more information should we see
the problem of controller printing jobs on the wrong node. I also saw
there were some issues accessing the unix sockets on the execution host,
so print more details in this case as well. The commit is 86de0fbfcca24.

David
Comment 13 Dylan 2014-07-14 18:05:57 MDT
We've never seen any performance problems with the Virtual Machine. I've increased the slurmd time-out like you suggested but we saw something very strange today that I believe is the same or a related problem. From clus297's syslog on pmon0001.

2014-07-15T09:05:27+08:00 clus297 slurmd[21742]: debug:  task_p_slurmd_batch_request: 4354861
2014-07-15T09:05:27+08:00 clus297 slurmd[21742]: debug:  Calling /d/sw/slurm/20140710/sbin/slurmstepd spank prolog
2014-07-15T09:05:27+08:00 clus297 spank-prolog[14920]: Reading slurm.conf file: /d/sw/slurm/20140710/etc/slurm.conf
2014-07-15T09:05:27+08:00 clus297 spank-prolog[14920]: Running spank/prolog for jobid [4354861] uid [1195]
2014-07-15T09:05:27+08:00 clus297 spank-prolog[14920]: spank: opening plugin stack /d/sw/slurm/20140710/etc/plugstack.conf
2014-07-15T09:05:27+08:00 clus297 slurmd[21742]: debug:  [job 4354861] attempting to run prolog [/d/sw/slurm/etc/slurm_prolog.sh]
2014-07-15T09:05:27+08:00 clus297 slurmd[21742]: get env for user mohyink here
2014-07-15T09:05:27+08:00 clus297 su: pam_unix(su-l:session): session opened for user mohyink by (uid=0)
2014-07-15T09:05:28+08:00 clus297 su: pam_unix(su-l:session): session closed for user mohyink
2014-07-15T09:05:28+08:00 clus297 slurmd[21742]: Launching batch job 4354861 for UID 1195
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: switch NONE plugin loaded
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: Received cpu frequency information for 32 cpus
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: AcctGatherProfile NONE plugin loaded
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: AcctGatherEnergy NONE plugin loaded
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: AcctGatherInfiniband NONE plugin loaded
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: AcctGatherFilesystem NONE plugin loaded
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: Job accounting gather LINUX plugin loaded
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: scaling CPU count by factor of 2 (32/(16-0)
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: laying out the 1 tasks on 1 hosts clus297 dist 2
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: Message thread started pid = 14959
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: task NONE plugin loaded
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: Reading cgroup.conf file /d/sw/slurm/20140710/etc/cgroup.conf
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: Checkpoint plugin loaded: checkpoint/none
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: mpi type = (null)
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: spank: opening plugin stack /d/sw/slurm/20140710/etc/plugstack.conf
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: mpi type = (null)
2014-07-15T09:05:28+08:00 clus297 slurmstepd[14959]: mpi/none: slurmstepd prefork
2014-07-15T09:05:29+08:00 clus297 slurmstepd[14959]: cont_id hasn't been set yet not running poll
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14959]: debug level = 2
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14959]: task 0 (14963) started 2014-07-15T09:05:30
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: task_p_pre_launch_priv: 4354861.4294967294
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14959]: Job 4354861 memory used:1512 limit:14864384 KB
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14959]: job_container none plugin loaded
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: task_p_pre_launch: 4354861.4294967294, task 0
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: [job 4354861] attempting to run slurm task_prolog [/d/sw/slurm/etc/slurm_task_prolog.sh]
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: export name:DUG_LICENSE:val:PER2-PUG1-L38W-Y8PG-WNXT-TL:
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: export name:DUG_LICENSE_SERVER:val:dugy:8080:
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: export name:COI_SEP_DISABLE:val:TRUE:
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: export name:USP_PRINT_PREFIX:val:DUG_4354861_:
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: export name:TMPDIR:val:/localData/queue-teambm/job4354861.Oukh:
2014-07-15T09:05:30+08:00 clus297 slurmstepd[14963]: export name:CWP_TMPDIR:val:/localData/queue-teambm/job4354861.Oukh:
2014-07-15T09:05:58+08:00 clus297 slurmstepd[14959]: Job 4354861 memory used:15176876 limit:14864384 KB
2014-07-15T09:06:03+08:00 clus297 monitoring: systemp=28 phitemp=45,49
2014-07-15T09:06:28+08:00 clus297 slurmstepd[14959]: Job 4354861 memory used:25565032 limit:14864384 KB
2014-07-15T09:08:57+08:00 clus297 slurmstepd[14959]: task 0 (14963) exited with exit code 0.
2014-07-15T09:08:57+08:00 clus297 slurmstepd[14959]: task_p_post_term: 4354861.4294967294, task 0
2014-07-15T09:08:57+08:00 clus297 slurmstepd[14959]: cpu_freq_reset: #cpus reset = 0
2014-07-15T09:08:57+08:00 clus297 slurmstepd[14959]: Message thread exited
2014-07-15T09:08:57+08:00 clus297 slurmstepd[14959]: job 4354861 completed with slurm_rc = 0, job_rc = 0
2014-07-15T09:08:57+08:00 clus297 slurmstepd[14959]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
2014-07-15T09:08:57+08:00 clus297 slurmstepd[14959]: Null authentication plugin loaded
2014-07-15T09:08:57+08:00 clus297 slurmd[21742]: debug:  _rpc_terminate_job, uid = 601
2014-07-15T09:08:57+08:00 clus297 slurmd[21742]: debug:  task_p_slurmd_release_resources: 4354861
2014-07-15T09:08:57+08:00 clus297 slurmd[21742]: debug:  credential for job 4354861 revoked
2014-07-15T09:08:57+08:00 clus297 slurmd[21742]: debug:  Waiting for job 4354861's prolog to complete
2014-07-15T09:08:57+08:00 clus297 slurmd[21742]: debug:  Finished wait for job 4354861's prolog to complete
2014-07-15T09:08:57+08:00 clus297 slurmd[21742]: debug:  Calling /d/sw/slurm/20140710/sbin/slurmstepd spank epilog
2014-07-15T09:08:57+08:00 clus297 spank-epilog[15673]: Reading slurm.conf file: /d/sw/slurm/20140710/etc/slurm.conf
2014-07-15T09:08:57+08:00 clus297 spank-epilog[15673]: Running spank/epilog for jobid [4354861] uid [1195]
2014-07-15T09:08:57+08:00 clus297 spank-epilog[15673]: spank: opening plugin stack /d/sw/slurm/20140710/etc/plugstack.conf
2014-07-15T09:08:57+08:00 clus297 slurmd[21742]: debug:  [job 4354861] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
2014-07-15T09:08:58+08:00 clus297 slurmstepd[14959]: done with job
2014-07-15T09:08:58+08:00 clus297 slurmd[21742]: debug:  completed epilog for jobid 4354861
2014-07-15T09:08:58+08:00 clus297 slurmd[21742]: debug:  Job 4354861: sent epilog complete msg: rc = 0
2014-07-15T09:09:09+08:00 clus297 slurmd[21742]: debug:  task_p_slurmd_batch_request: 4355149

From the slurmctld log on pque0001

[2014-07-15T09:05:27.169] sched: Allocate JobId=4354861 NodeList=clus297 #CPUs=32

[2014-07-15T09:08:57.805] Batch JobId=4354861 missing from node 0
[2014-07-15T09:08:57.805] completing job 4354861 status -2
[2014-07-15T09:08:57.805] Job 4354861 cancelled from interactive user or node failure
[2014-07-15T09:08:57.834] sched: job_complete for JobId=4354861 successful, exit code=4294967294
[2014-07-15T09:08:57.907] completing job 4354862 status 0
[2014-07-15T09:08:57.913] sched: job_complete for JobId=4354862 successful, exit code=0
[2014-07-15T09:08:57.976] licenses: request_license=iop5 total=1 used=0
[2014-07-15T09:08:57.984] _slurm_rpc_submit_batch_job JobId=4386986 usec=8261
[2014-07-15T09:08:57.987] completing job 4384884 status 0
[2014-07-15T09:08:58.008] sched: job_complete for JobId=4384884 successful, exit code=0
[2014-07-15T09:08:58.019] completing job 4354861 status 0
[2014-07-15T09:08:58.029] completing job 4384897 status 0
[2014-07-15T09:08:58.033] sched: job_complete for JobId=4384897 successful, exit code=0
[2014-07-15T09:08:58.069] completing job 4385053 status 0
[2014-07-15T09:08:58.074] sched: job_complete for JobId=4385053 successful, exit code=0
[2014-07-15T09:08:58.098] completing job 4385049 status 0
[2014-07-15T09:08:58.103] sched: job_complete for JobId=4385049 successful, exit code=0
[2014-07-15T09:08:58.122] completing job 4384891 status 0
[2014-07-15T09:08:58.128] sched: job_complete for JobId=4384891 successful, exit code=0
[2014-07-15T09:08:58.172] _slurm_rpc_requeue: Processing RPC: REQUEST_REQUEUE from uid=601
[2014-07-15T09:08:58.173] _slurm_rpc_requeue: 4354861: usec=753
[2014-07-15T09:08:58.575] requeue batch job 4354861

[2014-07-15T09:30:09.463] sched: update_job: releasing hold for job_id 4354861, new priority is 100
[2014-07-15T09:30:09.463] _slurm_rpc_update_job complete JobId=4354861 uid=601 usec=215
[2014-07-15T09:30:11.311] sched: Allocate JobId=4354861 NodeList=clus082 #CPUs=40

[2014-07-15T09:33:13.894] completing job 4354861 status 0
[2014-07-15T09:33:13.914] sched: job_complete for JobId=4354861 successful, exit code=0


It appears that the job finished the first time on clus297, reported completion back to slurmctld only to have slurmctld think the job had disappeared at the same time. Then reschedule the job on another node.
Comment 14 David Bigagli 2014-07-15 09:47:03 MDT
We think this is a rare race condition between slurmstepd terminating 
and slurmd registering with slurmctld. This is the sequence of events
we can read from the logs and actually reproduce.

1) The job is finished. slurmstepd removes the unix socket in its spool
directory, then it sends an message to the controller the job is finished.

2) At the very same time the slurmd is registering with the controller.
The unix sockets tell slurmd how many jobs are running on its hosts, since
the socket was just removed slurmd reports no job.

3) The controller which has not yet received the job finished message 
requeue the job.

We have developed a patch to address this race condition. It is in 
commit 6bb0548336d60.

David
Comment 15 David Bigagli 2014-08-01 07:04:47 MDT
Fixed.