Ticket 330

Summary:	JobAccounting values turn wrong for long runs or high resources utilization
Product:	Slurm	Reporter:	Yiannis Georgiou <yiannis.georgiou>
Component:	Accounting	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	Highest
Version:	2.6.x
Hardware:	Linux
OS:	Linux
Site:	Universitat Dresden (Germany)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	error log for jobid 705800 slurmcltd log of 253198 slurmd log of 253194

Description Yiannis Georgiou 2013-06-12 05:01:37 MDT

Hello,

Execution of Linpack for small run gives the following results in accounting which is correct:

root@taurusadmin1:~> sacct -j 492396 --format=jobid,jobname,MaxVMSize,MaxRSS,AveRSS,MaxPages,Nnodes,AveCPUFreq,ReqCPUFreq,ConsumedEnergy,MaxDiskRead,Elapsed
       JobID    JobName  MaxVMSize     MaxRSS     AveRSS MaxPages   NNodes AveCPUFreq ReqCPUFreq ConsumedEnergy  MaxDiskRead    Elapsed 
------------ ---------- ---------- ---------- ---------- -------- -------- ---------- ---------- -------------- ------------ ---------- 
492396              hpl                                                 18                                                     00:28:00 
492396.batc+      batch   1508276K      7348K      7348K        0        1    2901000          0         626114          10M   00:28:00 
492396.0           xhpl  27982672K  27500272K  27427238K       2K       18      78417    2901000       10560917        0.14M   00:28:00 

Whereas for a long run we get memory , cpu frequency , disk and consumed energy = 0 for the step:

root@taurusadmin1:~> sacct -j 492196 --format=jobid,jobname,MaxVMSize,MaxRSS,AveRSS,MaxPages,Nnodes,AveCPUFreq,ReqCPUFreq,ConsumedEnergy,MaxDiskRead,Elapsed
       JobID    JobName  MaxVMSize     MaxRSS     AveRSS MaxPages   NNodes AveCPUFreq ReqCPUFreq ConsumedEnergy  MaxDiskRead    Elapsed 
------------ ---------- ---------- ---------- ---------- -------- -------- ---------- ---------- -------------- ------------ ---------- 
492196              hpl                                                270                                                     01:51:14 
492196.batc+      batch   4346892K     67912K     10764K      21K        1    2901000          0          2.42M         554M   01:51:14 
492196.0           xhpl          0          0          0        0      270          0    2901000              0            0   01:51:10 

I'm wondering if the problem comes from the fact that since values would be much larger, there was an issue during the conversion. I don't think that they overflowed the uint32 that we store all those values initially but there could be something related to this because I couldn't find other similarities between all those values. I need to make more testing to be certain for the conditions but I wanted to report just in case you have ever seen this before.

Thanks,
Yiannis

Comment 1 Danny Auble 2013-06-12 08:54:54 MDT

Yiannis,

Make sure the job isn't requesting --acctg_freq=0 either like they are doing in bug 300.  That will turn off polling and you may get what you are getting.  I am guessing this may be the situation.

If it isn't, can you run the bad test with debug2 on the slurmstepd?  You will get the list of memory and cpu per poll.  I am guessing some time during the run things will just go to 0 on it.  If it doesn't go bad but it does in the database then the problem lies there.  Having the slurmdbd with debug3 will give us the sql we use to put the values there.

Comment 2 Yiannis Georgiou 2013-06-12 10:35:35 MDT

Well I'm pretty sure they haven't used the acctg-freq=0 for that job. Profiling is activated and we get correct results in the hdf5 files.
In bug 300 I've been using the acctg-freq=0 to improve throughput because I was launching >20K jobs to test the throughput.

I'll increase the debugging and run a linpack to get more info and let you know

Comment 3 Yiannis Georgiou 2013-06-14 05:58:38 MDT

Ok so here you are I've launched some linpacks and the problem appears only if profiling is activated: 3 out of 11 linpacks with activated profiling gave accounting results = 0 ... I hereby show you one with error (705800) and one without (705801)

$ sacct -u bull -S 2013-06-13T22:40:00 --format=jobid,MaxVMSize,MaxRSS,AveRSS,MaxPages,Nnodes,AveCPUFreq,ReqCPUFreq,ConsumedEnergy,MaxDiskRead,Elapsed
       JobID  MaxVMSize     MaxRSS     AveRSS MaxPages   NNodes AveCPUFreq ReqCPUFreq ConsumedEnergy  MaxDiskRead    Elapsed
------------ ---------- ---------- ---------- -------- -------- ---------- ---------- -------------- ------------ ----------

705800                                                      270                                                     00:00:39
705800.batc+   1641180K     56808K      8648K      19K        1    2901000          0           7457         240M   00:00:39
705800.0              0          0          0        0      270          0    2901000              0            0   00:00:37
705801                                                      270                                                     00:00:38
705801.batc+   3009412K    147460K      8620K      19K        1    2901000          0           7152         240M   00:00:38
705801.0       1115884K    923216K    355977K      46K      270     226558    2901000        1611560           3M   00:00:36
705802                                                      270                                                     00:00:38


Here are the logs of the first computing node for the job that gave the correct results :

[2013-06-13T22:55:31.297] [705801.0]   uid = 0
[2013-06-13T22:55:31.297] [705801.0] Leaving  _handle_request: SLURM_SUCCESS
[2013-06-13T22:55:31.297] [705801.0] Entering _handle_request
[2013-06-13T22:55:31.297] [705801.0] Leaving  _handle_accept
[2013-06-13T22:55:31.297] [705801.0] Rank 0 waiting for 57 (of 269) children
[2013-06-13T22:55:31.375] debug3: in the service_connection
[2013-06-13T22:55:31.376] debug2: got this type of message 5016
[2013-06-13T22:55:31.376] debug3: Entering _rpc_step_complete
[2013-06-13T22:55:31.376] [705801.0] Called _msg_socket_accept
[2013-06-13T22:55:31.376] [705801.0] Leaving _msg_socket_accept
[2013-06-13T22:55:31.376] [705801.0] Called _msg_socket_readable
[2013-06-13T22:55:31.376] [705801.0] Entering _handle_accept (new thread)
[2013-06-13T22:55:31.376] [705801.0]   Identity: uid=0, gid=0
[2013-06-13T22:55:31.376] [705801.0] Entering _handle_request
[2013-06-13T22:55:31.377] debug:  Entering stepd_completion, range_first = 172, range_last = 228
[2013-06-13T22:55:31.377] [705801.0] Got request 18
[2013-06-13T22:55:31.377] [705801.0] Handling REQUEST_STEP_COMPLETION_V2
[2013-06-13T22:55:31.377] [705801.0] _handle_completion for job 705801.0
[2013-06-13T22:55:31.377] [705801.0]   uid = 0
[2013-06-13T22:55:31.377] [705801.0] Leaving  _handle_request: SLURM_SUCCESS
[2013-06-13T22:55:31.377] [705801.0] Entering _handle_request
[2013-06-13T22:55:31.377] [705801.0] Rank 0 got all children completions
[2013-06-13T22:55:31.377] [705801.0] Leaving  _handle_accept
[2013-06-13T22:55:31.377] [705801.0] _one_step_complete_msg: first=0, last=269
[2013-06-13T22:55:31.377] [705801.0] Rank 0 sending complete to slurmctld, range 0 to 269
[2013-06-13T22:55:31.381] [705801.0] Called _msg_socket_readable
[2013-06-13T22:55:31.381] [705801.0]   false, shutdown
[2013-06-13T22:55:31.381] [705801.0] Message thread exited
[2013-06-13T22:55:31.381] [705801.0] done with job
[2013-06-13T22:55:31.385] [705801] Myname in build_hashtbl: (slurmstepd)
[2013-06-13T22:55:31.396] debug3: in the service_connection
[2013-06-13T22:55:31.396] debug2: got this type of message 1019
[2013-06-13T22:55:31.396] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2013-06-13T22:55:31.396] _get_joules_node_ipmi = consumed 4804842 Joules
[2013-06-13T22:55:31.397] [705801] _get_joules_task: consumed 7152 Joules (received 2646(300 watts) from slurmd)
[2013-06-13T22:55:31.397] [705801] getjoules_task energy = 7152
[2013-06-13T22:55:31.397] [705801] removing task 0 pid 56142 from jobacct
[2013-06-13T22:55:31.397] [705801] task 0 (56142) exited with exit code 0.
[2013-06-13T22:55:31.397] [705801] affinity task_post_term: 705801.4294967294, task 0
[2013-06-13T22:55:31.397] [705801] Aggregated 1 task exit messages
[2013-06-13T22:55:31.397] [705801] sending task exit msg for 1 tasks
[2013-06-13T22:55:31.397] [705801] Myname in build_hashtbl: (slurmstepd)
[2013-06-13T22:55:31.408] [705801] cpu_freq_reset: #cpus reset = 0
[2013-06-13T22:55:31.409] [705801] Before call to spank_fini()
[2013-06-13T22:55:31.409] [705801] After call to spank_fini()
[2013-06-13T22:55:31.409] [705801] Called _msg_socket_readable
[2013-06-13T22:55:31.409] [705801]   false, shutdown
[2013-06-13T22:55:31.409] [705801] Message thread exited
[2013-06-13T22:55:31.410] [705801] job 705801 completed with slurm_rc = 0, job_rc = 0
[2013-06-13T22:55:31.410] [705801] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2013-06-13T22:55:31.413] [705801] done with job

we see that the function _one_step_complete_msg is called once and the step finishes before the job...

and here are the logs of the job that gave error values in accounting
:

[2013-06-13T22:53:28.849] [705800.0] Leaving  _handle_request: SLURM_SUCCESS
[2013-06-13T22:53:28.849] [705800.0] Entering _handle_request
[2013-06-13T22:53:28.849] [705800.0] Rank 0 waiting for 155 (of 269) children
[2013-06-13T22:53:28.849] [705800.0] Leaving  _handle_accept
[2013-06-13T22:53:28.877] debug3: in the service_connection
[2013-06-13T22:53:28.877] debug2: got this type of message 5016
[2013-06-13T22:53:28.877] debug3: Entering _rpc_step_complete
[2013-06-13T22:53:28.878] [705800.0] Called _msg_socket_accept
[2013-06-13T22:53:28.878] [705800.0] Leaving _msg_socket_accept
[2013-06-13T22:53:28.878] [705800.0] Called _msg_socket_readable
[2013-06-13T22:53:28.878] [705800.0] Entering _handle_accept (new thread)
[2013-06-13T22:53:28.878] [705800.0]   Identity: uid=0, gid=0
[2013-06-13T22:53:28.878] [705800.0] Entering _handle_request
[2013-06-13T22:53:28.878] debug:  Entering stepd_completion, range_first = 229, range_last = 269
[2013-06-13T22:53:28.878] [705800.0] Got request 18
[2013-06-13T22:53:28.878] [705800.0] Handling REQUEST_STEP_COMPLETION_V2
[2013-06-13T22:53:28.878] [705800.0] _handle_completion for job 705800.0
[2013-06-13T22:53:28.878] [705800.0]   uid = 0
[2013-06-13T22:53:28.878] [705800.0] Leaving  _handle_request: SLURM_SUCCESS
[2013-06-13T22:53:28.878] [705800.0] Entering _handle_request
[2013-06-13T22:53:28.878] [705800.0] Rank 0 waiting for 114 (of 269) children
[2013-06-13T22:53:28.878] [705800.0] Leaving  _handle_accept
[2013-06-13T22:53:28.969] [705800] Myname in build_hashtbl: (slurmstepd)
[2013-06-13T22:53:28.980] debug3: in the service_connection
[2013-06-13T22:53:28.980] debug2: got this type of message 1019
[2013-06-13T22:53:28.980] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2013-06-13T22:53:28.981] _get_joules_node_ipmi = consumed 4789995 Joules
[2013-06-13T22:53:28.981] [705800] _get_joules_task: consumed 7457 Joules (received 2664(298 watts) from slurmd)
[2013-06-13T22:53:28.981] [705800] getjoules_task energy = 7457
[2013-06-13T22:53:28.981] [705800] removing task 0 pid 55111 from jobacct
[2013-06-13T22:53:28.981] [705800] task 0 (55111) exited with exit code 0.
[2013-06-13T22:53:28.981] [705800] affinity task_post_term: 705800.4294967294, task 0
[2013-06-13T22:53:28.981] [705800] Aggregated 1 task exit messages
[2013-06-13T22:53:28.981] [705800] sending task exit msg for 1 tasks
[2013-06-13T22:53:28.982] [705800] Myname in build_hashtbl: (slurmstepd)
[2013-06-13T22:53:28.991] [705800] cpu_freq_reset: #cpus reset = 0
[2013-06-13T22:53:28.992] [705800] Before call to spank_fini()
[2013-06-13T22:53:28.992] [705800] After call to spank_fini()
[2013-06-13T22:53:28.992] [705800] Called _msg_socket_readable
[2013-06-13T22:53:28.992] [705800]   false, shutdown
[2013-06-13T22:53:28.992] [705800] Message thread exited
[2013-06-13T22:53:28.993] [705800] job 705800 completed with slurm_rc = 0, job_rc = 0
[2013-06-13T22:53:28.993] [705800] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2013-06-13T22:53:28.996] [705800] done with job

.......

[2013-06-13T22:54:34.004] debug2: got this type of message 5016
[2013-06-13T22:54:34.004] debug3: Entering _rpc_step_complete
[2013-06-13T22:54:34.004] debug2: got this type of message 5016
[2013-06-13T22:54:34.004] debug3: Entering _rpc_step_complete
[2013-06-13T22:54:36.125] ipmi-thread = 3 sec, current 76 Watts, consumed 4796398 Joules 228 new
[2013-06-13T22:54:36.278] [705800.0] profile signalling type 0
[2013-06-13T22:54:36.279] debug3: in the service_connection
[2013-06-13T22:54:36.279] debug2: got this type of message 1019
[2013-06-13T22:54:36.279] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2013-06-13T22:54:36.279] _get_joules_node_ipmi = consumed 4796398 Joules
[2013-06-13T22:54:36.280] [705800.0] _get_joules_task: consumed 13539 Joules (received 228(76 watts) from slurmd)
[2013-06-13T22:54:36.280] [705800.0] _ipmi_send_profile: consumed 76 watts
[2013-06-13T22:54:36.280] [705800.0] PROFILE: add_sample_data Group-Energy Type=Energy
[2013-06-13T22:54:36.280] [705800.0] PROFILE: add_sample_data, HDF5 file not open
[2013-06-13T22:54:37.000] [705800.0] Rank 0 timed out waiting for 114 (of 269) children
[2013-06-13T22:54:37.000] [705800.0] _one_step_complete_msg: first=58, last=114
[2013-06-13T22:54:37.000] [705800.0] Rank 0 sending complete to slurmctld, range 58 to 114
[2013-06-13T22:54:37.003] [705800.0] _one_step_complete_msg: first=172, last=269
[2013-06-13T22:54:37.003] [705800.0] Rank 0 sending complete to slurmctld, range 172 to 269
[2013-06-13T22:54:37.006] [705800.0] _one_step_complete_msg: first=0, last=0
[2013-06-13T22:54:37.006] [705800.0] Rank 0 sending complete to slurmctld, range 0 to 0
[2013-06-13T22:54:37.008] [705800.0] done with job
[2013-06-13T22:54:37.010] debug:  stepd_api.c:264: : safe_read (4 of 4) failed: Connection reset by peer
[2013-06-13T22:54:37.010] debug:  stepd_api.c:264: : safe_read (4 of 4) failed: Connection reset by peer
[2013-06-13T22:54:37.010] debug3: state for jobid 705780: ctime:1371155585 revoked:1371155886 expires:1371155886
[2013-06-13T22:54:37.010] debug3: state for jobid 705786: ctime:1371156071 revoked:1371156110 expires:1371156110
[2013-06-13T22:54:37.010] debug3: state for jobid 705787: ctime:1371156110 revoked:1371156148 expires:1371156148
[2013-06-13T22:54:37.010] debug3: state for jobid 705788: ctime:1371156149 revoked:1371156188 expires:1371156188
[2013-06-13T22:54:37.010] debug3: state for jobid 705789: ctime:1371156188 revoked:1371156225 expires:1371156225
[2013-06-13T22:54:37.010] debug3: state for jobid 705790: ctime:1371156226 revoked:1371156263 expires:1371156263
[2013-06-13T22:54:37.010] debug3: state for jobid 705791: ctime:1371156264 revoked:1371156304 expires:1371156304
[2013-06-13T22:54:37.010] debug3: state for jobid 705792: ctime:1371156305 revoked:1371156341 expires:1371156341
[2013-06-13T22:54:37.010] debug3: state for jobid 705793: ctime:1371156342 revoked:1371156381 expires:1371156381
[2013-06-13T22:54:37.010] debug3: state for jobid 705794: ctime:1371156382 revoked:1371156421 expires:1371156421
[2013-06-13T22:54:37.011] debug3: state for jobid 705795: ctime:1371156421 revoked:1371156467 expires:1371156467
[2013-06-13T22:54:37.011] debug3: state for jobid 705796: ctime:1371156468 revoked:1371156506 expires:1371156506
[2013-06-13T22:54:37.011] debug3: state for jobid 705797: ctime:1371156507 revoked:1371156544 expires:1371156544
[2013-06-13T22:54:37.011] debug3: state for jobid 705798: ctime:1371156691 revoked:1371156729 expires:1371156729
[2013-06-13T22:54:37.011] debug3: state for jobid 705799: ctime:1371156730 revoked:1371156769 expires:1371156769
[2013-06-13T22:54:37.011] debug3: state for jobid 705800: ctime:1371156769 revoked:1371156808 expires:1371156808
[2013-06-13T22:54:37.011] debug2: set revoke expiration for jobid 705800 to 1371158077 UTS
[2013-06-13T22:54:37.011] debug:  stepd_api.c:264: : safe_read (4 of 4) failed: Connection reset by peer
[2013-06-13T22:54:37.011] error: stepd_connect to 705800.0 failed: Connection reset by peer
[2013-06-13T22:54:37.011] error: stepd_connect to 705800.0 failed: Connection reset by peer
[2013-06-13T22:54:37.012] debug:  Waiting for job 705800's prolog to complete
[2013-06-13T22:54:37.012] debug:  Finished wait for job 705800's prolog to complete
[2013-06-13T22:54:37.012] debug:  Calling /usr/sbin/slurmstepd spank epilog
[2013-06-13T22:54:37.016] Reading slurm.conf file: /etc/slurm/slurm.conf
[2013-06-13T22:54:37.020] Running spank/epilog for jobid [705800] uid [2054944]
[2013-06-13T22:54:37.020] spank: opening plugin stack /etc/slurm/plugstack.conf
[2013-06-13T22:54:37.023] debug:  [job 705800] attempting to run epilog [/etc/slurm/epilog.sh]
[2013-06-13T22:54:37.273] debug:  completed epilog for jobid 705800

....

[2013-06-13T22:54:38.014] error: stepd_connect to 705800.0 failed: Connection refused
[2013-06-13T22:54:38.014] error: stepd_connect to 705800.0 failed: Connection refused





we see that the job finishes at some point without the end of the step and after about 1 minute we see that _one_step_complete_msg function is called 3 times. 
I think that there should be errors somewhere _send_step_complete_msgs or in batch_finish function of src/slurmd/slurmstepd/mgr.c
Even if we have to wait for the other nodes to complete this shouldn't allow the job to finish before the step.


Yiannis

Comment 4 Danny Auble 2013-06-14 06:26:31 MDT

Yiannis, could you send me the sbatch script and the sbatch line?

Comment 5 Yiannis Georgiou 2013-06-14 07:33:32 MDT

Danny,

Here is the batch command used for both jobs (705800 and 705801)

export PPN=16 ; export NODE=270 ; sbatch --reservation=bull_56 -N $NODE -n $(( $PPN * $NODE )) -p mpi2  --profile=Energy --acctg-freq=Energy=3 BULL_run.sh


and here is the script used:
=========================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=16
#SBATCH -p newscaall 
####SBATCH -L lustre10k*1
#SBATCH --threads-per-core=1
#SBATCH -J DRESDE 
#SBATCH --time=24:00:00
#SBATCH --exclusive

###SBATCH --begin=20:00:00 # run begin at 20:00

###################################
#             CPUFREQ             #
###################################
export FREQ=2901000
#export FREQ=2900000
#Available frequencies : 2701000(T) 2700000 2400000 2200000 2000000 1800000 1600000 1400000 1200000
###################################
#           COMPILER              #
###################################
source $SLURM_SUBMIT_DIR/env_bench

###################################
#           MPDBOOT               #
###################################
NNODES=$SLURM_NNODES
cores=$SLURM_NTASKS
###################################
#           OMP_NUM_THREADS       #
###################################
export OMP_NUM_THREADS=1
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=-1
export FORT_BUFFERED=true
export decfort_dump_flag=true
unset OMP_NUM_THREADS FORT_BUFFERED decfort_dump_flag MALLOC_MMAP_MAX_ MALLOC_TRIM_THRESHOLD_
###################################
#           WORKDIR               #
###################################
workdir=$SLURM_SUBMIT_DIR/DIR_$FREQ\_$cores\_$OMP_NUM_THREADS\_$SLURM_JOBID

mkdir $workdir
cd $workdir
#-
$HOME/bin/ExpandNodeList -r -p 1 $SLURM_NODELIST > mpd.hosts
$HOME/bin/ExpandNodeList -r -p $SLURM_NTASKS_PER_NODE $SLURM_NODELIST > one_host_per_core
#-
cp ../BULL_run.sh .
#-
echo '----------------------------------------------------'
ulimit -a
echo '----------------------------------------------------'
pwd
echo '----------------------------------------------------'
echo ' NODE USED = '$SLURM_NODELIST
echo ' SLURM_JOBID = '$SLURM_JOBID
echo ' OMP_NUM_THREADS = '$OMP_NUM_THREADS
echo ' CORES = '$cores
echo '----------------------------------------------------'
# COPY EXE FROM BIN DIRECTORY
echo 'cp '$bin_dir'/'$bin' .'
cp  $bin_dir/$bin .
cp  $bin_dir/compile.sh .
echo '----------------------------------------------------'
echo 'ldd '$bin
ldd $bin
echo '----------------------------------------------------'
date
echo '----------------------------------------------------'


clush -bw root@$SLURM_NODELIST ' /bin/echo 3 > /proc/sys/vm/drop_caches '
#clush -bw root@$SLURM_NODELIST ' pwrb -N '
#clush -bw root@$SLURM_NODELIST ' pwrb -P '
#clush -bw root@$SLURM_NODELIST ' pwrb -f $FREQ '


#export LD_PRELOAD=/home/bull/Cyril/LIB/NO_BARRIER/COMPILATION_bullxmpi/no_barrier/libnobarrier.so



#time srun  --cpu-freq=$FREQ -m cyclic --cpu_bind=map_cpu:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 --resv-port    \

time srun  --cpu-freq=$FREQ  -m cyclic --cpu_bind=map_cpu:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 --resv-port    \
            -n $cores  ./$bin $input_file  > out_$cores

mv results.txt results.txt_$cores


echo '----------------------------------------------------'
date
echo '----------------------------------------------------'
==============================================================



I've also found those errors in the slurm-705800.out whereas the slurm-705801.out did not have any errors:




HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: H5Gdeprec.c line 820 in H5Gget_num_objs(): not a location ID
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: H5Gdeprec.c line 820 in H5Gget_num_objs(): not a location ID
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: H5Gdeprec.c line 1066 in H5Gget_objname_by_idx(): not a location ID
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
    major: Invalid arguments to routine
    minor: Bad value


Perhaps the problem is related to the synchronization function between the nodes which is active only if we activate profiling

Yiannis

Comment 6 Danny Auble 2013-06-14 10:03:31 MDT

Yiannis, perhaps unrelated, perhaps related, this code has changed quite a bit in the last few days.  All the HDF5 functions you reference here have been deprecated and have been replaced with more modern calls.  I would be interested if this still happens with the latest code.

I am guessing there is still a problem though since the batch step shouldn't exit before the other steps.

Comment 7 Danny Auble 2013-06-14 10:13:04 MDT

Yiannis, could you also have DebugFlags=Profile and send a run with the error again?

Comment 8 Yiannis Georgiou 2013-06-15 22:56:10 MDT

Danny, 

I had DebugFlags=NO_CONF_HASH,Energy,Profile in the slurm.conf .
What are the logs that you would like to see ? I think I have reported most of the relevant parts of the logs.

Yiannis

Comment 9 Yiannis Georgiou 2013-06-15 23:56:50 MDT

Danny,

do you think we should try directly the latest git version with the more modern calls for HDF5 or should we wait to see if we find anything more?
Which function do you have in mind?

Comment 10 Danny Auble 2013-06-16 02:06:50 MDT

(In reply to Yiannis Georgiou from comment #8)
> Danny, 
> 
> I had DebugFlags=NO_CONF_HASH,Energy,Profile in the slurm.conf .
> What are the logs that you would like to see ? I think I have reported most
> of the relevant parts of the logs.
> 
> Yiannis

I think my problem is I was expecting certain output which wasn't showing up.

[2013-06-13T22:54:36.280] [705800.0] PROFILE: add_sample_data, HDF5 file not open

Seems to be at the root of the cause.  I will see if I can force this scenario and try to reproduce what you are seeing.  If you could send the first few 100 lines from a failed run that would be helpful.

For what it is worth I have ran 1000's of jobs over the weekend with not a single failure.

On a completely unrelated note, I believe the reason you have NO_CONF_HASH there is because you have rapl on some nodes and ipmi on others.  I would suggest putting that one line in an include file so you can remove that debug flag.

Comment 11 Danny Auble 2013-06-16 02:07:33 MDT

(In reply to Yiannis Georgiou from comment #9)
> Danny,
> 
> do you think we should try directly the latest git version with the more
> modern calls for HDF5 or should we wait to see if we find anything more?
> Which function do you have in mind?

While I do not think it will really matter it is worth a shot if you can.

Comment 12 Yiannis Georgiou 2013-06-16 04:04:40 MDT

Created attachment 290 [details]
error log for jobid 705800

Comment 13 Yiannis Georgiou 2013-06-16 04:11:32 MDT

(In reply to Danny Auble from comment #10)
> (In reply to Yiannis Georgiou from comment #8)
> > Danny, 
> > 
> > I had DebugFlags=NO_CONF_HASH,Energy,Profile in the slurm.conf .
> > What are the logs that you would like to see ? I think I have reported most
> > of the relevant parts of the logs.
> > 
> > Yiannis
> 
> I think my problem is I was expecting certain output which wasn't showing up.
> 
> [2013-06-13T22:54:36.280] [705800.0] PROFILE: add_sample_data, HDF5 file not
> open
> 
> Seems to be at the root of the cause.  I will see if I can force this
> scenario and try to reproduce what you are seeing.  If you could send the
> first few 100 lines from a failed run that would be helpful.
> 
> For what it is worth I have ran 1000's of jobs over the weekend with not a
> single failure.
> 
> On a completely unrelated note, I believe the reason you have NO_CONF_HASH
> there is because you have rapl on some nodes and ipmi on others.  I would
> suggest putting that one line in an include file so you can remove that
> debug flag.

I have attached an error log with a grep on the jobid. I think that the log you mention arrives after the problem has been produced see line 1172 of the log I attached... This error arrives after the job has been finished and we continue to have log from the steps where we shouldn't . I think if we prevent this from happening and force the job to wait until all the steps are finished before it finishes then it would be fine. I'm wondering though why we don't see this every time and why we see it only when we have profiling activated.

Comment 14 Danny Auble 2013-06-17 08:13:51 MDT

Yiannis, could you please verify 77180.0 is still 0?  My feeling on the matter is you just didn't wait long enough for the step to end.  This step had an issue with IO.  I can generate this state but it always fills in the correct info after things completely end.

I do not think the batch script ending before the step is related to this either, nor do I think the batch script should wait for it's steps.  The allocation is still being enforced by the slurmctld.

One thing I could do though is wrap up the jobacct/acct_gather stuff earlier to avoid the gather threads running after the step was suppose to end.

Let me know what you think.  I am hoping 77180.0 is not zero anymore thus verifying my hypothesis.

Comment 15 Yiannis Georgiou 2013-06-17 09:14:28 MDT

Danny,

the situation for that job hasn't changed

$ sacct -j 705800 -o jobid,state,maxdiskread,aveRSS,consumedenergy
       JobID      State  MaxDiskRead     AveRSS ConsumedEnergy 
------------ ---------- ------------ ---------- -------------- 
705800        COMPLETED                                        
705800.batc+  COMPLETED         240M      8648K           7457 
705800.0      COMPLETED            0          0              0 

From what I see on the other correct jobs the batch jobs ends always after all the steps but perhaps there can be exceptions . In the case of our tests all 3 exceptions (sbatch ending before some steps) the result were 0 value in consumed energy and some other accounting fields.

If we wrap up the jobacct/acct_gather stuff earlier then we might miss the latest values for consumed energy but perhaps missing one last measure from some nodes might be better than having 0 once in a while.

I've seen the errors only with 270 nodes and never with less than that. It's strange that this happens only after a certain number of nodes.

Comment 16 Danny Auble 2013-06-17 09:42:00 MDT

That is interesting about the node count.  I'll post a proposed patch in a bit and see if you can try again with it.  I see no reason to keep polling on things as when we are just waiting for IO to finish.

I believe the reason the step finishes after the batch step I still don't think it is related.  I am guessing it is the IO hang that really causes the scenario.

I'll make a bigger cluster and see if that matters.

Could you see if the other 3 failed had issues with IO as well?

Comment 17 Danny Auble 2013-06-17 11:50:21 MDT

Please try the git master now.

The commit most interesting for this bug is a8f56cb2005a5db03601e11f44d73bf881f7a7cd.  Which does a bit of cleanup to avoid having the threads run longer than they needed to.

Looking at this code though it is almost impossible to believe things were zeroed out, I just don't see how it was possible.  Do you happen to see any core files laying around?

Comment 18 Danny Auble 2013-06-20 11:09:52 MDT

Have you seen this happen again on the new code?

Comment 19 Yiannis Georgiou 2013-06-24 05:09:33 MDT

Hi Danny,

I haven't managed to test for this yet. Hope I will have the chance to do this tomorrow.
I'll let you know

Yiannis

Comment 20 Yiannis Georgiou 2013-06-27 03:27:55 MDT

Hi Danny,

you are right that it doesn't matter that the step is finished after the job. 
On the other hand, I've made some testing with the latest git version and we still have the problem but I think I'm pretty sure that it is related with a particular error of cpusets. And that could explain why you have never seen it . Do you run with cpusets in your platform? 

Check this out the problem comes from a defunct sh process remained from a cpuset manipulation error that stalls the termination of the step as you can see below process 3071 provokes the problem that we see in the logs search for this line "Defunct process skipped:" 

I noticed that every time that we get 0 values in accounting tables a defunct process exists on the first computing node and we are obliged to wait until 
the following message : Abandoning IO 60 secs after job shutdown initiated
before the step is finished....


here are some messages in the log:

--------------------------------------------------
[2013-06-27T13:34:29.639] [253140] Message thread exited
[2013-06-27T13:34:29.639] [253140] job 253140 completed with slurm_rc = 0, job_rc = 0
[2013-06-27T13:34:29.639] [253140] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2013-06-27T13:34:29.642] [253140] done with job
............
[2013-06-27T13:34:29.662] [253140.0] Defunct process skipped: command=(sh) state=Z pid=3071 ppid=2521
...........
.......
[2013-06-27T13:35:20.875] [253140.0] Defunct process skipped: command=(sh) state=Z pid=3071 ppid=2521
[2013-06-27T13:35:20.878] [253140.0] Sent SIGKILL signal to 253140.0
[2013-06-27T13:35:20.878] [253140.0] Leaving  _handle_request: SLURM_SUCCESS
[2013-06-27T13:35:20.878] [253140.0] Entering _handle_request
[2013-06-27T13:35:20.878] [253140.0] Leaving  _handle_accept
[2013-06-27T13:35:20.878] debug3: Step from other job: jobid=253138 (this jobid=253140)
[2013-06-27T13:35:21.866] [253140.0] eio: handling events for 1 objects
[2013-06-27T13:35:21.866] [253140.0] Called _msg_socket_readable
[2013-06-27T13:35:22.127] ipmi-thread = 3 sec, current 76 Watts, consumed 653962 Joules 228 new
[2013-06-27T13:35:22.867] [253140.0] eio: handling events for 1 objects
[2013-06-27T13:35:22.867] [253140.0] Called _msg_socket_readable
[2013-06-27T13:35:23.868] [253140.0] eio: handling events for 1 objects
[2013-06-27T13:35:23.868] [253140.0] Called _msg_socket_readable
[2013-06-27T13:35:24.869] [253140.0] eio: handling events for 1 objects
[2013-06-27T13:35:24.870] [253140.0] Called _msg_socket_readable
[2013-06-27T13:35:25.134] ipmi-thread = 3 sec, current 76 Watts, consumed 654190 Joules 228 new
[2013-06-27T13:35:25.871] [253140.0] eio: handling events for 1 objects
[2013-06-27T13:35:25.871] [253140.0] Called _msg_socket_readable
[2013-06-27T13:35:26.872] [253140.0] eio: handling events for 1 objects
[2013-06-27T13:35:26.872] [253140.0] Called _msg_socket_readable
[2013-06-27T13:35:27.873] [253140.0] eio: handling events for 1 objects
[2013-06-27T13:35:27.873] [253140.0] Called _msg_socket_readable
[2013-06-27T13:35:28.141] ipmi-thread = 3 sec, current 76 Watts, consumed 654418 Joules 228 new
[2013-06-27T13:35:28.874] [253140.0] eio: handling events for 1 objects
[2013-06-27T13:35:28.874] [253140.0] Called _msg_socket_readable
[2013-06-27T13:35:29.875] [253140.0] Abandoning IO 60 secs after job shutdown initiated
[2013-06-27T13:35:29.875] [253140.0] Message thread exited
............
.......
[2013-06-27T13:35:38.000] [253140.0] Rank 0 timed out waiting for 41 (of 269) children
[2013-06-27T13:35:38.000] [253140.0] _one_step_complete_msg: first=0, last=228
[2013-06-27T13:35:38.000] [253140.0] Rank 0 sending complete to slurmctld, range 0 to 228
[2013-06-27T13:35:38.002] [253140.0] done with job
[2013-06-27T13:35:38.004] debug:  stepd_api.c:264: : safe_read (4 of 4) failed: Connection reset by peer
----------------------------------------


and here is the output of ps during the same time 

-----------------------------------
root@taurusi1001:~> ps -aux |grep 13:
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root       1602  0.0  0.0      0     0 ?        S    13:33   0:00 [flush-0:19]
root       2521  0.5  0.0 472692  7020 ?        Sl   13:33   0:00 slurmstepd: [253140.0]
root       2939  0.0  0.0  11292  1344 ?        S    13:34   0:00 /bin/bash /sbin/cpuset_release_agent /slurm253140/slurm253140.0_15
root       3063  0.0  0.0   4076   488 ?        D    13:34   0:00 rmdir /dev/cpuset//slurm253140/slurm253140.0_15
root       3064  0.0  0.0  11292  1344 ?        S    13:34   0:00 /bin/bash /sbin/cpuset_release_agent /slurm253140/slurm253140.0_10
root       3069  0.0  0.0   4076   488 ?        D    13:34   0:00 rmdir /dev/cpuset//slurm253140/slurm253140.0_10
root       3071  0.0  0.0      0     0 ?        Z    13:34   0:00 [sh] <defunct>
root       3076  0.0  0.0  11292  1340 ?        S    13:34   0:00 /bin/bash /sbin/cpuset_release_agent /slurm253140/slurm253140.4294967294_0
root       3080  0.0  0.0   4076   484 ?        D    13:34   0:00 rmdir /dev/cpuset//slurm253140/slurm253140.4294967294_0
root       3140  2.0  0.0 110224  1196 pts/2    R+   13:34   0:00 ps -aux
root       3141  0.0  0.0 103232   900 pts/2    S+   13:34   0:00 grep 13:
root     116560  0.0  0.0  96508  3812 ?        Ss   13:18   0:00 sshd: root@pts/2
root     116562  0.0  0.0 108428  2032 pts/2    Ss   13:18   0:00 -bash
root     120048  0.0  0.0      0     0 ?        S    13:25   0:00 [flush-lustre-2]
root     121182  0.0  0.0 100928   648 pts/1    S+   13:27   0:00 tail -f /usr/local/slurm26/var/log/slurm.log

Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root       1602  0.0  0.0      0     0 ?        S    13:33   0:00 [flush-0:19]
root       2521  0.3  0.0 472692  7020 ?        Sl   13:33   0:00 slurmstepd: [253140.0]
root       3071  0.0  0.0      0     0 ?        Z    13:34   0:00 [sh] <defunct>
root       3222  2.0  0.0 110224  1196 pts/2    R+   13:34   0:00 ps -aux
root       3223  0.0  0.0 103232   900 pts/2    S+   13:34   0:00 grep 13:
root     116560  0.0  0.0  96508  3812 ?        Ss   13:18   0:00 sshd: root@pts/2
root     116562  0.0  0.0 108428  2032 pts/2    Ss   13:18   0:00 -bash
root     120048  0.0  0.0      0     0 ?        S    13:25   0:00 [flush-lustre-2]
root     121182  0.0  0.0 100928   648 pts/1    S+   13:27   0:00 tail -f /usr/local/slurm26/var/log/slurm.log
-----------------------------------------

I don't know what exactly made the bash script to be defunct but I suppose that it was something related to cpusets because it is the only childs of slurmstepd as sh processes that may appear at that point.

I think that if we pass to cgroups we would not have this problem anymore. And since we were going to do it anyway it's fine with us. 
On the other hand I imagine there should be a way to correct the cpuset version as well.

What do you think?

Yiannis

Comment 21 Yiannis Georgiou 2013-06-28 00:31:22 MDT

Here are some more info:

there is no DBD_STEP_COMPLETE for the step as you can see from the log of slurmdbd

--------------------------
[2013-06-28T09:36:54.173] debug2: DBD_JOB_START: START CALL ID:253174 NAME:DRESDE INX:0
[2013-06-28T09:36:54.173] debug2: as_mysql_slurmdb_job_start() called
[2013-06-28T09:36:54.173] debug3: 10(as_mysql_job.c:509) query
insert into "taur_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc, nodes_alloc, mem_req, account, partition, node_inx) values (253174, 4, 1, 0, 2054944, 200026, 'taurusi[1001-1270]', 0, 1440, 1372405012, 1372405012, 1372405012, 'DRESDE', 0, 1, 29347, 4320, 4320, 270, 0, 'bull', 'mpi2', '0-269') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=2054944, id_group=200026, nodelist='taurusi[1001-1270]', id_resv=0, timelimit=1440, time_submit=1372405012, time_start=1372405012, job_name='DRESDE', track_steps=0, id_qos=1, state=greatest(state, 1), priority=29347, cpus_req=4320, cpus_alloc=4320, nodes_alloc=270, mem_req=0, account='bull', partition='mpi2', node_inx='0-269'
[2013-06-28T09:36:56.743] debug2: DBD_STEP_START: ID:253174.0 NAME:memory_bandwidth SUBMIT:1372405012
[2013-06-28T09:36:56.743] debug3: 10(as_mysql_job.c:946) query
insert into "taur_step_table" (job_db_inx, id_step, time_start, step_name, state, cpus_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq) values (273452, 0, 1372405016, 'memory_bandwidth', 1, 4320, 270, 4320, 'taurusi[1001-1270]', '0-269', 1, 2901000) on duplicate key update cpus_alloc=4320, nodes_alloc=270, task_cnt=4320, time_end=0, state=1, nodelist='taurusi[1001-1270]', node_inx='0-269', task_dist=1, req_cpufreq=2901000
[2013-06-28T09:36:59.213] debug2: DBD_JOB_START: ELIGIBLE CALL ID:253175 NAME:DRESDE
[2013-06-28T09:36:59.213] debug2: as_mysql_slurmdb_job_start() called
[2013-06-28T09:36:59.213] debug3: 10(as_mysql_job.c:509) query
insert into "taur_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc, nodes_alloc, mem_req, account, partition) values (253175, 4, 1, 0, 2054944, 200026, 'None assigned', 0, 1440, 1372405014, 1372405014, 0, 'DRESDE', 0, 0, 29347, 4320, 0, 0, 0, 'bull', 'mpi2') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=2054944, id_group=200026, nodelist='None assigned', id_resv=0, timelimit=1440, time_submit=1372405014, time_start=0, job_name='DRESDE', track_steps=0, id_qos=1, state=greatest(state, 0), priority=29347, cpus_req=4320, cpus_alloc=0, nodes_alloc=0, mem_req=0, account='bull', partition='mpi2'
[2013-06-28T09:37:56.073] debug2: DBD_STEP_START: ID:253174.4294967294 NAME:batch SUBMIT:1372405012
[2013-06-28T09:37:56.073] debug3: 10(as_mysql_job.c:946) query
insert into "taur_step_table" (job_db_inx, id_step, time_start, step_name, state, cpus_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq) values (273452, -2, 1372405012, 'batch', 1, 1, 1, 1, 'taurusi1001', '0', 0, 0) on duplicate key update cpus_alloc=1, nodes_alloc=1, task_cnt=1, time_end=0, state=1, nodelist='taurusi1001', node_inx='0', task_dist=0, req_cpufreq=0
[2013-06-28T09:37:56.113] debug2: DBD_STEP_COMPLETE: ID:253174.4294967294 SUBMIT:1372405012
[2013-06-28T09:37:56.113] debug3: 10(as_mysql_job.c:1127) query
update "taur_step_table" set time_end=1372405076, state=3, kill_requid=-1, exit_code=0, user_sec=1, user_usec=780000, sys_sec=2, sys_usec=80000, max_disk_read=135.214800, max_disk_read_task=0, max_disk_read_node=0, ave_disk_read=135.214800, max_disk_write=0.723989, max_disk_write_task=0, max_disk_write_node=0, ave_disk_write=0.723989, max_vsize=4060804, max_vsize_task=0, max_vsize_node=0, ave_vsize=1636840.000000, max_rss=214788, max_rss_task=0, max_rss_node=0, ave_rss=8512.000000, max_pages=19, max_pages_task=0, max_pages_node=0, ave_pages=19.000000, min_cpu=0, min_cpu_task=0, min_cpu_node=0, ave_cpu=0.000000, act_cpufreq=2901000, consumed_energy=14302 where job_db_inx=273452 and id_step=-2
[2013-06-28T09:37:56.113] debug2: DBD_JOB_COMPLETE: ID:253174
[2013-06-28T09:37:56.114] debug2: as_mysql_slurmdb_job_complete() called
[2013-06-28T09:37:56.114] debug3: 10(as_mysql_job.c:799) query
update "taur_job_table" set time_end=1372405076, state=3, nodelist='taurusi[1001-1270]', derived_ec=0, exit_code=0, kill_requid=-1 where job_db_inx=273452;
--------------------------------------

and slurmstepd is frozen for 60sec waiting other nodes to send terminate (during this time those nodes are in Completing state) 

-----------------------------------------
root@taurusi1001:~> ps -aux|grep 09:
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root      71809  0.0  0.0  96508  3804 ?        Ss   09:05   0:00 sshd: root@pts/0
root      71811  0.0  0.0 108428  2076 pts/0    Ss   09:05   0:00 -bash
root      71956  0.0  0.0  95736  3844 ?        Ss   09:05   0:00 sshd: root@pts/1
root      71958  0.0  0.0 108428  2068 pts/1    Ss   09:05   0:00 -bash
root      72632  0.3  0.0 3586316 12468 ?       Sl   09:07   0:05 /usr/local/slurm26/sbin/slurmd
root      83747  0.0  0.1 151176 33780 pts/1    S+   09:22   0:00 vi /usr/local/slurm26/var/log/slurm.log
root      85943  0.0  0.0      0     0 ?        S    09:28   0:00 [flush-lustre-2]
root      88365  0.0  0.0      0     0 ?        S    09:34   0:00 [flush-0:19]
root      90270  0.2  0.0 472824  7140 ?        Sl   09:36   0:00 slurmstepd: [253174.0]
root      90765  0.0  0.0      0     0 ?        Z    09:37   0:00 [sh] <defunct>
root      91003  3.0  0.0 110220  1196 pts/0    R+   09:38   0:00 ps -aux
root      91004  0.0  0.0 103232   900 pts/0    S+   09:38   0:00 grep 09:
root@taurusi1001:~> gstack 90270
Thread 3 (Thread 0x2b6f2d29e700 (LWP 90277)):
#0  0x00002b6f29a9efc3 in poll () from /lib64/libc.so.6
#1  0x0000000000433705 in eio_handle_mainloop ()
#2  0x0000000000432884 in _msg_thr_internal ()
#3  0x00002b6f297ab851 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b6f29aa867d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2b6f2d09c700 (LWP 90755)):
#0  0x00002b6f29a6c91d in nanosleep () from /lib64/libc.so.6
#1  0x00002b6f29a6c790 in sleep () from /lib64/libc.so.6
#2  0x0000000000426697 in _kill_thr ()
#3  0x00002b6f297ab851 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b6f29aa867d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b6f29f5db20 (LWP 90270)):
#0  0x00002b6f297af7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000042678b in _wait_for_children_slurmstepd ()
#2  0x00000000004275f4 in job_manager ()
#3  0x000000000042421d in main ()

----------------------------------------

Once the timeout of 60sec is reached it should normally do the normal procedure until the DBD_STEP_COMPLETE and also terminate the writings on the HDF5 files but it doesn't. As a result we don't get the aggregated jobacctinfo in the DB and the hdf5 files are not complete... we also get a bunch of the following errors in the slurm-253174.out file

---------------------------
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: H5G.c line 612 in H5Gget_info(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: H5L.c line 1093 in H5Lget_name_by_idx(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: H5L.c line 1093 in H5Lget_name_by_idx(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
  #000: H5L.c line 1093 in H5Lget_name_by_idx(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
...........
......

-----------------------
I'm wondering if the _kill_thr () that I see in the gstack is not killing the wrong threads or anything like that. I'll continue to investigate in the code.
Let me know if you have any ideas

Comment 22 Danny Auble 2013-06-28 03:42:22 MDT

Yiannis, well this is good information.  Do you happen to have the slurmctld and slurmd logs for this job?

Comment 23 Yiannis Georgiou 2013-06-28 04:31:42 MDT

Here is the output of some of the jobs I've run : from what we see 
253194 and 253198 had the problem (I just print only some of the values for the example)
--------------------------------------
root@taurusadmin1:~> /usr/local/slurm26/bin/sacct -S 2013-06-28T09:00:00 --format=jobid,MaxPages,Nnodes,ConsumedEnergy,MaxDiskRead,Elapsed,State       JobID MaxPages   NNodes ConsumedEnergy  MaxDiskRead    Elapsed      State 
------------ -------- -------- -------------- ------------ ---------- ---------- 
253194                     270                               00:01:03  COMPLETED 
253194.batc+      19K        1          14388         135M   00:01:03  COMPLETED 
253194.0            0      270              0            0   00:01:00  COMPLETED 
253195                     270                               00:01:11  COMPLETED 
253195.batc+      40K        1          14592         135M   00:01:11  COMPLETED 
253195.0          49K      270        3518593           3M   00:01:07  COMPLETED 
253196                     270                               00:01:12  COMPLETED 
253196.batc+      86K        1          16055         135M   00:01:12  COMPLETED 
253196.0          46K      270        3600626           3M   00:01:08  COMPLETED 
253197                     270                               00:01:05  COMPLETED 
253197.batc+      19K        1          14628         135M   00:01:05  COMPLETED 
253197.0          46K      270        3600831           3M   00:01:01  COMPLETED 
253198                     270                               00:01:03  COMPLETED 
253198.batc+      57K        1          15049         135M   00:01:03  COMPLETED 
253198.0            0      270              0            0   00:00:59  COMPLETED 
253199                     270                               00:01:03  COMPLETED 
253199.batc+      50K        1          14416         135M   00:01:03  COMPLETED 
253199.0          46K      270        3398440           3M   00:00:59  COMPLETED 

I'm attaching the slurmd log of 253194 and the slurmctld log of 253198 because I don't have it for 253194 (but it should be the same thing)

it is interesting to see in the slurmd log that we continue to get messages for all the problematic steps a long time after the job and the step is finished and other jobs have been executed on the node ...
all the jobids that you see here are jobs that had the problem of 0 values in accounting:

[2013-06-28T10:40:37.876] debug4: found jobid = 253192, stepid = 0
[2013-06-28T10:40:37.876] debug4: found jobid = 253186, stepid = 0
[2013-06-28T10:40:37.876] debug4: found jobid = 253194, stepid = 0
[2013-06-28T10:40:37.876] debug:  Cleaned up stray socket /usr/local/slurm26/var/log/slurm/spool_slurmd//taurusi1001_253192.0
[2013-06-28T10:40:37.876] debug:  Cleaned up stray socket /usr/local/slurm26/var/log/slurm/spool_slurmd//taurusi1001_253186.0
[2013-06-28T10:40:37.877] debug:  Cleaned up stray socket /usr/local/slurm26/var/log/slurm/spool_slurmd//taurusi1001_253194.0

Comment 24 Yiannis Georgiou 2013-06-28 04:32:37 MDT

Created attachment 320 [details]
slurmcltd log of 253198

Comment 25 Yiannis Georgiou 2013-06-28 04:33:16 MDT

Created attachment 321 [details]
slurmd log of 253194

Comment 26 Danny Auble 2013-06-28 04:35:43 MDT

It appears I have reproduced this!  Looks like it is related to the job being aborted.  The patch (hack) below will make every job be aborted and not handle the sending of information.  I'll see what I can do for aborted jobs like this.

--- a/src/slurmd/slurmstepd/mgr.c
+++ b/src/slurmd/slurmstepd/mgr.c
@@ -1106,7 +1106,7 @@ fail1:
                _send_launch_resp(job, rc);
        }
 
-       if (job->aborted)
+       if (!job->aborted)
                info("job_manager exiting with aborted job");
        else if (!job->batch && (step_complete.rank > -1)) {
                _wait_for_children_slurmstepd(job);

Comment 27 Danny Auble 2013-06-28 04:57:11 MDT

OK, try commit ee125a47250478df25d6b7c6a00027bb62ba0140 or just pull from the slurm-2.6 branch.

If what I found is the problem this should fix it.  Let me know what you find.

Comment 28 Yiannis Georgiou 2013-06-28 09:27:40 MDT

Thanks Danny!

I'll test this Monday morning and let you know

Yiannis

Comment 29 Yiannis Georgiou 2013-07-03 02:50:15 MDT

Danny,

we have applied the new version with this patch but unfortunately the bug still appears on Dresden cluster...
 we have confirmed that the error appears only with profiling activated along with the following variations

Jobs 1824196 – 1824199 were run with  “--profile=all --acctg-freq=3”

Job 1824206 was run with  “--profile=all --acctg-freq=energy=3”

$  sacct -u bull --name=mem_bw --format=jobid,jobname,nnodes,consumedenergy,avecpu,averss,elapsed,state | grep -e "\.0"
1824196.0    memory_ba+      270              0   00:03:55     64984K   00:04:14  COMPLETED
1824198.0    memory_ba+      270             72   00:03:56     60801K   00:04:14  COMPLETED
1824199.0    memory_ba+      270              0   00:03:54     62055K   00:04:12  COMPLETED
1824206.0    memory_ba+      270              0   00:00:00          0   00:04:16  COMPLETED

we see that the error is directly related with energy profiling... If we profile only task and not energy then we get results for all accounting info except energy (jobs 1824196 – 1824199) except the strange 72 !
on the other hand if we profile task and energy we get no results for any of the jobaccounting fields

here is a small log (verbosity 3) from the first compute node that shows jobs 1824199 and 1824206, no error appeared for 1824199, only for 1824206:

[2013-07-03T10:55:42.015] Launching batch job 1824199 for UID 2054944
[2013-07-03T10:55:42.026] init: Gres GPU plugin loaded
[2013-07-03T10:55:42.026] Received cpu frequency information for 32 cpus
[2013-07-03T10:55:42.538] launch task 1824199.0 request from 2054944.200026@172.26.72.190 (port 33436)
[2013-07-03T10:55:42.550] init: Gres GPU plugin loaded
[2013-07-03T10:55:42.550] Received cpu frequency information for 32 cpus
[2013-07-03T10:59:54.594] [1824199] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2013-07-03T10:59:54.599] [1824199] done with job
[2013-07-03T10:59:54.741] [1824199.0] done with job
[2013-07-03T11:00:05.336] Launching batch job 1824206 for UID 2054944
[2013-07-03T11:00:05.347] init: Gres GPU plugin loaded
[2013-07-03T11:00:05.347] Received cpu frequency information for 32 cpus
[2013-07-03T11:00:05.863] launch task 1824206.0 request from 2054944.200026@172.26.72.190 (port 64671)
[2013-07-03T11:00:05.874] init: Gres GPU plugin loaded
[2013-07-03T11:00:05.874] Received cpu frequency information for 32 cpus
[2013-07-03T11:04:21.045] [1824206] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2013-07-03T11:04:21.049] [1824206] done with job
[2013-07-03T11:05:19.123] [1824206.0] Abandoning IO 60 secs after job shutdown initiated
[2013-07-03T11:05:28.008] [1824206.0] done with job
[2013-07-03T11:05:28.010] error: stepd_connect to 1824206.0 failed: Connection reset by peer
[2013-07-03T11:05:28.010] error: stepd_connect to 1824206.0 failed: Connection reset by peer
[2013-07-03T11:05:28.010] error: stepd_connect to 1824206.0 failed: Connection reset by peer
[2013-07-03T11:05:29.013] error: stepd_connect to 1824206.0 failed: Connection refused
[2013-07-03T11:05:29.013] error: stepd_connect to 1824206.0 failed: No such file or directory
[2013-07-03T11:05:29.013] error: stepd_connect to 1824206.0 failed: No such file or directory
[2013-07-03T11:05:30.023] error: stepd_connect to 1824206.0 failed: No such file or directory
[2013-07-03T11:05:30.023] error: stepd_connect to 1824206.0 failed: No such file or directory
[2013-07-03T11:05:30.023] error: stepd_connect to 1824206.0 failed: No such file or directory


Let me know if you find anything and I'll try to make other tests with higher verbosity in the logs

thanks
Yiannis

Comment 30 Danny Auble 2013-07-03 05:20:24 MDT

Yiannis, I was concerned what I found wasn't the real problem since your logs never saw the warning message I saw.

Could you send the sacct for the batch step of 1824206 as well?

It just seems so strange the problem would be related to profiling.

Comment 31 Yiannis Georgiou 2013-07-03 05:52:02 MDT

Here you are I included for all of them:

root@taurusadmin1:~> sacct -j 1824196,1824197,1824198,1824199,1824206 --format=jobid,jobname,nnodes,consumedenergy,avecpu,averss,elapsed,state
       JobID    JobName   NNodes ConsumedEnergy     AveCPU     AveRSS    Elapsed      State 
------------ ---------- -------- -------------- ---------- ---------- ---------- ---------- 

1824196          mem_bw      270                                        00:04:14  COMPLETED 
1824196.bat+      batch        1          6.08M   00:00:00      8684K   00:04:14  COMPLETED 
1824196.0    memory_ba+      270              0   00:03:55     64984K   00:04:14  COMPLETED 
1824197      J2ant2thi+       24                                        00:00:02  COMPLETED 
1824197.bat+      batch        1         25.34M   00:00:00      8084K   00:00:02  COMPLETED 
1824197.0    gpaw-pyth+       24        530.87M   00:00:00       872K   00:00:02     FAILED 
1824198          mem_bw      270                                        00:04:14  COMPLETED 
1824198.bat+      batch        1          6.13M   00:00:00      8716K   00:04:14  COMPLETED 
1824198.0    memory_ba+      270             72   00:03:56     60801K   00:04:14  COMPLETED 
1824199          mem_bw      270                                        00:04:12  COMPLETED 
1824199.bat+      batch        1          6.19M   00:00:00      8712K   00:04:12  COMPLETED 
1824199.0    memory_ba+      270              0   00:03:54     62055K   00:04:12  COMPLETED 
1824206          mem_bw      270                                        00:04:16  COMPLETED 
1824206.bat+      batch        1          6.25M   00:00:00      8672K   00:04:16  COMPLETED 
1824206.0    memory_ba+      270              0   00:00:00          0   00:04:16  COMPLETED

Comment 32 Danny Auble 2013-07-03 06:07:20 MDT

Yiannis, I believe I am finding a common symptom here and perhaps cause of the bug.  One common instance in all the steps you have sent us logs for have a timeout waiting for a rank.  Something like this in the slurmd log

[2013-06-13T22:54:37.000] [705800.0] Rank 0 timed out waiting for 114 (of 269) children

The message for step completion is only sent to the database when a step fully completes.

My current theory is this timeout is leaving a hole in the step completion logic and making it so the step never fully completes and is thus lost for accounting purposes.

I will try to recreate and report back what I find, but this seems like something that is at least a plausible scenario to cause this.

What would be interesting is looking on the node that timed out and see what it is doing and what happened to the timed out ranks or if it tried to send the message but was rejected for some reason.  You should see messages like

"sending complete to slurmctld instead"

in the slurmd log of the timing out node, but perhaps there is something more hosed and they never finish.

Comment 33 Yiannis Georgiou 2013-07-03 07:03:27 MDT

well I do find these kind of messages in the slurmd log that was attached some days ago:

[2013-06-28T10:35:25.000] [253194.0] Rank 0 timed out waiting for 171 (of 269) children
[2013-06-28T10:35:25.000] [253194.0] _one_step_complete_msg: first=58, last=114
[2013-06-28T10:35:25.000] [253194.0] Rank 0 sending complete to slurmctld, range 58 to 114
[2013-06-28T10:35:25.002] [253194.0] _one_step_complete_msg: first=229, last=269
[2013-06-28T10:35:25.002] [253194.0] Rank 0 sending complete to slurmctld, range 229 to 269
[2013-06-28T10:35:25.004] [253194.0] _one_step_complete_msg: first=0, last=0
[2013-06-28T10:35:25.004] [253194.0] Rank 0 sending complete to slurmctld, range 0 to 0
[2013-06-28T10:35:25.006] [253194.0] done with job


but I'm not sure if this is the message you are talking about... ?

Comment 34 Danny Auble 2013-07-03 07:09:17 MDT

Exactly, I was referencing the attached log files when I wrote my last response.  

Seeing the slurmd log file from the other node that timed out would help see what is going on.

So what I am requesting is this, and it can be old as well (preferred).  At least with "debug" level debugging "debug3" would probably be the most help though.

slurmd log from head node, slurmd log from timeout node, and slurmctld log for the job.

If you had the other slurmd logs for 253194 that would help, but I really would like a clear picture and it doesn't sound like you have the slurmctld logs for that job.

Comment 35 Danny Auble 2013-07-08 12:54:37 MDT

Yiannis, have you been able to get those log files?

Comment 36 Moe Jette 2013-07-09 07:22:00 MDT

We have been able to reproduce this problem with a carefully placed slurm in the slurmstepd code (see below):

diff --git a/src/slurmd/slurmstepd/mgr.c b/src/slurmd/slurmstepd/mgr.c
index dd9f8ce..7c63c61 100644
--- a/src/slurmd/slurmstepd/mgr.c
+++ b/src/slurmd/slurmstepd/mgr.c
@@ -741,13 +741,14 @@ _one_step_complete_msg(slurmd_job_t *job, int first, int l
         * can be built with out the hostlist from the credential.
         */
        if (step_complete.parent_rank != -1) {
-               debug3("Rank %d sending complete to rank %d, range %d to %d",
+               info("Rank %d sending complete to rank %d, range %d to %d",
                       step_complete.rank, step_complete.parent_rank,
                       first, last);
                /* On error, pause then try sending to parent again.
                 * The parent slurmstepd may just not have started yet, because
                 * of the way that the launch message forwarding works.
                 */
+sleep(300);
                for (i = 0; i < REVERSE_TREE_PARENT_RETRY; i++) {
                        if (i)
                                sleep(1);

Comment 37 Moe Jette 2013-07-09 08:04:44 MDT

A bit more detail.

The accounting record will be filled in after the sleep, unless the slurmstepd on the sleeping node (anything other than node zero) is killed. That means when the slurmctld asks the status of the job on that node, the slurmd says the job is done, but the slurmstepd never sent the accounting record for its ranks, so the step accounting record is never filled in. Since the slurmctld has confirmed there are no remnants of the job on any node, the job and step records are deleted.

The best solution is probably to record the step accounting information for the ranks that it does have information about; a partial accounting record.

Comment 38 Moe Jette 2013-07-09 08:46:35 MDT

The commit below should fix the problem. It will be included in version 2.6.0. Please re-open this trouble ticket if the problem continues.

https://github.com/SchedMD/slurm/commit/26a8d2e6ba590cf1d8b9215e543a6726508b3ae4

Comment 39 Yiannis Georgiou 2013-07-10 01:51:42 MDT

I confirm that your patch corrects the problem on a large configuration.

In the sacct output that follows the patch was applied before the start of job 46 and as you see there is no 0 values after that job.

Thanks a lot, that was a difficult catch !!

       JobID    JobName   NNodes ConsumedEnergy     AveRSS Elapsed      State
------------ ---------- -------- -------------- ---------- ---------- ----------
37               DRESDE        2 00:00:00 CANCELLED+
38               DRESDE      270 00:10:17  COMPLETED
38.batch          batch        1          54895     17508K 00:10:17  COMPLETED
38.0         memory_ba+      270       15201640     22445K 00:10:17  COMPLETED
39               DRESDE      270 00:10:23  COMPLETED
39.batch          batch        1          55685     17436K 00:10:23  COMPLETED
39.0         memory_ba+      270       15483430     22153K 00:10:23  COMPLETED
40               DRESDE      270 00:10:16  COMPLETED
40.batch          batch        1          54794     17452K 00:10:16  COMPLETED
40.0         memory_ba+      270              0          0 00:10:15  COMPLETED
41               DRESDE      270 00:10:22  COMPLETED
41.batch          batch        1          55844     17460K 00:10:22  COMPLETED
41.0         memory_ba+      270       15390683     21775K 00:10:21  COMPLETED
42               DRESDE      270 00:10:17  COMPLETED
42.batch          batch        1          54895     17484K 00:10:17  COMPLETED
42.0         memory_ba+      270              0          0 00:10:17  COMPLETED
43               DRESDE      270 00:10:19  COMPLETED
43.batch          batch        1          55413     17492K 00:10:19  COMPLETED
43.0         memory_ba+      270              0          0 00:10:19  COMPLETED
44               DRESDE      270 00:10:18  COMPLETED
44.batch          batch        1          61690     17348K 00:10:18  COMPLETED
44.0         memory_ba+      270              0          0 00:10:18  COMPLETED
45               DRESDE      300 05:56:21 CANCELLED+
45.batch          batch        1        3952902     19504K 05:56:21  CANCELLED
45.0         memory_ba+      300              0          0 05:56:21  CANCELLED
46               DRESDE      270 00:10:20  COMPLETED
46.batch          batch        1          55439     17464K 00:10:20  COMPLETED
46.0         memory_ba+      270       15304438     21841K 00:11:40  COMPLETED
47               DRESDE      270 00:10:20  COMPLETED
47.batch          batch        1          61217     17492K 00:10:20  COMPLETED
47.0         memory_ba+      270       15371691     20261K 00:10:20  COMPLETED
48               DRESDE      270 00:10:17  COMPLETED
48.batch          batch        1          54798     17600K 00:10:17  COMPLETED
48.0         memory_ba+      270       15052102     20535K 00:11:41  COMPLETED
49               DRESDE      270 00:10:17  COMPLETED
49.batch          batch        1          55284     17312K 00:10:17  COMPLETED
49.0         memory_ba+      270       15152236     22634K 00:11:41  COMPLETED
50               DRESDE      270 00:10:17  COMPLETED
50.batch          batch        1          56720     17372K 00:10:17  COMPLETED
50.0         memory_ba+      270       15039470     21981K 00:11:41  COMPLETED
51               DRESDE      270 00:10:19  COMPLETED
51.batch          batch        1          55105     17488K 00:10:19  COMPLETED
51.0         memory_ba+      270       15149472     18748K 00:11:43  COMPLETED
52               DRESDE      270 00:10:19  COMPLETED
52.batch          batch        1          60902     17464K 00:10:19  COMPLETED
52.0         memory_ba+      270       15307016     21117K 00:10:19  COMPLETED
53               DRESDE      270 00:10:17  COMPLETED
53.batch          batch        1          54660     17468K 00:10:17  COMPLETED
53.0         memory_ba+      270       15233380     18782K 00:10:17  COMPLETED
54               DRESDE      270 00:10:17  COMPLETED
54.batch          batch        1          54944     17368K 00:10:17  COMPLETED
54.0         memory_ba+      270       15190992     20565K 00:10:15  COMPLETED
55               DRESDE      270 00:10:18  COMPLETED
55.batch          batch        1          54746     17480K 00:10:18  COMPLETED
55.0         memory_ba+      270       14729504     21241K 00:11:45  COMPLETED
56               DRESDE      270 00:10:18  COMPLETED
56.batch          batch        1          60888     17472K 00:10:18  COMPLETED
56.0         memory_ba+      270       15237551     21755K 00:11:42  COMPLETED
57               DRESDE      270 00:10:17  COMPLETED
57.batch          batch        1          54959     17428K 00:10:17  COMPLETED
57.0         memory_ba+      270       15153619     21780K 00:11:29  COMPLETED
58               DRESDE      270 00:10:17  COMPLETED
58.batch          batch        1          60446     17480K 00:10:17  COMPLETED
58.0         memory_ba+      270       15030168     20568K 00:11:42  COMPLETED
59               DRESDE      270 00:10:17  COMPLETED
59.batch          batch        1          55045     17484K 00:10:17  COMPLETED
59.0         memory_ba+      270       14802597     20895K 00:11:46  COMPLETED
60               DRESDE      270 00:10:19  COMPLETED
60.batch          batch        1          61642     17492K 00:10:19  COMPLETED
60.0         memory_ba+      270       15294558     21427K 00:11:43  COMPLETED
61               DRESDE      270 00:10:17  COMPLETED
61.batch          batch        1          54876     17476K 00:10:17  COMPLETED
61.0         memory_ba+      270       15194811     19843K 00:11:41  COMPLETED
62               DRESDE      270 00:10:23  COMPLETED
62.batch          batch        1          55578     17452K 00:10:23  COMPLETED
62.0         memory_ba+      270       15454512     18950K 00:10:22  COMPLETED
63               DRESDE      270 00:10:17  COMPLETED
63.batch          batch        1          54756     17540K 00:10:17  COMPLETED
63.0         memory_ba+      270       15159040     20682K 00:11:41  COMPLETED
64               DRESDE      270 00:10:20  COMPLETED
64.batch          batch        1          61315     17328K 00:10:20  COMPLETED
64.0         memory_ba+      270       15252129     20788K 00:11:44  COMPLETED
65               DRESDE      270 00:10:19  COMPLETED
65.batch          batch        1          55187     17504K 00:10:19  COMPLETED
65.0         memory_ba+      270       15327549     19719K 00:10:19  COMPLETED