| Summary: | JobAccounting values turn wrong for long runs or high resources utilization | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Yiannis Georgiou <yiannis.georgiou> |
| Component: | Accounting | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | Highest | ||
| Version: | 2.6.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Universitat Dresden (Germany) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
error log for jobid 705800
slurmcltd log of 253198 slurmd log of 253194 |
||
Yiannis, Make sure the job isn't requesting --acctg_freq=0 either like they are doing in bug 300. That will turn off polling and you may get what you are getting. I am guessing this may be the situation. If it isn't, can you run the bad test with debug2 on the slurmstepd? You will get the list of memory and cpu per poll. I am guessing some time during the run things will just go to 0 on it. If it doesn't go bad but it does in the database then the problem lies there. Having the slurmdbd with debug3 will give us the sql we use to put the values there. Well I'm pretty sure they haven't used the acctg-freq=0 for that job. Profiling is activated and we get correct results in the hdf5 files. In bug 300 I've been using the acctg-freq=0 to improve throughput because I was launching >20K jobs to test the throughput. I'll increase the debugging and run a linpack to get more info and let you know Ok so here you are I've launched some linpacks and the problem appears only if profiling is activated: 3 out of 11 linpacks with activated profiling gave accounting results = 0 ... I hereby show you one with error (705800) and one without (705801)
$ sacct -u bull -S 2013-06-13T22:40:00 --format=jobid,MaxVMSize,MaxRSS,AveRSS,MaxPages,Nnodes,AveCPUFreq,ReqCPUFreq,ConsumedEnergy,MaxDiskRead,Elapsed
JobID MaxVMSize MaxRSS AveRSS MaxPages NNodes AveCPUFreq ReqCPUFreq ConsumedEnergy MaxDiskRead Elapsed
------------ ---------- ---------- ---------- -------- -------- ---------- ---------- -------------- ------------ ----------
705800 270 00:00:39
705800.batc+ 1641180K 56808K 8648K 19K 1 2901000 0 7457 240M 00:00:39
705800.0 0 0 0 0 270 0 2901000 0 0 00:00:37
705801 270 00:00:38
705801.batc+ 3009412K 147460K 8620K 19K 1 2901000 0 7152 240M 00:00:38
705801.0 1115884K 923216K 355977K 46K 270 226558 2901000 1611560 3M 00:00:36
705802 270 00:00:38
Here are the logs of the first computing node for the job that gave the correct results :
[2013-06-13T22:55:31.297] [705801.0] uid = 0
[2013-06-13T22:55:31.297] [705801.0] Leaving _handle_request: SLURM_SUCCESS
[2013-06-13T22:55:31.297] [705801.0] Entering _handle_request
[2013-06-13T22:55:31.297] [705801.0] Leaving _handle_accept
[2013-06-13T22:55:31.297] [705801.0] Rank 0 waiting for 57 (of 269) children
[2013-06-13T22:55:31.375] debug3: in the service_connection
[2013-06-13T22:55:31.376] debug2: got this type of message 5016
[2013-06-13T22:55:31.376] debug3: Entering _rpc_step_complete
[2013-06-13T22:55:31.376] [705801.0] Called _msg_socket_accept
[2013-06-13T22:55:31.376] [705801.0] Leaving _msg_socket_accept
[2013-06-13T22:55:31.376] [705801.0] Called _msg_socket_readable
[2013-06-13T22:55:31.376] [705801.0] Entering _handle_accept (new thread)
[2013-06-13T22:55:31.376] [705801.0] Identity: uid=0, gid=0
[2013-06-13T22:55:31.376] [705801.0] Entering _handle_request
[2013-06-13T22:55:31.377] debug: Entering stepd_completion, range_first = 172, range_last = 228
[2013-06-13T22:55:31.377] [705801.0] Got request 18
[2013-06-13T22:55:31.377] [705801.0] Handling REQUEST_STEP_COMPLETION_V2
[2013-06-13T22:55:31.377] [705801.0] _handle_completion for job 705801.0
[2013-06-13T22:55:31.377] [705801.0] uid = 0
[2013-06-13T22:55:31.377] [705801.0] Leaving _handle_request: SLURM_SUCCESS
[2013-06-13T22:55:31.377] [705801.0] Entering _handle_request
[2013-06-13T22:55:31.377] [705801.0] Rank 0 got all children completions
[2013-06-13T22:55:31.377] [705801.0] Leaving _handle_accept
[2013-06-13T22:55:31.377] [705801.0] _one_step_complete_msg: first=0, last=269
[2013-06-13T22:55:31.377] [705801.0] Rank 0 sending complete to slurmctld, range 0 to 269
[2013-06-13T22:55:31.381] [705801.0] Called _msg_socket_readable
[2013-06-13T22:55:31.381] [705801.0] false, shutdown
[2013-06-13T22:55:31.381] [705801.0] Message thread exited
[2013-06-13T22:55:31.381] [705801.0] done with job
[2013-06-13T22:55:31.385] [705801] Myname in build_hashtbl: (slurmstepd)
[2013-06-13T22:55:31.396] debug3: in the service_connection
[2013-06-13T22:55:31.396] debug2: got this type of message 1019
[2013-06-13T22:55:31.396] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2013-06-13T22:55:31.396] _get_joules_node_ipmi = consumed 4804842 Joules
[2013-06-13T22:55:31.397] [705801] _get_joules_task: consumed 7152 Joules (received 2646(300 watts) from slurmd)
[2013-06-13T22:55:31.397] [705801] getjoules_task energy = 7152
[2013-06-13T22:55:31.397] [705801] removing task 0 pid 56142 from jobacct
[2013-06-13T22:55:31.397] [705801] task 0 (56142) exited with exit code 0.
[2013-06-13T22:55:31.397] [705801] affinity task_post_term: 705801.4294967294, task 0
[2013-06-13T22:55:31.397] [705801] Aggregated 1 task exit messages
[2013-06-13T22:55:31.397] [705801] sending task exit msg for 1 tasks
[2013-06-13T22:55:31.397] [705801] Myname in build_hashtbl: (slurmstepd)
[2013-06-13T22:55:31.408] [705801] cpu_freq_reset: #cpus reset = 0
[2013-06-13T22:55:31.409] [705801] Before call to spank_fini()
[2013-06-13T22:55:31.409] [705801] After call to spank_fini()
[2013-06-13T22:55:31.409] [705801] Called _msg_socket_readable
[2013-06-13T22:55:31.409] [705801] false, shutdown
[2013-06-13T22:55:31.409] [705801] Message thread exited
[2013-06-13T22:55:31.410] [705801] job 705801 completed with slurm_rc = 0, job_rc = 0
[2013-06-13T22:55:31.410] [705801] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2013-06-13T22:55:31.413] [705801] done with job
we see that the function _one_step_complete_msg is called once and the step finishes before the job...
and here are the logs of the job that gave error values in accounting
:
[2013-06-13T22:53:28.849] [705800.0] Leaving _handle_request: SLURM_SUCCESS
[2013-06-13T22:53:28.849] [705800.0] Entering _handle_request
[2013-06-13T22:53:28.849] [705800.0] Rank 0 waiting for 155 (of 269) children
[2013-06-13T22:53:28.849] [705800.0] Leaving _handle_accept
[2013-06-13T22:53:28.877] debug3: in the service_connection
[2013-06-13T22:53:28.877] debug2: got this type of message 5016
[2013-06-13T22:53:28.877] debug3: Entering _rpc_step_complete
[2013-06-13T22:53:28.878] [705800.0] Called _msg_socket_accept
[2013-06-13T22:53:28.878] [705800.0] Leaving _msg_socket_accept
[2013-06-13T22:53:28.878] [705800.0] Called _msg_socket_readable
[2013-06-13T22:53:28.878] [705800.0] Entering _handle_accept (new thread)
[2013-06-13T22:53:28.878] [705800.0] Identity: uid=0, gid=0
[2013-06-13T22:53:28.878] [705800.0] Entering _handle_request
[2013-06-13T22:53:28.878] debug: Entering stepd_completion, range_first = 229, range_last = 269
[2013-06-13T22:53:28.878] [705800.0] Got request 18
[2013-06-13T22:53:28.878] [705800.0] Handling REQUEST_STEP_COMPLETION_V2
[2013-06-13T22:53:28.878] [705800.0] _handle_completion for job 705800.0
[2013-06-13T22:53:28.878] [705800.0] uid = 0
[2013-06-13T22:53:28.878] [705800.0] Leaving _handle_request: SLURM_SUCCESS
[2013-06-13T22:53:28.878] [705800.0] Entering _handle_request
[2013-06-13T22:53:28.878] [705800.0] Rank 0 waiting for 114 (of 269) children
[2013-06-13T22:53:28.878] [705800.0] Leaving _handle_accept
[2013-06-13T22:53:28.969] [705800] Myname in build_hashtbl: (slurmstepd)
[2013-06-13T22:53:28.980] debug3: in the service_connection
[2013-06-13T22:53:28.980] debug2: got this type of message 1019
[2013-06-13T22:53:28.980] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2013-06-13T22:53:28.981] _get_joules_node_ipmi = consumed 4789995 Joules
[2013-06-13T22:53:28.981] [705800] _get_joules_task: consumed 7457 Joules (received 2664(298 watts) from slurmd)
[2013-06-13T22:53:28.981] [705800] getjoules_task energy = 7457
[2013-06-13T22:53:28.981] [705800] removing task 0 pid 55111 from jobacct
[2013-06-13T22:53:28.981] [705800] task 0 (55111) exited with exit code 0.
[2013-06-13T22:53:28.981] [705800] affinity task_post_term: 705800.4294967294, task 0
[2013-06-13T22:53:28.981] [705800] Aggregated 1 task exit messages
[2013-06-13T22:53:28.981] [705800] sending task exit msg for 1 tasks
[2013-06-13T22:53:28.982] [705800] Myname in build_hashtbl: (slurmstepd)
[2013-06-13T22:53:28.991] [705800] cpu_freq_reset: #cpus reset = 0
[2013-06-13T22:53:28.992] [705800] Before call to spank_fini()
[2013-06-13T22:53:28.992] [705800] After call to spank_fini()
[2013-06-13T22:53:28.992] [705800] Called _msg_socket_readable
[2013-06-13T22:53:28.992] [705800] false, shutdown
[2013-06-13T22:53:28.992] [705800] Message thread exited
[2013-06-13T22:53:28.993] [705800] job 705800 completed with slurm_rc = 0, job_rc = 0
[2013-06-13T22:53:28.993] [705800] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2013-06-13T22:53:28.996] [705800] done with job
.......
[2013-06-13T22:54:34.004] debug2: got this type of message 5016
[2013-06-13T22:54:34.004] debug3: Entering _rpc_step_complete
[2013-06-13T22:54:34.004] debug2: got this type of message 5016
[2013-06-13T22:54:34.004] debug3: Entering _rpc_step_complete
[2013-06-13T22:54:36.125] ipmi-thread = 3 sec, current 76 Watts, consumed 4796398 Joules 228 new
[2013-06-13T22:54:36.278] [705800.0] profile signalling type 0
[2013-06-13T22:54:36.279] debug3: in the service_connection
[2013-06-13T22:54:36.279] debug2: got this type of message 1019
[2013-06-13T22:54:36.279] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2013-06-13T22:54:36.279] _get_joules_node_ipmi = consumed 4796398 Joules
[2013-06-13T22:54:36.280] [705800.0] _get_joules_task: consumed 13539 Joules (received 228(76 watts) from slurmd)
[2013-06-13T22:54:36.280] [705800.0] _ipmi_send_profile: consumed 76 watts
[2013-06-13T22:54:36.280] [705800.0] PROFILE: add_sample_data Group-Energy Type=Energy
[2013-06-13T22:54:36.280] [705800.0] PROFILE: add_sample_data, HDF5 file not open
[2013-06-13T22:54:37.000] [705800.0] Rank 0 timed out waiting for 114 (of 269) children
[2013-06-13T22:54:37.000] [705800.0] _one_step_complete_msg: first=58, last=114
[2013-06-13T22:54:37.000] [705800.0] Rank 0 sending complete to slurmctld, range 58 to 114
[2013-06-13T22:54:37.003] [705800.0] _one_step_complete_msg: first=172, last=269
[2013-06-13T22:54:37.003] [705800.0] Rank 0 sending complete to slurmctld, range 172 to 269
[2013-06-13T22:54:37.006] [705800.0] _one_step_complete_msg: first=0, last=0
[2013-06-13T22:54:37.006] [705800.0] Rank 0 sending complete to slurmctld, range 0 to 0
[2013-06-13T22:54:37.008] [705800.0] done with job
[2013-06-13T22:54:37.010] debug: stepd_api.c:264: : safe_read (4 of 4) failed: Connection reset by peer
[2013-06-13T22:54:37.010] debug: stepd_api.c:264: : safe_read (4 of 4) failed: Connection reset by peer
[2013-06-13T22:54:37.010] debug3: state for jobid 705780: ctime:1371155585 revoked:1371155886 expires:1371155886
[2013-06-13T22:54:37.010] debug3: state for jobid 705786: ctime:1371156071 revoked:1371156110 expires:1371156110
[2013-06-13T22:54:37.010] debug3: state for jobid 705787: ctime:1371156110 revoked:1371156148 expires:1371156148
[2013-06-13T22:54:37.010] debug3: state for jobid 705788: ctime:1371156149 revoked:1371156188 expires:1371156188
[2013-06-13T22:54:37.010] debug3: state for jobid 705789: ctime:1371156188 revoked:1371156225 expires:1371156225
[2013-06-13T22:54:37.010] debug3: state for jobid 705790: ctime:1371156226 revoked:1371156263 expires:1371156263
[2013-06-13T22:54:37.010] debug3: state for jobid 705791: ctime:1371156264 revoked:1371156304 expires:1371156304
[2013-06-13T22:54:37.010] debug3: state for jobid 705792: ctime:1371156305 revoked:1371156341 expires:1371156341
[2013-06-13T22:54:37.010] debug3: state for jobid 705793: ctime:1371156342 revoked:1371156381 expires:1371156381
[2013-06-13T22:54:37.010] debug3: state for jobid 705794: ctime:1371156382 revoked:1371156421 expires:1371156421
[2013-06-13T22:54:37.011] debug3: state for jobid 705795: ctime:1371156421 revoked:1371156467 expires:1371156467
[2013-06-13T22:54:37.011] debug3: state for jobid 705796: ctime:1371156468 revoked:1371156506 expires:1371156506
[2013-06-13T22:54:37.011] debug3: state for jobid 705797: ctime:1371156507 revoked:1371156544 expires:1371156544
[2013-06-13T22:54:37.011] debug3: state for jobid 705798: ctime:1371156691 revoked:1371156729 expires:1371156729
[2013-06-13T22:54:37.011] debug3: state for jobid 705799: ctime:1371156730 revoked:1371156769 expires:1371156769
[2013-06-13T22:54:37.011] debug3: state for jobid 705800: ctime:1371156769 revoked:1371156808 expires:1371156808
[2013-06-13T22:54:37.011] debug2: set revoke expiration for jobid 705800 to 1371158077 UTS
[2013-06-13T22:54:37.011] debug: stepd_api.c:264: : safe_read (4 of 4) failed: Connection reset by peer
[2013-06-13T22:54:37.011] error: stepd_connect to 705800.0 failed: Connection reset by peer
[2013-06-13T22:54:37.011] error: stepd_connect to 705800.0 failed: Connection reset by peer
[2013-06-13T22:54:37.012] debug: Waiting for job 705800's prolog to complete
[2013-06-13T22:54:37.012] debug: Finished wait for job 705800's prolog to complete
[2013-06-13T22:54:37.012] debug: Calling /usr/sbin/slurmstepd spank epilog
[2013-06-13T22:54:37.016] Reading slurm.conf file: /etc/slurm/slurm.conf
[2013-06-13T22:54:37.020] Running spank/epilog for jobid [705800] uid [2054944]
[2013-06-13T22:54:37.020] spank: opening plugin stack /etc/slurm/plugstack.conf
[2013-06-13T22:54:37.023] debug: [job 705800] attempting to run epilog [/etc/slurm/epilog.sh]
[2013-06-13T22:54:37.273] debug: completed epilog for jobid 705800
....
[2013-06-13T22:54:38.014] error: stepd_connect to 705800.0 failed: Connection refused
[2013-06-13T22:54:38.014] error: stepd_connect to 705800.0 failed: Connection refused
we see that the job finishes at some point without the end of the step and after about 1 minute we see that _one_step_complete_msg function is called 3 times.
I think that there should be errors somewhere _send_step_complete_msgs or in batch_finish function of src/slurmd/slurmstepd/mgr.c
Even if we have to wait for the other nodes to complete this shouldn't allow the job to finish before the step.
Yiannis
Yiannis, could you send me the sbatch script and the sbatch line? Danny,
Here is the batch command used for both jobs (705800 and 705801)
export PPN=16 ; export NODE=270 ; sbatch --reservation=bull_56 -N $NODE -n $(( $PPN * $NODE )) -p mpi2 --profile=Energy --acctg-freq=Energy=3 BULL_run.sh
and here is the script used:
=========================================================
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=16
#SBATCH -p newscaall
####SBATCH -L lustre10k*1
#SBATCH --threads-per-core=1
#SBATCH -J DRESDE
#SBATCH --time=24:00:00
#SBATCH --exclusive
###SBATCH --begin=20:00:00 # run begin at 20:00
###################################
# CPUFREQ #
###################################
export FREQ=2901000
#export FREQ=2900000
#Available frequencies : 2701000(T) 2700000 2400000 2200000 2000000 1800000 1600000 1400000 1200000
###################################
# COMPILER #
###################################
source $SLURM_SUBMIT_DIR/env_bench
###################################
# MPDBOOT #
###################################
NNODES=$SLURM_NNODES
cores=$SLURM_NTASKS
###################################
# OMP_NUM_THREADS #
###################################
export OMP_NUM_THREADS=1
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=-1
export FORT_BUFFERED=true
export decfort_dump_flag=true
unset OMP_NUM_THREADS FORT_BUFFERED decfort_dump_flag MALLOC_MMAP_MAX_ MALLOC_TRIM_THRESHOLD_
###################################
# WORKDIR #
###################################
workdir=$SLURM_SUBMIT_DIR/DIR_$FREQ\_$cores\_$OMP_NUM_THREADS\_$SLURM_JOBID
mkdir $workdir
cd $workdir
#-
$HOME/bin/ExpandNodeList -r -p 1 $SLURM_NODELIST > mpd.hosts
$HOME/bin/ExpandNodeList -r -p $SLURM_NTASKS_PER_NODE $SLURM_NODELIST > one_host_per_core
#-
cp ../BULL_run.sh .
#-
echo '----------------------------------------------------'
ulimit -a
echo '----------------------------------------------------'
pwd
echo '----------------------------------------------------'
echo ' NODE USED = '$SLURM_NODELIST
echo ' SLURM_JOBID = '$SLURM_JOBID
echo ' OMP_NUM_THREADS = '$OMP_NUM_THREADS
echo ' CORES = '$cores
echo '----------------------------------------------------'
# COPY EXE FROM BIN DIRECTORY
echo 'cp '$bin_dir'/'$bin' .'
cp $bin_dir/$bin .
cp $bin_dir/compile.sh .
echo '----------------------------------------------------'
echo 'ldd '$bin
ldd $bin
echo '----------------------------------------------------'
date
echo '----------------------------------------------------'
clush -bw root@$SLURM_NODELIST ' /bin/echo 3 > /proc/sys/vm/drop_caches '
#clush -bw root@$SLURM_NODELIST ' pwrb -N '
#clush -bw root@$SLURM_NODELIST ' pwrb -P '
#clush -bw root@$SLURM_NODELIST ' pwrb -f $FREQ '
#export LD_PRELOAD=/home/bull/Cyril/LIB/NO_BARRIER/COMPILATION_bullxmpi/no_barrier/libnobarrier.so
#time srun --cpu-freq=$FREQ -m cyclic --cpu_bind=map_cpu:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 --resv-port \
time srun --cpu-freq=$FREQ -m cyclic --cpu_bind=map_cpu:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 --resv-port \
-n $cores ./$bin $input_file > out_$cores
mv results.txt results.txt_$cores
echo '----------------------------------------------------'
date
echo '----------------------------------------------------'
==============================================================
I've also found those errors in the slurm-705800.out whereas the slurm-705801.out did not have any errors:
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: H5Gdeprec.c line 820 in H5Gget_num_objs(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: H5Gdeprec.c line 820 in H5Gget_num_objs(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: H5Gdeprec.c line 1066 in H5Gget_objname_by_idx(): not a location ID
major: Invalid arguments to routine
minor: Inappropriate type
#001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
major: Invalid arguments to routine
minor: Bad value
Perhaps the problem is related to the synchronization function between the nodes which is active only if we activate profiling
Yiannis
Yiannis, perhaps unrelated, perhaps related, this code has changed quite a bit in the last few days. All the HDF5 functions you reference here have been deprecated and have been replaced with more modern calls. I would be interested if this still happens with the latest code. I am guessing there is still a problem though since the batch step shouldn't exit before the other steps. Yiannis, could you also have DebugFlags=Profile and send a run with the error again? Danny, I had DebugFlags=NO_CONF_HASH,Energy,Profile in the slurm.conf . What are the logs that you would like to see ? I think I have reported most of the relevant parts of the logs. Yiannis Danny, do you think we should try directly the latest git version with the more modern calls for HDF5 or should we wait to see if we find anything more? Which function do you have in mind? (In reply to Yiannis Georgiou from comment #8) > Danny, > > I had DebugFlags=NO_CONF_HASH,Energy,Profile in the slurm.conf . > What are the logs that you would like to see ? I think I have reported most > of the relevant parts of the logs. > > Yiannis I think my problem is I was expecting certain output which wasn't showing up. [2013-06-13T22:54:36.280] [705800.0] PROFILE: add_sample_data, HDF5 file not open Seems to be at the root of the cause. I will see if I can force this scenario and try to reproduce what you are seeing. If you could send the first few 100 lines from a failed run that would be helpful. For what it is worth I have ran 1000's of jobs over the weekend with not a single failure. On a completely unrelated note, I believe the reason you have NO_CONF_HASH there is because you have rapl on some nodes and ipmi on others. I would suggest putting that one line in an include file so you can remove that debug flag. (In reply to Yiannis Georgiou from comment #9) > Danny, > > do you think we should try directly the latest git version with the more > modern calls for HDF5 or should we wait to see if we find anything more? > Which function do you have in mind? While I do not think it will really matter it is worth a shot if you can. Created attachment 290 [details]
error log for jobid 705800
(In reply to Danny Auble from comment #10) > (In reply to Yiannis Georgiou from comment #8) > > Danny, > > > > I had DebugFlags=NO_CONF_HASH,Energy,Profile in the slurm.conf . > > What are the logs that you would like to see ? I think I have reported most > > of the relevant parts of the logs. > > > > Yiannis > > I think my problem is I was expecting certain output which wasn't showing up. > > [2013-06-13T22:54:36.280] [705800.0] PROFILE: add_sample_data, HDF5 file not > open > > Seems to be at the root of the cause. I will see if I can force this > scenario and try to reproduce what you are seeing. If you could send the > first few 100 lines from a failed run that would be helpful. > > For what it is worth I have ran 1000's of jobs over the weekend with not a > single failure. > > On a completely unrelated note, I believe the reason you have NO_CONF_HASH > there is because you have rapl on some nodes and ipmi on others. I would > suggest putting that one line in an include file so you can remove that > debug flag. I have attached an error log with a grep on the jobid. I think that the log you mention arrives after the problem has been produced see line 1172 of the log I attached... This error arrives after the job has been finished and we continue to have log from the steps where we shouldn't . I think if we prevent this from happening and force the job to wait until all the steps are finished before it finishes then it would be fine. I'm wondering though why we don't see this every time and why we see it only when we have profiling activated. Yiannis, could you please verify 77180.0 is still 0? My feeling on the matter is you just didn't wait long enough for the step to end. This step had an issue with IO. I can generate this state but it always fills in the correct info after things completely end. I do not think the batch script ending before the step is related to this either, nor do I think the batch script should wait for it's steps. The allocation is still being enforced by the slurmctld. One thing I could do though is wrap up the jobacct/acct_gather stuff earlier to avoid the gather threads running after the step was suppose to end. Let me know what you think. I am hoping 77180.0 is not zero anymore thus verifying my hypothesis. Danny,
the situation for that job hasn't changed
$ sacct -j 705800 -o jobid,state,maxdiskread,aveRSS,consumedenergy
JobID State MaxDiskRead AveRSS ConsumedEnergy
------------ ---------- ------------ ---------- --------------
705800 COMPLETED
705800.batc+ COMPLETED 240M 8648K 7457
705800.0 COMPLETED 0 0 0
From what I see on the other correct jobs the batch jobs ends always after all the steps but perhaps there can be exceptions . In the case of our tests all 3 exceptions (sbatch ending before some steps) the result were 0 value in consumed energy and some other accounting fields.
If we wrap up the jobacct/acct_gather stuff earlier then we might miss the latest values for consumed energy but perhaps missing one last measure from some nodes might be better than having 0 once in a while.
I've seen the errors only with 270 nodes and never with less than that. It's strange that this happens only after a certain number of nodes.
That is interesting about the node count. I'll post a proposed patch in a bit and see if you can try again with it. I see no reason to keep polling on things as when we are just waiting for IO to finish. I believe the reason the step finishes after the batch step I still don't think it is related. I am guessing it is the IO hang that really causes the scenario. I'll make a bigger cluster and see if that matters. Could you see if the other 3 failed had issues with IO as well? Please try the git master now. The commit most interesting for this bug is a8f56cb2005a5db03601e11f44d73bf881f7a7cd. Which does a bit of cleanup to avoid having the threads run longer than they needed to. Looking at this code though it is almost impossible to believe things were zeroed out, I just don't see how it was possible. Do you happen to see any core files laying around? Have you seen this happen again on the new code? Hi Danny, I haven't managed to test for this yet. Hope I will have the chance to do this tomorrow. I'll let you know Yiannis Hi Danny, you are right that it doesn't matter that the step is finished after the job. On the other hand, I've made some testing with the latest git version and we still have the problem but I think I'm pretty sure that it is related with a particular error of cpusets. And that could explain why you have never seen it . Do you run with cpusets in your platform? Check this out the problem comes from a defunct sh process remained from a cpuset manipulation error that stalls the termination of the step as you can see below process 3071 provokes the problem that we see in the logs search for this line "Defunct process skipped:" I noticed that every time that we get 0 values in accounting tables a defunct process exists on the first computing node and we are obliged to wait until the following message : Abandoning IO 60 secs after job shutdown initiated before the step is finished.... here are some messages in the log: -------------------------------------------------- [2013-06-27T13:34:29.639] [253140] Message thread exited [2013-06-27T13:34:29.639] [253140] job 253140 completed with slurm_rc = 0, job_rc = 0 [2013-06-27T13:34:29.639] [253140] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 [2013-06-27T13:34:29.642] [253140] done with job ............ [2013-06-27T13:34:29.662] [253140.0] Defunct process skipped: command=(sh) state=Z pid=3071 ppid=2521 ........... ....... [2013-06-27T13:35:20.875] [253140.0] Defunct process skipped: command=(sh) state=Z pid=3071 ppid=2521 [2013-06-27T13:35:20.878] [253140.0] Sent SIGKILL signal to 253140.0 [2013-06-27T13:35:20.878] [253140.0] Leaving _handle_request: SLURM_SUCCESS [2013-06-27T13:35:20.878] [253140.0] Entering _handle_request [2013-06-27T13:35:20.878] [253140.0] Leaving _handle_accept [2013-06-27T13:35:20.878] debug3: Step from other job: jobid=253138 (this jobid=253140) [2013-06-27T13:35:21.866] [253140.0] eio: handling events for 1 objects [2013-06-27T13:35:21.866] [253140.0] Called _msg_socket_readable [2013-06-27T13:35:22.127] ipmi-thread = 3 sec, current 76 Watts, consumed 653962 Joules 228 new [2013-06-27T13:35:22.867] [253140.0] eio: handling events for 1 objects [2013-06-27T13:35:22.867] [253140.0] Called _msg_socket_readable [2013-06-27T13:35:23.868] [253140.0] eio: handling events for 1 objects [2013-06-27T13:35:23.868] [253140.0] Called _msg_socket_readable [2013-06-27T13:35:24.869] [253140.0] eio: handling events for 1 objects [2013-06-27T13:35:24.870] [253140.0] Called _msg_socket_readable [2013-06-27T13:35:25.134] ipmi-thread = 3 sec, current 76 Watts, consumed 654190 Joules 228 new [2013-06-27T13:35:25.871] [253140.0] eio: handling events for 1 objects [2013-06-27T13:35:25.871] [253140.0] Called _msg_socket_readable [2013-06-27T13:35:26.872] [253140.0] eio: handling events for 1 objects [2013-06-27T13:35:26.872] [253140.0] Called _msg_socket_readable [2013-06-27T13:35:27.873] [253140.0] eio: handling events for 1 objects [2013-06-27T13:35:27.873] [253140.0] Called _msg_socket_readable [2013-06-27T13:35:28.141] ipmi-thread = 3 sec, current 76 Watts, consumed 654418 Joules 228 new [2013-06-27T13:35:28.874] [253140.0] eio: handling events for 1 objects [2013-06-27T13:35:28.874] [253140.0] Called _msg_socket_readable [2013-06-27T13:35:29.875] [253140.0] Abandoning IO 60 secs after job shutdown initiated [2013-06-27T13:35:29.875] [253140.0] Message thread exited ............ ....... [2013-06-27T13:35:38.000] [253140.0] Rank 0 timed out waiting for 41 (of 269) children [2013-06-27T13:35:38.000] [253140.0] _one_step_complete_msg: first=0, last=228 [2013-06-27T13:35:38.000] [253140.0] Rank 0 sending complete to slurmctld, range 0 to 228 [2013-06-27T13:35:38.002] [253140.0] done with job [2013-06-27T13:35:38.004] debug: stepd_api.c:264: : safe_read (4 of 4) failed: Connection reset by peer ---------------------------------------- and here is the output of ps during the same time ----------------------------------- root@taurusi1001:~> ps -aux |grep 13: Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ root 1602 0.0 0.0 0 0 ? S 13:33 0:00 [flush-0:19] root 2521 0.5 0.0 472692 7020 ? Sl 13:33 0:00 slurmstepd: [253140.0] root 2939 0.0 0.0 11292 1344 ? S 13:34 0:00 /bin/bash /sbin/cpuset_release_agent /slurm253140/slurm253140.0_15 root 3063 0.0 0.0 4076 488 ? D 13:34 0:00 rmdir /dev/cpuset//slurm253140/slurm253140.0_15 root 3064 0.0 0.0 11292 1344 ? S 13:34 0:00 /bin/bash /sbin/cpuset_release_agent /slurm253140/slurm253140.0_10 root 3069 0.0 0.0 4076 488 ? D 13:34 0:00 rmdir /dev/cpuset//slurm253140/slurm253140.0_10 root 3071 0.0 0.0 0 0 ? Z 13:34 0:00 [sh] <defunct> root 3076 0.0 0.0 11292 1340 ? S 13:34 0:00 /bin/bash /sbin/cpuset_release_agent /slurm253140/slurm253140.4294967294_0 root 3080 0.0 0.0 4076 484 ? D 13:34 0:00 rmdir /dev/cpuset//slurm253140/slurm253140.4294967294_0 root 3140 2.0 0.0 110224 1196 pts/2 R+ 13:34 0:00 ps -aux root 3141 0.0 0.0 103232 900 pts/2 S+ 13:34 0:00 grep 13: root 116560 0.0 0.0 96508 3812 ? Ss 13:18 0:00 sshd: root@pts/2 root 116562 0.0 0.0 108428 2032 pts/2 Ss 13:18 0:00 -bash root 120048 0.0 0.0 0 0 ? S 13:25 0:00 [flush-lustre-2] root 121182 0.0 0.0 100928 648 pts/1 S+ 13:27 0:00 tail -f /usr/local/slurm26/var/log/slurm.log Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ root 1602 0.0 0.0 0 0 ? S 13:33 0:00 [flush-0:19] root 2521 0.3 0.0 472692 7020 ? Sl 13:33 0:00 slurmstepd: [253140.0] root 3071 0.0 0.0 0 0 ? Z 13:34 0:00 [sh] <defunct> root 3222 2.0 0.0 110224 1196 pts/2 R+ 13:34 0:00 ps -aux root 3223 0.0 0.0 103232 900 pts/2 S+ 13:34 0:00 grep 13: root 116560 0.0 0.0 96508 3812 ? Ss 13:18 0:00 sshd: root@pts/2 root 116562 0.0 0.0 108428 2032 pts/2 Ss 13:18 0:00 -bash root 120048 0.0 0.0 0 0 ? S 13:25 0:00 [flush-lustre-2] root 121182 0.0 0.0 100928 648 pts/1 S+ 13:27 0:00 tail -f /usr/local/slurm26/var/log/slurm.log ----------------------------------------- I don't know what exactly made the bash script to be defunct but I suppose that it was something related to cpusets because it is the only childs of slurmstepd as sh processes that may appear at that point. I think that if we pass to cgroups we would not have this problem anymore. And since we were going to do it anyway it's fine with us. On the other hand I imagine there should be a way to correct the cpuset version as well. What do you think? Yiannis Here are some more info:
there is no DBD_STEP_COMPLETE for the step as you can see from the log of slurmdbd
--------------------------
[2013-06-28T09:36:54.173] debug2: DBD_JOB_START: START CALL ID:253174 NAME:DRESDE INX:0
[2013-06-28T09:36:54.173] debug2: as_mysql_slurmdb_job_start() called
[2013-06-28T09:36:54.173] debug3: 10(as_mysql_job.c:509) query
insert into "taur_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc, nodes_alloc, mem_req, account, partition, node_inx) values (253174, 4, 1, 0, 2054944, 200026, 'taurusi[1001-1270]', 0, 1440, 1372405012, 1372405012, 1372405012, 'DRESDE', 0, 1, 29347, 4320, 4320, 270, 0, 'bull', 'mpi2', '0-269') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=2054944, id_group=200026, nodelist='taurusi[1001-1270]', id_resv=0, timelimit=1440, time_submit=1372405012, time_start=1372405012, job_name='DRESDE', track_steps=0, id_qos=1, state=greatest(state, 1), priority=29347, cpus_req=4320, cpus_alloc=4320, nodes_alloc=270, mem_req=0, account='bull', partition='mpi2', node_inx='0-269'
[2013-06-28T09:36:56.743] debug2: DBD_STEP_START: ID:253174.0 NAME:memory_bandwidth SUBMIT:1372405012
[2013-06-28T09:36:56.743] debug3: 10(as_mysql_job.c:946) query
insert into "taur_step_table" (job_db_inx, id_step, time_start, step_name, state, cpus_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq) values (273452, 0, 1372405016, 'memory_bandwidth', 1, 4320, 270, 4320, 'taurusi[1001-1270]', '0-269', 1, 2901000) on duplicate key update cpus_alloc=4320, nodes_alloc=270, task_cnt=4320, time_end=0, state=1, nodelist='taurusi[1001-1270]', node_inx='0-269', task_dist=1, req_cpufreq=2901000
[2013-06-28T09:36:59.213] debug2: DBD_JOB_START: ELIGIBLE CALL ID:253175 NAME:DRESDE
[2013-06-28T09:36:59.213] debug2: as_mysql_slurmdb_job_start() called
[2013-06-28T09:36:59.213] debug3: 10(as_mysql_job.c:509) query
insert into "taur_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc, nodes_alloc, mem_req, account, partition) values (253175, 4, 1, 0, 2054944, 200026, 'None assigned', 0, 1440, 1372405014, 1372405014, 0, 'DRESDE', 0, 0, 29347, 4320, 0, 0, 0, 'bull', 'mpi2') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=2054944, id_group=200026, nodelist='None assigned', id_resv=0, timelimit=1440, time_submit=1372405014, time_start=0, job_name='DRESDE', track_steps=0, id_qos=1, state=greatest(state, 0), priority=29347, cpus_req=4320, cpus_alloc=0, nodes_alloc=0, mem_req=0, account='bull', partition='mpi2'
[2013-06-28T09:37:56.073] debug2: DBD_STEP_START: ID:253174.4294967294 NAME:batch SUBMIT:1372405012
[2013-06-28T09:37:56.073] debug3: 10(as_mysql_job.c:946) query
insert into "taur_step_table" (job_db_inx, id_step, time_start, step_name, state, cpus_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq) values (273452, -2, 1372405012, 'batch', 1, 1, 1, 1, 'taurusi1001', '0', 0, 0) on duplicate key update cpus_alloc=1, nodes_alloc=1, task_cnt=1, time_end=0, state=1, nodelist='taurusi1001', node_inx='0', task_dist=0, req_cpufreq=0
[2013-06-28T09:37:56.113] debug2: DBD_STEP_COMPLETE: ID:253174.4294967294 SUBMIT:1372405012
[2013-06-28T09:37:56.113] debug3: 10(as_mysql_job.c:1127) query
update "taur_step_table" set time_end=1372405076, state=3, kill_requid=-1, exit_code=0, user_sec=1, user_usec=780000, sys_sec=2, sys_usec=80000, max_disk_read=135.214800, max_disk_read_task=0, max_disk_read_node=0, ave_disk_read=135.214800, max_disk_write=0.723989, max_disk_write_task=0, max_disk_write_node=0, ave_disk_write=0.723989, max_vsize=4060804, max_vsize_task=0, max_vsize_node=0, ave_vsize=1636840.000000, max_rss=214788, max_rss_task=0, max_rss_node=0, ave_rss=8512.000000, max_pages=19, max_pages_task=0, max_pages_node=0, ave_pages=19.000000, min_cpu=0, min_cpu_task=0, min_cpu_node=0, ave_cpu=0.000000, act_cpufreq=2901000, consumed_energy=14302 where job_db_inx=273452 and id_step=-2
[2013-06-28T09:37:56.113] debug2: DBD_JOB_COMPLETE: ID:253174
[2013-06-28T09:37:56.114] debug2: as_mysql_slurmdb_job_complete() called
[2013-06-28T09:37:56.114] debug3: 10(as_mysql_job.c:799) query
update "taur_job_table" set time_end=1372405076, state=3, nodelist='taurusi[1001-1270]', derived_ec=0, exit_code=0, kill_requid=-1 where job_db_inx=273452;
--------------------------------------
and slurmstepd is frozen for 60sec waiting other nodes to send terminate (during this time those nodes are in Completing state)
-----------------------------------------
root@taurusi1001:~> ps -aux|grep 09:
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root 71809 0.0 0.0 96508 3804 ? Ss 09:05 0:00 sshd: root@pts/0
root 71811 0.0 0.0 108428 2076 pts/0 Ss 09:05 0:00 -bash
root 71956 0.0 0.0 95736 3844 ? Ss 09:05 0:00 sshd: root@pts/1
root 71958 0.0 0.0 108428 2068 pts/1 Ss 09:05 0:00 -bash
root 72632 0.3 0.0 3586316 12468 ? Sl 09:07 0:05 /usr/local/slurm26/sbin/slurmd
root 83747 0.0 0.1 151176 33780 pts/1 S+ 09:22 0:00 vi /usr/local/slurm26/var/log/slurm.log
root 85943 0.0 0.0 0 0 ? S 09:28 0:00 [flush-lustre-2]
root 88365 0.0 0.0 0 0 ? S 09:34 0:00 [flush-0:19]
root 90270 0.2 0.0 472824 7140 ? Sl 09:36 0:00 slurmstepd: [253174.0]
root 90765 0.0 0.0 0 0 ? Z 09:37 0:00 [sh] <defunct>
root 91003 3.0 0.0 110220 1196 pts/0 R+ 09:38 0:00 ps -aux
root 91004 0.0 0.0 103232 900 pts/0 S+ 09:38 0:00 grep 09:
root@taurusi1001:~> gstack 90270
Thread 3 (Thread 0x2b6f2d29e700 (LWP 90277)):
#0 0x00002b6f29a9efc3 in poll () from /lib64/libc.so.6
#1 0x0000000000433705 in eio_handle_mainloop ()
#2 0x0000000000432884 in _msg_thr_internal ()
#3 0x00002b6f297ab851 in start_thread () from /lib64/libpthread.so.0
#4 0x00002b6f29aa867d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2b6f2d09c700 (LWP 90755)):
#0 0x00002b6f29a6c91d in nanosleep () from /lib64/libc.so.6
#1 0x00002b6f29a6c790 in sleep () from /lib64/libc.so.6
#2 0x0000000000426697 in _kill_thr ()
#3 0x00002b6f297ab851 in start_thread () from /lib64/libpthread.so.0
#4 0x00002b6f29aa867d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b6f29f5db20 (LWP 90270)):
#0 0x00002b6f297af7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000042678b in _wait_for_children_slurmstepd ()
#2 0x00000000004275f4 in job_manager ()
#3 0x000000000042421d in main ()
----------------------------------------
Once the timeout of 60sec is reached it should normally do the normal procedure until the DBD_STEP_COMPLETE and also terminate the writings on the HDF5 files but it doesn't. As a result we don't get the aggregated jobacctinfo in the DB and the hdf5 files are not complete... we also get a bunch of the following errors in the slurm-253174.out file
---------------------------
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: H5G.c line 612 in H5Gget_info(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
#001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: H5L.c line 1093 in H5Lget_name_by_idx(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
#001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: H5L.c line 1093 in H5Lget_name_by_idx(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
#001: H5Gloc.c line 195 in H5G_loc(): invalid group ID
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: H5L.c line 1093 in H5Lget_name_by_idx(): not a location
major: Invalid arguments to routine
minor: Inappropriate type
...........
......
-----------------------
I'm wondering if the _kill_thr () that I see in the gstack is not killing the wrong threads or anything like that. I'll continue to investigate in the code.
Let me know if you have any ideas
Yiannis, well this is good information. Do you happen to have the slurmctld and slurmd logs for this job? Here is the output of some of the jobs I've run : from what we see 253194 and 253198 had the problem (I just print only some of the values for the example) -------------------------------------- root@taurusadmin1:~> /usr/local/slurm26/bin/sacct -S 2013-06-28T09:00:00 --format=jobid,MaxPages,Nnodes,ConsumedEnergy,MaxDiskRead,Elapsed,State JobID MaxPages NNodes ConsumedEnergy MaxDiskRead Elapsed State ------------ -------- -------- -------------- ------------ ---------- ---------- 253194 270 00:01:03 COMPLETED 253194.batc+ 19K 1 14388 135M 00:01:03 COMPLETED 253194.0 0 270 0 0 00:01:00 COMPLETED 253195 270 00:01:11 COMPLETED 253195.batc+ 40K 1 14592 135M 00:01:11 COMPLETED 253195.0 49K 270 3518593 3M 00:01:07 COMPLETED 253196 270 00:01:12 COMPLETED 253196.batc+ 86K 1 16055 135M 00:01:12 COMPLETED 253196.0 46K 270 3600626 3M 00:01:08 COMPLETED 253197 270 00:01:05 COMPLETED 253197.batc+ 19K 1 14628 135M 00:01:05 COMPLETED 253197.0 46K 270 3600831 3M 00:01:01 COMPLETED 253198 270 00:01:03 COMPLETED 253198.batc+ 57K 1 15049 135M 00:01:03 COMPLETED 253198.0 0 270 0 0 00:00:59 COMPLETED 253199 270 00:01:03 COMPLETED 253199.batc+ 50K 1 14416 135M 00:01:03 COMPLETED 253199.0 46K 270 3398440 3M 00:00:59 COMPLETED I'm attaching the slurmd log of 253194 and the slurmctld log of 253198 because I don't have it for 253194 (but it should be the same thing) it is interesting to see in the slurmd log that we continue to get messages for all the problematic steps a long time after the job and the step is finished and other jobs have been executed on the node ... all the jobids that you see here are jobs that had the problem of 0 values in accounting: [2013-06-28T10:40:37.876] debug4: found jobid = 253192, stepid = 0 [2013-06-28T10:40:37.876] debug4: found jobid = 253186, stepid = 0 [2013-06-28T10:40:37.876] debug4: found jobid = 253194, stepid = 0 [2013-06-28T10:40:37.876] debug: Cleaned up stray socket /usr/local/slurm26/var/log/slurm/spool_slurmd//taurusi1001_253192.0 [2013-06-28T10:40:37.876] debug: Cleaned up stray socket /usr/local/slurm26/var/log/slurm/spool_slurmd//taurusi1001_253186.0 [2013-06-28T10:40:37.877] debug: Cleaned up stray socket /usr/local/slurm26/var/log/slurm/spool_slurmd//taurusi1001_253194.0 Created attachment 320 [details]
slurmcltd log of 253198
Created attachment 321 [details]
slurmd log of 253194
It appears I have reproduced this! Looks like it is related to the job being aborted. The patch (hack) below will make every job be aborted and not handle the sending of information. I'll see what I can do for aborted jobs like this.
--- a/src/slurmd/slurmstepd/mgr.c
+++ b/src/slurmd/slurmstepd/mgr.c
@@ -1106,7 +1106,7 @@ fail1:
_send_launch_resp(job, rc);
}
- if (job->aborted)
+ if (!job->aborted)
info("job_manager exiting with aborted job");
else if (!job->batch && (step_complete.rank > -1)) {
_wait_for_children_slurmstepd(job);
OK, try commit ee125a47250478df25d6b7c6a00027bb62ba0140 or just pull from the slurm-2.6 branch. If what I found is the problem this should fix it. Let me know what you find. Thanks Danny! I'll test this Monday morning and let you know Yiannis Danny, we have applied the new version with this patch but unfortunately the bug still appears on Dresden cluster... we have confirmed that the error appears only with profiling activated along with the following variations Jobs 1824196 – 1824199 were run with “--profile=all --acctg-freq=3” Job 1824206 was run with “--profile=all --acctg-freq=energy=3” $ sacct -u bull --name=mem_bw --format=jobid,jobname,nnodes,consumedenergy,avecpu,averss,elapsed,state | grep -e "\.0" 1824196.0 memory_ba+ 270 0 00:03:55 64984K 00:04:14 COMPLETED 1824198.0 memory_ba+ 270 72 00:03:56 60801K 00:04:14 COMPLETED 1824199.0 memory_ba+ 270 0 00:03:54 62055K 00:04:12 COMPLETED 1824206.0 memory_ba+ 270 0 00:00:00 0 00:04:16 COMPLETED we see that the error is directly related with energy profiling... If we profile only task and not energy then we get results for all accounting info except energy (jobs 1824196 – 1824199) except the strange 72 ! on the other hand if we profile task and energy we get no results for any of the jobaccounting fields here is a small log (verbosity 3) from the first compute node that shows jobs 1824199 and 1824206, no error appeared for 1824199, only for 1824206: [2013-07-03T10:55:42.015] Launching batch job 1824199 for UID 2054944 [2013-07-03T10:55:42.026] init: Gres GPU plugin loaded [2013-07-03T10:55:42.026] Received cpu frequency information for 32 cpus [2013-07-03T10:55:42.538] launch task 1824199.0 request from 2054944.200026@172.26.72.190 (port 33436) [2013-07-03T10:55:42.550] init: Gres GPU plugin loaded [2013-07-03T10:55:42.550] Received cpu frequency information for 32 cpus [2013-07-03T10:59:54.594] [1824199] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 [2013-07-03T10:59:54.599] [1824199] done with job [2013-07-03T10:59:54.741] [1824199.0] done with job [2013-07-03T11:00:05.336] Launching batch job 1824206 for UID 2054944 [2013-07-03T11:00:05.347] init: Gres GPU plugin loaded [2013-07-03T11:00:05.347] Received cpu frequency information for 32 cpus [2013-07-03T11:00:05.863] launch task 1824206.0 request from 2054944.200026@172.26.72.190 (port 64671) [2013-07-03T11:00:05.874] init: Gres GPU plugin loaded [2013-07-03T11:00:05.874] Received cpu frequency information for 32 cpus [2013-07-03T11:04:21.045] [1824206] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 [2013-07-03T11:04:21.049] [1824206] done with job [2013-07-03T11:05:19.123] [1824206.0] Abandoning IO 60 secs after job shutdown initiated [2013-07-03T11:05:28.008] [1824206.0] done with job [2013-07-03T11:05:28.010] error: stepd_connect to 1824206.0 failed: Connection reset by peer [2013-07-03T11:05:28.010] error: stepd_connect to 1824206.0 failed: Connection reset by peer [2013-07-03T11:05:28.010] error: stepd_connect to 1824206.0 failed: Connection reset by peer [2013-07-03T11:05:29.013] error: stepd_connect to 1824206.0 failed: Connection refused [2013-07-03T11:05:29.013] error: stepd_connect to 1824206.0 failed: No such file or directory [2013-07-03T11:05:29.013] error: stepd_connect to 1824206.0 failed: No such file or directory [2013-07-03T11:05:30.023] error: stepd_connect to 1824206.0 failed: No such file or directory [2013-07-03T11:05:30.023] error: stepd_connect to 1824206.0 failed: No such file or directory [2013-07-03T11:05:30.023] error: stepd_connect to 1824206.0 failed: No such file or directory Let me know if you find anything and I'll try to make other tests with higher verbosity in the logs thanks Yiannis Yiannis, I was concerned what I found wasn't the real problem since your logs never saw the warning message I saw. Could you send the sacct for the batch step of 1824206 as well? It just seems so strange the problem would be related to profiling. Here you are I included for all of them:
root@taurusadmin1:~> sacct -j 1824196,1824197,1824198,1824199,1824206 --format=jobid,jobname,nnodes,consumedenergy,avecpu,averss,elapsed,state
JobID JobName NNodes ConsumedEnergy AveCPU AveRSS Elapsed State
------------ ---------- -------- -------------- ---------- ---------- ---------- ----------
1824196 mem_bw 270 00:04:14 COMPLETED
1824196.bat+ batch 1 6.08M 00:00:00 8684K 00:04:14 COMPLETED
1824196.0 memory_ba+ 270 0 00:03:55 64984K 00:04:14 COMPLETED
1824197 J2ant2thi+ 24 00:00:02 COMPLETED
1824197.bat+ batch 1 25.34M 00:00:00 8084K 00:00:02 COMPLETED
1824197.0 gpaw-pyth+ 24 530.87M 00:00:00 872K 00:00:02 FAILED
1824198 mem_bw 270 00:04:14 COMPLETED
1824198.bat+ batch 1 6.13M 00:00:00 8716K 00:04:14 COMPLETED
1824198.0 memory_ba+ 270 72 00:03:56 60801K 00:04:14 COMPLETED
1824199 mem_bw 270 00:04:12 COMPLETED
1824199.bat+ batch 1 6.19M 00:00:00 8712K 00:04:12 COMPLETED
1824199.0 memory_ba+ 270 0 00:03:54 62055K 00:04:12 COMPLETED
1824206 mem_bw 270 00:04:16 COMPLETED
1824206.bat+ batch 1 6.25M 00:00:00 8672K 00:04:16 COMPLETED
1824206.0 memory_ba+ 270 0 00:00:00 0 00:04:16 COMPLETED
Yiannis, I believe I am finding a common symptom here and perhaps cause of the bug. One common instance in all the steps you have sent us logs for have a timeout waiting for a rank. Something like this in the slurmd log [2013-06-13T22:54:37.000] [705800.0] Rank 0 timed out waiting for 114 (of 269) children The message for step completion is only sent to the database when a step fully completes. My current theory is this timeout is leaving a hole in the step completion logic and making it so the step never fully completes and is thus lost for accounting purposes. I will try to recreate and report back what I find, but this seems like something that is at least a plausible scenario to cause this. What would be interesting is looking on the node that timed out and see what it is doing and what happened to the timed out ranks or if it tried to send the message but was rejected for some reason. You should see messages like "sending complete to slurmctld instead" in the slurmd log of the timing out node, but perhaps there is something more hosed and they never finish. well I do find these kind of messages in the slurmd log that was attached some days ago: [2013-06-28T10:35:25.000] [253194.0] Rank 0 timed out waiting for 171 (of 269) children [2013-06-28T10:35:25.000] [253194.0] _one_step_complete_msg: first=58, last=114 [2013-06-28T10:35:25.000] [253194.0] Rank 0 sending complete to slurmctld, range 58 to 114 [2013-06-28T10:35:25.002] [253194.0] _one_step_complete_msg: first=229, last=269 [2013-06-28T10:35:25.002] [253194.0] Rank 0 sending complete to slurmctld, range 229 to 269 [2013-06-28T10:35:25.004] [253194.0] _one_step_complete_msg: first=0, last=0 [2013-06-28T10:35:25.004] [253194.0] Rank 0 sending complete to slurmctld, range 0 to 0 [2013-06-28T10:35:25.006] [253194.0] done with job but I'm not sure if this is the message you are talking about... ? Exactly, I was referencing the attached log files when I wrote my last response. Seeing the slurmd log file from the other node that timed out would help see what is going on. So what I am requesting is this, and it can be old as well (preferred). At least with "debug" level debugging "debug3" would probably be the most help though. slurmd log from head node, slurmd log from timeout node, and slurmctld log for the job. If you had the other slurmd logs for 253194 that would help, but I really would like a clear picture and it doesn't sound like you have the slurmctld logs for that job. Yiannis, have you been able to get those log files? We have been able to reproduce this problem with a carefully placed slurm in the slurmstepd code (see below):
diff --git a/src/slurmd/slurmstepd/mgr.c b/src/slurmd/slurmstepd/mgr.c
index dd9f8ce..7c63c61 100644
--- a/src/slurmd/slurmstepd/mgr.c
+++ b/src/slurmd/slurmstepd/mgr.c
@@ -741,13 +741,14 @@ _one_step_complete_msg(slurmd_job_t *job, int first, int l
* can be built with out the hostlist from the credential.
*/
if (step_complete.parent_rank != -1) {
- debug3("Rank %d sending complete to rank %d, range %d to %d",
+ info("Rank %d sending complete to rank %d, range %d to %d",
step_complete.rank, step_complete.parent_rank,
first, last);
/* On error, pause then try sending to parent again.
* The parent slurmstepd may just not have started yet, because
* of the way that the launch message forwarding works.
*/
+sleep(300);
for (i = 0; i < REVERSE_TREE_PARENT_RETRY; i++) {
if (i)
sleep(1);
A bit more detail. The accounting record will be filled in after the sleep, unless the slurmstepd on the sleeping node (anything other than node zero) is killed. That means when the slurmctld asks the status of the job on that node, the slurmd says the job is done, but the slurmstepd never sent the accounting record for its ranks, so the step accounting record is never filled in. Since the slurmctld has confirmed there are no remnants of the job on any node, the job and step records are deleted. The best solution is probably to record the step accounting information for the ranks that it does have information about; a partial accounting record. The commit below should fix the problem. It will be included in version 2.6.0. Please re-open this trouble ticket if the problem continues. https://github.com/SchedMD/slurm/commit/26a8d2e6ba590cf1d8b9215e543a6726508b3ae4
I confirm that your patch corrects the problem on a large configuration.
In the sacct output that follows the patch was applied before the start of job 46 and as you see there is no 0 values after that job.
Thanks a lot, that was a difficult catch !!
JobID JobName NNodes ConsumedEnergy AveRSS Elapsed State
------------ ---------- -------- -------------- ---------- ---------- ----------
37 DRESDE 2 00:00:00 CANCELLED+
38 DRESDE 270 00:10:17 COMPLETED
38.batch batch 1 54895 17508K 00:10:17 COMPLETED
38.0 memory_ba+ 270 15201640 22445K 00:10:17 COMPLETED
39 DRESDE 270 00:10:23 COMPLETED
39.batch batch 1 55685 17436K 00:10:23 COMPLETED
39.0 memory_ba+ 270 15483430 22153K 00:10:23 COMPLETED
40 DRESDE 270 00:10:16 COMPLETED
40.batch batch 1 54794 17452K 00:10:16 COMPLETED
40.0 memory_ba+ 270 0 0 00:10:15 COMPLETED
41 DRESDE 270 00:10:22 COMPLETED
41.batch batch 1 55844 17460K 00:10:22 COMPLETED
41.0 memory_ba+ 270 15390683 21775K 00:10:21 COMPLETED
42 DRESDE 270 00:10:17 COMPLETED
42.batch batch 1 54895 17484K 00:10:17 COMPLETED
42.0 memory_ba+ 270 0 0 00:10:17 COMPLETED
43 DRESDE 270 00:10:19 COMPLETED
43.batch batch 1 55413 17492K 00:10:19 COMPLETED
43.0 memory_ba+ 270 0 0 00:10:19 COMPLETED
44 DRESDE 270 00:10:18 COMPLETED
44.batch batch 1 61690 17348K 00:10:18 COMPLETED
44.0 memory_ba+ 270 0 0 00:10:18 COMPLETED
45 DRESDE 300 05:56:21 CANCELLED+
45.batch batch 1 3952902 19504K 05:56:21 CANCELLED
45.0 memory_ba+ 300 0 0 05:56:21 CANCELLED
46 DRESDE 270 00:10:20 COMPLETED
46.batch batch 1 55439 17464K 00:10:20 COMPLETED
46.0 memory_ba+ 270 15304438 21841K 00:11:40 COMPLETED
47 DRESDE 270 00:10:20 COMPLETED
47.batch batch 1 61217 17492K 00:10:20 COMPLETED
47.0 memory_ba+ 270 15371691 20261K 00:10:20 COMPLETED
48 DRESDE 270 00:10:17 COMPLETED
48.batch batch 1 54798 17600K 00:10:17 COMPLETED
48.0 memory_ba+ 270 15052102 20535K 00:11:41 COMPLETED
49 DRESDE 270 00:10:17 COMPLETED
49.batch batch 1 55284 17312K 00:10:17 COMPLETED
49.0 memory_ba+ 270 15152236 22634K 00:11:41 COMPLETED
50 DRESDE 270 00:10:17 COMPLETED
50.batch batch 1 56720 17372K 00:10:17 COMPLETED
50.0 memory_ba+ 270 15039470 21981K 00:11:41 COMPLETED
51 DRESDE 270 00:10:19 COMPLETED
51.batch batch 1 55105 17488K 00:10:19 COMPLETED
51.0 memory_ba+ 270 15149472 18748K 00:11:43 COMPLETED
52 DRESDE 270 00:10:19 COMPLETED
52.batch batch 1 60902 17464K 00:10:19 COMPLETED
52.0 memory_ba+ 270 15307016 21117K 00:10:19 COMPLETED
53 DRESDE 270 00:10:17 COMPLETED
53.batch batch 1 54660 17468K 00:10:17 COMPLETED
53.0 memory_ba+ 270 15233380 18782K 00:10:17 COMPLETED
54 DRESDE 270 00:10:17 COMPLETED
54.batch batch 1 54944 17368K 00:10:17 COMPLETED
54.0 memory_ba+ 270 15190992 20565K 00:10:15 COMPLETED
55 DRESDE 270 00:10:18 COMPLETED
55.batch batch 1 54746 17480K 00:10:18 COMPLETED
55.0 memory_ba+ 270 14729504 21241K 00:11:45 COMPLETED
56 DRESDE 270 00:10:18 COMPLETED
56.batch batch 1 60888 17472K 00:10:18 COMPLETED
56.0 memory_ba+ 270 15237551 21755K 00:11:42 COMPLETED
57 DRESDE 270 00:10:17 COMPLETED
57.batch batch 1 54959 17428K 00:10:17 COMPLETED
57.0 memory_ba+ 270 15153619 21780K 00:11:29 COMPLETED
58 DRESDE 270 00:10:17 COMPLETED
58.batch batch 1 60446 17480K 00:10:17 COMPLETED
58.0 memory_ba+ 270 15030168 20568K 00:11:42 COMPLETED
59 DRESDE 270 00:10:17 COMPLETED
59.batch batch 1 55045 17484K 00:10:17 COMPLETED
59.0 memory_ba+ 270 14802597 20895K 00:11:46 COMPLETED
60 DRESDE 270 00:10:19 COMPLETED
60.batch batch 1 61642 17492K 00:10:19 COMPLETED
60.0 memory_ba+ 270 15294558 21427K 00:11:43 COMPLETED
61 DRESDE 270 00:10:17 COMPLETED
61.batch batch 1 54876 17476K 00:10:17 COMPLETED
61.0 memory_ba+ 270 15194811 19843K 00:11:41 COMPLETED
62 DRESDE 270 00:10:23 COMPLETED
62.batch batch 1 55578 17452K 00:10:23 COMPLETED
62.0 memory_ba+ 270 15454512 18950K 00:10:22 COMPLETED
63 DRESDE 270 00:10:17 COMPLETED
63.batch batch 1 54756 17540K 00:10:17 COMPLETED
63.0 memory_ba+ 270 15159040 20682K 00:11:41 COMPLETED
64 DRESDE 270 00:10:20 COMPLETED
64.batch batch 1 61315 17328K 00:10:20 COMPLETED
64.0 memory_ba+ 270 15252129 20788K 00:11:44 COMPLETED
65 DRESDE 270 00:10:19 COMPLETED
65.batch batch 1 55187 17504K 00:10:19 COMPLETED
65.0 memory_ba+ 270 15327549 19719K 00:10:19 COMPLETED
|
Hello, Execution of Linpack for small run gives the following results in accounting which is correct: root@taurusadmin1:~> sacct -j 492396 --format=jobid,jobname,MaxVMSize,MaxRSS,AveRSS,MaxPages,Nnodes,AveCPUFreq,ReqCPUFreq,ConsumedEnergy,MaxDiskRead,Elapsed JobID JobName MaxVMSize MaxRSS AveRSS MaxPages NNodes AveCPUFreq ReqCPUFreq ConsumedEnergy MaxDiskRead Elapsed ------------ ---------- ---------- ---------- ---------- -------- -------- ---------- ---------- -------------- ------------ ---------- 492396 hpl 18 00:28:00 492396.batc+ batch 1508276K 7348K 7348K 0 1 2901000 0 626114 10M 00:28:00 492396.0 xhpl 27982672K 27500272K 27427238K 2K 18 78417 2901000 10560917 0.14M 00:28:00 Whereas for a long run we get memory , cpu frequency , disk and consumed energy = 0 for the step: root@taurusadmin1:~> sacct -j 492196 --format=jobid,jobname,MaxVMSize,MaxRSS,AveRSS,MaxPages,Nnodes,AveCPUFreq,ReqCPUFreq,ConsumedEnergy,MaxDiskRead,Elapsed JobID JobName MaxVMSize MaxRSS AveRSS MaxPages NNodes AveCPUFreq ReqCPUFreq ConsumedEnergy MaxDiskRead Elapsed ------------ ---------- ---------- ---------- ---------- -------- -------- ---------- ---------- -------------- ------------ ---------- 492196 hpl 270 01:51:14 492196.batc+ batch 4346892K 67912K 10764K 21K 1 2901000 0 2.42M 554M 01:51:14 492196.0 xhpl 0 0 0 0 270 0 2901000 0 0 01:51:10 I'm wondering if the problem comes from the fact that since values would be much larger, there was an issue during the conversion. I don't think that they overflowed the uint32 that we store all those values initially but there could be something related to this because I couldn't find other similarities between all those values. I need to make more testing to be certain for the conditions but I wanted to report just in case you have ever seen this before. Thanks, Yiannis