Created attachment 31623 [details] /var/log/mrssage Hi Team, Multiple nodes drained due to Prolog error. Please help to analyse these logs and resume the nodes in slurm NodeName=sdfmilan128 Arch=x86_64 CoresPerSocket=64 Reason=Prolog error [root@2023-08-03T15:34:48] Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Finished wait for job 21155407's prolog launch request Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155407's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155398's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155364 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155364 CPU input mask for node: 0x00000000000020000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155364 CPU final HW mask for node: 0x00000000000020000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155364's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Finished wait for job 21155329's prolog launch request Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155329's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155366 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155366 CPU input mask for node: 0x00000000000080000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155366 CPU final HW mask for node: 0x00000000000080000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155366's prolog launch request Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155387 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155387 CPU input mask for node: 0x00000010000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155387 CPU final HW mask for node: 0x00000010000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155387's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155322 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155322 CPU input mask for node: 0x00000000000000000000000010000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155322 CPU final HW mask for node: 0x00000000000000000000000010000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155361 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155361 CPU input mask for node: 0x00000000000000800000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155361 CPU final HW mask for node: 0x00000000000000800000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155322's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155361's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: error: Waiting for JobId=21155408 REQUEST_LAUNCH_PROLOG notification failed, giving up after 60 sec Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155385 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155385 CPU input mask for node: 0x00000004000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155385 CPU final HW mask for node: 0x00000004000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155385's prolog launch request Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: error: Waiting for JobId=21155304 REQUEST_LAUNCH_PROLOG notification failed, giving up after 60 sec Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Finished wait for job 21155313's prolog launch request Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155393 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155313's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155393 CPU input mask for node: 0x00000400000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155393 CPU final HW mask for node: 0x00000400000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155393's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155324 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155324 CPU input mask for node: 0x00000000000000000000000080000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155324 CPU final HW mask for node: 0x00000000000000000000000080000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155411 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155411 CPU input mask for node: 0x10000000000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155411 CPU final HW mask for node: 0x10000000000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155324's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155411's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Finished wait for job 21155309's prolog launch request Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155309's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Finished wait for job 21155397's prolog launch request Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155397's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155399 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155399 CPU input mask for node: 0x00010000000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155399 CPU final HW mask for node: 0x00010000000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155386 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155386 CPU input mask for node: 0x00000008000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155386 CPU final HW mask for node: 0x00000008000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155399's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155312 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155312 CPU input mask for node: 0x00000000000000000000000000020000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155312 CPU final HW mask for node: 0x00000000000000000000000000020000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155386's prolog to complete Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155400 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155400 CPU input mask for node: 0x00020000000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155400 CPU final HW mask for node: 0x00020000000000000000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155356 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155356 CPU input mask for node: 0x00000000000000040000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: batch_bind: job 21155356 CPU final HW mask for node: 0x00000000000000040000000000000000 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 21155371 Aug 3 15:34:41 sdfmilan205 slurmd[1739767]: slurmd: debug: Waiting for job 21155305's prolog to complete Aug 3 15:34:42 sdfmilan205 slurmd[1739767]: slurmd: debug: slurm_recv_timeout: Socket POLLERR: Connection reset by peer Aug 3 15:34:42 sdfmilan205 slurmd[1739767]: slurmd: debug: slurm_recv_timeout: Socket POLLERR: Connection reset by peer Aug 3 15:34:43 sdfmilan205 slurmd[1739767]: slurmd: Could not launch job 21155347 and not able to requeue it, cancelling job Aug 3 15:34:43 sdfmilan205 slurmd[1739767]: slurmd: Could not launch job 21155402 and not able to requeue it, cancelling job Aug 3 15:34:44 sdfmilan205 slurmd[1739767]: slurmd: debug: slurm_recv_timeout: Socket POLLERR: Connection reset by peer Aug 3 15:34:46 sdfmilan205 slurmd[1739767]: slurmd: Could not launch job 21155351 and not able to requeue it, cancelling job Aug 3 15:34:47 sdfmilan205 slurmd[1739767]: slurmd: debug: slurm_recv_timeout: Socket POLLERR: Connection reset by peer Aug 3 15:34:48 sdfmilan205 slurmd[1739767]: slurmd: Could not launch job 21155328 and not able to requeue it, cancelling job Aug 3 15:34:48 sdfmilan205 slurmd[1739767]: slurmd: debug: slurm_recv_timeout: Socket POLLERR: Connection reset by peer Aug 3 15:34:49 sdfmilan205 slurmd[1739767]: slurmd: Could not launch job 21155304 and not able to requeue it, cancelling job Aug 3 15:34:50 sdfmilan205 slurmd[1739767]: slurmd: debug: slurm_recv_timeout at 0 of 4, recv zero bytes Thanks Ramya
Created attachment 31624 [details] slurmd logd
Please also attach your slurmctld.log.
Created attachment 31627 [details] slurmctld logs
What does your prolog script look like? Have you been having network issues? There are lots of errors such as these in your slurmd log: [2023-08-04T03:44:48.469] error: slurm_receive_msg_and_forward: [[localhost]:60210] failed: Zero Bytes were transmitted or received [2023-08-04T03:44:48.479] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received [2023-08-04T03:44:48.479] debug: _service_connection: incomplete message Slurm can't seem to find your health_check script either. [2023-08-04T03:42:21.711] error: run_command: health_check can not be executed (/usr/sbin/nhc_wrapper.sh) No such file or directory [2023-08-04T03:42:21.711] error: health_check didn't run: status:127 reason:Run command failed - configuration error Some jobs seem to be using a lot of memory as well [2023-08-03T16:56:11.764] [21155514.batch] task/cgroup: task_cgroup_memory_check_oom: StepId=21155514.batch hit memory+swap limit at least once during execution. This may or may not result in some failure. Every 30 seconds I see one of this in your slurmctld.log [2023-08-07T15:41:07.326] error: slurm_receive_msg [127.0.0.1:57788]: Zero Bytes were transmitted or received I assume that is some kind of script on a chron job. What can you tell me about these issues? Caden
Hi we are still witnessing this prolog errors slurmctld logs: [2023-08-22T21:50:02.628] error: validate_node_specs: Prolog or job env setup failure on node sdfmilan029, draining the node [2023-08-22T21:50:02.628] drain_nodes: node sdfmilan029 state set to DRAIN /var/log/messages: [2023-08-22T21:50:04.139] [24971932.batch] debug: Handling REQUEST_SIGNAL_CONTAINER [2023-08-22T21:50:04.139] [24971932.batch] debug: _handle_signal_container for StepId=24971932.batch uid=16924 signal=998 [2023-08-22T21:50:04.139] [24971932.batch] error: *** JOB 24971932 ON sdfmilan029 CANCELLED AT 2023-08-22T21:50:04 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *** [2023-08-22T21:50:04.139] debug: _rpc_terminate_job: uid = 16924 JobId=24971947 [2023-08-22T21:50:04.140] debug: credential for job 24971947 revoked [2023-08-22T21:50:04.140] debug: _rpc_terminate_job: sent SUCCESS for 24971947, waiting for prolog to finish [2023-08-22T21:50:04.140] debug: Waiting for job 24971947's prolog to complete (In reply to Caden Ellis from comment #5) > What does your prolog script look like? > > > Have you been having network issues? There are lots of errors such as these > in your slurmd log: > > > [2023-08-04T03:44:48.469] error: slurm_receive_msg_and_forward: > [[localhost]:60210] failed: Zero Bytes were transmitted or received > [2023-08-04T03:44:48.479] error: service_connection: slurm_receive_msg: Zero > Bytes were transmitted or received > [2023-08-04T03:44:48.479] debug: _service_connection: incomplete message > > We are still unable to find the reason for these errors. > Slurm can't seem to find your health_check script either. > > > [2023-08-04T03:42:21.711] error: run_command: health_check can not be > executed (/usr/sbin/nhc_wrapper.sh) No such file or directory > [2023-08-04T03:42:21.711] error: health_check didn't run: status:127 > reason:Run command failed - configuration error > > we fixed this issue by removing health_check script[/usr/sbin/nhc_wrapper.sh] from slurm.conf > Some jobs seem to be using a lot of memory as well > > > [2023-08-03T16:56:11.764] [21155514.batch] task/cgroup: > task_cgroup_memory_check_oom: StepId=21155514.batch hit memory+swap limit at > least once during execution. This may or may not result in some failure. > We are already discussed this in other bugs which I raised > > Every 30 seconds I see one of this in your slurmctld.log > > > [2023-08-07T15:41:07.326] error: slurm_receive_msg [127.0.0.1:57788]: Zero > Bytes were transmitted or received > > I assume that is some kind of script on a chron job. > What can you tell me about these issues? We are already discussed this in other bugs which I raised > > Caden Thank you Ramya
Created attachment 31911 [details] slurmctld logs
Created attachment 31912 [details] slurm logs
Created attachment 31913 [details] slurm conf
Can I see your prolog script?
[reranna@sdfmilan216 ~]$ sudo cat /var/spool/slurmd/conf-cache/slurm.conf | grep prolog #Prolog=/etc/slurm/prolog.d/* Prolog=/etc/slurm/prolog.d/* TaskProlog=/etc/slurm/tasks/prolog.sh [reranna@sdfmilan216 ~]$ [reranna@sdfmilan216 ~]$ sudo cat /etc/slurm/tasks/prolog.sh #!/bin/sh # This will set LSCRATCH for sbatch jobs since /etc/profile.d/scratch.sh is # only executed for login shells if [ -d /lscratch/ ]; then if [ -n "$SLURM_JOB_USER" -a -n "$SLURM_JOB_ID" ]; then echo export LSCRATCH="/lscratch/$SLURM_JOB_USER/slurm_job_id_$SLURM_JOB_ID" fi fi [reranna@sdfmilan216 ~]$ [reranna@sdfmilan216 ~]$ sudo cat /etc/slurm/prolog.d/50-prolog #!/bin/bash # # This script adds timestamps to slurm logs logfile_name() { local parent_dir=$(basename $(dirname $(readlink -m $0))) local log_dir=/var/log/slurm local logfile="$log_dir/$parent_dir.log" if [ -n "$1" ]; then local logfile0=$logfile logfile=$1 echo logfile: $logfile0 >> $logfile fi echo $logfile } logfile=$(logfile_name $1) echo "Start: ${SLURM_JOB_ID:-000000} $(date '+%Y%m%dT%H%M%S') $(date '+%s')" >> $logfile # Log printenv out: #echo printenv: >> $logfile #printenv >> $logfile #echo >> $logfile # Log Slurm job details: for v in CUDA_MPS_ACTIVE_THREAD_PERCENTAGE CUDA_VISIBLE_DEVICES \ SLURM_ARRAY_JOB_ID SLURM_ARRAY_TASK_COUNT SLURM_ARRAY_TASK_ID \ SLURM_ARRAY_TASK_MAX SLURM_ARRAY_TASK_MIN SLURM_ARRAY_TASK_STEP \ SLURM_CLUSTER_NAME SLURM_CONF SLURM_CPUS_ON_NODE SLURM_DISTRIBUTION \ SLURMD_NODENAME SLURM_GPUS SLURM_GTID SLURM_JOB_CPUS_PER_NODE \ SLURM_JOB_ACCOUNT SLURM_JOB_CONSTRAINTS SLURM_JOB_DERIVED_EC \ SLURM_JOB_EXIT_CODE SLURM_JOB_EXIT_CODE2 SLURM_JOB_GID \ SLURM_JOB_GPUS SLURM_JOB_GROUP SLURM_JOB_ID SLURM_JOBID \ SLURM_JOB_NAME SLURM_JOB_NODELIST SLURM_JOB_NUM_NODES \ SLURM_JOB_PARTITION SLURM_JOB_QOS SLURM_JOB_UID SLURM_JOB_USER \ SLURM_LOCAL_GLOBALS_FILE SLURM_LOCALID SLURM_NNODES \ SLURM_NODE_ALIASES SLURM_NODEID SLURM_PRIO_PROCESS SLURM_PROCID \ SLURM_RLIMIT_AS SLURM_RLIMIT_CORE SLURM_RLIMIT_CPU SLURM_RLIMIT_DATA \ SLURM_RLIMIT_FSIZE SLURM_RLIMIT_MEMLOCK SLURM_RLIMIT_NOFILE \ SLURM_RLIMIT_NPROC SLURM_RLIMIT_RSS SLURM_RLIMIT_STACK \ SLURM_SCRIPT_CONTEXT SLURM_STEP_ID SLURM_STEPID SLURM_SUBMIT_DIR \ SLURM_SUBMIT_HOST SLURM_TASK_PID SLURM_TASKS_PER_NODE \ SLURM_TOPOLOGY_ADDR SLURM_TOPOLOGY_ADDR_PATTERN SLURM_WCKEY \ SLURM_WORKING_CLUSTER; do if [ -n "${!v}" ]; then echo $v: ${!v} >> $logfile fi done # Do stuff here: echo Get path to slurm commands >> $logfile slurm_path=$(timeout -k 20s 10s /etc/slurm/scripts/get_slurm_path.sh "$SLURM_CONF" 2>> $logfile) if [ $? -ne 0 ]; then echo $slurm_path >> $logfile slurm_path="" fi echo slurm_path: $slurm_path >> $logfile echo Start enable-linger to create /run/user/$SLURM_JOB_UID >> $logfile ls -l /run/user/ >> $logfile if [ -d "$slurm_path" ]; then squeue_path=$slurm_path/squeue if [ ! -f $squeue_path ]; then echo "Cannot find squeue command: skipping" >> $logfile else num_jobs=$($squeue_path -hw $SLURMD_NODENAME -u $SLURM_JOB_USER | wc -l) if [ -n "$num_jobs" ] && [ "$num_jobs" -ne 1 ] 2>> $logfile; then echo num_jobs: $num_jobs >> $logfile $squeue_path -hw $SLURMD_NODENAME >> $logfile else echo Run: loginctl enable-linger $SLURM_JOB_USER >> $logfile loginctl enable-linger $SLURM_JOB_USER &>> $logfile fi fi fi echo Done enable-linger to create /run/user/$SLURM_JOB_UID >> $logfile echo Start removing user files from under /lscratch/ >> $logfile if [ -d "$slurm_path" ]; then squeue_path=$slurm_path/squeue if [ ! -f $squeue_path ]; then echo "Cannot find squeue command: skipping" >> $logfile else num_jobs=$($squeue_path -hw $SLURMD_NODENAME | wc -l) if [ -z "$num_jobs" ] || [ "$num_jobs" -ne 1 ] 2>> $logfile; then $squeue_path -hw $SLURMD_NODENAME >> $logfile echo num_jobs: $num_jobs >> $logfile else echo Run: /etc/slurm/scripts/rm_lscratch_files.sh >> $logfile timeout -k 80s 60s /etc/slurm/scripts/rm_lscratch_files.sh &>> $logfile fi fi fi echo Done removing user files from under /lscratch/ >> $logfile echo "Start creating user LSCRATCH" >> $logfile # Done here and in /etc/profile because sbatch does not execute /etc/profile if [[ ! -d "/lscratch " ]]; then echo "Warning: /lscratch/ does not exist" >> "$logfile" else echo mkdir -p "/lscratch/${SLURM_JOB_USER}/slurm_job_id_${SLURM_JOB_ID}" &>> "$logfile" mkdir -p "/lscratch/${SLURM_JOB_USER}/slurm_job_id_${SLURM_JOB_ID}" &>> "$logfile" echo chown "${SLURM_JOB_USER}" "/lscratch/${SLURM_JOB_USER}" "${SLURM_JOB_USER}/slurm_job_id_${SLURM_JOB_ID}" &>> "$logfile" chown "${SLURM_JOB_USER}" "/lscratch/${SLURM_JOB_USER}" "${SLURM_JOB_USER}/slurm_job_id_${SLURM_JOB_ID}" &>> "$logfile" fi echo "Done creating user LSCRATCH" >> $logfile echo "Done: ${SLURM_JOB_ID:-000000} $(date '+%Y%m%dT%H%M%S') $(date '+%s')" >> $logfile echo >> $logfile [reranna@sdfmilan216 ~]$
Can I also see the output of sdiag? It looks like your prolog script is timing out ocassionally. There is a lot going on in that script. Writing to files and calling squeue a lot could be contributing, as well as the long timeouts within the script. [2023-08-22T16:13:46.003] error: Waiting for JobId=24917189 REQUEST_LAUNCH_PROLOG notification failed, giving up after 60 sec So it is only waiting 60 seconds for prolog, and when your system is busy the prolog could be hitting the timeout. I think your MessageTimeout=30 is a good value, so you need to look for ways to speed up that prolog. For example: if [ ! -f $squeue_path ]; then echo "Cannot find squeue command: skipping" >> $logfile else num_jobs=$($squeue_path -hw $SLURMD_NODENAME | wc -l) if [ -z "$num_jobs" ] || [ "$num_jobs" -ne 1 ] 2>> $logfile; then $squeue_path -hw $SLURMD_NODENAME >> $logfile echo num_jobs: $num_jobs >> $logfile else echo Run: /etc/slurm/scripts/rm_lscratch_files.sh >> $logfile timeout -k 80s 60s /etc/slurm/scripts/rm_lscratch_files.sh &>> $logfile fi fi Here squeue is called twice when you can get away with one call. We recommend as few client commands run in scripts as possible to keep things fast. The timeout on your lscratch_files.sh is also 60s with a kill at 80s, at which point the controller has already given up on the prolog. Can you see if speeding up the script works? Caden
Do you have an update on this?
Hi, Looks like we had some bugs with script. We fixed it and after that we did not see any kind of prolog errors. We would like to wait for this week and monitor the status if it actually fixed the issue. Thank you Ramya
How did this last week go? Are we good to close this? Caden
Yes. we are good to close. Thank you Ramya
Closing