Description
Yang Liu
2021-05-19 12:17:18 MDT
One user had two jobs which have the same status. But the user confirmed that these two jobs failed. The user actually didn’t get any slurm standard output file in the directory, but got the slurm notifying email saying “node fail”. The user always use - -email option when submitting jobs and that’s the only thing the user knows about these two weird terminated jobs. Hi Yang,
> How can we tell those jobs were cancelled by users/admins or due to node
> failures?
First of all let me notice that when a *job* is terminated due a NODE_FAIL, then Slurm tries its best to cancel all the *steps of the job*.
So, in your case both *jobs* 1055978_12 and 1351015 were terminated due NODE_FAIL, and some of their *steps* got cancelled.
If you use -X,--allocations you'll see only the *job* information, not the *steps*.
Note that the information on the steps may be useful for the user, for example imagine this job with 3 steps:
$ sbatch --wrap "srun sleep 3; srun sleep 5; srun sleep 300"
Submitted batch job 10558
After 3 seconds running you see somthing like this:
$ sacct -j 10558 --format jobid,jobname,state
JobID JobName State
------------ ---------- ----------
10558 wrap RUNNING
10558.batch batch RUNNING
10558.extern extern RUNNING
10558.0 sleep COMPLETED
10558.1 sleep RUNNING
And some seconds later:
$ sacct -j 10558 --format jobid,jobname,state
JobID JobName State
------------ ---------- ----------
10558 wrap RUNNING
10558.batch batch RUNNING
10558.extern extern RUNNING
10558.0 sleep COMPLETED
10558.1 sleep COMPLETED
10558.2 sleep RUNNING
I now the node fails, you will end with something like:
$ sacct -j 10558 --format jobid,jobname,state -X
JobID JobName State
------------ ---------- ----------
10558 wrap NODE_FAIL
$ sacct -j 10558 --format jobid,jobname,state
JobID JobName State
------------ ---------- ----------
10558 wrap NODE_FAIL
10558.batch batch COMPLETED
10558.extern extern CANCELLED
10558.0 sleep COMPLETED
10558.1 sleep COMPLETED
10558.2 sleep CANCELLED
Note that the user may be interested on knowing that steps .0 and .1 were able to complete.
Also note that JobRequeue option is controlling if those batch jobs on a NODE_FAIL should be cancelled or requeued.
Finally, imagine this case:
$ sbatch --wrap "srun sleep 3; srun sleep 50; srun sleep 3"
Submitted batch job 10559
$ sacct -j 10559 --format jobid,jobname,state
JobID JobName State
------------ ---------- ----------
10559 wrap RUNNING
10559.batch batch RUNNING
10559.extern extern RUNNING
10559.0 sleep COMPLETED
10559.1 sleep RUNNING
$ scancel 10559.1
$ sacct -j 10559 --format jobid,jobname,state%20
JobID JobName State
------------ ---------- --------------------
10559 wrap RUNNING
10559.batch batch RUNNING
10559.extern extern RUNNING
10559.0 sleep COMPLETED
10559.1 sleep CANCELLED by 1000
10559.2 sleep RUNNING
And finally:
$ sacct -j 10559 --format jobid,jobname,state%20
JobID JobName State
------------ ---------- --------------------
10559 wrap COMPLETED
10559.batch batch COMPLETED
10559.extern extern COMPLETED
10559.0 sleep COMPLETED
10559.1 sleep CANCELLED by 1000
10559.2 sleep COMPLETED
Note that scancel a step is not leading to cancel the job, and that the state of a step (or job) cancelled by a user has the "by UID" suffix to distinguish manual scancels from users from cancels from Slurm itself.
Is this answering your question?
Regards,
Albert
Hi Albert, Thank you for the clear and useful explanations. Our next question is how to get more details on why/how slurs mark a node failure. Our system admins checked logs and did find useful information on a node failure event. Meanwhile, other jobs ran on the same node and were completed successfully, when the node failure was reported by slurm for another job. For example, job 1 was reported as node failure, while job 2 ran to complete on the same node of job 1. During the execution of job 2, node failure was reported for job 1. How can we find more details why node failure was reported only for job 1, not the other jobs? Thans, Yang Hi Yang, > Thank you for the clear and useful explanations. Glad it helped. > Our next question is how to > get more details on why/how slurs mark a node failure. Our system admins > checked logs and did find useful information on a node failure event. The logs may provide useful information. Could you attach the slurmmctld and slurmd logs of the nodes on the periods that you got those node fails (eg. job 1351015). > Meanwhile, other jobs ran on the same node and were completed successfully, > when the node failure was reported by slurm for another job. For example, > job 1 was reported as node failure, while job 2 ran to complete on the same > node of job 1. During the execution of job 2, node failure was reported for > job 1. How can we find more details why node failure was reported only for > job 1, not the other jobs? This seems strange. I would need the slurmctld and slurmd logs on the period of those job1 and job2 to investigate it further, but first we can try a much simpler attempt to see if it explains the issue: can you use "sacct --format=JobId,State,Reason" on the failed job? Maybe the Reason gives us more info without looking further on the logs. Regards, Albert Hi Albert, Here are the logs for the two jobs failed on 2021-05-27: [yliu385@node947 yliu385]$ sacct -j 1407294 --format=JobId,State,Reason,NodeList,Start,End JobID State Reason NodeList Start End ------------ ---------- ---------------------- --------------- ------------------- ------------------- 1407294 NODE_FAIL None node1307 2021-05-27T11:32:45 2021-05-27T11:37:54 1407294.bat+ FAILED node1307 2021-05-27T11:32:45 2021-05-27T11:40:29 1407294.ext+ CANCELLED node1307 2021-05-27T11:32:45 2021-05-27T11:40:29 [yliu385@node947 yliu385]$ sacct -j 1405353_32 --format=JobId,State,Reason,NodeList,Start,End JobID State Reason NodeList Start End ------------ ---------- ---------------------- --------------- ------------------- ------------------- 1405353_32 NODE_FAIL None node1328 2021-05-27T09:25:33 2021-05-27T09:27:52 1405353_32.+ FAILED node1328 2021-05-27T09:25:33 2021-05-27T09:28:27 1405353_32.+ CANCELLED node1328 2021-05-27T09:25:33 2021-05-27T09:28:27 Slurmctl log for 2021-05-27 when the two jobs failed/ended: -bash-4.2$ grep 2021-05-27 slurmctld [2021-05-27T11:38:04.248] error: _find_node_record(766): lookup failure for 1333 [2021-05-27T11:47:21.560] error: slurm_receive_msg [172.20.205.51:56346]: Zero Bytes were transmitted or received [2021-05-27T13:57:42.235] error: _find_node_record(766): lookup failure for l [2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS pri-jfranck TRES cpu grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained. [2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS pri-jfranck TRES mem grp_used_tres_run_secs underflow, tried to remove 437551104 seconds when only 437518336 remained. [2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS pri-jfranck TRES node grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained. [2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS pri-jfranck TRES billing grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained. [2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: assoc 18783 TRES cpu grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained. [2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: assoc 18783 TRES mem grp_used_tres_run_secs underflow, tried to remove 437551104 seconds when only 437518336 remained. [2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: assoc 18783 TRES node grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained. [2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: assoc 18783 TRES billing grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained. [2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES cpu grp_used_tres_run_secs underflow, tried to remove 1200 seconds when only 724 remained. [2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES mem grp_used_tres_run_secs underflow, tried to remove 7680000 seconds when only 4633600 remained. [2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES node grp_used_tres_run_secs underflow, tried to remove 300 seconds when only 181 remained. [2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES billing grp_used_tres_run_secs underflow, tried to remove 1200 seconds when only 724 remained. [2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 300 seconds when only 181 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES cpu grp_used_tres_run_secs underflow, tried to remove 344 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES mem grp_used_tres_run_secs underflow, tried to remove 2201600 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES node grp_used_tres_run_secs underflow, tried to remove 86 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES billing grp_used_tres_run_secs underflow, tried to remove 344 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 86 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES cpu grp_used_tres_run_secs underflow, tried to remove 344 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES mem grp_used_tres_run_secs underflow, tried to remove 2201600 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES node grp_used_tres_run_secs underflow, tried to remove 86 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES billing grp_used_tres_run_secs underflow, tried to remove 344 seconds when only 0 remained. [2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 86 seconds when only 0 remained. -bash-4.2$ Slurmd log: [yliu385@node1328 log]$ sudo cat slurm/slurmd|grep 2021-05-27 [2021-05-27T12:33:03.625] [1391743.batch] error: *** JOB 1391743 ON node1328 CANCELLED AT 2021-05-27T12:33:03 DUE TO TIME LIMIT *** [2021-05-27T12:33:03.626] [1391744.batch] error: *** JOB 1391744 ON node1328 CANCELLED AT 2021-05-27T12:33:03 DUE TO TIME LIMIT *** [yliu385@node1307 log]$ sudo cat slurm/slurmd|grep 2021-05-27 [2021-05-27T02:59:36.600] [1403501.batch] error: *** JOB 1403501 ON node1307 CANCELLED AT 2021-05-27T02:59:36 DUE TO TIME LIMIT *** [2021-05-27T12:33:03.656] [1391745.batch] error: *** JOB 1391745 ON node1307 CANCELLED AT 2021-05-27T12:33:03 DUE TO TIME LIMIT *** [2021-05-27T12:33:03.660] [1391746.batch] error: *** JOB 1391746 ON node1307 CANCELLED AT 2021-05-27T12:33:03 DUE TO TIME LIMIT *** [2021-05-27T15:40:15.761] [1408962.batch] error: Detected 1 oom-kill event(s) in step 1408962.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2021-05-27T15:41:58.186] [1409110.batch] error: Detected 1 oom-kill event(s) in step 1409110.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2021-05-27T19:22:10.863] [1403667.batch] error: *** JOB 1403667 ON node1307 CANCELLED AT 2021-05-27T19:22:10 *** Yang Hi Yang, > Here are the logs for the two jobs failed on 2021-05-27: Thanks, although I'm investigating one lead that your logs provided, I would need more information from them. I assume that you are using SlurmctldDebug=error and SlurmdDebug=error, right? Could you increase the debug level to: DebugFlags=Agent,Network SlurmctldDebug=debug SlurmdDebug=debug Logs will be more verbose with this level. If you reproduce the issue while this debug level is enabled, please attach the new logs and restore your previous level. > Slurmctl log for 2021-05-27 when the two jobs failed/ended: > -bash-4.2$ grep 2021-05-27 slurmctld > [2021-05-27T11:38:04.248] error: _find_node_record(766): lookup failure for > 1333 > [2021-05-27T11:47:21.560] error: slurm_receive_msg [172.20.205.51:56346]: > Zero Bytes were transmitted or received > [2021-05-27T13:57:42.235] error: _find_node_record(766): lookup failure for l These lines above are the main lead I'm investigating. They could explain the NODE_FAIL, but I need to look at it closer. > [2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS > pri-jfranck TRES billing grp_used_tres_run_secs underflow, tried to remove > 13353 seconds when only 13352 remained. > [2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: > assoc 18783 TRES cpu grp_used_tres_run_secs underflow, tried to remove 13353 > seconds when only 13352 remained. These errors seems unrelated to this issue, but they are most probably related to bug 7375. Regards, Albert Yang, Could you also attach your slurm.conf? Thanks, Albert Created attachment 20049 [details]
slurm.conf
Hi Albert,
The slurm.conf is attached.
What are the commands to change the configurations for the log levels, other than modifying slurm.conf directly? For example, does 'scontrol setDebugLevel debug', 'scontrol setDebugFlag +agent', 'scontrol setDebugFlag +network'. Any command for "SlurmdDebug=debug"?
DebugFlags=Agent,Network
SlurmctldDebug=debug
SlurmdDebug=debug
Yang
Hi Yang, > The slurm.conf is attached. Thanks! > What are the commands to change the configurations for the log levels, other > than modifying slurm.conf directly? For example, does 'scontrol > setDebugLevel debug', 'scontrol setDebugFlag +agent', 'scontrol setDebugFlag > +network'. Yes, these command are correct. You can see the current levels and flags with a command like this: $ scontrol show config | grep Debug > Any command for "SlurmdDebug=debug"? Unfortunately we don't have such command. You need to update slurm.conf and run "scontrol reconfigure". Please note that any changes done with "scontrol setDebugLevel/setDebugFlag" will be lost if you reconfigure or restart slurmctld. > DebugFlags=Agent,Network > SlurmctldDebug=debug > SlurmdDebug=debug If you can reproduce the issue with this log levels, please attach the logs. That should help us. Thanks, Albert Hi Yang,
> If you can reproduce the issue with this log levels, please attach the logs.
> That should help us.
Have you been able to reproduce the issue with higher logs?
Regards,
Albert
Hi Yang, If this is OK for I'm closing this ticket as infogiven based on comment 5. If you still want to investigate further the case that you commented on comment 6, please don't hesitate to reopen this bug. Regards, Albert Hi Albert, Jobs still failed in our cluster. For example: $ sacct -j 1740055_67 --format="start,end,timelimit,state%30,reason,jobid,jobidraw" Start End Timelimit State Reason JobID JobIDRaw ------------------- ------------------- ---------- ------------------------------ ---------------------- ------------ ------------ 2021-07-12T08:43:24 2021-07-12T09:23:11 03:00:00 NODE_FAIL None 1740055_67 1740123 2021-07-12T08:43:24 2021-07-12T09:35:44 FAILED 1740055_67.+ 1740123.bat+ 2021-07-12T08:43:24 2021-07-12T09:58:21 CANCELLED 1740055_67.+ 1740123.ext+ ===============slurmctld log================== $ grep 1740055_67 ~/slurmctld.2021-07-12 [2021-07-12T08:43:24.972] sched: Allocate JobId=1740055_67(1740123) NodeList=node1311 #CPUs=1 Partition=batch [2021-07-12T08:43:32.747] prolog_running_decr: Configuration for JobId=1740055_67(1740123) is complete [2021-07-12T08:43:32.747] Extending JobId=1740055_67(1740123) time limit by 8 secs for configuration [2021-07-12T09:23:11.183] Batch JobId=1740055_67(1740123) missing from batch node node1311 (not found BatchStartTime after startup) [2021-07-12T09:23:11.183] _job_complete: JobId=1740055_67(1740123) WTERMSIG 126 [2021-07-12T09:23:11.183] _job_complete: JobId=1740055_67(1740123) cancelled by node failure [2021-07-12T09:23:11.183] _job_complete: JobId=1740055_67(1740123) done [2021-07-12T09:26:17.325] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:29:27.948] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:32:37.369] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:35:47.399] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:38:57.531] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:42:07.926] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:45:17.945] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:48:27.277] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:51:32.983] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:54:42.496] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:57:44.810] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311 [2021-07-12T09:58:21.485] cleanup_completing: JobId=1740055_67(1740123) completion process took 2110 seconds ===============slurmd log================== [yliu385@node1311 log]$ head -n 20 ~/slurmd-node1311-2021-07-12 [2021-07-12T06:47:40.826] [1739857.0] error: Detected 1 oom-kill event(s) in step 1739857.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2021-07-12T06:47:40.842] [1739857.batch] error: Detected 1 oom-kill event(s) in step 1739857.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2021-07-12T06:47:40.864] [1739857.extern] error: Detected 1 oom-kill event(s) in step 1739857.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2021-07-12T10:39:04.473] debug: Log file re-opened [2021-07-12T10:39:04.705] debug: CPUs:24 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:1 [2021-07-12T10:39:04.705] Message aggregation disabled [2021-07-12T10:39:04.705] debug: Reading the topology.conf file [2021-07-12T10:39:04.708] error: _find_node_record(766): lookup failure for gpu2101 [2021-07-12T10:39:04.708] error: _find_node_record(778): lookup failure for gpu2101 alias gpu2101 ======dmesg -T============== [Mon Jul 12 06:47:27 2021] Memory cgroup stats for /slurm/uid_140272125/job_1739857/step_extern: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB [Mon Jul 12 06:47:27 2021] Memory cgroup stats for /slurm/uid_140272125/job_1739857/step_batch: cache:0KB rss:3876KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:3876KB inactive_file:0KB active_file:0KB unevictable:0KB [Mon Jul 12 06:47:27 2021] Memory cgroup stats for /slurm/uid_140272125/job_1739857/step_0: cache:0KB rss:10236124KB rss_huge:235520KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:10236124KB inactive_file:0KB active_file:0KB unevictable:0KB [Mon Jul 12 06:47:27 2021] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [Mon Jul 12 06:47:27 2021] [185088] 140272125 185088 2394 364 10 0 0 slurm_script [Mon Jul 12 06:47:27 2021] [185399] 140272125 185399 62093 1287 33 0 0 srun [Mon Jul 12 06:47:27 2021] [185400] 140272125 185400 10871 448 25 0 0 srun [Mon Jul 12 06:47:27 2021] [185415] 140272125 185415 2681405 2559233 5097 0 0 htseq-count [Mon Jul 12 06:47:27 2021] Memory cgroup out of memory: Kill process 185415 (htseq-count) score 1001 or sacrifice child [Mon Jul 12 06:47:27 2021] Killed process 185415 (htseq-count), UID 140272125, total-vm:10725620kB, anon-rss:10235892kB, file-rss:4kB, shmem-rss:1036kB Thanks, Yang What does the 'credential for job 1756506 revoked' in the slurmd log mean? Yang $ sacct -j 1756506 --format='jobid,state,reason,start,end,timelimit,MaxRSS,ReqMem,nodeList' JobID State Reason Start End Timelimit MaxRSS ReqMem NodeList ------------ ---------- ---------------------- ------------------- ------------------- ---------- ---------- ---------- --------------- 1756506 NODE_FAIL None 2021-07-14T09:48:06 2021-07-14T09:50:58 2-12:00:00 16Gn node1306 1756506.bat+ FAILED 2021-07-14T09:48:06 2021-07-14T09:55:58 16Gn node1306 1756506.ext+ CANCELLED 2021-07-14T09:48:06 2021-07-14T09:55:58 16Gn node1306 [yliu385@node950 yliu385]$ ============slurmctld log========================= $ grep 1756506 ~/slurmctld.2021-07-14 [2021-07-14T09:48:04.611] _slurm_rpc_submit_batch_job: JobId=1756506 InitPrio=47834 usec=1295 [2021-07-14T09:48:06.525] sched: Allocate JobId=1756506 NodeList=node1306 #CPUs=1 Partition=batch [2021-07-14T09:48:07.424] prolog_running_decr: Configuration for JobId=1756506 is complete [2021-07-14T09:48:07.424] Extending JobId=1756506 time limit by 1 secs for configuration [2021-07-14T09:50:58.970] Batch JobId=1756506 missing from batch node node1306 (not found BatchStartTime after startup) [2021-07-14T09:50:58.970] _job_complete: JobId=1756506 WTERMSIG 126 [2021-07-14T09:50:58.970] _job_complete: JobId=1756506 cancelled by node failure [2021-07-14T09:50:58.970] _job_complete: JobId=1756506 done [2021-07-14T09:54:00.491] Resending TERMINATE_JOB request JobId=1756506 Nodelist=node1306 [2021-07-14T09:55:58.878] slurmd error running JobId=1756506 from node(s)=node1306: Job credential revoked [2021-07-14T09:55:58.884] cleanup_completing: JobId=1756506 completion process took 300 seconds [2021-07-14T14:17:54.948] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=1756506 uid 140114611 [2021-07-14T14:17:54.949] job_str_signal(3): invalid JobId=1756506 [2021-07-14T14:17:54.949] _slurm_rpc_kill_job: job_str_signal() JobId=1756506 sig 9 returned Invalid job id specified [2021-07-14T14:18:06.592] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=1756506 uid 140114611 [2021-07-14T14:18:06.592] job_str_signal(3): invalid JobId=1756506 [2021-07-14T14:18:06.592] _slurm_rpc_kill_job: job_str_signal() JobId=1756506 sig 9 returned Invalid job id specified ============slurmd log======================== $ grep 1756506 ~/slurmd-node1306.2021-07-14 [2021-07-14T09:48:07.475] debug: [job 1756506] attempting to run prolog [/usr/local/etc/slurm/prolog] [2021-07-14T09:48:07.477] debug: Waiting for job 1756506's prolog to complete [2021-07-14T09:50:58.971] debug: credential for job 1756506 revoked [2021-07-14T09:50:58.971] debug: _rpc_terminate_job: sent SUCCESS for 1756506, waiting for prolog to finish [2021-07-14T09:50:58.971] debug: Waiting for job 1756506's prolog to complete [2021-07-14T09:55:58.877] _run_prolog: prolog with lock for job 1756506 ran for 471 seconds [2021-07-14T09:55:58.877] prolog for job 1756506 ran for 471 seconds [2021-07-14T09:55:58.877] Job 1756506 already killed, do not launch extern step [2021-07-14T09:55:58.877] debug: Finished wait for job 1756506's prolog to complete [2021-07-14T09:55:58.877] debug: Finished wait for job 1756506's prolog to complete [2021-07-14T09:55:58.878] Job 1756506 already killed, do not launch batch job [2021-07-14T09:55:58.878] debug: Waiting for job 1756506's prolog to complete [2021-07-14T09:55:58.878] debug: Finished wait for job 1756506's prolog to complete [2021-07-14T09:55:58.878] debug: [job 1756506] attempting to run epilog [/usr/local/etc/slurm/epilog] [2021-07-14T09:55:58.883] debug: completed epilog for jobid 1756506 [2021-07-14T09:55:58.883] debug: Job 1756506: sent epilog complete msg: rc = 0 Hi Yang, > Jobs still failed in our cluster. For example: > > [2021-07-12T10:39:04.708] error: _find_node_record(766): lookup failure for > gpu2101 > [2021-07-12T10:39:04.708] error: _find_node_record(778): lookup failure for > gpu2101 alias gpu2101 Yes, this "lookup failure" seems the main reason why Slurm is marking the node as failing. Could you attach the full log of slurmd on node1311 and slurmctld? Were you able to setup DebugFlags=Agent,Network as mentioned in comment 10? > ======dmesg -T============== Yes, if you include the syslogs of node1311 for 2021-07-12, that would help too. > What does the 'credential for job 1756506 revoked' in the slurmd log mean? For every submission Slurm verifies that it can be converted and launched as a job in the system, and then it provides a job credentials that allow the job in a node to send messages to other nodes, daemons, etc.. There are several cases where Slurm will revoke those credentials (they need to expire), so they are not valid anymore. I still would need to see the complete logs, but based on what you've already attached, I think that you have some kind of network or time synchronization issue in the system (or in some nodes). Can you verify the time synchronization in your system? Regards, Albert Created attachment 20528 [details]
Slurmctld Log for 2021-07-23
Created attachment 20529 [details]
Slurmd Log for node1303 on 2021-07-23
Created attachment 20530 [details]
dmesg Log for node1303 on 2021-07-23
Hi Albert, We have changed the "DebugFlags" to "Network,Agent": $ scontrol show config|grep -i debug DebugFlags = Agent,Network SlurmctldDebug = debug SlurmctldSyslogDebug = unknown SlurmdDebug = debug SlurmdSyslogDebug = unknown After the change, another job failed: $ sacct -j 1832609_15 --format='jobid,state,reason,start,end,timelimit,MaxRSS,ReqMem,nodeList,jobIdRaw,partition' JobID State Reason Start End Timelimit MaxRSS ReqMem NodeList JobIDRaw Partition ------------ ---------- ---------------------- ------------------- ------------------- ---------- ---------- ---------- --------------- ------------ ---------- 1832609_15 NODE_FAIL None 2021-07-23T18:52:41 2021-07-23T18:59:19 08:00:00 4Gn node1303 1832624 batch 1832609_15.+ FAILED 2021-07-23T18:52:41 2021-07-23T19:04:32 4Gn node1303 1832624.bat+ 1832609_15.+ CANCELLED 2021-07-23T18:52:41 2021-07-23T19:07:30 4Gn node1303 1832624.ext+ I am not sure how to check time synchronization. Below are what I tried: [yliu385@node1303 ~]$ ntpstat synchronised to NTP server (172.20.0.6) at stratum 3 time correct to within 5 ms polling server every 512 s [yliu385@node1303 ~]$ echo $? 0 [yliu385@node1303 ~]$ ntpq -pn ::1: timed out, nothing received ***Request timed out [yliu385@node1303 ~]$ timedatectl status Local time: Sun 2021-07-25 08:10:32 EDT Universal time: Sun 2021-07-25 12:10:32 UTC RTC time: Sun 2021-07-25 12:10:32 Time zone: n/a (EDT, -0400) NTP enabled: yes NTP synchronized: yes RTC in local TZ: no DST active: yes Last DST change: DST began at Sun 2021-03-14 01:59:59 EST Sun 2021-03-14 03:00:00 EDT Next DST change: DST ends (the clock jumps one hour backwards) at Sun 2021-11-07 01:59:59 EDT Sun 2021-11-07 01:00:00 EST The slurmctld, slurmd, and dmsg logs are attached for node1303 on 2021-07-23. Thanks, Yang Hi Yang, > We have changed the "DebugFlags" to "Network,Agent": > $ scontrol show config|grep -i debug > DebugFlags = Agent,Network > SlurmctldDebug = debug > SlurmctldSyslogDebug = unknown > SlurmdDebug = debug > SlurmdSyslogDebug = unknown > > After the change, another job failed: > $ sacct -j 1832609_15 Thanks for the logs. Looking at them I can confirm that for this particular job 1832609_15 (1832624) we had some problem while trying to running the Prolog. For some reason it took 11 minutes to end (actually, it seems that it took 11 minutes to start the prolog), so Slurm decided that the node was facing issues and decided to cancel the job and marked the node to drain. In general this is expected behavior and I would suggest to see what could be happening on the nodes and the prolog, but in your case I'm investigating it further because this order and gap on the logs seems wrong: [2021-07-23T18:52:51.520] debug: Waiting for job 1832624's prolog to complete [2021-07-23T19:03:01.368] debug: [job 1832624] attempting to run prolog [/usr/local/etc/slurm/prolog] So, maybe Slurm is not starting properly the prolog and it's blaming incorrectly the node. Can you also attach your plugstack.conf, and your Prolog=/usr/local/etc/slurm/prolog files? > I am not sure how to check time synchronization. Right now I'm discarding this line of investigation and focusing on the prolog one. Regards, Albert Hi Albert, It seems that we don't have PlugStack.conf: $ scontrol show config|grep -i plugs PlugStackConfig = (null) SlurmctldPlugstack = (null) $cat prolog #!/bin/bash #echo "## SLURM PROLOG ###############################################################" #echo "## Job ID : $SLURM_JOB_ID" #if [ -n "$SLURM_JOB_NAME" ]; then #echo "## Job Name : $SLURM_JOB_NAME" #else #echo "## Job Name : [none]" #fi #echo "## Nodelist : $SLURM_JOB_NODELIST" #echo "## CPUs : $SLURM_NPROCS" #if [ -n "$SLURM_MEM_PER_CPU" ]; then #echo "## Mem/CPU : $SLURM_MEM_PER_CPU MB" #fi #if [ -n "$SLURM_MEM_PER_NODE" ]; then #echo "## Mem/Node : $SLURM_MEM_PER_NODE MB" #fi #echo "## Directory : $SLURM_SUBMIT_DIR" #echo "## Started : `date`" #echo "###############################################################################" /gpfs/runtime/sbin/scrub_tmp >/root/prolog.out 2>&1 ## Create directory to catch $XDG_RUNTIME_DIR /bin/mkdir -p /run/user/$SLURM_JOB_UID /bin/chown $SLURM_JOB_USER /run/user/$SLURM_JOB_UID /bin/chmod 700 /run/user/$SLURM_JOB_UID ##SUPREMM job info collector /usr/local/etc/slurm/supremm_prolog.pl if [[ $SLURM_JOB_PARTITION == dbgap ]] then /usr/local/etc/slurm/dbgap_prolog fi exit 0 Yang Hi Yang, > It seems that we don't have PlugStack.conf: > $ scontrol show config|grep -i plugs > PlugStackConfig = (null) > SlurmctldPlugstack = (null) Ok. > $cat prolog > > /gpfs/runtime/sbin/scrub_tmp >/root/prolog.out 2>&1 I assume that this is does some basic storage setup for the job. > ##SUPREMM job info collector > /usr/local/etc/slurm/supremm_prolog.pl I'm curious about this supremm_prolog.pl, could you explain a bit what it does? > if [[ $SLURM_JOB_PARTITION == dbgap ]] > then > /usr/local/etc/slurm/dbgap_prolog > fi Also, about this dbgap_prolog too? Thanks, Albert Created attachment 20582 [details]
cleans out /tmp/, /var/tmp, and /dev/shm.
Created attachment 20583 [details]
sets up security for using protected dbGAP sequence data.
Created attachment 20584 [details]
sets up pcp data collection for use with XDMOD
Hi Albert, the three files are attached: *scrub_tmp safely cleans out /tmp/, /var/tmp, and /dev/shm. *dbgap_prolog sets up security for using protected dbGAP sequence data. *supremm_prolog sets up pcp data collection for use with XDMOD. Yang Hi Yang, I found the issue is related to how long your prologs run for and the fact that the prologs are running one at a time. In the logs it shows many jobs being started at the same time. Since your prolog takes about 75 seconds, some jobs were not able to finish (or even start) their prolog until after they timed out. Are you still getting issues like this? How common is it? -Scott Yang, Prologs should run in parallel but there is a bug that was causing them to run one at a time. We found and fixed the bug with commits 4146467106 through 308d06f686. It will be released in 21.08.3. This should solve the issue. I am, therefore, closing the ticket. If you have any questions please mark this bug as "open". -Scott |