Ticket 11656

Summary: Inconsistent Stats for Job Steps
Product: Slurm Reporter: Yang Liu <yang_liu2>
Component: AccountingAssignee: Scott Hilton <scott>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
Site: Brown Univ Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.3 22.05.0pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf
Slurmctld Log for 2021-07-23
Slurmd Log for node1303 on 2021-07-23
dmesg Log for node1303 on 2021-07-23
cleans out /tmp/, /var/tmp, and /dev/shm.
sets up security for using protected dbGAP sequence data.
sets up pcp data collection for use with XDMOD

Description Yang Liu 2021-05-19 12:17:18 MDT
Greetings,
In our cluster, jobs show 'cancel' state for jobid.extern, but 'node_fail' state for jobid. For example,

$ sacct -j 1055978_12 -o"jobid%30,start,end,state,ExitCode,AllocCPUs"
                         JobID               Start                 End      State ExitCode  AllocCPUS 
------------------------------ ------------------- ------------------- ---------- -------- ---------- 
                    1055978_12 2021-04-14T16:35:03 2021-04-14T16:37:32  NODE_FAIL      1:0          1 
              1055978_12.batch 2021-04-14T16:35:03 2021-04-14T16:37:36     FAILED      0:0          1 
             1055978_12.extern 2021-04-14T16:35:03 2021-04-14T16:42:46  CANCELLED                   1 

$ sacct -j 1351015 -o"jobid%30,start,end,state,ExitCode,AllocCPUs"
                         JobID               Start                 End      State ExitCode  AllocCPUS 
------------------------------ ------------------- ------------------- ---------- -------- ---------- 
                       1351015 2021-05-18T13:39:36 2021-05-18T13:42:07  NODE_FAIL      1:0          1 
                 1351015.batch 2021-05-18T13:39:36 2021-05-18T13:43:02     FAILED      0:0          1 
                1351015.extern 2021-05-18T13:39:36 2021-05-18T13:44:44  CANCELLED                   1 

How can we tell those jobs were cancelled by users/admins or due to node failures? Thanks.

Yang
Comment 2 Yang Liu 2021-05-24 14:18:03 MDT
One user had two jobs which have the same status. But the user confirmed that these two jobs failed. The user actually didn’t get any slurm standard output file in the directory, but got the slurm notifying email saying “node fail”. The user always use - -email option when submitting jobs and that’s the only thing the user knows about these two weird terminated jobs.
Comment 5 Albert Gil 2021-05-26 09:59:33 MDT
Hi Yang,

> How can we tell those jobs were cancelled by users/admins or due to node
> failures?

First of all let me notice that when a *job* is terminated due a NODE_FAIL, then Slurm tries its best to cancel all the *steps of the job*.
So, in your case both *jobs* 1055978_12 and 1351015 were terminated due NODE_FAIL, and some of their *steps* got cancelled.

If you use -X,--allocations you'll see only the *job* information, not the *steps*.

Note that the information on the steps may be useful for the user, for example imagine this job with 3 steps:

$ sbatch --wrap "srun sleep 3; srun sleep 5; srun sleep 300"                                                                                                                             
Submitted batch job 10558

After 3 seconds running you see somthing like this:

$ sacct -j 10558 --format jobid,jobname,state
       JobID    JobName      State
------------ ---------- ----------
10558              wrap    RUNNING
10558.batch       batch    RUNNING
10558.extern     extern    RUNNING
10558.0           sleep  COMPLETED
10558.1           sleep    RUNNING

And some seconds later:

$ sacct -j 10558 --format jobid,jobname,state
       JobID    JobName      State
------------ ---------- ----------
10558              wrap    RUNNING
10558.batch       batch    RUNNING
10558.extern     extern    RUNNING
10558.0           sleep  COMPLETED
10558.1           sleep  COMPLETED
10558.2           sleep    RUNNING

I now the node fails, you will end with something like:

$ sacct -j 10558 --format jobid,jobname,state -X
       JobID    JobName      State
------------ ---------- ----------
10558              wrap  NODE_FAIL

$ sacct -j 10558 --format jobid,jobname,state
       JobID    JobName      State
------------ ---------- ----------
10558              wrap  NODE_FAIL
10558.batch       batch  COMPLETED
10558.extern     extern  CANCELLED
10558.0           sleep  COMPLETED
10558.1           sleep  COMPLETED
10558.2           sleep  CANCELLED

Note that the user may be interested on knowing that steps .0 and .1 were able to complete.
Also note that JobRequeue option is controlling if those batch jobs on a NODE_FAIL should be cancelled or requeued.

Finally, imagine this case:

$ sbatch --wrap "srun sleep 3; srun sleep 50; srun sleep 3"                                                                                                                              
Submitted batch job 10559

$ sacct -j 10559 --format jobid,jobname,state
       JobID    JobName      State 
------------ ---------- ---------- 
10559              wrap    RUNNING 
10559.batch       batch    RUNNING 
10559.extern     extern    RUNNING 
10559.0           sleep  COMPLETED 
10559.1           sleep    RUNNING 

$ scancel 10559.1

$ sacct -j 10559 --format jobid,jobname,state%20
       JobID    JobName                State 
------------ ---------- -------------------- 
10559              wrap              RUNNING 
10559.batch       batch              RUNNING 
10559.extern     extern              RUNNING 
10559.0           sleep            COMPLETED 
10559.1           sleep    CANCELLED by 1000 
10559.2           sleep              RUNNING 

And finally:

$ sacct -j 10559 --format jobid,jobname,state%20
       JobID    JobName                State 
------------ ---------- -------------------- 
10559              wrap            COMPLETED 
10559.batch       batch            COMPLETED 
10559.extern     extern            COMPLETED 
10559.0           sleep            COMPLETED 
10559.1           sleep    CANCELLED by 1000 
10559.2           sleep            COMPLETED 

Note that scancel a step is not leading to cancel the job, and that the state of a step (or job) cancelled by a user has the "by UID" suffix to distinguish manual scancels from users from cancels from Slurm itself.

Is this answering your question?

Regards,
Albert
Comment 6 Yang Liu 2021-06-01 07:21:06 MDT
Hi Albert,
Thank you for the clear and useful explanations. Our next question is how to get more details on why/how slurs mark a node failure. Our system admins checked logs and did find useful information on a node failure event. Meanwhile, other jobs ran on the same node and were completed successfully, when the node failure was reported by slurm for another job. For example, job 1 was reported as node failure, while job 2 ran to complete on the same node of job 1. During the execution of job 2, node failure was reported for job 1. How can we find more details why node failure was reported only for job 1, not the other jobs?

Thans,
Yang
Comment 8 Albert Gil 2021-06-01 09:07:38 MDT
Hi Yang,

> Thank you for the clear and useful explanations.

Glad it helped.

> Our next question is how to
> get more details on why/how slurs mark a node failure. Our system admins
> checked logs and did find useful information on a node failure event.

The logs may provide useful information.
Could you attach the slurmmctld and slurmd logs of the nodes on the periods that you got those node fails (eg. job 1351015).

> Meanwhile, other jobs ran on the same node and were completed successfully,
> when the node failure was reported by slurm for another job. For example,
> job 1 was reported as node failure, while job 2 ran to complete on the same
> node of job 1. During the execution of job 2, node failure was reported for
> job 1. How can we find more details why node failure was reported only for
> job 1, not the other jobs?

This seems strange.
I would need the slurmctld and slurmd logs on the period of those job1 and job2 to investigate it further, but first we can try a much simpler attempt to see if it explains the issue: can you use "sacct --format=JobId,State,Reason" on the failed job?
Maybe the Reason gives us more info without looking further on the logs.

Regards,
Albert
Comment 9 Yang Liu 2021-06-04 06:41:44 MDT
Hi Albert,

Here are the logs for the two jobs failed on 2021-05-27:

[yliu385@node947 yliu385]$ sacct -j 1407294 --format=JobId,State,Reason,NodeList,Start,End
       JobID      State                 Reason        NodeList               Start                 End 
------------ ---------- ---------------------- --------------- ------------------- ------------------- 
1407294       NODE_FAIL                   None        node1307 2021-05-27T11:32:45 2021-05-27T11:37:54 
1407294.bat+     FAILED                               node1307 2021-05-27T11:32:45 2021-05-27T11:40:29 
1407294.ext+  CANCELLED                               node1307 2021-05-27T11:32:45 2021-05-27T11:40:29 
[yliu385@node947 yliu385]$ sacct -j 1405353_32 --format=JobId,State,Reason,NodeList,Start,End
       JobID      State                 Reason        NodeList               Start                 End 
------------ ---------- ---------------------- --------------- ------------------- ------------------- 
1405353_32    NODE_FAIL                   None        node1328 2021-05-27T09:25:33 2021-05-27T09:27:52 
1405353_32.+     FAILED                               node1328 2021-05-27T09:25:33 2021-05-27T09:28:27 
1405353_32.+  CANCELLED                               node1328 2021-05-27T09:25:33 2021-05-27T09:28:27 


Slurmctl log for 2021-05-27 when the two jobs failed/ended:
-bash-4.2$ grep 2021-05-27 slurmctld 
[2021-05-27T11:38:04.248] error: _find_node_record(766): lookup failure for 1333
[2021-05-27T11:47:21.560] error: slurm_receive_msg [172.20.205.51:56346]: Zero Bytes were transmitted or received
[2021-05-27T13:57:42.235] error: _find_node_record(766): lookup failure for l
[2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS pri-jfranck TRES cpu grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained.
[2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS pri-jfranck TRES mem grp_used_tres_run_secs underflow, tried to remove 437551104 seconds when only 437518336 remained.
[2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS pri-jfranck TRES node grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained.
[2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS pri-jfranck TRES billing grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained.
[2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: assoc 18783 TRES cpu grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained.
[2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: assoc 18783 TRES mem grp_used_tres_run_secs underflow, tried to remove 437551104 seconds when only 437518336 remained.
[2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: assoc 18783 TRES node grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained.
[2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862: assoc 18783 TRES billing grp_used_tres_run_secs underflow, tried to remove 13353 seconds when only 13352 remained.
[2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES cpu grp_used_tres_run_secs underflow, tried to remove 1200 seconds when only 724 remained.
[2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES mem grp_used_tres_run_secs underflow, tried to remove 7680000 seconds when only 4633600 remained.
[2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES node grp_used_tres_run_secs underflow, tried to remove 300 seconds when only 181 remained.
[2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES billing grp_used_tres_run_secs underflow, tried to remove 1200 seconds when only 724 remained.
[2021-05-27T20:06:04.126] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 300 seconds when only 181 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES cpu grp_used_tres_run_secs underflow, tried to remove 344 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES mem grp_used_tres_run_secs underflow, tried to remove 2201600 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES node grp_used_tres_run_secs underflow, tried to remove 86 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES billing grp_used_tres_run_secs underflow, tried to remove 344 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379690: assoc 36577 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 86 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES cpu grp_used_tres_run_secs underflow, tried to remove 344 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES mem grp_used_tres_run_secs underflow, tried to remove 2201600 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES node grp_used_tres_run_secs underflow, tried to remove 86 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES billing grp_used_tres_run_secs underflow, tried to remove 344 seconds when only 0 remained.
[2021-05-27T20:07:51.924] error: _handle_assoc_tres_run_secs: job 1379691: assoc 36577 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 86 seconds when only 0 remained.
-bash-4.2$ 

Slurmd log:
[yliu385@node1328 log]$ sudo cat slurm/slurmd|grep 2021-05-27
[2021-05-27T12:33:03.625] [1391743.batch] error: *** JOB 1391743 ON node1328 CANCELLED AT 2021-05-27T12:33:03 DUE TO TIME LIMIT ***
[2021-05-27T12:33:03.626] [1391744.batch] error: *** JOB 1391744 ON node1328 CANCELLED AT 2021-05-27T12:33:03 DUE TO TIME LIMIT ***

[yliu385@node1307 log]$ sudo cat slurm/slurmd|grep 2021-05-27
[2021-05-27T02:59:36.600] [1403501.batch] error: *** JOB 1403501 ON node1307 CANCELLED AT 2021-05-27T02:59:36 DUE TO TIME LIMIT ***
[2021-05-27T12:33:03.656] [1391745.batch] error: *** JOB 1391745 ON node1307 CANCELLED AT 2021-05-27T12:33:03 DUE TO TIME LIMIT ***
[2021-05-27T12:33:03.660] [1391746.batch] error: *** JOB 1391746 ON node1307 CANCELLED AT 2021-05-27T12:33:03 DUE TO TIME LIMIT ***
[2021-05-27T15:40:15.761] [1408962.batch] error: Detected 1 oom-kill event(s) in step 1408962.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2021-05-27T15:41:58.186] [1409110.batch] error: Detected 1 oom-kill event(s) in step 1409110.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2021-05-27T19:22:10.863] [1403667.batch] error: *** JOB 1403667 ON node1307 CANCELLED AT 2021-05-27T19:22:10 ***


Yang
Comment 10 Albert Gil 2021-06-11 08:33:30 MDT
Hi Yang,

> Here are the logs for the two jobs failed on 2021-05-27:

Thanks, although I'm investigating one lead that your logs provided, I would need more information from them.
I assume that you are using SlurmctldDebug=error and SlurmdDebug=error, right?
Could you increase the debug level to:

DebugFlags=Agent,Network
SlurmctldDebug=debug
SlurmdDebug=debug

Logs will be more verbose with this level.
If you reproduce the issue while this debug level is enabled, please attach the new logs and restore your previous level.

> Slurmctl log for 2021-05-27 when the two jobs failed/ended:
> -bash-4.2$ grep 2021-05-27 slurmctld 
> [2021-05-27T11:38:04.248] error: _find_node_record(766): lookup failure for
> 1333
> [2021-05-27T11:47:21.560] error: slurm_receive_msg [172.20.205.51:56346]:
> Zero Bytes were transmitted or received
> [2021-05-27T13:57:42.235] error: _find_node_record(766): lookup failure for l

These lines above are the main lead I'm investigating.
They could explain the NODE_FAIL, but I need to look at it closer.

> [2021-05-27T18:23:51.617] error: _handle_qos_tres_run_secs: job 1405862: QOS
> pri-jfranck TRES billing grp_used_tres_run_secs underflow, tried to remove
> 13353 seconds when only 13352 remained.
> [2021-05-27T18:23:51.617] error: _handle_assoc_tres_run_secs: job 1405862:
> assoc 18783 TRES cpu grp_used_tres_run_secs underflow, tried to remove 13353
> seconds when only 13352 remained.

These errors seems unrelated to this issue, but they are most probably related to bug 7375.

Regards,
Albert
Comment 11 Albert Gil 2021-06-11 08:41:22 MDT
Yang,

Could you also attach your slurm.conf?

Thanks,
Albert
Comment 14 Yang Liu 2021-06-22 08:05:42 MDT
Created attachment 20049 [details]
slurm.conf

Hi Albert,
The slurm.conf is attached. 

What are the commands to change the configurations for the log levels, other than modifying slurm.conf directly? For example, does 'scontrol setDebugLevel debug', 'scontrol setDebugFlag +agent', 'scontrol setDebugFlag +network'. Any command for "SlurmdDebug=debug"?

DebugFlags=Agent,Network
SlurmctldDebug=debug
SlurmdDebug=debug

Yang
Comment 15 Albert Gil 2021-06-22 10:39:56 MDT
Hi Yang,

> The slurm.conf is attached. 

Thanks!

> What are the commands to change the configurations for the log levels, other
> than modifying slurm.conf directly? For example, does 'scontrol
> setDebugLevel debug', 'scontrol setDebugFlag +agent', 'scontrol setDebugFlag
> +network'.

Yes, these command are correct.
You can see the current levels and flags with a command like this:

$ scontrol show config | grep Debug

> Any command for "SlurmdDebug=debug"?

Unfortunately we don't have such command.
You need to update slurm.conf and run "scontrol reconfigure".

Please note that any changes done with "scontrol setDebugLevel/setDebugFlag" will be lost if you reconfigure or restart slurmctld.

> DebugFlags=Agent,Network
> SlurmctldDebug=debug
> SlurmdDebug=debug

If you can reproduce the issue with this log levels, please attach the logs.
That should help us.

Thanks,
Albert
Comment 16 Albert Gil 2021-06-30 07:24:16 MDT
Hi Yang,

> If you can reproduce the issue with this log levels, please attach the logs.
> That should help us.

Have you been able to reproduce the issue with higher logs?

Regards,
Albert
Comment 17 Albert Gil 2021-07-09 08:35:02 MDT
Hi Yang,

If this is OK for I'm closing this ticket as infogiven based on comment 5.
If you still want to investigate further the case that you commented on comment 6, please don't hesitate to reopen this bug.

Regards,
Albert
Comment 18 Yang Liu 2021-07-13 06:16:56 MDT
Hi Albert,
Jobs still failed in our cluster. For example:
$ sacct -j 1740055_67  --format="start,end,timelimit,state%30,reason,jobid,jobidraw"
              Start                 End  Timelimit                          State                 Reason        JobID     JobIDRaw 
------------------- ------------------- ---------- ------------------------------ ---------------------- ------------ ------------ 
2021-07-12T08:43:24 2021-07-12T09:23:11   03:00:00                      NODE_FAIL                   None 1740055_67   1740123      
2021-07-12T08:43:24 2021-07-12T09:35:44                                    FAILED                        1740055_67.+ 1740123.bat+ 
2021-07-12T08:43:24 2021-07-12T09:58:21                                 CANCELLED                        1740055_67.+ 1740123.ext+ 

===============slurmctld log==================
$ grep  1740055_67 ~/slurmctld.2021-07-12
[2021-07-12T08:43:24.972] sched: Allocate JobId=1740055_67(1740123) NodeList=node1311 #CPUs=1 Partition=batch
[2021-07-12T08:43:32.747] prolog_running_decr: Configuration for JobId=1740055_67(1740123) is complete
[2021-07-12T08:43:32.747] Extending JobId=1740055_67(1740123) time limit by 8 secs for configuration
[2021-07-12T09:23:11.183] Batch JobId=1740055_67(1740123) missing from batch node node1311 (not found BatchStartTime after startup)
[2021-07-12T09:23:11.183] _job_complete: JobId=1740055_67(1740123) WTERMSIG 126
[2021-07-12T09:23:11.183] _job_complete: JobId=1740055_67(1740123) cancelled by node failure
[2021-07-12T09:23:11.183] _job_complete: JobId=1740055_67(1740123) done
[2021-07-12T09:26:17.325] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:29:27.948] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:32:37.369] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:35:47.399] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:38:57.531] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:42:07.926] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:45:17.945] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:48:27.277] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:51:32.983] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:54:42.496] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:57:44.810] Resending TERMINATE_JOB request JobId=1740055_67(1740123) Nodelist=node1311
[2021-07-12T09:58:21.485] cleanup_completing: JobId=1740055_67(1740123) completion process took 2110 seconds

===============slurmd log==================
[yliu385@node1311 log]$ head -n 20 ~/slurmd-node1311-2021-07-12 
[2021-07-12T06:47:40.826] [1739857.0] error: Detected 1 oom-kill event(s) in step 1739857.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2021-07-12T06:47:40.842] [1739857.batch] error: Detected 1 oom-kill event(s) in step 1739857.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2021-07-12T06:47:40.864] [1739857.extern] error: Detected 1 oom-kill event(s) in step 1739857.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2021-07-12T10:39:04.473] debug:  Log file re-opened
[2021-07-12T10:39:04.705] debug:  CPUs:24 Boards:1 Sockets:2 CoresPerSocket:12 ThreadsPerCore:1
[2021-07-12T10:39:04.705] Message aggregation disabled
[2021-07-12T10:39:04.705] debug:  Reading the topology.conf file
[2021-07-12T10:39:04.708] error: _find_node_record(766): lookup failure for gpu2101
[2021-07-12T10:39:04.708] error: _find_node_record(778): lookup failure for gpu2101 alias gpu2101


======dmesg -T==============
[Mon Jul 12 06:47:27 2021] Memory cgroup stats for /slurm/uid_140272125/job_1739857/step_extern: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Mon Jul 12 06:47:27 2021] Memory cgroup stats for /slurm/uid_140272125/job_1739857/step_batch: cache:0KB rss:3876KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:3876KB inactive_file:0KB active_file:0KB unevictable:0KB
[Mon Jul 12 06:47:27 2021] Memory cgroup stats for /slurm/uid_140272125/job_1739857/step_0: cache:0KB rss:10236124KB rss_huge:235520KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:10236124KB inactive_file:0KB active_file:0KB unevictable:0KB
[Mon Jul 12 06:47:27 2021] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Mon Jul 12 06:47:27 2021] [185088] 140272125 185088     2394      364      10        0             0 slurm_script
[Mon Jul 12 06:47:27 2021] [185399] 140272125 185399    62093     1287      33        0             0 srun
[Mon Jul 12 06:47:27 2021] [185400] 140272125 185400    10871      448      25        0             0 srun
[Mon Jul 12 06:47:27 2021] [185415] 140272125 185415  2681405  2559233    5097        0             0 htseq-count
[Mon Jul 12 06:47:27 2021] Memory cgroup out of memory: Kill process 185415 (htseq-count) score 1001 or sacrifice child
[Mon Jul 12 06:47:27 2021] Killed process 185415 (htseq-count), UID 140272125, total-vm:10725620kB, anon-rss:10235892kB, file-rss:4kB, shmem-rss:1036kB

Thanks,
Yang
Comment 19 Yang Liu 2021-07-15 12:03:51 MDT
What does the 'credential for job 1756506 revoked' in the slurmd log mean?

Yang

$ sacct -j 1756506 --format='jobid,state,reason,start,end,timelimit,MaxRSS,ReqMem,nodeList'
       JobID      State                 Reason               Start                 End  Timelimit     MaxRSS     ReqMem        NodeList 
------------ ---------- ---------------------- ------------------- ------------------- ---------- ---------- ---------- --------------- 
1756506       NODE_FAIL                   None 2021-07-14T09:48:06 2021-07-14T09:50:58 2-12:00:00                  16Gn        node1306 
1756506.bat+     FAILED                        2021-07-14T09:48:06 2021-07-14T09:55:58                             16Gn        node1306 
1756506.ext+  CANCELLED                        2021-07-14T09:48:06 2021-07-14T09:55:58                             16Gn        node1306 
[yliu385@node950 yliu385]$ 


============slurmctld log=========================
$ grep 1756506 ~/slurmctld.2021-07-14
[2021-07-14T09:48:04.611] _slurm_rpc_submit_batch_job: JobId=1756506 InitPrio=47834 usec=1295
[2021-07-14T09:48:06.525] sched: Allocate JobId=1756506 NodeList=node1306 #CPUs=1 Partition=batch
[2021-07-14T09:48:07.424] prolog_running_decr: Configuration for JobId=1756506 is complete
[2021-07-14T09:48:07.424] Extending JobId=1756506 time limit by 1 secs for configuration
[2021-07-14T09:50:58.970] Batch JobId=1756506 missing from batch node node1306 (not found BatchStartTime after startup)
[2021-07-14T09:50:58.970] _job_complete: JobId=1756506 WTERMSIG 126
[2021-07-14T09:50:58.970] _job_complete: JobId=1756506 cancelled by node failure
[2021-07-14T09:50:58.970] _job_complete: JobId=1756506 done
[2021-07-14T09:54:00.491] Resending TERMINATE_JOB request JobId=1756506 Nodelist=node1306
[2021-07-14T09:55:58.878] slurmd error running JobId=1756506 from node(s)=node1306: Job credential revoked
[2021-07-14T09:55:58.884] cleanup_completing: JobId=1756506 completion process took 300 seconds
[2021-07-14T14:17:54.948] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=1756506 uid 140114611
[2021-07-14T14:17:54.949] job_str_signal(3): invalid JobId=1756506
[2021-07-14T14:17:54.949] _slurm_rpc_kill_job: job_str_signal() JobId=1756506 sig 9 returned Invalid job id specified
[2021-07-14T14:18:06.592] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=1756506 uid 140114611
[2021-07-14T14:18:06.592] job_str_signal(3): invalid JobId=1756506
[2021-07-14T14:18:06.592] _slurm_rpc_kill_job: job_str_signal() JobId=1756506 sig 9 returned Invalid job id specified


============slurmd log========================
$ grep 1756506 ~/slurmd-node1306.2021-07-14 
[2021-07-14T09:48:07.475] debug:  [job 1756506] attempting to run prolog [/usr/local/etc/slurm/prolog]
[2021-07-14T09:48:07.477] debug:  Waiting for job 1756506's prolog to complete
[2021-07-14T09:50:58.971] debug:  credential for job 1756506 revoked
[2021-07-14T09:50:58.971] debug:  _rpc_terminate_job: sent SUCCESS for 1756506, waiting for prolog to finish
[2021-07-14T09:50:58.971] debug:  Waiting for job 1756506's prolog to complete
[2021-07-14T09:55:58.877] _run_prolog: prolog with lock for job 1756506 ran for 471 seconds
[2021-07-14T09:55:58.877] prolog for job 1756506 ran for 471 seconds
[2021-07-14T09:55:58.877] Job 1756506 already killed, do not launch extern step
[2021-07-14T09:55:58.877] debug:  Finished wait for job 1756506's prolog to complete
[2021-07-14T09:55:58.877] debug:  Finished wait for job 1756506's prolog to complete
[2021-07-14T09:55:58.878] Job 1756506 already killed, do not launch batch job
[2021-07-14T09:55:58.878] debug:  Waiting for job 1756506's prolog to complete
[2021-07-14T09:55:58.878] debug:  Finished wait for job 1756506's prolog to complete
[2021-07-14T09:55:58.878] debug:  [job 1756506] attempting to run epilog [/usr/local/etc/slurm/epilog]
[2021-07-14T09:55:58.883] debug:  completed epilog for jobid 1756506
[2021-07-14T09:55:58.883] debug:  Job 1756506: sent epilog complete msg: rc = 0
Comment 20 Albert Gil 2021-07-16 01:10:40 MDT
Hi Yang,

> Jobs still failed in our cluster. For example:
>
> [2021-07-12T10:39:04.708] error: _find_node_record(766): lookup failure for
> gpu2101
> [2021-07-12T10:39:04.708] error: _find_node_record(778): lookup failure for
> gpu2101 alias gpu2101

Yes, this "lookup failure" seems the main reason why Slurm is marking the node as failing.
Could you attach the full log of slurmd on node1311 and slurmctld?

Were you able to setup DebugFlags=Agent,Network as mentioned in comment 10?

> ======dmesg -T==============

Yes, if you include the syslogs of node1311 for 2021-07-12, that would help too.

> What does the 'credential for job 1756506 revoked' in the slurmd log mean?

For every submission Slurm verifies that it can be converted and launched as a job in the system, and then it provides a job credentials that allow the job in a node to send messages to other nodes, daemons, etc..
There are several cases where Slurm will revoke those credentials (they need to expire), so they are not valid anymore.

I still would need to see the complete logs, but based on what you've already attached, I think that you have some kind of network or time synchronization issue in the system (or in some nodes).
Can you verify the time synchronization in your system?


Regards,
Albert
Comment 21 Yang Liu 2021-07-25 06:13:49 MDT
Created attachment 20528 [details]
Slurmctld Log for 2021-07-23
Comment 22 Yang Liu 2021-07-25 06:14:49 MDT
Created attachment 20529 [details]
Slurmd Log for node1303 on 2021-07-23
Comment 23 Yang Liu 2021-07-25 06:16:15 MDT
Created attachment 20530 [details]
dmesg Log for node1303 on 2021-07-23
Comment 24 Yang Liu 2021-07-25 06:19:46 MDT
Hi Albert,
We have changed the "DebugFlags" to "Network,Agent":
$ scontrol show config|grep -i debug
DebugFlags              = Agent,Network
SlurmctldDebug          = debug
SlurmctldSyslogDebug    = unknown
SlurmdDebug             = debug
SlurmdSyslogDebug       = unknown

After the change, another job failed:
$ sacct -j 1832609_15 --format='jobid,state,reason,start,end,timelimit,MaxRSS,ReqMem,nodeList,jobIdRaw,partition'
       JobID      State                 Reason               Start                 End  Timelimit     MaxRSS     ReqMem        NodeList     JobIDRaw  Partition 
------------ ---------- ---------------------- ------------------- ------------------- ---------- ---------- ---------- --------------- ------------ ---------- 
1832609_15    NODE_FAIL                   None 2021-07-23T18:52:41 2021-07-23T18:59:19   08:00:00                   4Gn        node1303 1832624           batch 
1832609_15.+     FAILED                        2021-07-23T18:52:41 2021-07-23T19:04:32                              4Gn        node1303 1832624.bat+            
1832609_15.+  CANCELLED                        2021-07-23T18:52:41 2021-07-23T19:07:30                              4Gn        node1303 1832624.ext+            

I am not sure how to check time synchronization. Below are what I tried:
[yliu385@node1303 ~]$ ntpstat 
synchronised to NTP server (172.20.0.6) at stratum 3
   time correct to within 5 ms
   polling server every 512 s
[yliu385@node1303 ~]$ echo $?
0
[yliu385@node1303 ~]$ ntpq -pn
::1: timed out, nothing received
***Request timed out
[yliu385@node1303 ~]$  timedatectl status
      Local time: Sun 2021-07-25 08:10:32 EDT
  Universal time: Sun 2021-07-25 12:10:32 UTC
        RTC time: Sun 2021-07-25 12:10:32
       Time zone: n/a (EDT, -0400)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Sun 2021-03-14 01:59:59 EST
                  Sun 2021-03-14 03:00:00 EDT
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2021-11-07 01:59:59 EDT
                  Sun 2021-11-07 01:00:00 EST

The slurmctld, slurmd, and dmsg logs are attached for node1303 on 2021-07-23.

Thanks,
Yang
Comment 27 Albert Gil 2021-07-28 04:47:16 MDT
Hi Yang,

> We have changed the "DebugFlags" to "Network,Agent":
> $ scontrol show config|grep -i debug
> DebugFlags              = Agent,Network
> SlurmctldDebug          = debug
> SlurmctldSyslogDebug    = unknown
> SlurmdDebug             = debug
> SlurmdSyslogDebug       = unknown
> 
> After the change, another job failed:
> $ sacct -j 1832609_15

Thanks for the logs.
Looking at them I can confirm that for this particular job 1832609_15 (1832624) we had some problem while trying to running the Prolog. For some reason it took 11 minutes to end (actually, it seems that it took 11 minutes to start the prolog), so Slurm decided that the node was facing issues and decided to cancel the job and marked the node to drain.

In general this is expected behavior and I would suggest to see what could be happening on the nodes and the prolog, but in your case I'm investigating it further because this order and gap on the logs seems wrong:

[2021-07-23T18:52:51.520] debug:  Waiting for job 1832624's prolog to complete
[2021-07-23T19:03:01.368] debug:  [job 1832624] attempting to run prolog [/usr/local/etc/slurm/prolog]

So, maybe Slurm is not starting properly the prolog and it's blaming incorrectly the node.
Can you also attach your plugstack.conf, and your Prolog=/usr/local/etc/slurm/prolog files?

> I am not sure how to check time synchronization.

Right now I'm discarding this line of investigation and focusing on the prolog one.


Regards,
Albert
Comment 28 Yang Liu 2021-07-28 06:48:24 MDT
Hi Albert,

It seems that we don't have PlugStack.conf:
$ scontrol show config|grep -i plugs
PlugStackConfig         = (null)
SlurmctldPlugstack      = (null)


$cat prolog 
#!/bin/bash
#echo "## SLURM PROLOG ###############################################################"
#echo "##    Job ID : $SLURM_JOB_ID"
#if [ -n "$SLURM_JOB_NAME" ]; then
#echo "##  Job Name : $SLURM_JOB_NAME"
#else
#echo "##  Job Name : [none]"
#fi
#echo "##  Nodelist : $SLURM_JOB_NODELIST"
#echo "##      CPUs : $SLURM_NPROCS"
#if [ -n "$SLURM_MEM_PER_CPU" ]; then
#echo "##   Mem/CPU : $SLURM_MEM_PER_CPU MB"
#fi
#if [ -n "$SLURM_MEM_PER_NODE" ]; then
#echo "##  Mem/Node : $SLURM_MEM_PER_NODE MB"
#fi
#echo "## Directory : $SLURM_SUBMIT_DIR"
#echo "##   Started : `date`"
#echo "###############################################################################"

/gpfs/runtime/sbin/scrub_tmp >/root/prolog.out 2>&1


## Create directory to catch $XDG_RUNTIME_DIR
/bin/mkdir -p /run/user/$SLURM_JOB_UID 
/bin/chown $SLURM_JOB_USER /run/user/$SLURM_JOB_UID
/bin/chmod 700 /run/user/$SLURM_JOB_UID


##SUPREMM job info collector
/usr/local/etc/slurm/supremm_prolog.pl

if [[ $SLURM_JOB_PARTITION == dbgap ]]
then
        /usr/local/etc/slurm/dbgap_prolog
fi


exit 0


Yang
Comment 29 Albert Gil 2021-07-28 07:58:22 MDT
Hi Yang,

> It seems that we don't have PlugStack.conf:
> $ scontrol show config|grep -i plugs
> PlugStackConfig         = (null)
> SlurmctldPlugstack      = (null)

Ok.


> $cat prolog 
>
> /gpfs/runtime/sbin/scrub_tmp >/root/prolog.out 2>&1

I assume that this is does some basic storage setup for the job.

> ##SUPREMM job info collector
> /usr/local/etc/slurm/supremm_prolog.pl

I'm curious about this supremm_prolog.pl, could you explain a bit what it does?

> if [[ $SLURM_JOB_PARTITION == dbgap ]]
> then
>         /usr/local/etc/slurm/dbgap_prolog
> fi

Also, about this dbgap_prolog too?


Thanks,
Albert
Comment 30 Yang Liu 2021-07-28 12:40:19 MDT
Created attachment 20582 [details]
cleans out /tmp/, /var/tmp, and /dev/shm.
Comment 31 Yang Liu 2021-07-28 12:41:06 MDT
Created attachment 20583 [details]
sets up security for using protected dbGAP sequence data.
Comment 32 Yang Liu 2021-07-28 12:41:29 MDT
Created attachment 20584 [details]
sets up pcp data collection for use with XDMOD
Comment 33 Yang Liu 2021-07-28 12:42:27 MDT
Hi Albert, 
the three files are attached:

*scrub_tmp safely cleans out /tmp/, /var/tmp, and /dev/shm.

*dbgap_prolog sets up security for using protected dbGAP sequence data.

*supremm_prolog sets up pcp data collection for use with XDMOD.

Yang
Comment 36 Scott Hilton 2021-10-25 15:45:58 MDT
Hi Yang,

I found the issue is related to how long your prologs run for and the fact that the prologs are running one at a time. 

In the logs it shows many jobs being started at the same time. Since your prolog takes about 75 seconds, some jobs were not able to finish (or even start) their prolog until after they timed out.

Are you still getting issues like this? How common is it? 

-Scott
Comment 41 Scott Hilton 2021-10-27 14:40:26 MDT
Yang,

Prologs should run in parallel but there is a bug that was causing them to run one at a time.

We found and fixed the bug with commits 4146467106 through 308d06f686. It will be released in 21.08.3.

This should solve the issue. I am, therefore, closing the ticket. If you have any questions please mark this bug as "open".

-Scott