Ticket 11747

Summary: Prolog error
Product: Slurm Reporter: Torkil Svensgaard <torkil>
Component: slurmdAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: cinek, rkv, tripiana
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8398
Site: DRCMR Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Where add attachment link is
Log from bigger7
Bigger7 slurm.conf
Slurm.conf from controller
Slurmctld.log
Bigger13 slurmd.log

Description Torkil Svensgaard 2021-06-02 01:02:58 MDT
Hi

"
# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
HPC*         up   infinite      2  down* big22,small19
HPC*         up   infinite      1  drain bigger7
HPC*         up   infinite      1    mix bigger9
HPC*         up   infinite     29  alloc big[20-21,27-28],bigger[2-4,10-13],drakkisath,fenrir,gojira,ix1,rivendare,small[1-2,23-25,29-35],smaug
# bigger7/root ~ 
# sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2021-05-27T09:25:35 big22
Not responding       slurm     2021-05-27T18:14:46 small19
Prolog error         slurm     2021-06-02T08:05:53 bigger7
"

How do I determine what the prolog error was? I can't find anything in /var/log/slurmd.log on bigger7.

Mvh.

Torkil
Comment 1 Torkil Svensgaard 2021-06-02 01:05:56 MDT
On the controller there's only this:

[2021-06-02T08:05:53.398] error: validate_node_specs: Prolog or job env setup failure on node bigger7, draining the node
[2021-06-02T08:05:53.398] drain_nodes: node bigger7 state set to DRAIN

Mvh.

Torkil
Comment 2 Marcin Stolarek 2021-06-02 04:15:57 MDT
Generally, slurmd logs should contain some information about the cause of the error. 
What's SlurmdDebug set? Did you confirm that slurmd logs to /var/log/slurmd.log file?

Could you please share the log(slurmd side) with us?

cheers,
Marcin
Comment 3 Torkil Svensgaard 2021-06-03 02:31:45 MDT
Hi Marcin

How do I go about sending you guys a large log file?

From the conf file:

SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd.log

Mvh.

Torkil
Comment 4 Marcin Stolarek 2021-06-03 06:30:36 MDT
Created attachment 19775 [details]
Where add attachment link is

>How do I go about sending you guys a large log file?
It's usually fine to attach a compressed file to the case. (I using "Add an attachment" link - screen shot attached).

>SlurmdDebug=debug
>SlurmdLogFile=/var/log/slurmd.log
Looks good, did you check if you see new logs appearing there?

cheers,
Marcin
Comment 5 Torkil Svensgaard 2021-06-04 00:35:28 MDT
Created attachment 19793 [details]
Log from bigger7
Comment 6 Torkil Svensgaard 2021-06-04 00:35:58 MDT
Hi Marcin

Log uploaded. Log is active, yes.

Mvh.

Torkil
Comment 7 Marcin Stolarek 2021-06-04 01:52:55 MDT
Torkil,

Looking at the log file I see that starting from 2 June you're getting a step level messages like:
>[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error'

From the configuration you shared with us some time ago I see that it may be running config less, could you please check the content of /run/slurm/conf/slurm.conf on the node? 
Was there any configuration change around that date?

cheers,
Marcin
Comment 8 Torkil Svensgaard 2021-06-04 02:33:58 MDT
Created attachment 19796 [details]
Bigger7 slurm.conf
Comment 9 Torkil Svensgaard 2021-06-04 02:34:16 MDT
Created attachment 19797 [details]
Slurm.conf from controller
Comment 10 Torkil Svensgaard 2021-06-04 02:38:45 MDT
(In reply to Marcin Stolarek from comment #7)

> Looking at the log file I see that starting from 2 June you're getting a
> step level messages like:
> >[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error'
> 
> From the configuration you shared with us some time ago I see that it may be
> running config less, could you please check the content of
> /run/slurm/conf/slurm.conf on the node? 

Seems to be up to date.

> Was there any configuration change around that date?

Yes. Not specifically on bigger7 though, but we added GPU's to other nodes and thus had to change slurm.conf (and run reconfigure).

The cluster was fully loaded on that day, and the prolog file is sitting on NFS. Could that be a problem if it is getting hammered, even though it is cached in a lot of places? One other node also had this prolog error right after I created the ticket.

Mvh.

Torkil
Comment 11 Marcin Stolarek 2021-06-04 05:14:58 MDT
Torkil,

Looking at the sources I realized that the message:
>[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error'
Is just misleading and doesn't have any impact on batch step logging. It logs the level used for stderr, which is only meaningful for srun created steps, since those are connected to stderr of slurmstepd, batch step logs to file only.

I've actually found the error message related to the node being drained, it's:
>error: Waiting for JobId=215583 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec

This error means that the request to run batch step timed-out waiting for appropriate launch prolog RPC start processing by the slurmd. Unfortunately, we can't say more about the reason for that to happen. Did you notice any connection errors logged on slurmctld side to bigger7 node?

There was a very rare condition where it was happening on the versions prior to 20.02.6 I'm adding the bug as See Also, since may find a few hypothesis on when this can happen.

cheers,
Marcin
Comment 12 Torkil Svensgaard 2021-06-04 05:20:55 MDT
(In reply to Marcin Stolarek from comment #11)
 
> I've actually found the error message related to the node being drained,
> it's:
> >error: Waiting for JobId=215583 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec

I just checked bigger13 which also had the same problem at the same time(ish), same thing:

"
[2021-06-02T13:53:48.689] error: Waiting for JobId=217868 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2021-06-02T13:53:48.689] error: Waiting for JobId=217871 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2021-06-02T13:53:48.699] error: Waiting for JobId=217870 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2021-06-02T13:53:48.774] error: Waiting for JobId=217872 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
"

Uploading full log from that node.
 
> This error means that the request to run batch step timed-out waiting for
> appropriate launch prolog RPC start processing by the slurmd. Unfortunately,
> we can't say more about the reason for that to happen. Did you notice any
> connection errors logged on slurmctld side to bigger7 node?

I didn't at the time. Uploading slurmctld.log.

> There was a very rare condition where it was happening on the versions prior
> to 20.02.6 I'm adding the bug as See Also, since may find a few hypothesis
> on when this can happen.

Thanks, 

Torkil
Comment 13 Torkil Svensgaard 2021-06-04 05:23:32 MDT
Created attachment 19808 [details]
Slurmctld.log
Comment 14 Torkil Svensgaard 2021-06-04 05:24:01 MDT
Created attachment 19809 [details]
Bigger13 slurmd.log
Comment 16 Marcin Stolarek 2021-06-08 03:27:15 MDT
 Torkil,

Looking at slurmctld logs I see that it was under high pressure during the time when the issue occured:
>[2021-06-02T08:05:53.756] sched: 176 pending RPCs at cycle end, consider configuring max_rpc_cnt
>[2021-06-02T13:53:48.326] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt

I'd recommend setting MessageTimeout=30 and eventualy consinder setting SchedulerParameters=max_rpc_cnt=100.

Looking at your configuration you're running with a default MessageTimeout=10. The prolog error:
> REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
shows 20s which is 2xMessageTimeout. This value may be to low for a high throughput environment and result in errors in the burst of RPCs.

cheers,
Marcin
Comment 17 Torkil Svensgaard 2021-06-08 06:30:39 MDT
(In reply to Marcin Stolarek from comment #16)
>  Torkil,

Hi Marcin

> Looking at slurmctld logs I see that it was under high pressure during the
> time when the issue occured:
> >[2021-06-02T08:05:53.756] sched: 176 pending RPCs at cycle end, consider configuring max_rpc_cnt
> >[2021-06-02T13:53:48.326] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt
> 
> I'd recommend setting MessageTimeout=30 and eventualy consinder setting
> SchedulerParameters=max_rpc_cnt=100.

Thanks, I've changed the configuration. Feel free to close the ticket, I'll get back to you if it happens again.

Mvh.

Torkil