Ticket 11747

Summary:	Prolog error
Product:	Slurm	Reporter:	Torkil Svensgaard <torkil>
Component:	slurmd	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	cinek, rkv, tripiana
Version:	20.11.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8398
Site:	DRCMR	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Where add attachment link is Log from bigger7 Bigger7 slurm.conf Slurm.conf from controller Slurmctld.log Bigger13 slurmd.log

Description Torkil Svensgaard 2021-06-02 01:02:58 MDT

Hi

"
# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
HPC*         up   infinite      2  down* big22,small19
HPC*         up   infinite      1  drain bigger7
HPC*         up   infinite      1    mix bigger9
HPC*         up   infinite     29  alloc big[20-21,27-28],bigger[2-4,10-13],drakkisath,fenrir,gojira,ix1,rivendare,small[1-2,23-25,29-35],smaug
# bigger7/root ~ 
# sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2021-05-27T09:25:35 big22
Not responding       slurm     2021-05-27T18:14:46 small19
Prolog error         slurm     2021-06-02T08:05:53 bigger7
"

How do I determine what the prolog error was? I can't find anything in /var/log/slurmd.log on bigger7.

Mvh.

Torkil

Comment 1 Torkil Svensgaard 2021-06-02 01:05:56 MDT

On the controller there's only this:

[2021-06-02T08:05:53.398] error: validate_node_specs: Prolog or job env setup failure on node bigger7, draining the node
[2021-06-02T08:05:53.398] drain_nodes: node bigger7 state set to DRAIN

Mvh.

Torkil

Comment 2 Marcin Stolarek 2021-06-02 04:15:57 MDT

Generally, slurmd logs should contain some information about the cause of the error. 
What's SlurmdDebug set? Did you confirm that slurmd logs to /var/log/slurmd.log file?

Could you please share the log(slurmd side) with us?

cheers,
Marcin

Comment 3 Torkil Svensgaard 2021-06-03 02:31:45 MDT

Hi Marcin

How do I go about sending you guys a large log file?

From the conf file:

SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd.log

Mvh.

Torkil

Comment 4 Marcin Stolarek 2021-06-03 06:30:36 MDT

Created attachment 19775 [details]
Where add attachment link is

>How do I go about sending you guys a large log file?
It's usually fine to attach a compressed file to the case. (I using "Add an attachment" link - screen shot attached).

>SlurmdDebug=debug
>SlurmdLogFile=/var/log/slurmd.log
Looks good, did you check if you see new logs appearing there?

cheers,
Marcin

Comment 5 Torkil Svensgaard 2021-06-04 00:35:28 MDT

Created attachment 19793 [details]
Log from bigger7

Comment 6 Torkil Svensgaard 2021-06-04 00:35:58 MDT

Hi Marcin

Log uploaded. Log is active, yes.

Mvh.

Torkil

Comment 7 Marcin Stolarek 2021-06-04 01:52:55 MDT

Torkil,

Looking at the log file I see that starting from 2 June you're getting a step level messages like:
>[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error'

From the configuration you shared with us some time ago I see that it may be running config less, could you please check the content of /run/slurm/conf/slurm.conf on the node? 
Was there any configuration change around that date?

cheers,
Marcin

Comment 8 Torkil Svensgaard 2021-06-04 02:33:58 MDT

Created attachment 19796 [details]
Bigger7 slurm.conf

Comment 9 Torkil Svensgaard 2021-06-04 02:34:16 MDT

Created attachment 19797 [details]
Slurm.conf from controller

Comment 10 Torkil Svensgaard 2021-06-04 02:38:45 MDT

(In reply to Marcin Stolarek from comment #7)

> Looking at the log file I see that starting from 2 June you're getting a
> step level messages like:
> >[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error'
> 
> From the configuration you shared with us some time ago I see that it may be
> running config less, could you please check the content of
> /run/slurm/conf/slurm.conf on the node? 

Seems to be up to date.

> Was there any configuration change around that date?

Yes. Not specifically on bigger7 though, but we added GPU's to other nodes and thus had to change slurm.conf (and run reconfigure).

The cluster was fully loaded on that day, and the prolog file is sitting on NFS. Could that be a problem if it is getting hammered, even though it is cached in a lot of places? One other node also had this prolog error right after I created the ticket.

Mvh.

Torkil

Comment 11 Marcin Stolarek 2021-06-04 05:14:58 MDT

Torkil,

Looking at the sources I realized that the message:
>[2021-06-02T08:05:49.645] [215133.batch] debug level is 'error'
Is just misleading and doesn't have any impact on batch step logging. It logs the level used for stderr, which is only meaningful for srun created steps, since those are connected to stderr of slurmstepd, batch step logs to file only.

I've actually found the error message related to the node being drained, it's:
>error: Waiting for JobId=215583 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec

This error means that the request to run batch step timed-out waiting for appropriate launch prolog RPC start processing by the slurmd. Unfortunately, we can't say more about the reason for that to happen. Did you notice any connection errors logged on slurmctld side to bigger7 node?

There was a very rare condition where it was happening on the versions prior to 20.02.6 I'm adding the bug as See Also, since may find a few hypothesis on when this can happen.

cheers,
Marcin

Comment 12 Torkil Svensgaard 2021-06-04 05:20:55 MDT

(In reply to Marcin Stolarek from comment #11)
 
> I've actually found the error message related to the node being drained,
> it's:
> >error: Waiting for JobId=215583 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec

I just checked bigger13 which also had the same problem at the same time(ish), same thing:

"
[2021-06-02T13:53:48.689] error: Waiting for JobId=217868 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2021-06-02T13:53:48.689] error: Waiting for JobId=217871 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2021-06-02T13:53:48.699] error: Waiting for JobId=217870 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2021-06-02T13:53:48.774] error: Waiting for JobId=217872 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
"

Uploading full log from that node.
 
> This error means that the request to run batch step timed-out waiting for
> appropriate launch prolog RPC start processing by the slurmd. Unfortunately,
> we can't say more about the reason for that to happen. Did you notice any
> connection errors logged on slurmctld side to bigger7 node?

I didn't at the time. Uploading slurmctld.log.

> There was a very rare condition where it was happening on the versions prior
> to 20.02.6 I'm adding the bug as See Also, since may find a few hypothesis
> on when this can happen.

Thanks, 

Torkil

Comment 13 Torkil Svensgaard 2021-06-04 05:23:32 MDT

Created attachment 19808 [details]
Slurmctld.log

Comment 14 Torkil Svensgaard 2021-06-04 05:24:01 MDT

Created attachment 19809 [details]
Bigger13 slurmd.log

Comment 16 Marcin Stolarek 2021-06-08 03:27:15 MDT

 Torkil,

Looking at slurmctld logs I see that it was under high pressure during the time when the issue occured:
>[2021-06-02T08:05:53.756] sched: 176 pending RPCs at cycle end, consider configuring max_rpc_cnt
>[2021-06-02T13:53:48.326] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt

I'd recommend setting MessageTimeout=30 and eventualy consinder setting SchedulerParameters=max_rpc_cnt=100.

Looking at your configuration you're running with a default MessageTimeout=10. The prolog error:
> REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
shows 20s which is 2xMessageTimeout. This value may be to low for a high throughput environment and result in errors in the burst of RPCs.

cheers,
Marcin

Comment 17 Torkil Svensgaard 2021-06-08 06:30:39 MDT

(In reply to Marcin Stolarek from comment #16)
>  Torkil,

Hi Marcin

> Looking at slurmctld logs I see that it was under high pressure during the
> time when the issue occured:
> >[2021-06-02T08:05:53.756] sched: 176 pending RPCs at cycle end, consider configuring max_rpc_cnt
> >[2021-06-02T13:53:48.326] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt
> 
> I'd recommend setting MessageTimeout=30 and eventualy consinder setting
> SchedulerParameters=max_rpc_cnt=100.

Thanks, I've changed the configuration. Feel free to close the ticket, I'll get back to you if it happens again.

Mvh.

Torkil