16219 – Not able to execute any job

Ticket 16219 - Not able to execute any job

Summary: Not able to execute any job

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Nate Rini
QA Contact:

URL:

Duplicates (1):	16329 (view as ticket list)
Depends on:
Blocks:

Reported:	2023-03-08 09:16 MST by Openfive Support
Modified:	2023-04-06 17:25 MDT (History)
CC List:	3 users (show)

See Also:	14434
Site:	Alphawave
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm configuration file from the slurm controller (15.21 KB, text/plain) 2023-03-08 09:59 MST, Openfive Support	Details
sdiag from the slurm controller (7.68 KB, text/plain) 2023-03-08 10:26 MST, Openfive Support	Details
scontrol show nodes from slurm controller (68.37 KB, text/plain) 2023-03-08 10:27 MST, Openfive Support	Details
scontrol show partitions from slurm controller (6.15 KB, text/plain) 2023-03-08 10:27 MST, Openfive Support	Details
slurmctld.log from the slurm controller (83.46 MB, text/plain) 2023-03-08 13:15 MST, Openfive Support	Details
slurm all nodes slurmd.log (73.07 KB, text/plain) 2023-03-08 13:18 MST, Openfive Support	Details
slurmd_all_nodes_details_09-03-2023 (67.66 KB, text/plain) 2023-03-09 11:03 MST, Openfive Support	Details
scontrol_nodes_jobs_sdiag_11-03-2023 (13.86 MB, application/x-zip-compressed) 2023-03-10 13:48 MST, Openfive Support	Details
slurmctld.log_13-03-2023 (801.26 KB, text/plain) 2023-03-13 12:31 MDT, Openfive Support	Details
sdiag.zip (3.09 KB, application/x-zip-compressed) 2023-03-13 13:49 MDT, Openfive Support	Details
slurmctld.log from the slurm controller on 21-03-2023 08:32 AM IST (51.72 MB, text/plain) 2023-03-20 21:03 MDT, Openfive Support	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Openfive Support 2023-03-08 09:16:34 MST

Hi Team,


We are not able to execute any jobs and are getting the below errors in the slurmctld.log

error: slurm_receive_msgs: Zero Bytes were transmitted or received

error: slurm_auth_get_host: Lookup failed for 0.0.0.0

Please let us know what caused these errors.


Regards,
Debajit Dutta

Comment 1 Nate Rini 2023-03-08 09:25:34 MST

Please attach slurm.conf and related files from the cluster.

When did these errors start? What has changed recently in the cluster?

Comment 2 Openfive Support 2023-03-08 09:59:21 MST

Created attachment 29211 [details]
Slurm configuration file from the slurm controller

Comment 3 Openfive Support 2023-03-08 10:00:58 MST

Hi Nate,


We have attached the slurm.conf file.

We are getting these from today itself.

Also, there was no change in the slurm.conf file.


Regards,
Debajit Dutta

Comment 5 Nate Rini 2023-03-08 10:04:48 MST

(In reply to Openfive Support from comment #3)
> We are getting these from today itself.
> Also, there was no change in the slurm.conf file.

Generally, when we see these errors:

> error: slurm_receive_msgs: Zero Bytes were transmitted or received
> error: slurm_auth_get_host: Lookup failed for 0.0.0.0

It means there is an authentication error.

What is the release version of the cluster? Please call:
> slurmctld -V

Comment 6 Openfive Support 2023-03-08 10:10:09 MST

Hi Nate,


We are randomly facing this issue, we are not able to execute jobs.

Can we have a call now on this?


Regards,
Debajit Dutta

Comment 7 Openfive Support 2023-03-08 10:11:53 MST

Hi Nate,


Also, the output of slurmctld -V is below:-

slurm 20.11.8


Regards,
Debajit Dutta

Comment 8 Nate Rini 2023-03-08 10:16:04 MST

(In reply to Openfive Support from comment #6)
> We are randomly facing this issue, we are not able to execute jobs.

Are all jobs not able to execute or is this just a specific type of job?

Please provide the following output for one of the jobs that are not executing:
> sacct -o all -p -j $FAILED_JOB_ID

Please attach slurmctld log and slurmd log from one of the nodes where jobs are not executing. Please also attach logs from the job above.

Comment 9 Openfive Support 2023-03-08 10:19:17 MST

Hi Nate,


We are randomly getting these errors, sometimes the jobs are getting dispatched but sometimes it is getting out with these errors.

Also, in addition to the previous errors we are getting the following as well:-

srun: error: Unable to allocate resources: Socket timed out on send/recv operation


We will be adding the requested logs.


Regards,
Debajit Dutta

Comment 10 Nate Rini 2023-03-08 10:19:38 MST

Please also provide the output of the following:
> sdiag
> scontrol ping
> scontrol show nodes
> scontrol show partitions

Comment 11 Nate Rini 2023-03-08 10:21:01 MST

(In reply to Openfive Support from comment #9)
> We are randomly getting these errors, sometimes the jobs are getting
> dispatched but sometimes it is getting out with these errors.

So there are jobs running but some jobs are failing?

> srun: error: Unable to allocate resources: Socket timed out on send/recv
> operation

Please make sure to add '-vvvv' to the srun call as an argument to get more verbose logs of this.

Comment 12 Openfive Support 2023-03-08 10:26:29 MST

Created attachment 29213 [details]
sdiag from the slurm controller

Comment 13 Openfive Support 2023-03-08 10:27:06 MST

Created attachment 29214 [details]
scontrol show nodes from slurm controller

Comment 14 Openfive Support 2023-03-08 10:27:35 MST

Created attachment 29215 [details]
scontrol show partitions from slurm controller

Comment 15 Openfive Support 2023-03-08 10:29:28 MST

(In reply to Nate Rini from comment #10)

Hi Nate,


> Please also provide the output of the following:
> > sdiag
We have attached the file for this.

> > scontrol ping
Below is the output from the slurm controller:-

[root@hpcmaster Documents]# scontrol ping
Slurmctld(primary) at hpcmaster is UP
Slurmctld(backup) at hpcslave is UP
[root@hpcmaster Documents]# 


> > scontrol show nodes
We have attached the file for this.

> > scontrol show partitions
We have attached the file for this.


Please check and let us know.


Regards,
Debajit Duta

Comment 16 Nate Rini 2023-03-08 10:33:11 MST

Reviewing the logs now

Comment 17 Nate Rini 2023-03-08 10:37:17 MST

(In reply to Openfive Support from comment #13)
> Created attachment 29214 [details]
> scontrol show nodes from slurm controller

NodeName=slurm-dashboard needs to be upgraded to the current running version of 20.11.8. Please attach the slurmd logs from this node.

Comment 18 Nate Rini 2023-03-08 10:38:37 MST

(In reply to Nate Rini from comment #8)
> (In reply to Openfive Support from comment #6)
> > We are randomly facing this issue, we are not able to execute jobs.
> 
> Are all jobs not able to execute or is this just a specific type of job?
> 
> Please provide the following output for one of the jobs that are not
> executing:
> > sacct -o all -p -j $FAILED_JOB_ID
> 
> Please attach slurmctld log and slurmd log from one of the nodes where jobs
> are not executing. Please also attach logs from the job above.

Please also attach the logs requested in comment#8. A zip or tarball of the logs is generally preferred to avoid having to do multiple attachments.

Comment 19 Nate Rini 2023-03-08 10:58:35 MST

(In reply to Nate Rini from comment #18)
> (In reply to Nate Rini from comment #8)
> > (In reply to Openfive Support from comment #6)
> > > We are randomly facing this issue, we are not able to execute jobs.
> > 
> > Are all jobs not able to execute or is this just a specific type of job?
> > 
> > Please provide the following output for one of the jobs that are not
> > executing:
> > > sacct -o all -p -j $FAILED_JOB_ID
> > 
> > Please attach slurmctld log and slurmd log from one of the nodes where jobs
> > are not executing. Please also attach logs from the job above.
> 
> Please also attach the logs requested in comment#8. A zip or tarball of the
> logs is generally preferred to avoid having to do multiple attachments.

The last day of logs is sufficient. No need to attach all logs timestamped from before that.

Comment 20 Nate Rini 2023-03-08 12:39:47 MST

I'm reducing ticket severity to SEV2. The logs from sdiag show jobs are running, and we require a site to proactively respond to a ticket to maintain SEV1 status. We take SEV1 tickets very seriously and lack of response to requested information cause wasted resources on our part.

Please provide the logs requested in comment#8. We currently lack sufficient data to diagnose the issue.

Comment 21 Openfive Support 2023-03-08 13:14:30 MST

(In reply to Nate Rini from comment #8)
> (In reply to Openfive Support from comment #6)

Hi Nate,


> > We are randomly facing this issue, we are not able to execute jobs.
> 
> Are all jobs not able to execute or is this just a specific type of job?

A few jobs are running, but this is happening very intermittently and the jobs are not getting dispatched to any nodes and coming out of the prompt with these errors.

For example see the below job:-

[vishalkrishnat@osvnc002 ~]$ srun -p normal --pty /bin/tcsh
srun: error: Unable to allocate resources: Socket timed out on send/recv operation

Here, we did not get any job ID and the job came out with the error.

 
> Please provide the following output for one of the jobs that are not
> executing:
> > sacct -o all -p -j $FAILED_JOB_ID

We are not getting job IDs, the jobs are simply coming out with errors.

>
> Please attach slurmctld log and slurmd log from one of the nodes where jobs
> are not executing. Please also attach logs from the job above.


We will attach the logs.



Regards,
Debajit Dutta

Comment 22 Openfive Support 2023-03-08 13:15:42 MST

Created attachment 29219 [details]
slurmctld.log from the slurm controller

Comment 23 Openfive Support 2023-03-08 13:18:40 MST

Created attachment 29220 [details]
slurm all nodes slurmd.log

Comment 24 Nate Rini 2023-03-08 13:19:14 MST

(In reply to Openfive Support from comment #21)
> We will attach the logs.

While attaching logs, please verify the following:

1. munged is running on all nodes

> systemctl status munge

2. munge is using the same key on all nodes:

> # sha1sum  /etc/munge/munge.key 
> xxxxxxxxxxdc3d8f1629e3dfef7a31  /etc/munge/munge.key

Please do not send us or share your munge key or the sha1sum output on this ticket. Verify they are all exactly the same.

Comment 25 Openfive Support 2023-03-08 13:26:19 MST

Hi Nate,


Since this issue is coming from more than one node, also before the jobs gets dispatched to some node.

We have used Ansible to gather the log from all the nodes.

We have attached the same here.


Regards,
Debajit Dutta

Comment 26 Nate Rini 2023-03-08 13:30:04 MST

(In reply to Openfive Support from comment #23)
> Created attachment 29220 [details]
> slurm all nodes slurmd.log
>
> error: Node configuration differs from hardware: CPUs=32:32(hw) Boards=1:1(hw) SocketsPerBoard=32:2(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=1:2(hw)

Any node dumping this error needs to be reconfigured. Please use 'slurmd -C' to get the detected configuration to correct slurm.conf.

Comment 27 Nate Rini 2023-03-08 13:34:15 MST

(In reply to Openfive Support from comment #22)
> Created attachment 29219 [details]
> slurmctld.log from the slurm controller
>
> error: slurm_auth_get_host: Lookup failed for 0.0.0.0

(In reply to Openfive Support from comment #23)
> Created attachment 29220 [details]
> slurm all nodes slurmd.log
>
> debug2: Error connecting slurm stream socket at 192.168.2.127:6817: Connection timed out

Please provide the output of the following from osvnc007, hpcmaster, and hpcslave:
> getent hosts osvnc007
> getent hosts hpcmaster
> getent hosts 192.168.2.127
> getent hosts hpcslave
> getent hosts 192.168.2.107

Comment 28 Openfive Support 2023-03-08 13:42:07 MST

(In reply to Nate Rini from comment #27)

Hi Nate,


Below is the output:-

> Please provide the output of the following from osvnc007, hpcmaster, and
> hpcslave:
> > getent hosts osvnc007
> > getent hosts hpcmaster
> > getent hosts 192.168.2.127
> > getent hosts hpcslave
> > getent hosts 192.168.2.107


From osvnc007:-

[root@osvnc007 ~]# getent hosts osvnc007
192.168.2.96    osvnc007.open-silicon.com osvnc007
[root@osvnc007 ~]# getent hosts hpcmaster
192.168.2.127   hpcmaster.open-silicon.com hpcmaster
[root@osvnc007 ~]# getent hosts 192.168.2.127
192.168.2.127   osncmaster.open-silicon.com osncmaster
[root@osvnc007 ~]# getent hosts hpcslave
192.168.2.107   hpcslave.open-silicon.com hpcslave
[root@osvnc007 ~]# getent hosts 192.168.2.107
192.168.2.107   hpcslave.open-silicon.com hpcslave
[root@osvnc007 ~]# 


From hpcmaster:-

[root@hpcmaster Documents]# getent hosts osvnc007
192.168.2.96    osvnc007.open-silicon.com osvnc007
[root@hpcmaster Documents]# getent hosts hpcmaster
192.168.2.127   hpcmaster.open-silicon.com hpcmaster
[root@hpcmaster Documents]# getent hosts 192.168.2.127
192.168.2.127   hpcmaster.open-silicon.com hpcmaster
[root@hpcmaster Documents]# getent hosts hpcslave
192.168.2.107   hpcslave.open-silicon.com hpcslave
[root@hpcmaster Documents]# getent hosts 192.168.2.107
192.168.2.107   hpcslave.open-silicon.com hpcslave
[root@hpcmaster Documents]# 


From hpcslave:-

[root@hpcslave ~]# getent hosts osvnc007
192.168.2.96    osvnc007.open-silicon.com osvnc007
[root@hpcslave ~]# getent hosts hpcmaster
192.168.2.127   hpcmaster.open-silicon.com hpcmaster
[root@hpcslave ~]# getent hosts 192.168.2.127
192.168.2.127   osncmaster.open-silicon.com osncmaster
[root@hpcslave ~]# getent hosts hpcslave
192.168.2.107   hpcslave.open-silicon.com hpcslave
[root@hpcslave ~]# getent hosts 192.168.2.107
192.168.2.107   hpcslave.open-silicon.com hpcslave
[root@hpcslave ~]# 



Regards,
Debajit Dutta

Comment 30 Nate Rini 2023-03-08 13:50:26 MST

> Please provide the output of the following from osvnc007, hpcmaster, and
> hpcslave:
> > getent ahostsv6 osvnc007
> > getent ahostsv6 hpcmaster
> > getent ahostsv6 192.168.2.127
> > getent ahostsv6 hpcslave
> > getent ahostsv6 192.168.2.107

Was the IPv6 configuration of the cluster changed recently?

Comment 31 Nate Rini 2023-03-08 13:52:39 MST

Please also provide the output from osvnc007, hpcmaster, and hpcslave of this command:
> systemctl status munge

Comment 32 Openfive Support 2023-03-08 21:19:04 MST

(In reply to Nate Rini from comment #30)

Hi Nate,


Below are the output details:-

> > Please provide the output of the following from osvnc007, hpcmaster, and
> > hpcslave:
> > > getent ahostsv6 osvnc007
> > > getent ahostsv6 hpcmaster
> > > getent ahostsv6 192.168.2.127
> > > getent ahostsv6 hpcslave
> > > getent ahostsv6 192.168.2.107
> 


For osvnc007:-

[root@osvnc007 ~]# getent ahostsv6 osvnc007
::ffff:192.168.2.96 STREAM osvnc007
::ffff:192.168.2.96 DGRAM  
::ffff:192.168.2.96 RAW    
[root@osvnc007 ~]# getent ahostsv6 hpcmaster
::ffff:192.168.2.127 STREAM hpcmaster
::ffff:192.168.2.127 DGRAM  
::ffff:192.168.2.127 RAW    
[root@osvnc007 ~]# getent ahostsv6 192.168.2.127
::ffff:192.168.2.127 STREAM 192.168.2.127
::ffff:192.168.2.127 DGRAM  
::ffff:192.168.2.127 RAW    
[root@osvnc007 ~]# getent ahostsv6 hpcslave
::ffff:192.168.2.107 STREAM hpcslave
::ffff:192.168.2.107 DGRAM  
::ffff:192.168.2.107 RAW    
[root@osvnc007 ~]# getent ahostsv6 192.168.2.107
::ffff:192.168.2.107 STREAM 192.168.2.107
::ffff:192.168.2.107 DGRAM  
::ffff:192.168.2.107 RAW    
[root@osvnc007 ~]# 


For hpcmaster:-
 
[root@hpcmaster Documents]# getent ahostsv6 osvnc007
::ffff:192.168.2.96 STREAM osvnc007
::ffff:192.168.2.96 DGRAM  
::ffff:192.168.2.96 RAW    
[root@hpcmaster Documents]# getent ahostsv6 hpcmaster
::ffff:192.168.2.127 STREAM hpcmaster.open-silicon.com
::ffff:192.168.2.127 DGRAM  
::ffff:192.168.2.127 RAW    
[root@hpcmaster Documents]# getent ahostsv6 192.168.2.127
::ffff:192.168.2.127 STREAM 192.168.2.127
::ffff:192.168.2.127 DGRAM  
::ffff:192.168.2.127 RAW    
[root@hpcmaster Documents]# getent ahostsv6 hpcslave
::ffff:192.168.2.107 STREAM hpcslave.open-silicon.com
::ffff:192.168.2.107 DGRAM  
::ffff:192.168.2.107 RAW    
[root@hpcmaster Documents]# getent ahostsv6 192.168.2.107
::ffff:192.168.2.107 STREAM 192.168.2.107
::ffff:192.168.2.107 DGRAM  
::ffff:192.168.2.107 RAW    


For hpcslave:-

[root@hpcslave ~]# getent ahostsv6 osvnc007
::ffff:192.168.2.96 STREAM osvnc007
::ffff:192.168.2.96 DGRAM  
::ffff:192.168.2.96 RAW    
[root@hpcslave ~]# getent ahostsv6 hpcmaster
::ffff:192.168.2.127 STREAM hpcmaster
::ffff:192.168.2.127 DGRAM  
::ffff:192.168.2.127 RAW    
[root@hpcslave ~]# getent ahostsv6 192.168.2.127
::ffff:192.168.2.127 STREAM 192.168.2.127
::ffff:192.168.2.127 DGRAM  
::ffff:192.168.2.127 RAW    
[root@hpcslave ~]# getent ahostsv6 hpcslave
::ffff:192.168.2.107 STREAM hpcslave
::ffff:192.168.2.107 DGRAM  
::ffff:192.168.2.107 RAW    
[root@hpcslave ~]# getent ahostsv6 192.168.2.107
::ffff:192.168.2.107 STREAM 192.168.2.107
::ffff:192.168.2.107 DGRAM  
::ffff:192.168.2.107 RAW    
[root@hpcslave ~]# 


> Was the IPv6 configuration of the cluster changed recently?

No.



Regards,
Debajit Dutta

Comment 33 Openfive Support 2023-03-08 21:23:13 MST

(In reply to Nate Rini from comment #31)

Hi Nate,


Below is the output details:-

> Please also provide the output from osvnc007, hpcmaster, and hpcslave of
> this command:
> > systemctl status munge


For osvnc007:-

[root@osvnc007 ~]# systemctl status munge
\u25cf munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2022-07-31 11:00:06 IST; 7 months 7 days ago
     Docs: man:munged(8)
 Main PID: 855 (munged)
    Tasks: 4
   Memory: 892.0K
   CGroup: /system.slice/munge.service
           \u2514\u2500855 /usr/sbin/munged

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@osvnc007 ~]# 


For hpcmaster:-

[root@hpcmaster Documents]# systemctl status munge
\u25cf munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2022-07-29 23:55:32 IST; 7 months 9 days ago
     Docs: man:munged(8)
  Process: 1375 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 1381 (munged)
    Tasks: 4
   CGroup: /system.slice/munge.service
           \u2514\u25001381 /usr/sbin/munged

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.


For hpcslave:-

[root@hpcslave ~]# systemctl status munge
\u25cf munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2022-12-07 15:54:51 IST; 3 months 0 days ago
     Docs: man:munged(8)
  Process: 1198 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 1220 (munged)
    Tasks: 4
   CGroup: /system.slice/munge.service
           \u2514\u25001220 /usr/sbin/munged

Dec 07 15:54:51 hpcslave systemd[1]: Starting MUNGE authentication service...
Dec 07 15:54:51 hpcslave systemd[1]: Started MUNGE authentication service.
[root@hpcslave ~]# 



Regards,
Debajit Dutta

Comment 34 Openfive Support 2023-03-08 21:28:46 MST

(In reply to Nate Rini from comment #24)

Hi Nate,


> (In reply to Openfive Support from comment #21)
> > We will attach the logs.
> 
> While attaching logs, please verify the following:
> 
> 1. munged is running on all nodes
> 

Yes, I have verified munge service is active and running in all slurm nodes.


> > systemctl status munge
> 
> 2. munge is using the same key on all nodes:
> 
> > # sha1sum  /etc/munge/munge.key 
> > xxxxxxxxxxdc3d8f1629e3dfef7a31  /etc/munge/munge.key
> 
> Please do not send us or share your munge key or the sha1sum output on this
> ticket. Verify they are all exactly the same.

Yes, I have verified all slurm nodes have the exact same munge key.



Regards,
Debajit Dutta

Comment 35 Openfive Support 2023-03-08 21:31:47 MST

Hi Nate,


Please review the information and logs I have sent.

Also, I have replied to all questions, however, let me know in case I have missed any.


Regards,
Debajit Dutta

Comment 36 Nate Rini 2023-03-09 09:23:06 MST

(In reply to Openfive Support from comment #34)
> (In reply to Nate Rini from comment #24)
> > (In reply to Openfive Support from comment #21)
> > > We will attach the logs.
> > 
> > While attaching logs, please verify the following:
> > 
> > 1. munged is running on all nodes
> > 
> 
> Yes, I have verified munge service is active and running in all slurm nodes.

Are there any errors in the munge logs? 

Please restart all munge daemons on all nodes. Once done, start slurmd in foreground mode on a node that is marked down.
> slurmd -Dvvvvvvv'

Please attach log from slurmd and then slurmctld while it is starting up.

Comment 37 Nate Rini 2023-03-09 09:46:12 MST

(In reply to Openfive Support from comment #35)
> Also, I have replied to all questions, however, let me know in case I have
> missed any.

The error we are seeing in the logs is consistent with a munge authentication issue. Munge is not a verbosely logging service, so we are going to have to take a few extra steps to determine the cause.

The cluster doesn't have a large number of nodes but it is possible munge is getting overwhelmed. Setting an increased number of threads is suggested:
> https://slurm.schedmd.com/high_throughput.html#munge_config

Comment 38 Openfive Support 2023-03-09 10:51:07 MST

(In reply to Nate Rini from comment #36)

Hi Nate,


> 
> Are there any errors in the munge logs? 
> 
> Please restart all munge daemons on all nodes. Once done, start slurmd in
> foreground mode on a node that is marked down.
> > slurmd -Dvvvvvvv'

I didn't get what you meant by "marked down"? If this is about the node state, there are only two nodes which are marked as down, but those are not reachable remotely as of now.


Regards,
Debajit Dutta

Comment 39 Nate Rini 2023-03-09 10:55:02 MST

(In reply to Openfive Support from comment #38)
> (In reply to Nate Rini from comment #36)
> I didn't get what you meant by "marked down"? If this is about the node
> state, there are only two nodes which are marked as down, but those are not
> reachable remotely as of now.

Please provide the output of this command:
> scontrol show nodes

Comment 40 Openfive Support 2023-03-09 11:03:00 MST

Created attachment 29245 [details]
slurmd_all_nodes_details_09-03-2023

Comment 41 Openfive Support 2023-03-09 11:04:03 MST

(In reply to Nate Rini from comment #39)

Hi Nate,


> (In reply to Openfive Support from comment #38)
> > (In reply to Nate Rini from comment #36)
> > I didn't get what you meant by "marked down"? If this is about the node
> > state, there are only two nodes which are marked as down, but those are not
> > reachable remotely as of now.
> 
> Please provide the output of this command:
> > scontrol show nodes

I have attached the output of the above command. File name:- slurmd_all_nodes_details_09-03-2023


Regards,
Debajit Dutta

Comment 42 Nate Rini 2023-03-09 11:11:23 MST

(In reply to Openfive Support from comment #38)
> (In reply to Nate Rini from comment #36)
> I didn't get what you meant by "marked down"? If this is about the node
> state, there are only two nodes which are marked as down, but those are not
> reachable remotely as of now.

I was referring to this state:
>  State=DOWN* 

In this state, slurmctld is unable to talk to slurmd.

(In reply to Openfive Support from comment #40)
> Created attachment 29245 [details]
> slurmd_all_nodes_details_09-03-2023
>
> NodeName=osxon002 Arch=x86_64 CoresPerSocket=1 
> State=IDLE

Since the 2 DOWN* nodes are expected to be in that state, can we try starting slurmd in foreground mode on this node since it is idle?

Comment 43 Nate Rini 2023-03-09 11:11:55 MST

(In reply to Nate Rini from comment #37)
> (In reply to Openfive Support from comment #35)
> The cluster doesn't have a large number of nodes but it is possible munge is
> getting overwhelmed. Setting an increased number of threads is suggested:
> > https://slurm.schedmd.com/high_throughput.html#munge_config

Please tell me once this config change is implemented for munge.

Comment 44 Openfive Support 2023-03-09 12:48:38 MST

(In reply to Nate Rini from comment #43)

Hi Nate,


> (In reply to Nate Rini from comment #37)
> > (In reply to Openfive Support from comment #35)
> > The cluster doesn't have a large number of nodes but it is possible munge is
> > getting overwhelmed. Setting an increased number of threads is suggested:
> > > https://slurm.schedmd.com/high_throughput.html#munge_config
> 
> Please tell me once this config change is implemented for munge.

so, should I first:-

1. Set osxon002 as DOWN for our testing purpose
2. Implement this config:- https://slurm.schedmd.com/high_throughput.html#munge_config
2. Restart the munge service in all nodes
3. Start slurmd in foreground mode on osxon002 as node will be marked down.
> slurmd -Dvvvvvvv'
4. attach log from slurmd and then slurmctld while it is starting up.


Please let me know if the sequence in the above steps is correct.


Regards,
Debajit Dutta

Comment 45 Nate Rini 2023-03-09 12:50:55 MST

> 1. Set osxon002 as DOWN for our testing purpose
I suggest setting it to drain to avoid killing a job that may start on it.

> Please let me know if the sequence in the above steps is correct.
Restarting munge could be done at any time. Please make sure to add the extra threads to munge before restarting it.

Comment 46 Openfive Support 2023-03-09 12:58:02 MST

(In reply to Nate Rini from comment #45)
> > 1. Set osxon002 as DOWN for our testing purpose
> I suggest setting it to drain to avoid killing a job that may start on it.
> 
ok I have set the osxon002 state to down

> > Please let me know if the sequence in the above steps is correct.
> Restarting munge could be done at any time. Please make sure to add the
> extra threads to munge before restarting it.

How to check what the current number of threads munge is configured to?

Also, do I need to add extra threads to munge only on slurm master server or on all nodes?

Comment 47 Nate Rini 2023-03-09 13:00:35 MST

(In reply to Openfive Support from comment #46)
> How to check what the current number of threads munge is configured to?

Munge doesn't have a configuration file which means all options are passed as an argument at invocation. I used `systemctl status munge` to get this:
>   Process: 1375 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
It shows that munge is getting started without any arguments.

> Also, do I need to add extra threads to munge only on slurm master server or
> on all nodes?

Most likely the config change is only needed on the controllers but it won't hurt to have it on all nodes.

Comment 48 Openfive Support 2023-03-09 13:06:19 MST

Hi Nate,


While executing the command for munge, I am getting the below error:-

[root@osxon002 ~]# 
[root@osxon002 ~]# munged --num-threads 10
munged: Error: Logfile is insecure: "/var/log/munge/munged.log" should be owned by UID 0
[root@osxon002 ~]# 
[root@osxon002 ~]# 
[root@osxon002 ~]# ll /var/log/munge/munged.log
-rw-r----- 1 munge munge 0 Feb 17 03:40 /var/log/munge/munged.log
[root@osxon002 ~]# 


Should I change the owner of the above log file from munge to user root ?

Please let us know.


Regards,
Debajit Dutta

Comment 49 Nate Rini 2023-03-09 13:10:31 MST

(In reply to Openfive Support from comment #48)
> munged: Error: Logfile is insecure: "/var/log/munge/munged.log" should be owned by UID 0
> 
> Should I change the owner of the above log file from munge to user root ?

No. The systemd unit file (or a drop-in) is required to be modified to set munge's arguments at startup. Changing the ownership would only break munge when munge is started by systemd as the `munge` user.

Comment 50 Nate Rini 2023-03-10 08:56:45 MST

(In reply to Nate Rini from comment #49)
> (In reply to Openfive Support from comment #48)
> > munged: Error: Logfile is insecure: "/var/log/munge/munged.log" should be owned by UID 0
> > 
> > Should I change the owner of the above log file from munge to user root ?
> 
> No. The systemd unit file (or a drop-in) is required to be modified to set
> munge's arguments at startup. Changing the ownership would only break munge
> when munge is started by systemd as the `munge` user.

Are instruction on how to do this needed?

Comment 51 Openfive Support 2023-03-10 09:32:55 MST

(In reply to Nate Rini from comment #50)

Hi Nate,

> (In reply to Nate Rini from comment #49)
> > (In reply to Openfive Support from comment #48)
> > > munged: Error: Logfile is insecure: "/var/log/munge/munged.log" should be owned by UID 0
> > > 
> > > Should I change the owner of the above log file from munge to user root ?
> > 
> > No. The systemd unit file (or a drop-in) is required to be modified to set
> > munge's arguments at startup. Changing the ownership would only break munge
> > when munge is started by systemd as the `munge` user.
> 
> Are instruction on how to do this needed?

Yes, it will be really great if you provide me with the instructions on how to do this.


Regards,
Debajit Dutta

Comment 52 Nate Rini 2023-03-10 09:50:23 MST

Please attach /usr/lib/systemd/system/munge.service

Comment 53 Openfive Support 2023-03-10 09:58:28 MST

(In reply to Nate Rini from comment #52)

Hi Nate,

> Please attach /usr/lib/systemd/system/munge.service

Below is the output of the file:-


[Unit]
Description=MUNGE authentication service
Documentation=man:munged(8)
After=network.target
After=syslog.target
After=time-sync.target

[Service]
Type=forking
ExecStart=/usr/sbin/munged
PIDFile=/var/run/munge/munged.pid
User=munge
Group=munge
Restart=on-abort

[Install]
WantedBy=multi-user.target



Regards,
Debajit Dutta

Comment 54 Nate Rini 2023-03-10 10:04:34 MST

(In reply to Openfive Support from comment #53)
> (In reply to Nate Rini from comment #52)
> > Please attach /usr/lib/systemd/system/munge.service

How was munge installed? using rpms?

Comment 55 Nate Rini 2023-03-10 10:07:35 MST

Follow this procedure:

> mkdir -p /usr/lib/systemd/system/munge.service.d
Populate file: /usr/lib/systemd/system/munge.service.d/local.conf
> [Service]
> ExecStart=/usr/sbin/munged -M --num-threads 10

Reload and restart
> systemctl daemon-reload 
> systemctl restart munge

Comment 56 Openfive Support 2023-03-10 10:11:13 MST

(In reply to Nate Rini from comment #54)

Hi Nate,


> (In reply to Openfive Support from comment #53)
> > (In reply to Nate Rini from comment #52)
> > > Please attach /usr/lib/systemd/system/munge.service
> 
> How was munge installed? using rpms?

Yes, munge was installed using rpms.

Below are the munge rpms that were installed:-

munge-libs-0.5.11-3.el7.x86_64.rpm
munge-0.5.11-3.el7.x86_64.rpm
munge-devel-0.5.11-3.el7.x86_64.rpm


Regards,
Debajit Dutta

Comment 57 Nate Rini 2023-03-10 10:13:00 MST

Once the changes in comment#55 are applied. Please call this and attach the log:
> $ echo test | munge | unmunge

Comment 58 Openfive Support 2023-03-10 10:21:29 MST

Hi Nate,


After the restart of the munge.service we are getting the below errors:-


[root@oslab002 system]# systemctl restart munge.service 
Failed to restart munge.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status munge.service' for details.
[root@oslab002 system]# 



[root@oslab002 system]# systemctl restart munge.service 
Failed to restart munge.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status munge.service' for details.
[root@oslab002 system]# 
[root@oslab002 system]# systemctl status munge.service 
\u25cf munge.service - MUNGE authentication service
   Loaded: error (Reason: Invalid argument)
  Drop-In: /usr/lib/systemd/system/munge.service.d
           \u2514\u2500local.conf
   Active: active (running) since Fri 2022-11-18 14:27:51 IST; 3 months 21 days ago
     Docs: man:munged(8)
 Main PID: 1345 (munged)
   CGroup: /system.slice/munge.service
           \u2514\u25001345 /usr/sbin/munged

Nov 18 14:27:50 oslab002 systemd[1]: Starting MUNGE authentication service...
Nov 18 14:27:51 oslab002 systemd[1]: Started MUNGE authentication service.
Mar 10 22:42:59 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing.
[root@oslab002 system]# 


Regards,
Debajit Dutta

Comment 59 Nate Rini 2023-03-10 10:24:10 MST

Please swap to this and follow procedure in comment#55

Populate file: /usr/lib/systemd/system/munge.service.d/local.conf
> [Service]
> Type=simple
> ExecStart=/usr/sbin/munged -M --num-threads 10 -F

Comment 60 Openfive Support 2023-03-10 10:57:16 MST

(In reply to Nate Rini from comment #59)

Hi Nate,


> Please swap to this and follow procedure in comment#55
> 
> Populate file: /usr/lib/systemd/system/munge.service.d/local.conf
> > [Service]
> > Type=simple
> > ExecStart=/usr/sbin/munged -M --num-threads 10 -F

Still we are getting the below same error:-


[root@oslab002 system]# 
[root@oslab002 system]# cat /usr/lib/systemd/system/munge.service.d/local.conf
[Service]
Type=simple
ExecStart=/usr/sbin/munged -M --num-threads 10 -F
[root@oslab002 system]# 
[root@oslab002 system]# 
[root@oslab002 system]# systemctl daemon-reload
[root@oslab002 system]# 
[root@oslab002 system]# systemctl restart munge.service 
Failed to restart munge.service: Unit is not loaded properly: Invalid argument.
See system logs and 'systemctl status munge.service' for details.
[root@oslab002 system]# 
[root@oslab002 system]# 
[root@oslab002 system]# systemctl status munge.service 
\u25cf munge.service - MUNGE authentication service
   Loaded: error (Reason: Invalid argument)
  Drop-In: /usr/lib/systemd/system/munge.service.d
           \u2514\u2500local.conf
   Active: active (running) since Fri 2022-11-18 14:27:51 IST; 3 months 21 days ago
     Docs: man:munged(8)
 Main PID: 1345 (munged)
   CGroup: /system.slice/munge.service
           \u2514\u25001345 /usr/sbin/munged

Nov 18 14:27:50 oslab002 systemd[1]: Starting MUNGE authentication service...
Nov 18 14:27:51 oslab002 systemd[1]: Started MUNGE authentication service.
Mar 10 22:42:59 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing.
Mar 10 23:22:57 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing.
[root@oslab002 system]# 



Regards,
Debajit Dutta

Comment 61 Nate Rini 2023-03-10 11:03:47 MST

Please revert the changes for now and provide this:
> $ systemd --version

Comment 62 Openfive Support 2023-03-10 11:14:44 MST

(In reply to Nate Rini from comment #61)
> Please revert the changes for now and provide this:
> > $ systemd --version

[root@oslab002 system]# systemd --version
bash: systemd: command not found...
[root@oslab002 system]#

Comment 63 Nate Rini 2023-03-10 11:22:09 MST

Please call:
> cat /etc/os-release
> lsb_release -a
> ps -ef|grep systemd

Comment 64 Openfive Support 2023-03-10 11:28:08 MST

(In reply to Nate Rini from comment #63)
> Please call:
> > cat /etc/os-release
> > lsb_release -a
> > ps -ef|grep systemd

[root@oslab002 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

[root@oslab002 ~]# lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-ia32:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-ia32:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-ia32:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.9.2009 (Core)
Release:	7.9.2009
Codename:	Core
[root@oslab002 ~]# ps -ef | grep systemd
root         1     0  0  2022 ?        00:33:27 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
root       611     1  0  2022 ?        00:00:47 /usr/lib/systemd/systemd-journald
root       658     1  0  2022 ?        00:00:00 /usr/lib/systemd/systemd-udevd
dbus       868     1  0  2022 ?        00:02:42 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root      1449     1  0  2022 ?        00:00:57 /usr/lib/systemd/systemd-logind
root      1453     1  0  2022 ?        00:11:57 /usr/sbin/automount --systemd-service --dont-check-daemon
root     12117 12011  0 23:53 pts/0    00:00:00 grep --color=auto systemd
[root@oslab002 ~]#

Comment 65 Nate Rini 2023-03-10 11:38:49 MST

Please call:
> /usr/lib/systemd/systemd --version

Comment 66 Openfive Support 2023-03-10 11:42:33 MST

(In reply to Nate Rini from comment #65)
> Please call:
> > /usr/lib/systemd/systemd --version

[root@oslab002 ~]# /usr/lib/systemd/systemd --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN
[root@oslab002 ~]#

Comment 67 Nate Rini 2023-03-10 12:10:25 MST

Please swap to this and follow procedure in comment#55

Populate file: /usr/lib/systemd/system/munge.service.d/local.conf
> [Service]
> Type=simple
> ExecStart=
> ExecStart=/usr/sbin/munged -M --num-threads 10 -F

This testing can be done on any node. Please don't test on the controllers.

Comment 68 Openfive Support 2023-03-10 12:17:35 MST

(In reply to Nate Rini from comment #67)
> Please swap to this and follow procedure in comment#55
> 
> Populate file: /usr/lib/systemd/system/munge.service.d/local.conf
> > [Service]
> > Type=simple
> > ExecStart=
> > ExecStart=/usr/sbin/munged -M --num-threads 10 -F
> 
> This testing can be done on any node. Please don't test on the controllers.

I have updated the content of the file as above.

This time there are no errors but after daemon-reload and munge service restart, the munge service is failing.

Below is the output of the systemctl status munge after the restart:-

[root@oslab002 ~]# systemctl restart munge.service 
[root@oslab002 ~]# 
[root@oslab002 ~]# systemctl status munge.service 
\u25cf munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/munge.service.d
           \u2514\u2500local.conf
   Active: failed (Result: exit-code) since Sat 2023-03-11 00:41:29 IST; 2s ago
     Docs: man:munged(8)
  Process: 15433 ExecStart=/usr/sbin/munged -M --num-threads 10 -F (code=exited, status=1/FAILURE)
 Main PID: 15433 (code=exited, status=1/FAILURE)

Mar 11 00:41:29 oslab002 systemd[1]: Started MUNGE authentication service.
Mar 11 00:41:29 oslab002 munged[15433]: munged: Notice: Running on "oslab002.open-silicon.com" (172.16.24.83)
Mar 11 00:41:29 oslab002 systemd[1]: munge.service: main process exited, code=exited, status=1/FAILURE
Mar 11 00:41:29 oslab002 systemd[1]: Unit munge.service entered failed state.
Mar 11 00:41:29 oslab002 systemd[1]: munge.service failed.
[root@oslab002 ~]#

Comment 69 Nate Rini 2023-03-10 12:23:25 MST

Please provide:
> sudo journalctl --unit munge

Comment 70 Openfive Support 2023-03-10 12:28:52 MST

(In reply to Nate Rini from comment #69)
> Please provide:
> > sudo journalctl --unit munge

[root@oslab002 ~]# sudo journalctl --unit munge
-- Logs begin at Fri 2022-11-18 14:27:26 IST, end at Sat 2023-03-11 00:55:21 IST. --
Nov 18 14:27:50 oslab002 systemd[1]: Starting MUNGE authentication service...
Nov 18 14:27:51 oslab002 systemd[1]: Started MUNGE authentication service.
Mar 10 22:42:59 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing.
Mar 10 23:22:57 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing.
Mar 10 23:32:33 oslab002 systemd[1]: Stopping MUNGE authentication service...
Mar 10 23:32:33 oslab002 systemd[1]: Stopped MUNGE authentication service.
Mar 10 23:32:33 oslab002 systemd[1]: Starting MUNGE authentication service...
Mar 10 23:32:34 oslab002 systemd[1]: Started MUNGE authentication service.
Mar 11 00:41:22 oslab002 systemd[1]: Stopping MUNGE authentication service...
Mar 11 00:41:22 oslab002 systemd[1]: Stopped MUNGE authentication service.
Mar 11 00:41:22 oslab002 systemd[1]: Started MUNGE authentication service.
Mar 11 00:41:22 oslab002 systemd[1]: munge.service: main process exited, code=exited, status=1/FAILURE
Mar 11 00:41:22 oslab002 systemd[1]: Unit munge.service entered failed state.
Mar 11 00:41:22 oslab002 systemd[1]: munge.service failed.
Mar 11 00:41:29 oslab002 systemd[1]: Started MUNGE authentication service.
Mar 11 00:41:29 oslab002 munged[15433]: munged: Notice: Running on "oslab002.open-silicon.com" (172.16.24.83)
Mar 11 00:41:29 oslab002 systemd[1]: munge.service: main process exited, code=exited, status=1/FAILURE
Mar 11 00:41:29 oslab002 systemd[1]: Unit munge.service entered failed state.
Mar 11 00:41:29 oslab002 systemd[1]: munge.service failed.
[root@oslab002 ~]#

Comment 71 Nate Rini 2023-03-10 12:30:06 MST

Please call:
>  munged --version

Comment 72 Openfive Support 2023-03-10 12:34:40 MST

(In reply to Nate Rini from comment #71)
> Please call:
> >  munged --version

[root@oslab002 ~]# munged --version
munge-0.5.11 (2013-08-27)
[root@oslab002 ~]#

Comment 74 Nate Rini 2023-03-10 12:43:44 MST

Please try:

/usr/lib/systemd/system/munge.service.d/local.conf:
> [Service]
> ExecStart=
> ExecStart=/usr/sbin/munged --num-threads 10

Please plan to upgrade the cluster. Many of these issues are caused by running older and generally deprecated versions.

Comment 75 Openfive Support 2023-03-10 12:48:03 MST

(In reply to Nate Rini from comment #74)
> Please try:
> 
> /usr/lib/systemd/system/munge.service.d/local.conf:
> > [Service]
> > ExecStart=
> > ExecStart=/usr/sbin/munged --num-threads 10

Yes now it is running

[root@oslab002 ~]# systemctl status munge.service 
\u25cf munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/munge.service.d
           \u2514\u2500local.conf
   Active: active (running) since Sat 2023-03-11 01:13:34 IST; 8s ago
     Docs: man:munged(8)
  Process: 17756 ExecStart=/usr/sbin/munged --num-threads 10 (code=exited, status=0/SUCCESS)
 Main PID: 17759 (munged)
    Tasks: 12
   Memory: 892.0K
   CGroup: /system.slice/munge.service
           \u2514\u250017759 /usr/sbin/munged --num-threads 10

Mar 11 01:13:34 oslab002 systemd[1]: Starting MUNGE authentication service...
Mar 11 01:13:34 oslab002 systemd[1]: Started MUNGE authentication service.
[root@oslab002 ~]# 


> 
> Please plan to upgrade the cluster. Many of these issues are caused by
> running older and generally deprecated versions.

Sure, will do this.

Comment 76 Nate Rini 2023-03-10 13:05:23 MST

Please follow comment#57

Comment 77 Openfive Support 2023-03-10 13:12:21 MST

(In reply to Nate Rini from comment #76)
> Please follow comment#57

[root@oslab002 ~]# echo test | munge | unmunge
STATUS:           Success (0)
ENCODE_HOST:      oslab002.open-silicon.com (172.16.24.83)
ENCODE_TIME:      2023-03-11 01:38:34 +0530 (1678478914)
DECODE_TIME:      2023-03-11 01:38:34 +0530 (1678478914)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           5

test
[root@oslab002 ~]#

Comment 78 Nate Rini 2023-03-10 13:29:58 MST

(In reply to Openfive Support from comment #77)
> (In reply to Nate Rini from comment #76)
> > Please follow comment#57
> [root@oslab002 ~]# echo test | munge | unmunge
> STATUS:           Success (0)

Please provide the output of the following
> sdiag
> scontrol show nodes
> scontrol show jobs

Comment 79 Openfive Support 2023-03-10 13:48:09 MST

Created attachment 29273 [details]
scontrol_nodes_jobs_sdiag_11-03-2023

Comment 80 Openfive Support 2023-03-10 13:50:35 MST

(In reply to Nate Rini from comment #78)
> (In reply to Openfive Support from comment #77)
> > (In reply to Nate Rini from comment #76)
> > > Please follow comment#57
> > [root@oslab002 ~]# echo test | munge | unmunge
> > STATUS:           Success (0)
> 
> Please provide the output of the following
> > sdiag
> > scontrol show nodes
> > scontrol show jobs

I have uploaded the above command outputs in files in a zip:-

scontrol_nodes_jobs_sdiag_11-03-2023.zip

Also, I did the same munge process in osxon002, and below is the output:-

[root@osxon002 ~]# echo test | munge | unmunge
STATUS:           Success (0)
ENCODE_HOST:      osxon002.open-silicon.com (192.168.2.61)
ENCODE_TIME:      2023-03-11 02:04:39 +0530 (1678480479)
DECODE_TIME:      2023-03-11 02:04:39 +0530 (1678480479)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           5

test
[root@osxon002 ~]#

Comment 82 Nate Rini 2023-03-10 14:02:49 MST

(In reply to Openfive Support from comment #79)
> Created attachment 29273 [details]
> scontrol_nodes_jobs_sdiag_11-03-2023
> 
> Remote Procedure Call statistics by user
>	vishalk         (    3093) count:128849 ave_time:40420  total_time:5208098782
>	krutikak        (    3591) count:35137  ave_time:105909 total_time:3721339909
>	santhoshb       (    3549) count:35131  ave_time:106051 total_time:3725703359
>	radhes          (    3582) count:35128  ave_time:106392 total_time:3737340749
>	root            (       0) count:26450  ave_time:630897 total_time:16687241811

These users are quering Slurm more than root. Every one of these queries requires a munge connection which may be the source of munge getting overloaded. Please work with these users to see why they are quering Slurm soo much. This is usually due to a while() or for() loop that are constantly running one of the Slurm commands such as squeue.

Please note that the Slurm-23.02 release has new features to help with users with this issue:
> https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable

Comment 83 Nate Rini 2023-03-10 14:04:07 MST

(In reply to Openfive Support from comment #79)
> Created attachment 29273 [details]
> scontrol_nodes_jobs_sdiag_11-03-2023

There are still a good number of idle nodes. Please verify that test jobs can start on them:
> srun -w osvnc007 uptime

Comment 84 Openfive Support 2023-03-10 14:13:32 MST

(In reply to Nate Rini from comment #83)
> (In reply to Openfive Support from comment #79)
> > Created attachment 29273 [details]
> > scontrol_nodes_jobs_sdiag_11-03-2023
> 
> There are still a good number of idle nodes. Please verify that test jobs
> can start on them:
> > srun -w osvnc007 uptime

Well, many nodes are not added for computing purposes like osvnc007 is only for running users' VNCs.

Is it required to add a server to slurm as a node if I want to execute the srun command from the server?

Comment 85 Openfive Support 2023-03-10 14:14:35 MST

(In reply to Nate Rini from comment #82)
> (In reply to Openfive Support from comment #79)
> > Created attachment 29273 [details]
> > scontrol_nodes_jobs_sdiag_11-03-2023
> > 
> > Remote Procedure Call statistics by user
> >	vishalk         (    3093) count:128849 ave_time:40420  total_time:5208098782
> >	krutikak        (    3591) count:35137  ave_time:105909 total_time:3721339909
> >	santhoshb       (    3549) count:35131  ave_time:106051 total_time:3725703359
> >	radhes          (    3582) count:35128  ave_time:106392 total_time:3737340749
> >	root            (       0) count:26450  ave_time:630897 total_time:16687241811
> 
> These users are quering Slurm more than root. Every one of these queries
> requires a munge connection which may be the source of munge getting
> overloaded. Please work with these users to see why they are quering Slurm
> soo much. This is usually due to a while() or for() loop that are constantly
> running one of the Slurm commands such as squeue.
> 
> Please note that the Slurm-23.02 release has new features to help with users
> with this issue:
> > https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable


Sure, we will check with these users and will inform other users as well.

Also, we will plan to upgrade the cluster.

Comment 86 Nate Rini 2023-03-10 14:16:52 MST

(In reply to Openfive Support from comment #84)
> (In reply to Nate Rini from comment #83)
> > (In reply to Openfive Support from comment #79)
> > > Created attachment 29273 [details]
> > > scontrol_nodes_jobs_sdiag_11-03-2023
> > 
> > There are still a good number of idle nodes. Please verify that test jobs
> > can start on them:
> > > srun -w osvnc007 uptime
> 
> Well, many nodes are not added for computing purposes like osvnc007 is only
> for running users' VNCs.

For purposes of this ticket, I want to verify that a job can start on all online nodes as that was the issue in comment#0.

Please attach the slurmctld log after at least a single job has been tested on every node.
 
> Is it required to add a server to slurm as a node if I want to execute the
> srun command from the server?
No: munge, Munge configuration, Slurm binaries, and Slurm configuration are the only things required beyond IP connectivity.

Comment 87 Openfive Support 2023-03-10 20:32:57 MST

(In reply to Nate Rini from comment #82)
> (In reply to Openfive Support from comment #79)
> > Created attachment 29273 [details]
> > scontrol_nodes_jobs_sdiag_11-03-2023
> > 
> > Remote Procedure Call statistics by user
> >	vishalk         (    3093) count:128849 ave_time:40420  total_time:5208098782
> >	krutikak        (    3591) count:35137  ave_time:105909 total_time:3721339909
> >	santhoshb       (    3549) count:35131  ave_time:106051 total_time:3725703359
> >	radhes          (    3582) count:35128  ave_time:106392 total_time:3737340749
> >	root            (       0) count:26450  ave_time:630897 total_time:16687241811
> 
> These users are quering Slurm more than root. Every one of these queries
> requires a munge connection which may be the source of munge getting
> overloaded. Please work with these users to see why they are quering Slurm
> soo much. This is usually due to a while() or for() loop that are constantly
> running one of the Slurm commands such as squeue.
> 
> Please note that the Slurm-23.02 release has new features to help with users
> with this issue:
> > https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable


Hi Nate,


Can you please let us know, how we can get the above information from our side?

We would like to do a periodic check on the same and follow up with the users.

Please let us know


Regards,
Debajit Dutta

Comment 88 Openfive Support 2023-03-10 20:38:02 MST

(In reply to Openfive Support from comment #87)
> (In reply to Nate Rini from comment #82)
> > (In reply to Openfive Support from comment #79)
> > > Created attachment 29273 [details]
> > > scontrol_nodes_jobs_sdiag_11-03-2023
> > > 
> > > Remote Procedure Call statistics by user
> > >	vishalk         (    3093) count:128849 ave_time:40420  total_time:5208098782
> > >	krutikak        (    3591) count:35137  ave_time:105909 total_time:3721339909
> > >	santhoshb       (    3549) count:35131  ave_time:106051 total_time:3725703359
> > >	radhes          (    3582) count:35128  ave_time:106392 total_time:3737340749
> > >	root            (       0) count:26450  ave_time:630897 total_time:16687241811
> > 
> > These users are quering Slurm more than root. Every one of these queries
> > requires a munge connection which may be the source of munge getting
> > overloaded. Please work with these users to see why they are quering Slurm
> > soo much. This is usually due to a while() or for() loop that are constantly
> > running one of the Slurm commands such as squeue.
> > 
> > Please note that the Slurm-23.02 release has new features to help with users
> > with this issue:
> > > https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable
> 
> 
> Hi Nate,
> 
> 
> Can you please let us know, how we can get the above information from our
> side?
> 
> We would like to do a periodic check on the same and follow up with the
> users.
> 
> Please let us know
> 
> 
> Regards,
> Debajit Dutta


ok got it, this data we get from the sdiag command.

Comment 89 Nate Rini 2023-03-13 09:04:06 MDT

Please provide a status update

Comment 90 Openfive Support 2023-03-13 11:59:54 MDT

(In reply to Nate Rini from comment #89)
> Please provide a status update

Hi Nate,


Can we have a call to resolve this issue?


Regards,
Debajit Dutta

Comment 91 Jason Booth 2023-03-13 12:13:10 MDT

Hi Debajit Dutta, Nate asked me to reply to you regarding your request for a call.

I do not see the purpose or value in a call right now, especially since Nate has 
requested the following and a status update. I would not want to have an engineer
sitting on a call while they review these logs.

> For purposes of this ticket, I want to verify that a job can start on all online
>  nodes as that was the issue in comment#0.
> Please attach the slurmctld log after at least a single job has been tested on 
>  every node.

Based on your other updates, it seems that jobs are running and users are able to 
submit. Please let Nate know if this is not the case.

We can re-evaluate a call if needed once we have confirmation on the status, and
 answers to the above questions.

Comment 92 Openfive Support 2023-03-13 12:31:13 MDT

(In reply to Jason Booth from comment #91)

Hi Jason,


> > Please attach the slurmctld log after at least a single job has been tested on 
> >  every node.

 I have executed the below command in around 12 running nodes and have attached the slurmctl log.

srun -p normal -w osxon047 uptime


You will find in the log file that the below error is too frequent:-

[2023-03-13T23:54:35.368] error: slurm_auth_get_host: Lookup failed for 0.0.0.0


What is the meaning of this error message? why we are getting this ?


> Based on your other updates, it seems that jobs are running and users are
> able to 
> submit. Please let Nate know if this is not the case.

Yes, users are able to invoke jobs, however, we want to know what caused the error to display and how can we prevent it in the future?


Regards,
Debajit Dutta

Comment 93 Openfive Support 2023-03-13 12:31:54 MDT

Created attachment 29299 [details]
slurmctld.log_13-03-2023

Comment 94 Nate Rini 2023-03-13 12:40:55 MDT

(In reply to Openfive Support from comment #93)
> Created attachment 29299 [details]
> slurmctld.log_13-03-2023
>
> [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619
It looks like there is number of jobs that reference a now defunct node "osxon092s". I suggest scancelling all of these jobs as they can never run.
>
> [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
This error is caused my munge either rejecting a packet or otherwise being too busy to make a munge token. I will note that in these logs, that slurmctld ran with any of these errors for around 6 minutes (2023-03-13T22:05:07.754 -> 
2023-03-13T22:13:23.031). This suggests that this a load issue. munge is a crytographic service which has very clear scaling limits based on the number of CPU cores on the host. We can try adding more threads to the munge daemon but can actually cause munge to go slower once munge has more threads than there are physical cores on the host.

Have the users listed in comment#79 been contacted to verify they are no longer hammering slurmctld with requests?

Comment 95 Nate Rini 2023-03-13 12:42:05 MDT

> > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0

Please also note that this error no longer even exists in the currently supported releases of Slurm. It has been replaced by substantially improved error logging.

Comment 96 Openfive Support 2023-03-13 12:58:35 MDT

(In reply to Nate Rini from comment #94)
> (In reply to Openfive Support from comment #93)
> > Created attachment 29299 [details]
> > slurmctld.log_13-03-2023
> >
> > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619
> It looks like there is number of jobs that reference a now defunct node
> "osxon092s". I suggest scancelling all of these jobs as they can never run.
> >
> > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
> This error is caused my munge either rejecting a packet or otherwise being
> too busy to make a munge token. I will note that in these logs, that
> slurmctld ran with any of these errors for around 6 minutes
> (2023-03-13T22:05:07.754 -> 
> 2023-03-13T22:13:23.031). This suggests that this a load issue. munge is a
> crytographic service which has very clear scaling limits based on the number
> of CPU cores on the host. We can try adding more threads to the munge daemon
> but can actually cause munge to go slower once munge has more threads than
> there are physical cores on the host.
> 

What to do now? should I increase the munge thread in the master server and restart the munge service in all nodes?


> Have the users listed in comment#79 been contacted to verify they are no
> longer hammering slurmctld with requests?

Yes, we did contacted the users and also followed up with them on this.

Comment 97 Openfive Support 2023-03-13 12:59:48 MDT

(In reply to Nate Rini from comment #95)
> > > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
> 
> Please also note that this error no longer even exists in the currently
> supported releases of Slurm. It has been replaced by substantially improved
> error logging.

Are you recommending that upgrading to the latest version of the slurm resolves all these issues?

Comment 98 Openfive Support 2023-03-13 13:05:46 MDT

(In reply to Nate Rini from comment #94)

Hi Nate,


> (In reply to Openfive Support from comment #93)
> > Created attachment 29299 [details]
> > slurmctld.log_13-03-2023
> >
> > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619
> It looks like there is number of jobs that reference a now defunct node
> "osxon092s". I suggest scancelling all of these jobs as they can never run.
> >

What I find is that from the below error:-

[2023-03-13T22:05:08.136] error: _find_node_record(763): lookup failure for osxon092s
[2023-03-13T22:05:08.136] error: node_name2bitmap: invalid node specified osxon092s
[2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619


The job ID 1190619 was already completed on 2023-03-08 at 22:35:32, you can see that from the control output below:-

[root@hpcmaster 13-03-2023]# scontrol show job 1190619
JobId=1190619 JobName=osi_hbmc_protocol_controller_wrap_falcon_FUNC_FFm40_rcworst_CCworstm40_SI_ENABLED_true_HOLD_ONLY
   UserId=renishd(1232) GroupId=technodebm(1023) MCS_label=N/A
   Priority=5000 Nice=0 Account=(null) QOS=normal WCKey=*
   JobState=COMPLETED Reason=NodeDown Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:08:53 TimeLimit=15-00:00:00 TimeMin=N/A
   SubmitTime=2023-03-08T22:26:17 EligibleTime=2023-03-08T22:26:17
   AccrueTime=2023-03-08T22:26:17
   StartTime=2023-03-08T22:26:39 EndTime=2023-03-08T22:35:32 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-08T22:26:39
   Partition=normal AllocNode:Sid=0.0.0.0:13928
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=osxon092s
   BatchHost=osxon092s
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=20000M,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=20000M MinTmpDiskNode=0


Why am I getting this error message now i.e. on 2023-03-13 at 22:05:08.136?


Regards,
Debajit Dutta

Comment 99 Nate Rini 2023-03-13 13:11:03 MDT

(In reply to Openfive Support from comment #96)
> (In reply to Nate Rini from comment #94)
> > (In reply to Openfive Support from comment #93)
> > > Created attachment 29299 [details]
> > > slurmctld.log_13-03-2023
> > >
> > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619
> > It looks like there is number of jobs that reference a now defunct node
> > "osxon092s". I suggest scancelling all of these jobs as they can never run.
> > >
> > > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
> > This error is caused my munge either rejecting a packet or otherwise being
> > too busy to make a munge token. I will note that in these logs, that
> > slurmctld ran with any of these errors for around 6 minutes
> > (2023-03-13T22:05:07.754 -> 
> > 2023-03-13T22:13:23.031). This suggests that this a load issue. munge is a
> > crytographic service which has very clear scaling limits based on the number
> > of CPU cores on the host. We can try adding more threads to the munge daemon
> > but can actually cause munge to go slower once munge has more threads than
> > there are physical cores on the host.
> > 
> 
> What to do now? should I increase the munge thread in the master server and
> restart the munge service in all nodes?

We need to verify the number of cores on the host. Please call:
> lscpu

> > Have the users listed in comment#79 been contacted to verify they are no
> > longer hammering slurmctld with requests?
> 
> Yes, we did contacted the users and also followed up with them on this.

Please call the following:
> sdiag -r
> sdiag
> sleep 15m
> sdiag

Please upload the output.

(In reply to Openfive Support from comment #97)
> (In reply to Nate Rini from comment #95)
> > > > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
> > 
> > Please also note that this error no longer even exists in the currently
> > supported releases of Slurm. It has been replaced by substantially improved
> > error logging.
> 
> Are you recommending that upgrading to the latest version of the slurm

The cluster is running a no longer supported version. We always suggest a site upgrade to a supported version as we are limited in our options in fixing issues.

> resolves all these issues?

No, there is no guarantee of that, but we will have the ability to get better logs and provide corrective patches (if needed) on supported releases.




(In reply to Openfive Support from comment #98)
> (In reply to Nate Rini from comment #94)
> > (In reply to Openfive Support from comment #93)
> > > Created attachment 29299 [details]
> > > slurmctld.log_13-03-2023
> > >
> > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619
> > It looks like there is number of jobs that reference a now defunct node
> > "osxon092s". I suggest scancelling all of these jobs as they can never run.
> > >
> 
> What I find is that from the below error:-
> 
> [2023-03-13T22:05:08.136] error: _find_node_record(763): lookup failure for
> osxon092s
> [2023-03-13T22:05:08.136] error: node_name2bitmap: invalid node specified
> osxon092s
> Why am I getting this error message now i.e. on 2023-03-13 at 22:05:08.136?

These errors happened all at the start of the log. Was slurmctld restarted prior to providing the log?

Comment 100 Openfive Support 2023-03-13 13:20:40 MDT

(In reply to Nate Rini from comment #99)
> (In reply to Openfive Support from comment #96)
> > (In reply to Nate Rini from comment #94)
> > > (In reply to Openfive Support from comment #93)

> > 
> > What to do now? should I increase the munge thread in the master server and
> > restart the munge service in all nodes?
> 
> We need to verify the number of cores on the host. Please call:
> > lscpu
> 

Below is the output:-

[root@hpcmaster 13-03-2023]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
Stepping:              4
CPU MHz:               2100.000
BogoMIPS:              4200.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              11264K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
[root@hpcmaster 13-03-2023]# 



> > > Have the users listed in comment#79 been contacted to verify they are no
> > > longer hammering slurmctld with requests?
> > 
> > Yes, we did contacted the users and also followed up with them on this.
> 
> Please call the following:
> > sdiag -r
> > sdiag
> > sleep 15m
> > sdiag
> 
> Please upload the output.
> 

Will attach a zip file for the same in few minutes.

 
> (In reply to Openfive Support from comment #98)
> > (In reply to Nate Rini from comment #94)
> > > (In reply to Openfive Support from comment #93)
> > > > Created attachment 29299 [details]
> > > > slurmctld.log_13-03-2023
> > > >
> > > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619
> > > It looks like there is number of jobs that reference a now defunct node
> > > "osxon092s". I suggest scancelling all of these jobs as they can never run.
> > > >
> > 
> > What I find is that from the below error:-
> > 
> > [2023-03-13T22:05:08.136] error: _find_node_record(763): lookup failure for
> > osxon092s
> > [2023-03-13T22:05:08.136] error: node_name2bitmap: invalid node specified
> > osxon092s
> > Why am I getting this error message now i.e. on 2023-03-13 at 22:05:08.136?
> 
> These errors happened all at the start of the log. Was slurmctld restarted
> prior to providing the log?


Yes, today it was restarted for some changes that were made in the slurm.conf file.

Comment 101 Nate Rini 2023-03-13 13:33:43 MDT

(In reply to Openfive Support from comment #100)
> (In reply to Nate Rini from comment #99)
> > (In reply to Openfive Support from comment #96)
> > > (In reply to Nate Rini from comment #94)
> > > > (In reply to Openfive Support from comment #93)
> 
> > > 
> > > What to do now? should I increase the munge thread in the master server and
> > > restart the munge service in all nodes?
> > 
> > We need to verify the number of cores on the host. Please call:
> > > lscpu
> > 
> 
> Below is the output:-
> Core(s) per socket:    8

The max number of threads for munge should be 8 on this host to match the number of cores. This host is likely underpowered to run Slurm on anything but a small cluster.

Please see slides 18-20:
> https://slurm.schedmd.com/SLUG22/Field_Notes_6.pdf

I understand this is not something that can change immediately, but please consider providing faster server hardware for the Slurm controllers.

> > (In reply to Openfive Support from comment #98)
> Yes, today it was restarted for some changes that were made in the
> slurm.conf file.

Then these jobs were likely kept in the StateSaveLocation. If these errors don't happen on the next cycle of slurmctld, we can safely ignore these errors.

Comment 102 Openfive Support 2023-03-13 13:49:23 MDT

Created attachment 29301 [details]
sdiag.zip

Comment 103 Nate Rini 2023-03-13 15:06:40 MDT

(In reply to Openfive Support from comment #102)
> Created attachment 29301 [details]
> sdiag.zip
>
> krutikak        (    3591) count:2830   ave_time:67513  total_time:191064547

This user is still doing more RPCs than root.

Comment 104 Nate Rini 2023-03-13 15:25:08 MDT

(In reply to Nate Rini from comment #101)
> The max number of threads for munge should be 8 on this host to match the
> number of cores. This host is likely underpowered to run Slurm on anything
> but a small cluster.

Is the controller a physical machine or is it a VM?

Comment 105 Openfive Support 2023-03-17 01:54:59 MDT

(In reply to Nate Rini from comment #104)
> (In reply to Nate Rini from comment #101)
> > The max number of threads for munge should be 8 on this host to match the
> > number of cores. This host is likely underpowered to run Slurm on anything
> > but a small cluster.
> 
> Is the controller a physical machine or is it a VM?


No, it is a physical server.

Comment 106 Nate Rini 2023-03-17 09:42:19 MDT

(In reply to Nate Rini from comment #103)
> (In reply to Openfive Support from comment #102)
> > Created attachment 29301 [details]
> > sdiag.zip
> >
> > krutikak        (    3591) count:2830   ave_time:67513  total_time:191064547
> 
> This user is still doing more RPCs than root.

Has the RPC count/total_time for this user been reduced? Please provide a new sdiag output:

Please call the following:
> sdiag -r
> uptime
> sdiag
> sleep 15m
> uptime
> sdiag
> sleep 15m
> uptime
> sdiag

Comment 107 Openfive Support 2023-03-18 13:23:25 MDT

(In reply to Nate Rini from comment #106)

Hi Nate,


> (In reply to Nate Rini from comment #103)
> > (In reply to Openfive Support from comment #102)
> > > Created attachment 29301 [details]
> > > sdiag.zip
> > >
> > > krutikak        (    3591) count:2830   ave_time:67513  total_time:191064547
> > 
> > This user is still doing more RPCs than root.
> 
> Has the RPC count/total_time for this user been reduced? Please provide a
> new sdiag output:

We are following up with the users, we will take some time on this.

> Please call the following:
> > sdiag -r
> > uptime
> > sdiag
> > sleep 15m
> > uptime
> > sdiag
> > sleep 15m
> > uptime
> > sdiag

Is this to verify the RPC count/total_time for the users?


Regards,
Debajit Dutta

Comment 108 Nate Rini 2023-03-20 10:03:58 MDT

(In reply to Openfive Support from comment #107)
> (In reply to Nate Rini from comment #106)
> We are following up with the users, we will take some time on this.
I will reduce this ticket to SEV4 while we wait.

> Is this to verify the RPC count/total_time for the users?
Yes but I'm also looking to verify that jobs are starting.

(In reply to Openfive Support from comment #105)
> No, it is a physical server.
The CPU on this server is very likely just too slow to be a Slurm controller for the cluster. I strongly suggest looking into getting a faster server. Many of these issues will likely just be resolved by that.

Comment 109 Openfive Support 2023-03-20 13:05:14 MDT

Hi Nate,


The cores are available but still we are getting the below wait time

[debajitd@osvnc001 ~]$ srun -p normal --pty /bin/tcsh
srun: job 1257513 queued and waiting for resources


[root@hpcmaster 13-03-2023]# sinfo 
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal         up 15-00:00:0      2  down* osxon[030,060]
normal         up 15-00:00:0      7   drng osxon[010,019,032,047,050,055,080]
normal         up 15-00:00:0      1   resv osxon038
normal         up 15-00:00:0     28    mix osxon[001,004-006,009,013,015,024,031,033,036-037,041-042,045,056,059,063,065-066,069,073-075,079,082,091,094]
normal         up 15-00:00:0     26  alloc osxon[007-008,018,020-021,023,028-029,035,039,043-044,046,048-049,052-054,058,064,067,070-071,078,087,090]
long           up 60-00:00:0      2  down* osxon[030,060]
long           up 60-00:00:0      7   drng osxon[010,019,032,047,050,055,080]
long           up 60-00:00:0      1   resv osxon038
long           up 60-00:00:0     28    mix osxon[001,004-006,009,013,015,024,031,033,036-037,041-042,045,056,059,063,065-066,069,073-075,079,082,091,094]
long           up 60-00:00:0     26  alloc osxon[007-008,018,020-021,023,028-029,035,039,043-044,046,048-049,052-054,058,064,067,070-071,078,087,090]
short          up    6:00:00      1  down* osxon060
short          up    6:00:00      4    mix osxon[059,065-066,081]
short          up    6:00:00      1  alloc osxon067
prio           up 15-00:00:0      1  down* osxon060
prio           up 15-00:00:0      8    mix osxon[002,004,006,031,065-066,081,088]
prio           up 15-00:00:0      3  alloc osxon[035,054,067]
prio           up 15-00:00:0      1   idle osxon068
sms-license    up 15-00:00:0      1    mix osxon081
regression     up 15-00:00:0      3  alloc osxon[061,072,077]
guest          up 15-00:00:0      1  drain osxon034
guest          up 15-00:00:0      1   idle osxon003
eda            up 15-00:00:0      1   idle osxon095d
vnc            up   infinite     13  maint guest-ausdia,guestvnc001,osvnc[001-004,007-013]
vnc            up   infinite      2  down* guest-ansys,guest-mentor
guest-vnc*     up   infinite      1  maint ofindcon
[root@hpcmaster 13-03-2023]# 



Please help us here.


Regards,
Debajit Dutta

Comment 110 Openfive Support 2023-03-20 13:06:25 MDT

we are not able to execute any jobs right now? This problem is happening again.

Can we please get into a call to resolve this?

Comment 111 Nate Rini 2023-03-20 13:11:13 MDT

Please call:
> scontrol show job 1257513

Comment 112 Openfive Support 2023-03-20 13:13:36 MDT

(In reply to Nate Rini from comment #111)
> Please call:
> > scontrol show job 1257513

[root@hpcmaster 13-03-2023]# scontrol show job 1257513
JobId=1257513 JobName=tcsh
   UserId=debajitd(3403) GroupId=engr(500) MCS_label=N/A
   Priority=5000 Nice=0 Account=(null) QOS=normal WCKey=*
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A
   SubmitTime=2023-03-21T00:18:14 EligibleTime=2023-03-21T00:18:14
   AccrueTime=2023-03-21T00:18:14
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T00:42:18
   Partition=normal AllocNode:Sid=osvnc001:24466
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/tcsh
   WorkDir=/home/debajitd
   Power=
   NtasksPerTRES:0

[root@hpcmaster 13-03-2023]#

Comment 113 Nate Rini 2023-03-20 13:22:09 MDT

Please call:
> sprio

Comment 114 Openfive Support 2023-03-20 13:23:08 MDT

(In reply to Nate Rini from comment #113)
> Please call:
> > sprio

[root@hpcmaster 13-03-2023]# sprio 
          JOBID PARTITION   PRIORITY       SITE  PARTITION
        1256323 normal          5000          0       5000
        1256578 normal          5000          0       5000
        1256626 normal          5000          0       5000
        1256755 normal          5000          0       5000
        1256756 normal          5000          0       5000
        1256757 normal          5000          0       5000
        1256758 normal          5000          0       5000
        1256759 normal          5000          0       5000
        1256760 normal          5000          0       5000
        1256761 normal          5000          0       5000
        1256762 normal          5000          0       5000
        1256763 normal          5000          0       5000
        1256764 normal          5000          0       5000
        1256765 normal          5000          0       5000
        1256816 normal          5000          0       5000
        1256852 normal          5000          0       5000
        1257068 normal          5000          0       5000
        1257103 regressio       6250          0       6250
        1257104 regressio       6250          0       6250
        1257106 regressio       6250          0       6250
        1257107 regressio       6250          0       6250
        1257128 normal          5000          0       5000
        1257133 normal          5000          0       5000
        1257142 regressio       6250          0       6250
        1257150 normal          5000          0       5000
        1257151 normal          5000          0       5000
        1257152 normal          5000          0       5000
        1257153 normal          5000          0       5000
        1257154 normal          5000          0       5000
        1257155 normal          5000          0       5000
        1257156 normal          5000          0       5000
        1257157 normal          5000          0       5000
        1257158 normal          5000          0       5000
        1257159 normal          5000          0       5000
        1257160 normal          5000          0       5000
        1257161 normal          5000          0       5000
        1257162 normal          5000          0       5000
        1257163 normal          5000          0       5000
        1257164 normal          5000          0       5000
        1257165 normal          5000          0       5000
        1257166 normal          5000          0       5000
        1257167 normal          5000          0       5000
        1257206 normal          5000          0       5000
        1257207 normal          5000          0       5000
        1257208 normal          5000          0       5000
        1257209 normal          5000          0       5000
        1257210 normal          5000          0       5000
        1257211 normal          5000          0       5000
        1257212 normal          5000          0       5000
        1257213 normal          5000          0       5000
        1257214 normal          5000          0       5000
        1257215 normal          5000          0       5000
        1257216 normal          5000          0       5000
        1257217 normal          5000          0       5000
        1257218 normal          5000          0       5000
        1257219 normal          5000          0       5000
        1257220 normal          5000          0       5000
        1257221 normal          5000          0       5000
        1257222 normal          5000          0       5000
        1257223 normal          5000          0       5000
        1257224 normal          5000          0       5000
        1257225 normal          5000          0       5000
        1257226 normal          5000          0       5000
        1257227 normal          5000          0       5000
        1257228 normal          5000          0       5000
        1257229 normal          5000          0       5000
        1257230 normal          5000          0       5000
        1257231 normal          5000          0       5000
        1257232 normal          5000          0       5000
        1257233 normal          5000          0       5000
        1257234 normal          5000          0       5000
        1257235 normal          5000          0       5000
        1257236 normal          5000          0       5000
        1257237 normal          5000          0       5000
        1257238 normal          5000          0       5000
        1257239 normal          5000          0       5000
        1257240 normal          5000          0       5000
        1257241 normal          5000          0       5000
        1257242 normal          5000          0       5000
        1257243 normal          5000          0       5000
        1257244 normal          5000          0       5000
        1257245 normal          5000          0       5000
        1257246 normal          5000          0       5000
        1257247 normal          5000          0       5000
        1257251 long            3750          0       3750
        1257263 normal          5000          0       5000
        1257269 normal          5000          0       5000
        1257270 normal          5000          0       5000
        1257325 regressio       6250          0       6250
        1257327 regressio       6250          0       6250
        1257333 regressio       6250          0       6250
        1257356 normal          5000          0       5000
        1257366 normal          5000          0       5000
        1257390 normal          5000          0       5000
        1257398 long            3750          0       3750
        1257422 normal          5000          0       5000
        1257423 normal          5000          0       5000
        1257424 regressio       6250          0       6250
        1257436 regressio       6250          0       6250
        1257464 normal          5000          0       5000
        1257465 normal          5000          0       5000
        1257466 normal          5000          0       5000
        1257470 normal          5000          0       5000
        1257488 normal          5000          0       5000
        1257489 normal          5000          0       5000
        1257490 normal          5000          0       5000
        1257491 normal          5000          0       5000
        1257492 normal          5000          0       5000
        1257493 normal          5000          0       5000
        1257494 normal          5000          0       5000
        1257495 normal          5000          0       5000
        1257496 normal          5000          0       5000
        1257497 normal          5000          0       5000
        1257498 normal          5000          0       5000
        1257499 normal          5000          0       5000
        1257500 normal          5000          0       5000
        1257505 normal          5000          0       5000
        1257512 regressio       6250          0       6250
        1257514 regressio       6250          0       6250
        1257522 normal          5000          0       5000
        1257523 normal          5000          0       5000
        1257528 normal          5000          0       5000
        1257536 normal          5000          0       5000
        1257543 normal          5000          0       5000
        1257547 normal          5000          0       5000
        1257548 normal          5000          0       5000
        1257553 normal          5000          0       5000
        1257554 normal          5000          0       5000
        1257555 normal          5000          0       5000
        1257556 normal          5000          0       5000
        1257557 normal          5000          0       5000
        1257558 normal          5000          0       5000
        1257559 normal          5000          0       5000
        1257560 normal          5000          0       5000
        1257561 normal          5000          0       5000
        1257562 normal          5000          0       5000
        1257563 normal          5000          0       5000
        1257564 normal          5000          0       5000
        1257565 normal          5000          0       5000
        1257570 normal          5000          0       5000
        1257571 normal          5000          0       5000
        1257572 regressio       6250          0       6250
        1257573 normal          5000          0       5000
        1257574 normal          5000          0       5000
        1257575 normal          5000          0       5000
        1257576 normal          5000          0       5000
        1257577 normal          5000          0       5000
        1257578 regressio       6250          0       6250
        1257579 normal          5000          0       5000
        1257580 normal          5000          0       5000
        1257581 normal          5000          0       5000
        1257582 normal          5000          0       5000
        1257583 normal          5000          0       5000
        1257584 normal          5000          0       5000
        1257585 normal          5000          0       5000
        1257586 normal          5000          0       5000
        1257587 normal          5000          0       5000
        1257588 normal          5000          0       5000
        1257589 normal          5000          0       5000
        1257590 normal          5000          0       5000
        1257591 normal          5000          0       5000
        1257592 normal          5000          0       5000
        1257593 normal          5000          0       5000
        1257595 normal          5000          0       5000
        1257596 normal          5000          0       5000
        1257597 normal          5000          0       5000
        1257598 regressio       6250          0       6250
        1257599 normal          5000          0       5000
        1257600 normal          5000          0       5000
        1257601 normal          5000          0       5000
        1257602 normal          5000          0       5000
        1257603 normal          5000          0       5000
        1257604 regressio       6250          0       6250
        1257605 normal          5000          0       5000
        1257606 normal          5000          0       5000
        1257607 normal          5000          0       5000
        1257608 normal          5000          0       5000
        1257609 normal          5000          0       5000
        1257610 normal          5000          0       5000
        1257611 normal          5000          0       5000
        1257612 normal          5000          0       5000
[root@hpcmaster 13-03-2023]#

Comment 115 Openfive Support 2023-03-20 14:08:40 MDT

Hi Team,


Please help us here, this is urgent, we are not able to execute any jobs, and the job is going to the pending state, despite having available resources.


Also, we have noticed that the "normal" and "long" partitions are having this issue. Rest partitions are working.


Below are the configurations of the two partitions:-


[root@hpcmaster 21-03-2023]# grep normal /etc/slurm/slurm.conf
###normal queue####
PartitionName=normal PriorityJobFactor=400 Nodes=osxon[004,005,006,007,019,024,036,082,018,020,021,023,031,038,039,044,090,091,055,059,045,070,053,056,069,033,080,008,066,010,060,073,009,013,028,029,015,047,032,030,050,071,074,049,037,041,058,048,035,075,079,043,065,064,001,067,094,087,046,078,052,054,042,063] Default=YES MaxTime=15-00 State=UP AllowGroups=engr
[root@hpcmaster 21-03-2023]# 

[root@hpcmaster 21-03-2023]# grep long /etc/slurm/slurm.conf
####long queue####
PartitionName=long PriorityJobFactor=300 Nodes=osxon[004,005,006,007,019,024,036,082,018,020,021,023,031,038,039,044,090,091,055,059,045,070,053,056,069,033,080,008,066,010,060,073,009,013,028,029,015,047,032,030,050,071,074,049,037,041,058,048,035,075,079,043,065,064,001,067,094,087,046,078,052,054,042,063] Default=YES MaxTime=60-00 State=UP AllowGroups=engr
[root@hpcmaster 21-03-2023]# 


Regards,
Debajit Dutta

Comment 116 Nate Rini 2023-03-20 14:29:37 MDT

The job has Priority=5000 while sprio reports a large number of jobs with the same priority. When the priority is the same, Slurm orders the jobs by submission time. Job 1257513 will not run until it has the highest priority.

To verify, please call:
> scontrol show job 1257513
> scontrol top 1257513
> scontrol show job 1257513

Comment 117 Openfive Support 2023-03-20 14:40:51 MDT

(In reply to Nate Rini from comment #116)
> The job has Priority=5000 while sprio reports a large number of jobs with
> the same priority. When the priority is the same, Slurm orders the jobs by
> submission time. Job 1257513 will not run until it has the highest priority.
> 
> To verify, please call:
> > scontrol show job 1257513
> > scontrol top 1257513
> > scontrol show job 1257513

[root@hpcmaster 21-03-2023]# 
[root@hpcmaster 21-03-2023]# scontrol show job 1257861
JobId=1257861 JobName=tcsh
   UserId=bahubalir(3350) GroupId=engr(500) MCS_label=N/A
   Priority=5000 Nice=0 Account=(null) QOS=normal WCKey=*
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A
   SubmitTime=2023-03-21T02:08:59 EligibleTime=2023-03-21T02:08:59
   AccrueTime=2023-03-21T02:08:59
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T02:08:59
   Partition=normal AllocNode:Sid=osvnc004:25357
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/tcsh
   WorkDir=/home/bahubalir
   Power=
   NtasksPerTRES:0

[root@hpcmaster 21-03-2023]# 
[root@hpcmaster 21-03-2023]# scontrol top 1257861
[root@hpcmaster 21-03-2023]# 
[root@hpcmaster 21-03-2023]# scontrol show job 1257861
JobId=1257861 JobName=tcsh
   UserId=bahubalir(3350) GroupId=engr(500) MCS_label=N/A
   Priority=5000 Nice=0 Account=(null) QOS=normal WCKey=*
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A
   SubmitTime=2023-03-21T02:08:59 EligibleTime=2023-03-21T02:08:59
   AccrueTime=2023-03-21T02:08:59
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T02:09:36
   Partition=normal AllocNode:Sid=osvnc004:25357
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/tcsh
   WorkDir=/home/bahubalir
   Power=
   NtasksPerTRES:0

[root@hpcmaster 21-03-2023]#

Comment 118 Nate Rini 2023-03-20 15:13:41 MDT

Please call:
> scontrol show job 1257861 #I want to see if the state changed
> scontrol setdebug debug3
> scontrol setdebugflags +SelectType
> scontrol top 1257861
> scontrol show job 1257861
> sleep 5m
> scontrol setdebug verbose
> scontrol setdebugflags -SelectType
> scontrol show job 1257861

Please attach the slurmctld log from this testing period.

Comment 119 Nate Rini 2023-03-20 15:15:14 MDT

Please also provide the output of the following on the controller:
> date +"%Z %z"
> date +%s
> date

Comment 120 Nate Rini 2023-03-20 17:26:48 MDT

We need prompt responses to maintain SEV1 status of a ticket. Please respond for comment#119 and comment#118 to allow us to continue to debug.

Comment 121 Openfive Support 2023-03-20 19:40:27 MDT

Hi Nate,


Let me execute a new job and provide you the data.

Comment 122 Openfive Support 2023-03-20 20:53:14 MDT

(In reply to Nate Rini from comment #118)
> Please call:
> > scontrol show job 1257861 #I want to see if the state changed
> > scontrol setdebug debug3
> > scontrol setdebugflags +SelectType
> > scontrol top 1257861

[root@hpcmaster 21-03-2023]# scontrol show job 1259263
JobId=1259263 JobName=tcsh
   UserId=bahubalir(3350) GroupId=engr(500) MCS_label=N/A
   Priority=3750 Nice=0 Account=(null) QOS=normal WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:13 TimeLimit=60-00:00:00 TimeMin=N/A
   SubmitTime=2023-03-21T08:12:10 EligibleTime=2023-03-21T08:12:10
   AccrueTime=2023-03-21T08:12:10
   StartTime=2023-03-21T08:12:14 EndTime=2023-05-20T08:12:14 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T08:12:13
   Partition=long AllocNode:Sid=osvnc004:25357
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=osxon007
   BatchHost=osxon007
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/tcsh
   WorkDir=/home/bahubalir
   Power=
   NtasksPerTRES:0

[root@hpcmaster 21-03-2023]# 
[root@hpcmaster 21-03-2023]# scontrol setdebug debug3
[root@hpcmaster 21-03-2023]# 
[root@hpcmaster 21-03-2023]# scontrol setdebugflags +SelectType
[root@hpcmaster 21-03-2023]# 
[root@hpcmaster 21-03-2023]# scontrol top 1259263
Job is no longer pending execution for job 1259263
[root@hpcmaster 21-03-2023]# 

> > scontrol show job 1257861
> > sleep 5m
> > scontrol setdebug verbose
> > scontrol setdebugflags -SelectType
> > scontrol show job 1257861

[root@hpcmaster 21-03-2023]# sleep 5m
[root@hpcmaster 21-03-2023]# scontrol setdebug verbose
[root@hpcmaster 21-03-2023]# scontrol setdebugflags -SelectType
[root@hpcmaster 21-03-2023]# scontrol show job 1259263
JobId=1259263 JobName=tcsh
   UserId=bahubalir(3350) GroupId=engr(500) MCS_label=N/A
   Priority=3750 Nice=0 Account=(null) QOS=normal WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:10:50 TimeLimit=60-00:00:00 TimeMin=N/A
   SubmitTime=2023-03-21T08:12:10 EligibleTime=2023-03-21T08:12:10
   AccrueTime=2023-03-21T08:12:10
   StartTime=2023-03-21T08:12:14 EndTime=2023-05-20T08:12:14 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T08:12:13
   Partition=long AllocNode:Sid=osvnc004:25357
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=osxon007
   BatchHost=osxon007
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/tcsh
   WorkDir=/home/bahubalir
   Power=
   NtasksPerTRES:0

[root@hpcmaster 21-03-2023]# 


> 
> Please attach the slurmctld log from this testing period.


Now from morning 7:00 AM IST we see that the jobs are dispatching again as usual the slurm is working fine. However, we want to know the root cause of it.


Regards,
Debajit Dutta

Comment 123 Openfive Support 2023-03-20 20:54:16 MDT

(In reply to Nate Rini from comment #119)
> Please also provide the output of the following on the controller:
> > date +"%Z %z"
> > date +%s
> > date

[root@hpcmaster 21-03-2023]# date +"%Z %z"
IST +0530
[root@hpcmaster 21-03-2023]# date +%s
1679367218
[root@hpcmaster 21-03-2023]# date
Tue Mar 21 08:23:48 IST 2023
[root@hpcmaster 21-03-2023]#

Comment 124 Openfive Support 2023-03-20 21:03:20 MDT

Created attachment 29430 [details]
slurmctld.log from the slurm controller on 21-03-2023 08:32 AM IST

Comment 125 Openfive Support 2023-03-20 21:04:47 MDT

(In reply to Nate Rini from comment #118)
> Please call:
> > scontrol show job 1257861 #I want to see if the state changed
> > scontrol setdebug debug3
> > scontrol setdebugflags +SelectType
> > scontrol top 1257861
> > scontrol show job 1257861
> > sleep 5m
> > scontrol setdebug verbose
> > scontrol setdebugflags -SelectType
> > scontrol show job 1257861
> 
> Please attach the slurmctld log from this testing period.

Log attached:-

slurmctld.log from the slurm controller on 21-03-2023 08:32 AM IST

Comment 126 Openfive Support 2023-03-21 00:48:36 MDT

Hi Team,
The issue is again repeating and causing issues. No jobs are executing and its impacting our production. Please arrange a call immediaetly

[2023-03-21T12:12:26.784] Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=5828060 began=12:12:20.956
[2023-03-21T12:12:27.745] Warning: Note very large processing time from _slurm_rpc_allocate_resources: usec=6770181 began=12:12:20.975
[2023-03-21T12:12:27.745] sched: _slurm_rpc_allocate_resources JobId=1261189 NodeList=(null) usec=6770181
[2023-03-21T12:12:27.857] Warning: Note very large processing time from _slurmctld_background: usec=6858277 began=12:12:20.999
[2023-03-21T12:12:27.857] job_signal: 9 of pending JobId=1261148 successful
[2023-03-21T12:12:28.191] Warning: Note very large processing time from dump_all_job_state: usec=4190409 began=12:12:24.001
[2023-03-21T12:12:29.925] sched: _slurm_rpc_allocate_resources JobId=1261190 NodeList=(null) usec=30493
[2023-03-21T12:12:30.224] sched: _slurm_rpc_allocate_resources JobId=1261191 NodeList=(null) usec=123222
[2023-03-21T12:12:34.593] _job_complete: JobId=1261088 WTERMSIG 126
[2023-03-21T12:12:34.593] _job_complete: JobId=1261088 cancelled by interactive user
[2023-03-21T12:12:34.594] _job_complete: JobId=1261088 done
[2023-03-21T12:12:34.594] _slurm_rpc_complete_job_allocation: JobId=1261088 error Job/step already completing or completed
[2023-03-21T12:12:35.441] _slurm_rpc_complete_job_allocation: JobId=1261088 error Job/step already completing or completed
[2023-03-21T12:12:35.466] _slurm_rpc_complete_job_allocation: JobId=1261088 error Job/step already completing or completed
[2023-03-21T12:12:36.193] _job_complete: JobId=1261025 WEXITSTATUS 0
[2023-03-21T12:12:36.193] _job_complete: JobId=1261025 done
[2023-03-21T12:12:36.238] sched: _slurm_rpc_allocate_resources JobId=1261192 NodeList=(null) usec=26488
[2023-03-21T12:12:36.373] sched: _slurm_rpc_allocate_resources JobId=1261193 NodeList=(null) usec=26556
[2023-03-21T12:12:36.611] Time limit exhausted for JobId=1174490
[2023-03-21T12:12:36.826] _slurm_rpc_complete_job_allocation: JobId=1174490 error Job/step already completing or completed
[2023-03-21T12:12:39.894] _job_complete: JobId=1261028 WTERMSIG 126
[2023-03-21T12:12:39.894] _job_complete: JobId=1261028 cancelled by interactive user
[2023-03-21T12:12:39.894] _job_complete: JobId=1261028 done
[2023-03-21T12:12:39.894] _slurm_rpc_complete_job_allocation: JobId=1261028 error Job/step already completing or completed
[2023-03-21T12:12:41.323] _job_complete: JobId=1261023 WEXITSTATUS 0
[2023-03-21T12:12:41.323] _job_complete: JobId=1261023 done

Comment 127 Openfive Support 2023-03-21 05:00:30 MDT

(In reply to Nate Rini from comment #108)

> The CPU on this server is very likely just too slow to be a Slurm controller
> for the cluster. I strongly suggest looking into getting a faster server.
> Many of these issues will likely just be resolved by that.

What are the recommended hardware configurations 

1. for the latest version of Slurm?
2. for our current version i.e. 20.11.8 ?


Regards,
Debajit Dutta

Comment 128 Openfive Support 2023-03-21 05:09:43 MDT

Hi Nate,


The problem we are facing is that suddenly slurm is not accepting any jobs, i.e. whenever a user is executing a job, it is going to the PENDING state even if the CPU defined is 1. This problem remains for an hour and after an hour gradually slurm starts picking up the jobs that were in a pending state and works as normal again.

Till the time this happened twice.
1. on 20-03-2023 around 11:17 PM
2. on 21-03-2023 around 1:03 PM


Is it because of the changes we made in munge in the slurm controller last weekend, i.e. added 8 threads to munge service?

Or this is a new issue?

Also, we still find the errors in the slurmctl log for which we opened this case.

1. error: slurm_receive_msgs: Zero Bytes were transmitted or received
2. error: Invalid nodes (osxon092s) for JobId=1190199
3. error: slurm_auth_get_host: Lookup failed for 0.0.0.0

Please let us know what steps we need to take to resolve these recurring issues.


Regards,
Debajit Dutta

Comment 131 Nate Rini 2023-03-21 08:39:41 MDT

(In reply to Openfive Support from comment #122)
> Now from morning 7:00 AM IST we see that the jobs are dispatching again as
> usual the slurm is working fine. However, we want to know the root cause of
> it.

JobId=1259263 was submitted with a priority that was too low to be scheduled immediately. The request to call `scontrol top 1259263` forced the job to have the highest priority. Slurm then scheduled the job and it was started on `osvnc004`. This appears to be normal operation for Slurm.

It seems like you could greatly benefit from turning the scheduler, which would help with jobs both starting and backfill jobs placement.

We see that you have previously set SchedulerParameters, but then commented it out.

> SchedulerParameters=bf_continue,bf_max_job_test=10000

Can you explain the decision behind this?

Regarding your current situation, if your scheduler is not tuned for your workload, then it is likely that backfill never looks deep enough into the queue for jobs you would expect to run. 

Slurm has a main, a quick scheduler, and backfill schedulers.

1. Once the main scheduler reaches the first job it can not run, it stops to process the job queue. This is normally the highest priority job, but could be another job further down the queue depending on job size. The job queue is ordered by computed job priority. It tried to run a job at job submission if the requested resources are available, and if it does not start out higher priority jobs.

Backfill runs on a set cadence and will then backfill jobs, so long as it does not delay the start time of your higher priority jobs. For backfill to work correctly, your jobs need to request a run time and the scheduler needs to be configured for your site's workflow.

1. How long is your longest running job?
2. Does your site prefer larger or smaller jobs to take priority, or a mix of both?
3. Should your site limit how many jobs one user can start at each backfill iteration or in total through backfill?
4. Roughly how many jobs does your site submit and finish in a day, week, month?

The following command can help you understand your mixture of jobs, runtime and size in order to answer this question.
>$ squeue -o "%.P | %.A | %.u | %.h | %.c | %.C | %.D | %.e | %.l |  %.N | %.p | %.S | %.T | %.V"

(In reply to Openfive Support from comment #128)
> Is it because of the changes we made in munge in the slurm controller last
> weekend, i.e. added 8 threads to munge service?
> Or this is a new issue?
> 
> Also, we still find the errors in the slurmctl log for which we opened this
> case.
> 
> 1. error: slurm_receive_msgs: Zero Bytes were transmitted or received
> 3. error: slurm_auth_get_host: Lookup failed for 0.0.0.0
> 
> Please let us know what steps we need to take to resolve these recurring
> issues.

This is all the same issue of munge being overloaded. The controller's hardware is too slow to handle the workload. I suspect this period was induced by a user sending too many RPCs. Calling `sdiag` should be able to determine who. We have already configured munge the best possible case with 1 thread per physical core. Munge is a cryptographic service which is entirely reliant on the speed of the core and if there is support for more advanced SSE operations. If munge has an issue or is slow, then Slurm has to wait for it for communications.

> 2. error: Invalid nodes (osxon092s) for JobId=1190199

Please provide the entire log message. If this is the entire log message, then please provide at least 5 messages from before and after.

(In reply to Openfive Support from comment #127)
> (In reply to Nate Rini from comment #108)
> 
> > The CPU on this server is very likely just too slow to be a Slurm controller
> > for the cluster. I strongly suggest looking into getting a faster server.
> > Many of these issues will likely just be resolved by that.
> 
> What are the recommended hardware configurations 
> 
> 1. for the latest version of Slurm?
> 2. for our current version i.e. 20.11.8 ?

We generally don't make different suggestions by the Slurm version installed. The slowest blocking operations (calling out to munge) are actually unchanged for many years.

We have a list of publications on our website: https://slurm.schedmd.com/publications.html

I suggest watching and reading this one specifically:
> Field Notes 6: From The Frontlines of Slurm Support, Video, Jason Booth, SchedMD
> https://slurm.schedmd.com/SLUG22/Field_Notes_6.pdf
> https://youtu.be/njEgeMUAqMY

While I suggest watching/reading the entire presentation, the relevant slides start at #17. To avoid repeating the same information on this ticket, please watch and read the entire presentation. We will be happy to answer any additional questions or provide clarifications to the content of the slides.

Comment 134 Jason Booth 2023-03-21 11:22:39 MDT

*** Ticket 16329 has been marked as a duplicate of this ticket. ***

Comment 135 Nate Rini 2023-03-21 14:19:40 MDT

(In reply to Nate Rini from comment #131)
> > SchedulerParameters=bf_continue,bf_max_job_test=10000
> 
> Can you explain the decision behind this?
> 
> Regarding your current situation, if your scheduler is not tuned for your
> workload, then it is likely that backfill never looks deep enough into the
> queue for jobs you would expect to run. 
> 
> Slurm has a main, a quick scheduler, and backfill schedulers.
> 
> 1. Once the main scheduler reaches the first job it can not run, it stops to
> process the job queue. This is normally the highest priority job, but could
> be another job further down the queue depending on job size. The job queue
> is ordered by computed job priority. It tried to run a job at job submission
> if the requested resources are available, and if it does not start out
> higher priority jobs.
> 
> Backfill runs on a set cadence and will then backfill jobs, so long as it
> does not delay the start time of your higher priority jobs. For backfill to
> work correctly, your jobs need to request a run time and the scheduler needs
> to be configured for your site's workflow.
> 
> 1. How long is your longest running job?
> 2. Does your site prefer larger or smaller jobs to take priority, or a mix
> of both?
> 3. Should your site limit how many jobs one user can start at each backfill
> iteration or in total through backfill?
> 4. Roughly how many jobs does your site submit and finish in a day, week,
> month?
> 
> The following command can help you understand your mixture of jobs, runtime
> and size in order to answer this question.
> >$ squeue -o "%.P | %.A | %.u | %.h | %.c | %.C | %.D | %.e | %.l |  %.N | %.p | %.S | %.T | %.V"

Please provide the requested information. I know some of it was covered in the concall but having all of it will be helpful.

After the meeting, I assume that this config has been reactivated:
> SchedulerParameters=bf_continue,bf_max_job_test=10000

Please provide a current slurm.conf and output of sdiag.

Comment 139 Nate Rini 2023-03-22 15:03:10 MDT

Please provide a status update. We are waiting on a reply to comment#135

Comment 140 Nate Rini 2023-03-23 12:02:50 MDT

(In reply to Nate Rini from comment #139)
> Please provide a status update. We are waiting on a reply to comment#135

Please provide a status update. We are waiting on a reply to comment#135

Comment 141 Nate Rini 2023-04-06 17:25:12 MDT

I'm going to time this ticket out. I assume the process of converting to a new machine for the controller is taking a while. Please respond when that is complete and we can continue debugging the issues (if needed).