Hi Team, We are not able to execute any jobs and are getting the below errors in the slurmctld.log error: slurm_receive_msgs: Zero Bytes were transmitted or received error: slurm_auth_get_host: Lookup failed for 0.0.0.0 Please let us know what caused these errors. Regards, Debajit Dutta
Please attach slurm.conf and related files from the cluster. When did these errors start? What has changed recently in the cluster?
Created attachment 29211 [details] Slurm configuration file from the slurm controller
Hi Nate, We have attached the slurm.conf file. We are getting these from today itself. Also, there was no change in the slurm.conf file. Regards, Debajit Dutta
(In reply to Openfive Support from comment #3) > We are getting these from today itself. > Also, there was no change in the slurm.conf file. Generally, when we see these errors: > error: slurm_receive_msgs: Zero Bytes were transmitted or received > error: slurm_auth_get_host: Lookup failed for 0.0.0.0 It means there is an authentication error. What is the release version of the cluster? Please call: > slurmctld -V
Hi Nate, We are randomly facing this issue, we are not able to execute jobs. Can we have a call now on this? Regards, Debajit Dutta
Hi Nate, Also, the output of slurmctld -V is below:- slurm 20.11.8 Regards, Debajit Dutta
(In reply to Openfive Support from comment #6) > We are randomly facing this issue, we are not able to execute jobs. Are all jobs not able to execute or is this just a specific type of job? Please provide the following output for one of the jobs that are not executing: > sacct -o all -p -j $FAILED_JOB_ID Please attach slurmctld log and slurmd log from one of the nodes where jobs are not executing. Please also attach logs from the job above.
Hi Nate, We are randomly getting these errors, sometimes the jobs are getting dispatched but sometimes it is getting out with these errors. Also, in addition to the previous errors we are getting the following as well:- srun: error: Unable to allocate resources: Socket timed out on send/recv operation We will be adding the requested logs. Regards, Debajit Dutta
Please also provide the output of the following: > sdiag > scontrol ping > scontrol show nodes > scontrol show partitions
(In reply to Openfive Support from comment #9) > We are randomly getting these errors, sometimes the jobs are getting > dispatched but sometimes it is getting out with these errors. So there are jobs running but some jobs are failing? > srun: error: Unable to allocate resources: Socket timed out on send/recv > operation Please make sure to add '-vvvv' to the srun call as an argument to get more verbose logs of this.
Created attachment 29213 [details] sdiag from the slurm controller
Created attachment 29214 [details] scontrol show nodes from slurm controller
Created attachment 29215 [details] scontrol show partitions from slurm controller
(In reply to Nate Rini from comment #10) Hi Nate, > Please also provide the output of the following: > > sdiag We have attached the file for this. > > scontrol ping Below is the output from the slurm controller:- [root@hpcmaster Documents]# scontrol ping Slurmctld(primary) at hpcmaster is UP Slurmctld(backup) at hpcslave is UP [root@hpcmaster Documents]# > > scontrol show nodes We have attached the file for this. > > scontrol show partitions We have attached the file for this. Please check and let us know. Regards, Debajit Duta
Reviewing the logs now
(In reply to Openfive Support from comment #13) > Created attachment 29214 [details] > scontrol show nodes from slurm controller NodeName=slurm-dashboard needs to be upgraded to the current running version of 20.11.8. Please attach the slurmd logs from this node.
(In reply to Nate Rini from comment #8) > (In reply to Openfive Support from comment #6) > > We are randomly facing this issue, we are not able to execute jobs. > > Are all jobs not able to execute or is this just a specific type of job? > > Please provide the following output for one of the jobs that are not > executing: > > sacct -o all -p -j $FAILED_JOB_ID > > Please attach slurmctld log and slurmd log from one of the nodes where jobs > are not executing. Please also attach logs from the job above. Please also attach the logs requested in comment#8. A zip or tarball of the logs is generally preferred to avoid having to do multiple attachments.
(In reply to Nate Rini from comment #18) > (In reply to Nate Rini from comment #8) > > (In reply to Openfive Support from comment #6) > > > We are randomly facing this issue, we are not able to execute jobs. > > > > Are all jobs not able to execute or is this just a specific type of job? > > > > Please provide the following output for one of the jobs that are not > > executing: > > > sacct -o all -p -j $FAILED_JOB_ID > > > > Please attach slurmctld log and slurmd log from one of the nodes where jobs > > are not executing. Please also attach logs from the job above. > > Please also attach the logs requested in comment#8. A zip or tarball of the > logs is generally preferred to avoid having to do multiple attachments. The last day of logs is sufficient. No need to attach all logs timestamped from before that.
I'm reducing ticket severity to SEV2. The logs from sdiag show jobs are running, and we require a site to proactively respond to a ticket to maintain SEV1 status. We take SEV1 tickets very seriously and lack of response to requested information cause wasted resources on our part. Please provide the logs requested in comment#8. We currently lack sufficient data to diagnose the issue.
(In reply to Nate Rini from comment #8) > (In reply to Openfive Support from comment #6) Hi Nate, > > We are randomly facing this issue, we are not able to execute jobs. > > Are all jobs not able to execute or is this just a specific type of job? A few jobs are running, but this is happening very intermittently and the jobs are not getting dispatched to any nodes and coming out of the prompt with these errors. For example see the below job:- [vishalkrishnat@osvnc002 ~]$ srun -p normal --pty /bin/tcsh srun: error: Unable to allocate resources: Socket timed out on send/recv operation Here, we did not get any job ID and the job came out with the error. > Please provide the following output for one of the jobs that are not > executing: > > sacct -o all -p -j $FAILED_JOB_ID We are not getting job IDs, the jobs are simply coming out with errors. > > Please attach slurmctld log and slurmd log from one of the nodes where jobs > are not executing. Please also attach logs from the job above. We will attach the logs. Regards, Debajit Dutta
Created attachment 29219 [details] slurmctld.log from the slurm controller
Created attachment 29220 [details] slurm all nodes slurmd.log
(In reply to Openfive Support from comment #21) > We will attach the logs. While attaching logs, please verify the following: 1. munged is running on all nodes > systemctl status munge 2. munge is using the same key on all nodes: > # sha1sum /etc/munge/munge.key > xxxxxxxxxxdc3d8f1629e3dfef7a31 /etc/munge/munge.key Please do not send us or share your munge key or the sha1sum output on this ticket. Verify they are all exactly the same.
Hi Nate, Since this issue is coming from more than one node, also before the jobs gets dispatched to some node. We have used Ansible to gather the log from all the nodes. We have attached the same here. Regards, Debajit Dutta
(In reply to Openfive Support from comment #23) > Created attachment 29220 [details] > slurm all nodes slurmd.log > > error: Node configuration differs from hardware: CPUs=32:32(hw) Boards=1:1(hw) SocketsPerBoard=32:2(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=1:2(hw) Any node dumping this error needs to be reconfigured. Please use 'slurmd -C' to get the detected configuration to correct slurm.conf.
(In reply to Openfive Support from comment #22) > Created attachment 29219 [details] > slurmctld.log from the slurm controller > > error: slurm_auth_get_host: Lookup failed for 0.0.0.0 (In reply to Openfive Support from comment #23) > Created attachment 29220 [details] > slurm all nodes slurmd.log > > debug2: Error connecting slurm stream socket at 192.168.2.127:6817: Connection timed out Please provide the output of the following from osvnc007, hpcmaster, and hpcslave: > getent hosts osvnc007 > getent hosts hpcmaster > getent hosts 192.168.2.127 > getent hosts hpcslave > getent hosts 192.168.2.107
(In reply to Nate Rini from comment #27) Hi Nate, Below is the output:- > Please provide the output of the following from osvnc007, hpcmaster, and > hpcslave: > > getent hosts osvnc007 > > getent hosts hpcmaster > > getent hosts 192.168.2.127 > > getent hosts hpcslave > > getent hosts 192.168.2.107 From osvnc007:- [root@osvnc007 ~]# getent hosts osvnc007 192.168.2.96 osvnc007.open-silicon.com osvnc007 [root@osvnc007 ~]# getent hosts hpcmaster 192.168.2.127 hpcmaster.open-silicon.com hpcmaster [root@osvnc007 ~]# getent hosts 192.168.2.127 192.168.2.127 osncmaster.open-silicon.com osncmaster [root@osvnc007 ~]# getent hosts hpcslave 192.168.2.107 hpcslave.open-silicon.com hpcslave [root@osvnc007 ~]# getent hosts 192.168.2.107 192.168.2.107 hpcslave.open-silicon.com hpcslave [root@osvnc007 ~]# From hpcmaster:- [root@hpcmaster Documents]# getent hosts osvnc007 192.168.2.96 osvnc007.open-silicon.com osvnc007 [root@hpcmaster Documents]# getent hosts hpcmaster 192.168.2.127 hpcmaster.open-silicon.com hpcmaster [root@hpcmaster Documents]# getent hosts 192.168.2.127 192.168.2.127 hpcmaster.open-silicon.com hpcmaster [root@hpcmaster Documents]# getent hosts hpcslave 192.168.2.107 hpcslave.open-silicon.com hpcslave [root@hpcmaster Documents]# getent hosts 192.168.2.107 192.168.2.107 hpcslave.open-silicon.com hpcslave [root@hpcmaster Documents]# From hpcslave:- [root@hpcslave ~]# getent hosts osvnc007 192.168.2.96 osvnc007.open-silicon.com osvnc007 [root@hpcslave ~]# getent hosts hpcmaster 192.168.2.127 hpcmaster.open-silicon.com hpcmaster [root@hpcslave ~]# getent hosts 192.168.2.127 192.168.2.127 osncmaster.open-silicon.com osncmaster [root@hpcslave ~]# getent hosts hpcslave 192.168.2.107 hpcslave.open-silicon.com hpcslave [root@hpcslave ~]# getent hosts 192.168.2.107 192.168.2.107 hpcslave.open-silicon.com hpcslave [root@hpcslave ~]# Regards, Debajit Dutta
> Please provide the output of the following from osvnc007, hpcmaster, and > hpcslave: > > getent ahostsv6 osvnc007 > > getent ahostsv6 hpcmaster > > getent ahostsv6 192.168.2.127 > > getent ahostsv6 hpcslave > > getent ahostsv6 192.168.2.107 Was the IPv6 configuration of the cluster changed recently?
Please also provide the output from osvnc007, hpcmaster, and hpcslave of this command: > systemctl status munge
(In reply to Nate Rini from comment #30) Hi Nate, Below are the output details:- > > Please provide the output of the following from osvnc007, hpcmaster, and > > hpcslave: > > > getent ahostsv6 osvnc007 > > > getent ahostsv6 hpcmaster > > > getent ahostsv6 192.168.2.127 > > > getent ahostsv6 hpcslave > > > getent ahostsv6 192.168.2.107 > For osvnc007:- [root@osvnc007 ~]# getent ahostsv6 osvnc007 ::ffff:192.168.2.96 STREAM osvnc007 ::ffff:192.168.2.96 DGRAM ::ffff:192.168.2.96 RAW [root@osvnc007 ~]# getent ahostsv6 hpcmaster ::ffff:192.168.2.127 STREAM hpcmaster ::ffff:192.168.2.127 DGRAM ::ffff:192.168.2.127 RAW [root@osvnc007 ~]# getent ahostsv6 192.168.2.127 ::ffff:192.168.2.127 STREAM 192.168.2.127 ::ffff:192.168.2.127 DGRAM ::ffff:192.168.2.127 RAW [root@osvnc007 ~]# getent ahostsv6 hpcslave ::ffff:192.168.2.107 STREAM hpcslave ::ffff:192.168.2.107 DGRAM ::ffff:192.168.2.107 RAW [root@osvnc007 ~]# getent ahostsv6 192.168.2.107 ::ffff:192.168.2.107 STREAM 192.168.2.107 ::ffff:192.168.2.107 DGRAM ::ffff:192.168.2.107 RAW [root@osvnc007 ~]# For hpcmaster:- [root@hpcmaster Documents]# getent ahostsv6 osvnc007 ::ffff:192.168.2.96 STREAM osvnc007 ::ffff:192.168.2.96 DGRAM ::ffff:192.168.2.96 RAW [root@hpcmaster Documents]# getent ahostsv6 hpcmaster ::ffff:192.168.2.127 STREAM hpcmaster.open-silicon.com ::ffff:192.168.2.127 DGRAM ::ffff:192.168.2.127 RAW [root@hpcmaster Documents]# getent ahostsv6 192.168.2.127 ::ffff:192.168.2.127 STREAM 192.168.2.127 ::ffff:192.168.2.127 DGRAM ::ffff:192.168.2.127 RAW [root@hpcmaster Documents]# getent ahostsv6 hpcslave ::ffff:192.168.2.107 STREAM hpcslave.open-silicon.com ::ffff:192.168.2.107 DGRAM ::ffff:192.168.2.107 RAW [root@hpcmaster Documents]# getent ahostsv6 192.168.2.107 ::ffff:192.168.2.107 STREAM 192.168.2.107 ::ffff:192.168.2.107 DGRAM ::ffff:192.168.2.107 RAW For hpcslave:- [root@hpcslave ~]# getent ahostsv6 osvnc007 ::ffff:192.168.2.96 STREAM osvnc007 ::ffff:192.168.2.96 DGRAM ::ffff:192.168.2.96 RAW [root@hpcslave ~]# getent ahostsv6 hpcmaster ::ffff:192.168.2.127 STREAM hpcmaster ::ffff:192.168.2.127 DGRAM ::ffff:192.168.2.127 RAW [root@hpcslave ~]# getent ahostsv6 192.168.2.127 ::ffff:192.168.2.127 STREAM 192.168.2.127 ::ffff:192.168.2.127 DGRAM ::ffff:192.168.2.127 RAW [root@hpcslave ~]# getent ahostsv6 hpcslave ::ffff:192.168.2.107 STREAM hpcslave ::ffff:192.168.2.107 DGRAM ::ffff:192.168.2.107 RAW [root@hpcslave ~]# getent ahostsv6 192.168.2.107 ::ffff:192.168.2.107 STREAM 192.168.2.107 ::ffff:192.168.2.107 DGRAM ::ffff:192.168.2.107 RAW [root@hpcslave ~]# > Was the IPv6 configuration of the cluster changed recently? No. Regards, Debajit Dutta
(In reply to Nate Rini from comment #31) Hi Nate, Below is the output details:- > Please also provide the output from osvnc007, hpcmaster, and hpcslave of > this command: > > systemctl status munge For osvnc007:- [root@osvnc007 ~]# systemctl status munge \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2022-07-31 11:00:06 IST; 7 months 7 days ago Docs: man:munged(8) Main PID: 855 (munged) Tasks: 4 Memory: 892.0K CGroup: /system.slice/munge.service \u2514\u2500855 /usr/sbin/munged Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. [root@osvnc007 ~]# For hpcmaster:- [root@hpcmaster Documents]# systemctl status munge \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2022-07-29 23:55:32 IST; 7 months 9 days ago Docs: man:munged(8) Process: 1375 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) Main PID: 1381 (munged) Tasks: 4 CGroup: /system.slice/munge.service \u2514\u25001381 /usr/sbin/munged Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. For hpcslave:- [root@hpcslave ~]# systemctl status munge \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2022-12-07 15:54:51 IST; 3 months 0 days ago Docs: man:munged(8) Process: 1198 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) Main PID: 1220 (munged) Tasks: 4 CGroup: /system.slice/munge.service \u2514\u25001220 /usr/sbin/munged Dec 07 15:54:51 hpcslave systemd[1]: Starting MUNGE authentication service... Dec 07 15:54:51 hpcslave systemd[1]: Started MUNGE authentication service. [root@hpcslave ~]# Regards, Debajit Dutta
(In reply to Nate Rini from comment #24) Hi Nate, > (In reply to Openfive Support from comment #21) > > We will attach the logs. > > While attaching logs, please verify the following: > > 1. munged is running on all nodes > Yes, I have verified munge service is active and running in all slurm nodes. > > systemctl status munge > > 2. munge is using the same key on all nodes: > > > # sha1sum /etc/munge/munge.key > > xxxxxxxxxxdc3d8f1629e3dfef7a31 /etc/munge/munge.key > > Please do not send us or share your munge key or the sha1sum output on this > ticket. Verify they are all exactly the same. Yes, I have verified all slurm nodes have the exact same munge key. Regards, Debajit Dutta
Hi Nate, Please review the information and logs I have sent. Also, I have replied to all questions, however, let me know in case I have missed any. Regards, Debajit Dutta
(In reply to Openfive Support from comment #34) > (In reply to Nate Rini from comment #24) > > (In reply to Openfive Support from comment #21) > > > We will attach the logs. > > > > While attaching logs, please verify the following: > > > > 1. munged is running on all nodes > > > > Yes, I have verified munge service is active and running in all slurm nodes. Are there any errors in the munge logs? Please restart all munge daemons on all nodes. Once done, start slurmd in foreground mode on a node that is marked down. > slurmd -Dvvvvvvv' Please attach log from slurmd and then slurmctld while it is starting up.
(In reply to Openfive Support from comment #35) > Also, I have replied to all questions, however, let me know in case I have > missed any. The error we are seeing in the logs is consistent with a munge authentication issue. Munge is not a verbosely logging service, so we are going to have to take a few extra steps to determine the cause. The cluster doesn't have a large number of nodes but it is possible munge is getting overwhelmed. Setting an increased number of threads is suggested: > https://slurm.schedmd.com/high_throughput.html#munge_config
(In reply to Nate Rini from comment #36) Hi Nate, > > Are there any errors in the munge logs? > > Please restart all munge daemons on all nodes. Once done, start slurmd in > foreground mode on a node that is marked down. > > slurmd -Dvvvvvvv' I didn't get what you meant by "marked down"? If this is about the node state, there are only two nodes which are marked as down, but those are not reachable remotely as of now. Regards, Debajit Dutta
(In reply to Openfive Support from comment #38) > (In reply to Nate Rini from comment #36) > I didn't get what you meant by "marked down"? If this is about the node > state, there are only two nodes which are marked as down, but those are not > reachable remotely as of now. Please provide the output of this command: > scontrol show nodes
Created attachment 29245 [details] slurmd_all_nodes_details_09-03-2023
(In reply to Nate Rini from comment #39) Hi Nate, > (In reply to Openfive Support from comment #38) > > (In reply to Nate Rini from comment #36) > > I didn't get what you meant by "marked down"? If this is about the node > > state, there are only two nodes which are marked as down, but those are not > > reachable remotely as of now. > > Please provide the output of this command: > > scontrol show nodes I have attached the output of the above command. File name:- slurmd_all_nodes_details_09-03-2023 Regards, Debajit Dutta
(In reply to Openfive Support from comment #38) > (In reply to Nate Rini from comment #36) > I didn't get what you meant by "marked down"? If this is about the node > state, there are only two nodes which are marked as down, but those are not > reachable remotely as of now. I was referring to this state: > State=DOWN* In this state, slurmctld is unable to talk to slurmd. (In reply to Openfive Support from comment #40) > Created attachment 29245 [details] > slurmd_all_nodes_details_09-03-2023 > > NodeName=osxon002 Arch=x86_64 CoresPerSocket=1 > State=IDLE Since the 2 DOWN* nodes are expected to be in that state, can we try starting slurmd in foreground mode on this node since it is idle?
(In reply to Nate Rini from comment #37) > (In reply to Openfive Support from comment #35) > The cluster doesn't have a large number of nodes but it is possible munge is > getting overwhelmed. Setting an increased number of threads is suggested: > > https://slurm.schedmd.com/high_throughput.html#munge_config Please tell me once this config change is implemented for munge.
(In reply to Nate Rini from comment #43) Hi Nate, > (In reply to Nate Rini from comment #37) > > (In reply to Openfive Support from comment #35) > > The cluster doesn't have a large number of nodes but it is possible munge is > > getting overwhelmed. Setting an increased number of threads is suggested: > > > https://slurm.schedmd.com/high_throughput.html#munge_config > > Please tell me once this config change is implemented for munge. so, should I first:- 1. Set osxon002 as DOWN for our testing purpose 2. Implement this config:- https://slurm.schedmd.com/high_throughput.html#munge_config 2. Restart the munge service in all nodes 3. Start slurmd in foreground mode on osxon002 as node will be marked down. > slurmd -Dvvvvvvv' 4. attach log from slurmd and then slurmctld while it is starting up. Please let me know if the sequence in the above steps is correct. Regards, Debajit Dutta
> 1. Set osxon002 as DOWN for our testing purpose I suggest setting it to drain to avoid killing a job that may start on it. > Please let me know if the sequence in the above steps is correct. Restarting munge could be done at any time. Please make sure to add the extra threads to munge before restarting it.
(In reply to Nate Rini from comment #45) > > 1. Set osxon002 as DOWN for our testing purpose > I suggest setting it to drain to avoid killing a job that may start on it. > ok I have set the osxon002 state to down > > Please let me know if the sequence in the above steps is correct. > Restarting munge could be done at any time. Please make sure to add the > extra threads to munge before restarting it. How to check what the current number of threads munge is configured to? Also, do I need to add extra threads to munge only on slurm master server or on all nodes?
(In reply to Openfive Support from comment #46) > How to check what the current number of threads munge is configured to? Munge doesn't have a configuration file which means all options are passed as an argument at invocation. I used `systemctl status munge` to get this: > Process: 1375 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) It shows that munge is getting started without any arguments. > Also, do I need to add extra threads to munge only on slurm master server or > on all nodes? Most likely the config change is only needed on the controllers but it won't hurt to have it on all nodes.
Hi Nate, While executing the command for munge, I am getting the below error:- [root@osxon002 ~]# [root@osxon002 ~]# munged --num-threads 10 munged: Error: Logfile is insecure: "/var/log/munge/munged.log" should be owned by UID 0 [root@osxon002 ~]# [root@osxon002 ~]# [root@osxon002 ~]# ll /var/log/munge/munged.log -rw-r----- 1 munge munge 0 Feb 17 03:40 /var/log/munge/munged.log [root@osxon002 ~]# Should I change the owner of the above log file from munge to user root ? Please let us know. Regards, Debajit Dutta
(In reply to Openfive Support from comment #48) > munged: Error: Logfile is insecure: "/var/log/munge/munged.log" should be owned by UID 0 > > Should I change the owner of the above log file from munge to user root ? No. The systemd unit file (or a drop-in) is required to be modified to set munge's arguments at startup. Changing the ownership would only break munge when munge is started by systemd as the `munge` user.
(In reply to Nate Rini from comment #49) > (In reply to Openfive Support from comment #48) > > munged: Error: Logfile is insecure: "/var/log/munge/munged.log" should be owned by UID 0 > > > > Should I change the owner of the above log file from munge to user root ? > > No. The systemd unit file (or a drop-in) is required to be modified to set > munge's arguments at startup. Changing the ownership would only break munge > when munge is started by systemd as the `munge` user. Are instruction on how to do this needed?
(In reply to Nate Rini from comment #50) Hi Nate, > (In reply to Nate Rini from comment #49) > > (In reply to Openfive Support from comment #48) > > > munged: Error: Logfile is insecure: "/var/log/munge/munged.log" should be owned by UID 0 > > > > > > Should I change the owner of the above log file from munge to user root ? > > > > No. The systemd unit file (or a drop-in) is required to be modified to set > > munge's arguments at startup. Changing the ownership would only break munge > > when munge is started by systemd as the `munge` user. > > Are instruction on how to do this needed? Yes, it will be really great if you provide me with the instructions on how to do this. Regards, Debajit Dutta
Please attach /usr/lib/systemd/system/munge.service
(In reply to Nate Rini from comment #52) Hi Nate, > Please attach /usr/lib/systemd/system/munge.service Below is the output of the file:- [Unit] Description=MUNGE authentication service Documentation=man:munged(8) After=network.target After=syslog.target After=time-sync.target [Service] Type=forking ExecStart=/usr/sbin/munged PIDFile=/var/run/munge/munged.pid User=munge Group=munge Restart=on-abort [Install] WantedBy=multi-user.target Regards, Debajit Dutta
(In reply to Openfive Support from comment #53) > (In reply to Nate Rini from comment #52) > > Please attach /usr/lib/systemd/system/munge.service How was munge installed? using rpms?
Follow this procedure: > mkdir -p /usr/lib/systemd/system/munge.service.d Populate file: /usr/lib/systemd/system/munge.service.d/local.conf > [Service] > ExecStart=/usr/sbin/munged -M --num-threads 10 Reload and restart > systemctl daemon-reload > systemctl restart munge
(In reply to Nate Rini from comment #54) Hi Nate, > (In reply to Openfive Support from comment #53) > > (In reply to Nate Rini from comment #52) > > > Please attach /usr/lib/systemd/system/munge.service > > How was munge installed? using rpms? Yes, munge was installed using rpms. Below are the munge rpms that were installed:- munge-libs-0.5.11-3.el7.x86_64.rpm munge-0.5.11-3.el7.x86_64.rpm munge-devel-0.5.11-3.el7.x86_64.rpm Regards, Debajit Dutta
Once the changes in comment#55 are applied. Please call this and attach the log: > $ echo test | munge | unmunge
Hi Nate, After the restart of the munge.service we are getting the below errors:- [root@oslab002 system]# systemctl restart munge.service Failed to restart munge.service: Unit is not loaded properly: Invalid argument. See system logs and 'systemctl status munge.service' for details. [root@oslab002 system]# [root@oslab002 system]# systemctl restart munge.service Failed to restart munge.service: Unit is not loaded properly: Invalid argument. See system logs and 'systemctl status munge.service' for details. [root@oslab002 system]# [root@oslab002 system]# systemctl status munge.service \u25cf munge.service - MUNGE authentication service Loaded: error (Reason: Invalid argument) Drop-In: /usr/lib/systemd/system/munge.service.d \u2514\u2500local.conf Active: active (running) since Fri 2022-11-18 14:27:51 IST; 3 months 21 days ago Docs: man:munged(8) Main PID: 1345 (munged) CGroup: /system.slice/munge.service \u2514\u25001345 /usr/sbin/munged Nov 18 14:27:50 oslab002 systemd[1]: Starting MUNGE authentication service... Nov 18 14:27:51 oslab002 systemd[1]: Started MUNGE authentication service. Mar 10 22:42:59 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing. [root@oslab002 system]# Regards, Debajit Dutta
Please swap to this and follow procedure in comment#55 Populate file: /usr/lib/systemd/system/munge.service.d/local.conf > [Service] > Type=simple > ExecStart=/usr/sbin/munged -M --num-threads 10 -F
(In reply to Nate Rini from comment #59) Hi Nate, > Please swap to this and follow procedure in comment#55 > > Populate file: /usr/lib/systemd/system/munge.service.d/local.conf > > [Service] > > Type=simple > > ExecStart=/usr/sbin/munged -M --num-threads 10 -F Still we are getting the below same error:- [root@oslab002 system]# [root@oslab002 system]# cat /usr/lib/systemd/system/munge.service.d/local.conf [Service] Type=simple ExecStart=/usr/sbin/munged -M --num-threads 10 -F [root@oslab002 system]# [root@oslab002 system]# [root@oslab002 system]# systemctl daemon-reload [root@oslab002 system]# [root@oslab002 system]# systemctl restart munge.service Failed to restart munge.service: Unit is not loaded properly: Invalid argument. See system logs and 'systemctl status munge.service' for details. [root@oslab002 system]# [root@oslab002 system]# [root@oslab002 system]# systemctl status munge.service \u25cf munge.service - MUNGE authentication service Loaded: error (Reason: Invalid argument) Drop-In: /usr/lib/systemd/system/munge.service.d \u2514\u2500local.conf Active: active (running) since Fri 2022-11-18 14:27:51 IST; 3 months 21 days ago Docs: man:munged(8) Main PID: 1345 (munged) CGroup: /system.slice/munge.service \u2514\u25001345 /usr/sbin/munged Nov 18 14:27:50 oslab002 systemd[1]: Starting MUNGE authentication service... Nov 18 14:27:51 oslab002 systemd[1]: Started MUNGE authentication service. Mar 10 22:42:59 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing. Mar 10 23:22:57 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing. [root@oslab002 system]# Regards, Debajit Dutta
Please revert the changes for now and provide this: > $ systemd --version
(In reply to Nate Rini from comment #61) > Please revert the changes for now and provide this: > > $ systemd --version [root@oslab002 system]# systemd --version bash: systemd: command not found... [root@oslab002 system]#
Please call: > cat /etc/os-release > lsb_release -a > ps -ef|grep systemd
(In reply to Nate Rini from comment #63) > Please call: > > cat /etc/os-release > > lsb_release -a > > ps -ef|grep systemd [root@oslab002 ~]# cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" [root@oslab002 ~]# lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-ia32:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-ia32:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-ia32:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009 Codename: Core [root@oslab002 ~]# ps -ef | grep systemd root 1 0 0 2022 ? 00:33:27 /usr/lib/systemd/systemd --switched-root --system --deserialize 22 root 611 1 0 2022 ? 00:00:47 /usr/lib/systemd/systemd-journald root 658 1 0 2022 ? 00:00:00 /usr/lib/systemd/systemd-udevd dbus 868 1 0 2022 ? 00:02:42 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation root 1449 1 0 2022 ? 00:00:57 /usr/lib/systemd/systemd-logind root 1453 1 0 2022 ? 00:11:57 /usr/sbin/automount --systemd-service --dont-check-daemon root 12117 12011 0 23:53 pts/0 00:00:00 grep --color=auto systemd [root@oslab002 ~]#
Please call: > /usr/lib/systemd/systemd --version
(In reply to Nate Rini from comment #65) > Please call: > > /usr/lib/systemd/systemd --version [root@oslab002 ~]# /usr/lib/systemd/systemd --version systemd 219 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN [root@oslab002 ~]#
Please swap to this and follow procedure in comment#55 Populate file: /usr/lib/systemd/system/munge.service.d/local.conf > [Service] > Type=simple > ExecStart= > ExecStart=/usr/sbin/munged -M --num-threads 10 -F This testing can be done on any node. Please don't test on the controllers.
(In reply to Nate Rini from comment #67) > Please swap to this and follow procedure in comment#55 > > Populate file: /usr/lib/systemd/system/munge.service.d/local.conf > > [Service] > > Type=simple > > ExecStart= > > ExecStart=/usr/sbin/munged -M --num-threads 10 -F > > This testing can be done on any node. Please don't test on the controllers. I have updated the content of the file as above. This time there are no errors but after daemon-reload and munge service restart, the munge service is failing. Below is the output of the systemctl status munge after the restart:- [root@oslab002 ~]# systemctl restart munge.service [root@oslab002 ~]# [root@oslab002 ~]# systemctl status munge.service \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/munge.service.d \u2514\u2500local.conf Active: failed (Result: exit-code) since Sat 2023-03-11 00:41:29 IST; 2s ago Docs: man:munged(8) Process: 15433 ExecStart=/usr/sbin/munged -M --num-threads 10 -F (code=exited, status=1/FAILURE) Main PID: 15433 (code=exited, status=1/FAILURE) Mar 11 00:41:29 oslab002 systemd[1]: Started MUNGE authentication service. Mar 11 00:41:29 oslab002 munged[15433]: munged: Notice: Running on "oslab002.open-silicon.com" (172.16.24.83) Mar 11 00:41:29 oslab002 systemd[1]: munge.service: main process exited, code=exited, status=1/FAILURE Mar 11 00:41:29 oslab002 systemd[1]: Unit munge.service entered failed state. Mar 11 00:41:29 oslab002 systemd[1]: munge.service failed. [root@oslab002 ~]#
Please provide: > sudo journalctl --unit munge
(In reply to Nate Rini from comment #69) > Please provide: > > sudo journalctl --unit munge [root@oslab002 ~]# sudo journalctl --unit munge -- Logs begin at Fri 2022-11-18 14:27:26 IST, end at Sat 2023-03-11 00:55:21 IST. -- Nov 18 14:27:50 oslab002 systemd[1]: Starting MUNGE authentication service... Nov 18 14:27:51 oslab002 systemd[1]: Started MUNGE authentication service. Mar 10 22:42:59 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing. Mar 10 23:22:57 oslab002 systemd[1]: munge.service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing. Mar 10 23:32:33 oslab002 systemd[1]: Stopping MUNGE authentication service... Mar 10 23:32:33 oslab002 systemd[1]: Stopped MUNGE authentication service. Mar 10 23:32:33 oslab002 systemd[1]: Starting MUNGE authentication service... Mar 10 23:32:34 oslab002 systemd[1]: Started MUNGE authentication service. Mar 11 00:41:22 oslab002 systemd[1]: Stopping MUNGE authentication service... Mar 11 00:41:22 oslab002 systemd[1]: Stopped MUNGE authentication service. Mar 11 00:41:22 oslab002 systemd[1]: Started MUNGE authentication service. Mar 11 00:41:22 oslab002 systemd[1]: munge.service: main process exited, code=exited, status=1/FAILURE Mar 11 00:41:22 oslab002 systemd[1]: Unit munge.service entered failed state. Mar 11 00:41:22 oslab002 systemd[1]: munge.service failed. Mar 11 00:41:29 oslab002 systemd[1]: Started MUNGE authentication service. Mar 11 00:41:29 oslab002 munged[15433]: munged: Notice: Running on "oslab002.open-silicon.com" (172.16.24.83) Mar 11 00:41:29 oslab002 systemd[1]: munge.service: main process exited, code=exited, status=1/FAILURE Mar 11 00:41:29 oslab002 systemd[1]: Unit munge.service entered failed state. Mar 11 00:41:29 oslab002 systemd[1]: munge.service failed. [root@oslab002 ~]#
Please call: > munged --version
(In reply to Nate Rini from comment #71) > Please call: > > munged --version [root@oslab002 ~]# munged --version munge-0.5.11 (2013-08-27) [root@oslab002 ~]#
Please try: /usr/lib/systemd/system/munge.service.d/local.conf: > [Service] > ExecStart= > ExecStart=/usr/sbin/munged --num-threads 10 Please plan to upgrade the cluster. Many of these issues are caused by running older and generally deprecated versions.
(In reply to Nate Rini from comment #74) > Please try: > > /usr/lib/systemd/system/munge.service.d/local.conf: > > [Service] > > ExecStart= > > ExecStart=/usr/sbin/munged --num-threads 10 Yes now it is running [root@oslab002 ~]# systemctl status munge.service \u25cf munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/munge.service.d \u2514\u2500local.conf Active: active (running) since Sat 2023-03-11 01:13:34 IST; 8s ago Docs: man:munged(8) Process: 17756 ExecStart=/usr/sbin/munged --num-threads 10 (code=exited, status=0/SUCCESS) Main PID: 17759 (munged) Tasks: 12 Memory: 892.0K CGroup: /system.slice/munge.service \u2514\u250017759 /usr/sbin/munged --num-threads 10 Mar 11 01:13:34 oslab002 systemd[1]: Starting MUNGE authentication service... Mar 11 01:13:34 oslab002 systemd[1]: Started MUNGE authentication service. [root@oslab002 ~]# > > Please plan to upgrade the cluster. Many of these issues are caused by > running older and generally deprecated versions. Sure, will do this.
Please follow comment#57
(In reply to Nate Rini from comment #76) > Please follow comment#57 [root@oslab002 ~]# echo test | munge | unmunge STATUS: Success (0) ENCODE_HOST: oslab002.open-silicon.com (172.16.24.83) ENCODE_TIME: 2023-03-11 01:38:34 +0530 (1678478914) DECODE_TIME: 2023-03-11 01:38:34 +0530 (1678478914) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 5 test [root@oslab002 ~]#
(In reply to Openfive Support from comment #77) > (In reply to Nate Rini from comment #76) > > Please follow comment#57 > [root@oslab002 ~]# echo test | munge | unmunge > STATUS: Success (0) Please provide the output of the following > sdiag > scontrol show nodes > scontrol show jobs
Created attachment 29273 [details] scontrol_nodes_jobs_sdiag_11-03-2023
(In reply to Nate Rini from comment #78) > (In reply to Openfive Support from comment #77) > > (In reply to Nate Rini from comment #76) > > > Please follow comment#57 > > [root@oslab002 ~]# echo test | munge | unmunge > > STATUS: Success (0) > > Please provide the output of the following > > sdiag > > scontrol show nodes > > scontrol show jobs I have uploaded the above command outputs in files in a zip:- scontrol_nodes_jobs_sdiag_11-03-2023.zip Also, I did the same munge process in osxon002, and below is the output:- [root@osxon002 ~]# echo test | munge | unmunge STATUS: Success (0) ENCODE_HOST: osxon002.open-silicon.com (192.168.2.61) ENCODE_TIME: 2023-03-11 02:04:39 +0530 (1678480479) DECODE_TIME: 2023-03-11 02:04:39 +0530 (1678480479) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 5 test [root@osxon002 ~]#
(In reply to Openfive Support from comment #79) > Created attachment 29273 [details] > scontrol_nodes_jobs_sdiag_11-03-2023 > > Remote Procedure Call statistics by user > vishalk ( 3093) count:128849 ave_time:40420 total_time:5208098782 > krutikak ( 3591) count:35137 ave_time:105909 total_time:3721339909 > santhoshb ( 3549) count:35131 ave_time:106051 total_time:3725703359 > radhes ( 3582) count:35128 ave_time:106392 total_time:3737340749 > root ( 0) count:26450 ave_time:630897 total_time:16687241811 These users are quering Slurm more than root. Every one of these queries requires a munge connection which may be the source of munge getting overloaded. Please work with these users to see why they are quering Slurm soo much. This is usually due to a while() or for() loop that are constantly running one of the Slurm commands such as squeue. Please note that the Slurm-23.02 release has new features to help with users with this issue: > https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable
(In reply to Openfive Support from comment #79) > Created attachment 29273 [details] > scontrol_nodes_jobs_sdiag_11-03-2023 There are still a good number of idle nodes. Please verify that test jobs can start on them: > srun -w osvnc007 uptime
(In reply to Nate Rini from comment #83) > (In reply to Openfive Support from comment #79) > > Created attachment 29273 [details] > > scontrol_nodes_jobs_sdiag_11-03-2023 > > There are still a good number of idle nodes. Please verify that test jobs > can start on them: > > srun -w osvnc007 uptime Well, many nodes are not added for computing purposes like osvnc007 is only for running users' VNCs. Is it required to add a server to slurm as a node if I want to execute the srun command from the server?
(In reply to Nate Rini from comment #82) > (In reply to Openfive Support from comment #79) > > Created attachment 29273 [details] > > scontrol_nodes_jobs_sdiag_11-03-2023 > > > > Remote Procedure Call statistics by user > > vishalk ( 3093) count:128849 ave_time:40420 total_time:5208098782 > > krutikak ( 3591) count:35137 ave_time:105909 total_time:3721339909 > > santhoshb ( 3549) count:35131 ave_time:106051 total_time:3725703359 > > radhes ( 3582) count:35128 ave_time:106392 total_time:3737340749 > > root ( 0) count:26450 ave_time:630897 total_time:16687241811 > > These users are quering Slurm more than root. Every one of these queries > requires a munge connection which may be the source of munge getting > overloaded. Please work with these users to see why they are quering Slurm > soo much. This is usually due to a while() or for() loop that are constantly > running one of the Slurm commands such as squeue. > > Please note that the Slurm-23.02 release has new features to help with users > with this issue: > > https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable Sure, we will check with these users and will inform other users as well. Also, we will plan to upgrade the cluster.
(In reply to Openfive Support from comment #84) > (In reply to Nate Rini from comment #83) > > (In reply to Openfive Support from comment #79) > > > Created attachment 29273 [details] > > > scontrol_nodes_jobs_sdiag_11-03-2023 > > > > There are still a good number of idle nodes. Please verify that test jobs > > can start on them: > > > srun -w osvnc007 uptime > > Well, many nodes are not added for computing purposes like osvnc007 is only > for running users' VNCs. For purposes of this ticket, I want to verify that a job can start on all online nodes as that was the issue in comment#0. Please attach the slurmctld log after at least a single job has been tested on every node. > Is it required to add a server to slurm as a node if I want to execute the > srun command from the server? No: munge, Munge configuration, Slurm binaries, and Slurm configuration are the only things required beyond IP connectivity.
(In reply to Nate Rini from comment #82) > (In reply to Openfive Support from comment #79) > > Created attachment 29273 [details] > > scontrol_nodes_jobs_sdiag_11-03-2023 > > > > Remote Procedure Call statistics by user > > vishalk ( 3093) count:128849 ave_time:40420 total_time:5208098782 > > krutikak ( 3591) count:35137 ave_time:105909 total_time:3721339909 > > santhoshb ( 3549) count:35131 ave_time:106051 total_time:3725703359 > > radhes ( 3582) count:35128 ave_time:106392 total_time:3737340749 > > root ( 0) count:26450 ave_time:630897 total_time:16687241811 > > These users are quering Slurm more than root. Every one of these queries > requires a munge connection which may be the source of munge getting > overloaded. Please work with these users to see why they are quering Slurm > soo much. This is usually due to a while() or for() loop that are constantly > running one of the Slurm commands such as squeue. > > Please note that the Slurm-23.02 release has new features to help with users > with this issue: > > https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable Hi Nate, Can you please let us know, how we can get the above information from our side? We would like to do a periodic check on the same and follow up with the users. Please let us know Regards, Debajit Dutta
(In reply to Openfive Support from comment #87) > (In reply to Nate Rini from comment #82) > > (In reply to Openfive Support from comment #79) > > > Created attachment 29273 [details] > > > scontrol_nodes_jobs_sdiag_11-03-2023 > > > > > > Remote Procedure Call statistics by user > > > vishalk ( 3093) count:128849 ave_time:40420 total_time:5208098782 > > > krutikak ( 3591) count:35137 ave_time:105909 total_time:3721339909 > > > santhoshb ( 3549) count:35131 ave_time:106051 total_time:3725703359 > > > radhes ( 3582) count:35128 ave_time:106392 total_time:3737340749 > > > root ( 0) count:26450 ave_time:630897 total_time:16687241811 > > > > These users are quering Slurm more than root. Every one of these queries > > requires a munge connection which may be the source of munge getting > > overloaded. Please work with these users to see why they are quering Slurm > > soo much. This is usually due to a while() or for() loop that are constantly > > running one of the Slurm commands such as squeue. > > > > Please note that the Slurm-23.02 release has new features to help with users > > with this issue: > > > https://slurm.schedmd.com/slurm.conf.html#OPT_rl_enable > > > Hi Nate, > > > Can you please let us know, how we can get the above information from our > side? > > We would like to do a periodic check on the same and follow up with the > users. > > Please let us know > > > Regards, > Debajit Dutta ok got it, this data we get from the sdiag command.
Please provide a status update
(In reply to Nate Rini from comment #89) > Please provide a status update Hi Nate, Can we have a call to resolve this issue? Regards, Debajit Dutta
Hi Debajit Dutta, Nate asked me to reply to you regarding your request for a call. I do not see the purpose or value in a call right now, especially since Nate has requested the following and a status update. I would not want to have an engineer sitting on a call while they review these logs. > For purposes of this ticket, I want to verify that a job can start on all online > nodes as that was the issue in comment#0. > Please attach the slurmctld log after at least a single job has been tested on > every node. Based on your other updates, it seems that jobs are running and users are able to submit. Please let Nate know if this is not the case. We can re-evaluate a call if needed once we have confirmation on the status, and answers to the above questions.
(In reply to Jason Booth from comment #91) Hi Jason, > > Please attach the slurmctld log after at least a single job has been tested on > > every node. I have executed the below command in around 12 running nodes and have attached the slurmctl log. srun -p normal -w osxon047 uptime You will find in the log file that the below error is too frequent:- [2023-03-13T23:54:35.368] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 What is the meaning of this error message? why we are getting this ? > Based on your other updates, it seems that jobs are running and users are > able to > submit. Please let Nate know if this is not the case. Yes, users are able to invoke jobs, however, we want to know what caused the error to display and how can we prevent it in the future? Regards, Debajit Dutta
Created attachment 29299 [details] slurmctld.log_13-03-2023
(In reply to Openfive Support from comment #93) > Created attachment 29299 [details] > slurmctld.log_13-03-2023 > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619 It looks like there is number of jobs that reference a now defunct node "osxon092s". I suggest scancelling all of these jobs as they can never run. > > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 This error is caused my munge either rejecting a packet or otherwise being too busy to make a munge token. I will note that in these logs, that slurmctld ran with any of these errors for around 6 minutes (2023-03-13T22:05:07.754 -> 2023-03-13T22:13:23.031). This suggests that this a load issue. munge is a crytographic service which has very clear scaling limits based on the number of CPU cores on the host. We can try adding more threads to the munge daemon but can actually cause munge to go slower once munge has more threads than there are physical cores on the host. Have the users listed in comment#79 been contacted to verify they are no longer hammering slurmctld with requests?
> > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 Please also note that this error no longer even exists in the currently supported releases of Slurm. It has been replaced by substantially improved error logging.
(In reply to Nate Rini from comment #94) > (In reply to Openfive Support from comment #93) > > Created attachment 29299 [details] > > slurmctld.log_13-03-2023 > > > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619 > It looks like there is number of jobs that reference a now defunct node > "osxon092s". I suggest scancelling all of these jobs as they can never run. > > > > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 > This error is caused my munge either rejecting a packet or otherwise being > too busy to make a munge token. I will note that in these logs, that > slurmctld ran with any of these errors for around 6 minutes > (2023-03-13T22:05:07.754 -> > 2023-03-13T22:13:23.031). This suggests that this a load issue. munge is a > crytographic service which has very clear scaling limits based on the number > of CPU cores on the host. We can try adding more threads to the munge daemon > but can actually cause munge to go slower once munge has more threads than > there are physical cores on the host. > What to do now? should I increase the munge thread in the master server and restart the munge service in all nodes? > Have the users listed in comment#79 been contacted to verify they are no > longer hammering slurmctld with requests? Yes, we did contacted the users and also followed up with them on this.
(In reply to Nate Rini from comment #95) > > > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 > > Please also note that this error no longer even exists in the currently > supported releases of Slurm. It has been replaced by substantially improved > error logging. Are you recommending that upgrading to the latest version of the slurm resolves all these issues?
(In reply to Nate Rini from comment #94) Hi Nate, > (In reply to Openfive Support from comment #93) > > Created attachment 29299 [details] > > slurmctld.log_13-03-2023 > > > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619 > It looks like there is number of jobs that reference a now defunct node > "osxon092s". I suggest scancelling all of these jobs as they can never run. > > What I find is that from the below error:- [2023-03-13T22:05:08.136] error: _find_node_record(763): lookup failure for osxon092s [2023-03-13T22:05:08.136] error: node_name2bitmap: invalid node specified osxon092s [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619 The job ID 1190619 was already completed on 2023-03-08 at 22:35:32, you can see that from the control output below:- [root@hpcmaster 13-03-2023]# scontrol show job 1190619 JobId=1190619 JobName=osi_hbmc_protocol_controller_wrap_falcon_FUNC_FFm40_rcworst_CCworstm40_SI_ENABLED_true_HOLD_ONLY UserId=renishd(1232) GroupId=technodebm(1023) MCS_label=N/A Priority=5000 Nice=0 Account=(null) QOS=normal WCKey=* JobState=COMPLETED Reason=NodeDown Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:08:53 TimeLimit=15-00:00:00 TimeMin=N/A SubmitTime=2023-03-08T22:26:17 EligibleTime=2023-03-08T22:26:17 AccrueTime=2023-03-08T22:26:17 StartTime=2023-03-08T22:26:39 EndTime=2023-03-08T22:35:32 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-08T22:26:39 Partition=normal AllocNode:Sid=0.0.0.0:13928 ReqNodeList=(null) ExcNodeList=(null) NodeList=osxon092s BatchHost=osxon092s NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:* TRES=cpu=4,mem=20000M,node=1,billing=4 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=4 MinMemoryNode=20000M MinTmpDiskNode=0 Why am I getting this error message now i.e. on 2023-03-13 at 22:05:08.136? Regards, Debajit Dutta
(In reply to Openfive Support from comment #96) > (In reply to Nate Rini from comment #94) > > (In reply to Openfive Support from comment #93) > > > Created attachment 29299 [details] > > > slurmctld.log_13-03-2023 > > > > > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619 > > It looks like there is number of jobs that reference a now defunct node > > "osxon092s". I suggest scancelling all of these jobs as they can never run. > > > > > > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 > > This error is caused my munge either rejecting a packet or otherwise being > > too busy to make a munge token. I will note that in these logs, that > > slurmctld ran with any of these errors for around 6 minutes > > (2023-03-13T22:05:07.754 -> > > 2023-03-13T22:13:23.031). This suggests that this a load issue. munge is a > > crytographic service which has very clear scaling limits based on the number > > of CPU cores on the host. We can try adding more threads to the munge daemon > > but can actually cause munge to go slower once munge has more threads than > > there are physical cores on the host. > > > > What to do now? should I increase the munge thread in the master server and > restart the munge service in all nodes? We need to verify the number of cores on the host. Please call: > lscpu > > Have the users listed in comment#79 been contacted to verify they are no > > longer hammering slurmctld with requests? > > Yes, we did contacted the users and also followed up with them on this. Please call the following: > sdiag -r > sdiag > sleep 15m > sdiag Please upload the output. (In reply to Openfive Support from comment #97) > (In reply to Nate Rini from comment #95) > > > > [2023-03-13T22:13:23.031] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 > > > > Please also note that this error no longer even exists in the currently > > supported releases of Slurm. It has been replaced by substantially improved > > error logging. > > Are you recommending that upgrading to the latest version of the slurm The cluster is running a no longer supported version. We always suggest a site upgrade to a supported version as we are limited in our options in fixing issues. > resolves all these issues? No, there is no guarantee of that, but we will have the ability to get better logs and provide corrective patches (if needed) on supported releases. (In reply to Openfive Support from comment #98) > (In reply to Nate Rini from comment #94) > > (In reply to Openfive Support from comment #93) > > > Created attachment 29299 [details] > > > slurmctld.log_13-03-2023 > > > > > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619 > > It looks like there is number of jobs that reference a now defunct node > > "osxon092s". I suggest scancelling all of these jobs as they can never run. > > > > > What I find is that from the below error:- > > [2023-03-13T22:05:08.136] error: _find_node_record(763): lookup failure for > osxon092s > [2023-03-13T22:05:08.136] error: node_name2bitmap: invalid node specified > osxon092s > Why am I getting this error message now i.e. on 2023-03-13 at 22:05:08.136? These errors happened all at the start of the log. Was slurmctld restarted prior to providing the log?
(In reply to Nate Rini from comment #99) > (In reply to Openfive Support from comment #96) > > (In reply to Nate Rini from comment #94) > > > (In reply to Openfive Support from comment #93) > > > > What to do now? should I increase the munge thread in the master server and > > restart the munge service in all nodes? > > We need to verify the number of cores on the host. Please call: > > lscpu > Below is the output:- [root@hpcmaster 13-03-2023]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz Stepping: 4 CPU MHz: 2100.000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 11264K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities [root@hpcmaster 13-03-2023]# > > > Have the users listed in comment#79 been contacted to verify they are no > > > longer hammering slurmctld with requests? > > > > Yes, we did contacted the users and also followed up with them on this. > > Please call the following: > > sdiag -r > > sdiag > > sleep 15m > > sdiag > > Please upload the output. > Will attach a zip file for the same in few minutes. > (In reply to Openfive Support from comment #98) > > (In reply to Nate Rini from comment #94) > > > (In reply to Openfive Support from comment #93) > > > > Created attachment 29299 [details] > > > > slurmctld.log_13-03-2023 > > > > > > > > [2023-03-13T22:05:08.136] error: Invalid nodes (osxon092s) for JobId=1190619 > > > It looks like there is number of jobs that reference a now defunct node > > > "osxon092s". I suggest scancelling all of these jobs as they can never run. > > > > > > > > What I find is that from the below error:- > > > > [2023-03-13T22:05:08.136] error: _find_node_record(763): lookup failure for > > osxon092s > > [2023-03-13T22:05:08.136] error: node_name2bitmap: invalid node specified > > osxon092s > > Why am I getting this error message now i.e. on 2023-03-13 at 22:05:08.136? > > These errors happened all at the start of the log. Was slurmctld restarted > prior to providing the log? Yes, today it was restarted for some changes that were made in the slurm.conf file.
(In reply to Openfive Support from comment #100) > (In reply to Nate Rini from comment #99) > > (In reply to Openfive Support from comment #96) > > > (In reply to Nate Rini from comment #94) > > > > (In reply to Openfive Support from comment #93) > > > > > > > What to do now? should I increase the munge thread in the master server and > > > restart the munge service in all nodes? > > > > We need to verify the number of cores on the host. Please call: > > > lscpu > > > > Below is the output:- > Core(s) per socket: 8 The max number of threads for munge should be 8 on this host to match the number of cores. This host is likely underpowered to run Slurm on anything but a small cluster. Please see slides 18-20: > https://slurm.schedmd.com/SLUG22/Field_Notes_6.pdf I understand this is not something that can change immediately, but please consider providing faster server hardware for the Slurm controllers. > > (In reply to Openfive Support from comment #98) > Yes, today it was restarted for some changes that were made in the > slurm.conf file. Then these jobs were likely kept in the StateSaveLocation. If these errors don't happen on the next cycle of slurmctld, we can safely ignore these errors.
Created attachment 29301 [details] sdiag.zip
(In reply to Openfive Support from comment #102) > Created attachment 29301 [details] > sdiag.zip > > krutikak ( 3591) count:2830 ave_time:67513 total_time:191064547 This user is still doing more RPCs than root.
(In reply to Nate Rini from comment #101) > The max number of threads for munge should be 8 on this host to match the > number of cores. This host is likely underpowered to run Slurm on anything > but a small cluster. Is the controller a physical machine or is it a VM?
(In reply to Nate Rini from comment #104) > (In reply to Nate Rini from comment #101) > > The max number of threads for munge should be 8 on this host to match the > > number of cores. This host is likely underpowered to run Slurm on anything > > but a small cluster. > > Is the controller a physical machine or is it a VM? No, it is a physical server.
(In reply to Nate Rini from comment #103) > (In reply to Openfive Support from comment #102) > > Created attachment 29301 [details] > > sdiag.zip > > > > krutikak ( 3591) count:2830 ave_time:67513 total_time:191064547 > > This user is still doing more RPCs than root. Has the RPC count/total_time for this user been reduced? Please provide a new sdiag output: Please call the following: > sdiag -r > uptime > sdiag > sleep 15m > uptime > sdiag > sleep 15m > uptime > sdiag
(In reply to Nate Rini from comment #106) Hi Nate, > (In reply to Nate Rini from comment #103) > > (In reply to Openfive Support from comment #102) > > > Created attachment 29301 [details] > > > sdiag.zip > > > > > > krutikak ( 3591) count:2830 ave_time:67513 total_time:191064547 > > > > This user is still doing more RPCs than root. > > Has the RPC count/total_time for this user been reduced? Please provide a > new sdiag output: We are following up with the users, we will take some time on this. > Please call the following: > > sdiag -r > > uptime > > sdiag > > sleep 15m > > uptime > > sdiag > > sleep 15m > > uptime > > sdiag Is this to verify the RPC count/total_time for the users? Regards, Debajit Dutta
(In reply to Openfive Support from comment #107) > (In reply to Nate Rini from comment #106) > We are following up with the users, we will take some time on this. I will reduce this ticket to SEV4 while we wait. > Is this to verify the RPC count/total_time for the users? Yes but I'm also looking to verify that jobs are starting. (In reply to Openfive Support from comment #105) > No, it is a physical server. The CPU on this server is very likely just too slow to be a Slurm controller for the cluster. I strongly suggest looking into getting a faster server. Many of these issues will likely just be resolved by that.
Hi Nate, The cores are available but still we are getting the below wait time [debajitd@osvnc001 ~]$ srun -p normal --pty /bin/tcsh srun: job 1257513 queued and waiting for resources [root@hpcmaster 13-03-2023]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal up 15-00:00:0 2 down* osxon[030,060] normal up 15-00:00:0 7 drng osxon[010,019,032,047,050,055,080] normal up 15-00:00:0 1 resv osxon038 normal up 15-00:00:0 28 mix osxon[001,004-006,009,013,015,024,031,033,036-037,041-042,045,056,059,063,065-066,069,073-075,079,082,091,094] normal up 15-00:00:0 26 alloc osxon[007-008,018,020-021,023,028-029,035,039,043-044,046,048-049,052-054,058,064,067,070-071,078,087,090] long up 60-00:00:0 2 down* osxon[030,060] long up 60-00:00:0 7 drng osxon[010,019,032,047,050,055,080] long up 60-00:00:0 1 resv osxon038 long up 60-00:00:0 28 mix osxon[001,004-006,009,013,015,024,031,033,036-037,041-042,045,056,059,063,065-066,069,073-075,079,082,091,094] long up 60-00:00:0 26 alloc osxon[007-008,018,020-021,023,028-029,035,039,043-044,046,048-049,052-054,058,064,067,070-071,078,087,090] short up 6:00:00 1 down* osxon060 short up 6:00:00 4 mix osxon[059,065-066,081] short up 6:00:00 1 alloc osxon067 prio up 15-00:00:0 1 down* osxon060 prio up 15-00:00:0 8 mix osxon[002,004,006,031,065-066,081,088] prio up 15-00:00:0 3 alloc osxon[035,054,067] prio up 15-00:00:0 1 idle osxon068 sms-license up 15-00:00:0 1 mix osxon081 regression up 15-00:00:0 3 alloc osxon[061,072,077] guest up 15-00:00:0 1 drain osxon034 guest up 15-00:00:0 1 idle osxon003 eda up 15-00:00:0 1 idle osxon095d vnc up infinite 13 maint guest-ausdia,guestvnc001,osvnc[001-004,007-013] vnc up infinite 2 down* guest-ansys,guest-mentor guest-vnc* up infinite 1 maint ofindcon [root@hpcmaster 13-03-2023]# Please help us here. Regards, Debajit Dutta
we are not able to execute any jobs right now? This problem is happening again. Can we please get into a call to resolve this?
Please call: > scontrol show job 1257513
(In reply to Nate Rini from comment #111) > Please call: > > scontrol show job 1257513 [root@hpcmaster 13-03-2023]# scontrol show job 1257513 JobId=1257513 JobName=tcsh UserId=debajitd(3403) GroupId=engr(500) MCS_label=N/A Priority=5000 Nice=0 Account=(null) QOS=normal WCKey=* JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A SubmitTime=2023-03-21T00:18:14 EligibleTime=2023-03-21T00:18:14 AccrueTime=2023-03-21T00:18:14 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T00:42:18 Partition=normal AllocNode:Sid=osvnc001:24466 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/tcsh WorkDir=/home/debajitd Power= NtasksPerTRES:0 [root@hpcmaster 13-03-2023]#
Please call: > sprio
(In reply to Nate Rini from comment #113) > Please call: > > sprio [root@hpcmaster 13-03-2023]# sprio JOBID PARTITION PRIORITY SITE PARTITION 1256323 normal 5000 0 5000 1256578 normal 5000 0 5000 1256626 normal 5000 0 5000 1256755 normal 5000 0 5000 1256756 normal 5000 0 5000 1256757 normal 5000 0 5000 1256758 normal 5000 0 5000 1256759 normal 5000 0 5000 1256760 normal 5000 0 5000 1256761 normal 5000 0 5000 1256762 normal 5000 0 5000 1256763 normal 5000 0 5000 1256764 normal 5000 0 5000 1256765 normal 5000 0 5000 1256816 normal 5000 0 5000 1256852 normal 5000 0 5000 1257068 normal 5000 0 5000 1257103 regressio 6250 0 6250 1257104 regressio 6250 0 6250 1257106 regressio 6250 0 6250 1257107 regressio 6250 0 6250 1257128 normal 5000 0 5000 1257133 normal 5000 0 5000 1257142 regressio 6250 0 6250 1257150 normal 5000 0 5000 1257151 normal 5000 0 5000 1257152 normal 5000 0 5000 1257153 normal 5000 0 5000 1257154 normal 5000 0 5000 1257155 normal 5000 0 5000 1257156 normal 5000 0 5000 1257157 normal 5000 0 5000 1257158 normal 5000 0 5000 1257159 normal 5000 0 5000 1257160 normal 5000 0 5000 1257161 normal 5000 0 5000 1257162 normal 5000 0 5000 1257163 normal 5000 0 5000 1257164 normal 5000 0 5000 1257165 normal 5000 0 5000 1257166 normal 5000 0 5000 1257167 normal 5000 0 5000 1257206 normal 5000 0 5000 1257207 normal 5000 0 5000 1257208 normal 5000 0 5000 1257209 normal 5000 0 5000 1257210 normal 5000 0 5000 1257211 normal 5000 0 5000 1257212 normal 5000 0 5000 1257213 normal 5000 0 5000 1257214 normal 5000 0 5000 1257215 normal 5000 0 5000 1257216 normal 5000 0 5000 1257217 normal 5000 0 5000 1257218 normal 5000 0 5000 1257219 normal 5000 0 5000 1257220 normal 5000 0 5000 1257221 normal 5000 0 5000 1257222 normal 5000 0 5000 1257223 normal 5000 0 5000 1257224 normal 5000 0 5000 1257225 normal 5000 0 5000 1257226 normal 5000 0 5000 1257227 normal 5000 0 5000 1257228 normal 5000 0 5000 1257229 normal 5000 0 5000 1257230 normal 5000 0 5000 1257231 normal 5000 0 5000 1257232 normal 5000 0 5000 1257233 normal 5000 0 5000 1257234 normal 5000 0 5000 1257235 normal 5000 0 5000 1257236 normal 5000 0 5000 1257237 normal 5000 0 5000 1257238 normal 5000 0 5000 1257239 normal 5000 0 5000 1257240 normal 5000 0 5000 1257241 normal 5000 0 5000 1257242 normal 5000 0 5000 1257243 normal 5000 0 5000 1257244 normal 5000 0 5000 1257245 normal 5000 0 5000 1257246 normal 5000 0 5000 1257247 normal 5000 0 5000 1257251 long 3750 0 3750 1257263 normal 5000 0 5000 1257269 normal 5000 0 5000 1257270 normal 5000 0 5000 1257325 regressio 6250 0 6250 1257327 regressio 6250 0 6250 1257333 regressio 6250 0 6250 1257356 normal 5000 0 5000 1257366 normal 5000 0 5000 1257390 normal 5000 0 5000 1257398 long 3750 0 3750 1257422 normal 5000 0 5000 1257423 normal 5000 0 5000 1257424 regressio 6250 0 6250 1257436 regressio 6250 0 6250 1257464 normal 5000 0 5000 1257465 normal 5000 0 5000 1257466 normal 5000 0 5000 1257470 normal 5000 0 5000 1257488 normal 5000 0 5000 1257489 normal 5000 0 5000 1257490 normal 5000 0 5000 1257491 normal 5000 0 5000 1257492 normal 5000 0 5000 1257493 normal 5000 0 5000 1257494 normal 5000 0 5000 1257495 normal 5000 0 5000 1257496 normal 5000 0 5000 1257497 normal 5000 0 5000 1257498 normal 5000 0 5000 1257499 normal 5000 0 5000 1257500 normal 5000 0 5000 1257505 normal 5000 0 5000 1257512 regressio 6250 0 6250 1257514 regressio 6250 0 6250 1257522 normal 5000 0 5000 1257523 normal 5000 0 5000 1257528 normal 5000 0 5000 1257536 normal 5000 0 5000 1257543 normal 5000 0 5000 1257547 normal 5000 0 5000 1257548 normal 5000 0 5000 1257553 normal 5000 0 5000 1257554 normal 5000 0 5000 1257555 normal 5000 0 5000 1257556 normal 5000 0 5000 1257557 normal 5000 0 5000 1257558 normal 5000 0 5000 1257559 normal 5000 0 5000 1257560 normal 5000 0 5000 1257561 normal 5000 0 5000 1257562 normal 5000 0 5000 1257563 normal 5000 0 5000 1257564 normal 5000 0 5000 1257565 normal 5000 0 5000 1257570 normal 5000 0 5000 1257571 normal 5000 0 5000 1257572 regressio 6250 0 6250 1257573 normal 5000 0 5000 1257574 normal 5000 0 5000 1257575 normal 5000 0 5000 1257576 normal 5000 0 5000 1257577 normal 5000 0 5000 1257578 regressio 6250 0 6250 1257579 normal 5000 0 5000 1257580 normal 5000 0 5000 1257581 normal 5000 0 5000 1257582 normal 5000 0 5000 1257583 normal 5000 0 5000 1257584 normal 5000 0 5000 1257585 normal 5000 0 5000 1257586 normal 5000 0 5000 1257587 normal 5000 0 5000 1257588 normal 5000 0 5000 1257589 normal 5000 0 5000 1257590 normal 5000 0 5000 1257591 normal 5000 0 5000 1257592 normal 5000 0 5000 1257593 normal 5000 0 5000 1257595 normal 5000 0 5000 1257596 normal 5000 0 5000 1257597 normal 5000 0 5000 1257598 regressio 6250 0 6250 1257599 normal 5000 0 5000 1257600 normal 5000 0 5000 1257601 normal 5000 0 5000 1257602 normal 5000 0 5000 1257603 normal 5000 0 5000 1257604 regressio 6250 0 6250 1257605 normal 5000 0 5000 1257606 normal 5000 0 5000 1257607 normal 5000 0 5000 1257608 normal 5000 0 5000 1257609 normal 5000 0 5000 1257610 normal 5000 0 5000 1257611 normal 5000 0 5000 1257612 normal 5000 0 5000 [root@hpcmaster 13-03-2023]#
Hi Team, Please help us here, this is urgent, we are not able to execute any jobs, and the job is going to the pending state, despite having available resources. Also, we have noticed that the "normal" and "long" partitions are having this issue. Rest partitions are working. Below are the configurations of the two partitions:- [root@hpcmaster 21-03-2023]# grep normal /etc/slurm/slurm.conf ###normal queue#### PartitionName=normal PriorityJobFactor=400 Nodes=osxon[004,005,006,007,019,024,036,082,018,020,021,023,031,038,039,044,090,091,055,059,045,070,053,056,069,033,080,008,066,010,060,073,009,013,028,029,015,047,032,030,050,071,074,049,037,041,058,048,035,075,079,043,065,064,001,067,094,087,046,078,052,054,042,063] Default=YES MaxTime=15-00 State=UP AllowGroups=engr [root@hpcmaster 21-03-2023]# [root@hpcmaster 21-03-2023]# grep long /etc/slurm/slurm.conf ####long queue#### PartitionName=long PriorityJobFactor=300 Nodes=osxon[004,005,006,007,019,024,036,082,018,020,021,023,031,038,039,044,090,091,055,059,045,070,053,056,069,033,080,008,066,010,060,073,009,013,028,029,015,047,032,030,050,071,074,049,037,041,058,048,035,075,079,043,065,064,001,067,094,087,046,078,052,054,042,063] Default=YES MaxTime=60-00 State=UP AllowGroups=engr [root@hpcmaster 21-03-2023]# Regards, Debajit Dutta
The job has Priority=5000 while sprio reports a large number of jobs with the same priority. When the priority is the same, Slurm orders the jobs by submission time. Job 1257513 will not run until it has the highest priority. To verify, please call: > scontrol show job 1257513 > scontrol top 1257513 > scontrol show job 1257513
(In reply to Nate Rini from comment #116) > The job has Priority=5000 while sprio reports a large number of jobs with > the same priority. When the priority is the same, Slurm orders the jobs by > submission time. Job 1257513 will not run until it has the highest priority. > > To verify, please call: > > scontrol show job 1257513 > > scontrol top 1257513 > > scontrol show job 1257513 [root@hpcmaster 21-03-2023]# [root@hpcmaster 21-03-2023]# scontrol show job 1257861 JobId=1257861 JobName=tcsh UserId=bahubalir(3350) GroupId=engr(500) MCS_label=N/A Priority=5000 Nice=0 Account=(null) QOS=normal WCKey=* JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A SubmitTime=2023-03-21T02:08:59 EligibleTime=2023-03-21T02:08:59 AccrueTime=2023-03-21T02:08:59 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T02:08:59 Partition=normal AllocNode:Sid=osvnc004:25357 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/tcsh WorkDir=/home/bahubalir Power= NtasksPerTRES:0 [root@hpcmaster 21-03-2023]# [root@hpcmaster 21-03-2023]# scontrol top 1257861 [root@hpcmaster 21-03-2023]# [root@hpcmaster 21-03-2023]# scontrol show job 1257861 JobId=1257861 JobName=tcsh UserId=bahubalir(3350) GroupId=engr(500) MCS_label=N/A Priority=5000 Nice=0 Account=(null) QOS=normal WCKey=* JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A SubmitTime=2023-03-21T02:08:59 EligibleTime=2023-03-21T02:08:59 AccrueTime=2023-03-21T02:08:59 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T02:09:36 Partition=normal AllocNode:Sid=osvnc004:25357 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/tcsh WorkDir=/home/bahubalir Power= NtasksPerTRES:0 [root@hpcmaster 21-03-2023]#
Please call: > scontrol show job 1257861 #I want to see if the state changed > scontrol setdebug debug3 > scontrol setdebugflags +SelectType > scontrol top 1257861 > scontrol show job 1257861 > sleep 5m > scontrol setdebug verbose > scontrol setdebugflags -SelectType > scontrol show job 1257861 Please attach the slurmctld log from this testing period.
Please also provide the output of the following on the controller: > date +"%Z %z" > date +%s > date
We need prompt responses to maintain SEV1 status of a ticket. Please respond for comment#119 and comment#118 to allow us to continue to debug.
Hi Nate, Let me execute a new job and provide you the data.
(In reply to Nate Rini from comment #118) > Please call: > > scontrol show job 1257861 #I want to see if the state changed > > scontrol setdebug debug3 > > scontrol setdebugflags +SelectType > > scontrol top 1257861 [root@hpcmaster 21-03-2023]# scontrol show job 1259263 JobId=1259263 JobName=tcsh UserId=bahubalir(3350) GroupId=engr(500) MCS_label=N/A Priority=3750 Nice=0 Account=(null) QOS=normal WCKey=* JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:13 TimeLimit=60-00:00:00 TimeMin=N/A SubmitTime=2023-03-21T08:12:10 EligibleTime=2023-03-21T08:12:10 AccrueTime=2023-03-21T08:12:10 StartTime=2023-03-21T08:12:14 EndTime=2023-05-20T08:12:14 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T08:12:13 Partition=long AllocNode:Sid=osvnc004:25357 ReqNodeList=(null) ExcNodeList=(null) NodeList=osxon007 BatchHost=osxon007 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/tcsh WorkDir=/home/bahubalir Power= NtasksPerTRES:0 [root@hpcmaster 21-03-2023]# [root@hpcmaster 21-03-2023]# scontrol setdebug debug3 [root@hpcmaster 21-03-2023]# [root@hpcmaster 21-03-2023]# scontrol setdebugflags +SelectType [root@hpcmaster 21-03-2023]# [root@hpcmaster 21-03-2023]# scontrol top 1259263 Job is no longer pending execution for job 1259263 [root@hpcmaster 21-03-2023]# > > scontrol show job 1257861 > > sleep 5m > > scontrol setdebug verbose > > scontrol setdebugflags -SelectType > > scontrol show job 1257861 [root@hpcmaster 21-03-2023]# sleep 5m [root@hpcmaster 21-03-2023]# scontrol setdebug verbose [root@hpcmaster 21-03-2023]# scontrol setdebugflags -SelectType [root@hpcmaster 21-03-2023]# scontrol show job 1259263 JobId=1259263 JobName=tcsh UserId=bahubalir(3350) GroupId=engr(500) MCS_label=N/A Priority=3750 Nice=0 Account=(null) QOS=normal WCKey=* JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:10:50 TimeLimit=60-00:00:00 TimeMin=N/A SubmitTime=2023-03-21T08:12:10 EligibleTime=2023-03-21T08:12:10 AccrueTime=2023-03-21T08:12:10 StartTime=2023-03-21T08:12:14 EndTime=2023-05-20T08:12:14 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T08:12:13 Partition=long AllocNode:Sid=osvnc004:25357 ReqNodeList=(null) ExcNodeList=(null) NodeList=osxon007 BatchHost=osxon007 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/tcsh WorkDir=/home/bahubalir Power= NtasksPerTRES:0 [root@hpcmaster 21-03-2023]# > > Please attach the slurmctld log from this testing period. Now from morning 7:00 AM IST we see that the jobs are dispatching again as usual the slurm is working fine. However, we want to know the root cause of it. Regards, Debajit Dutta
(In reply to Nate Rini from comment #119) > Please also provide the output of the following on the controller: > > date +"%Z %z" > > date +%s > > date [root@hpcmaster 21-03-2023]# date +"%Z %z" IST +0530 [root@hpcmaster 21-03-2023]# date +%s 1679367218 [root@hpcmaster 21-03-2023]# date Tue Mar 21 08:23:48 IST 2023 [root@hpcmaster 21-03-2023]#
Created attachment 29430 [details] slurmctld.log from the slurm controller on 21-03-2023 08:32 AM IST
(In reply to Nate Rini from comment #118) > Please call: > > scontrol show job 1257861 #I want to see if the state changed > > scontrol setdebug debug3 > > scontrol setdebugflags +SelectType > > scontrol top 1257861 > > scontrol show job 1257861 > > sleep 5m > > scontrol setdebug verbose > > scontrol setdebugflags -SelectType > > scontrol show job 1257861 > > Please attach the slurmctld log from this testing period. Log attached:- slurmctld.log from the slurm controller on 21-03-2023 08:32 AM IST
Hi Team, The issue is again repeating and causing issues. No jobs are executing and its impacting our production. Please arrange a call immediaetly [2023-03-21T12:12:26.784] Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=5828060 began=12:12:20.956 [2023-03-21T12:12:27.745] Warning: Note very large processing time from _slurm_rpc_allocate_resources: usec=6770181 began=12:12:20.975 [2023-03-21T12:12:27.745] sched: _slurm_rpc_allocate_resources JobId=1261189 NodeList=(null) usec=6770181 [2023-03-21T12:12:27.857] Warning: Note very large processing time from _slurmctld_background: usec=6858277 began=12:12:20.999 [2023-03-21T12:12:27.857] job_signal: 9 of pending JobId=1261148 successful [2023-03-21T12:12:28.191] Warning: Note very large processing time from dump_all_job_state: usec=4190409 began=12:12:24.001 [2023-03-21T12:12:29.925] sched: _slurm_rpc_allocate_resources JobId=1261190 NodeList=(null) usec=30493 [2023-03-21T12:12:30.224] sched: _slurm_rpc_allocate_resources JobId=1261191 NodeList=(null) usec=123222 [2023-03-21T12:12:34.593] _job_complete: JobId=1261088 WTERMSIG 126 [2023-03-21T12:12:34.593] _job_complete: JobId=1261088 cancelled by interactive user [2023-03-21T12:12:34.594] _job_complete: JobId=1261088 done [2023-03-21T12:12:34.594] _slurm_rpc_complete_job_allocation: JobId=1261088 error Job/step already completing or completed [2023-03-21T12:12:35.441] _slurm_rpc_complete_job_allocation: JobId=1261088 error Job/step already completing or completed [2023-03-21T12:12:35.466] _slurm_rpc_complete_job_allocation: JobId=1261088 error Job/step already completing or completed [2023-03-21T12:12:36.193] _job_complete: JobId=1261025 WEXITSTATUS 0 [2023-03-21T12:12:36.193] _job_complete: JobId=1261025 done [2023-03-21T12:12:36.238] sched: _slurm_rpc_allocate_resources JobId=1261192 NodeList=(null) usec=26488 [2023-03-21T12:12:36.373] sched: _slurm_rpc_allocate_resources JobId=1261193 NodeList=(null) usec=26556 [2023-03-21T12:12:36.611] Time limit exhausted for JobId=1174490 [2023-03-21T12:12:36.826] _slurm_rpc_complete_job_allocation: JobId=1174490 error Job/step already completing or completed [2023-03-21T12:12:39.894] _job_complete: JobId=1261028 WTERMSIG 126 [2023-03-21T12:12:39.894] _job_complete: JobId=1261028 cancelled by interactive user [2023-03-21T12:12:39.894] _job_complete: JobId=1261028 done [2023-03-21T12:12:39.894] _slurm_rpc_complete_job_allocation: JobId=1261028 error Job/step already completing or completed [2023-03-21T12:12:41.323] _job_complete: JobId=1261023 WEXITSTATUS 0 [2023-03-21T12:12:41.323] _job_complete: JobId=1261023 done
(In reply to Nate Rini from comment #108) > The CPU on this server is very likely just too slow to be a Slurm controller > for the cluster. I strongly suggest looking into getting a faster server. > Many of these issues will likely just be resolved by that. What are the recommended hardware configurations 1. for the latest version of Slurm? 2. for our current version i.e. 20.11.8 ? Regards, Debajit Dutta
Hi Nate, The problem we are facing is that suddenly slurm is not accepting any jobs, i.e. whenever a user is executing a job, it is going to the PENDING state even if the CPU defined is 1. This problem remains for an hour and after an hour gradually slurm starts picking up the jobs that were in a pending state and works as normal again. Till the time this happened twice. 1. on 20-03-2023 around 11:17 PM 2. on 21-03-2023 around 1:03 PM Is it because of the changes we made in munge in the slurm controller last weekend, i.e. added 8 threads to munge service? Or this is a new issue? Also, we still find the errors in the slurmctl log for which we opened this case. 1. error: slurm_receive_msgs: Zero Bytes were transmitted or received 2. error: Invalid nodes (osxon092s) for JobId=1190199 3. error: slurm_auth_get_host: Lookup failed for 0.0.0.0 Please let us know what steps we need to take to resolve these recurring issues. Regards, Debajit Dutta
(In reply to Openfive Support from comment #122) > Now from morning 7:00 AM IST we see that the jobs are dispatching again as > usual the slurm is working fine. However, we want to know the root cause of > it. JobId=1259263 was submitted with a priority that was too low to be scheduled immediately. The request to call `scontrol top 1259263` forced the job to have the highest priority. Slurm then scheduled the job and it was started on `osvnc004`. This appears to be normal operation for Slurm. It seems like you could greatly benefit from turning the scheduler, which would help with jobs both starting and backfill jobs placement. We see that you have previously set SchedulerParameters, but then commented it out. > SchedulerParameters=bf_continue,bf_max_job_test=10000 Can you explain the decision behind this? Regarding your current situation, if your scheduler is not tuned for your workload, then it is likely that backfill never looks deep enough into the queue for jobs you would expect to run. Slurm has a main, a quick scheduler, and backfill schedulers. 1. Once the main scheduler reaches the first job it can not run, it stops to process the job queue. This is normally the highest priority job, but could be another job further down the queue depending on job size. The job queue is ordered by computed job priority. It tried to run a job at job submission if the requested resources are available, and if it does not start out higher priority jobs. Backfill runs on a set cadence and will then backfill jobs, so long as it does not delay the start time of your higher priority jobs. For backfill to work correctly, your jobs need to request a run time and the scheduler needs to be configured for your site's workflow. 1. How long is your longest running job? 2. Does your site prefer larger or smaller jobs to take priority, or a mix of both? 3. Should your site limit how many jobs one user can start at each backfill iteration or in total through backfill? 4. Roughly how many jobs does your site submit and finish in a day, week, month? The following command can help you understand your mixture of jobs, runtime and size in order to answer this question. >$ squeue -o "%.P | %.A | %.u | %.h | %.c | %.C | %.D | %.e | %.l | %.N | %.p | %.S | %.T | %.V" (In reply to Openfive Support from comment #128) > Is it because of the changes we made in munge in the slurm controller last > weekend, i.e. added 8 threads to munge service? > Or this is a new issue? > > Also, we still find the errors in the slurmctl log for which we opened this > case. > > 1. error: slurm_receive_msgs: Zero Bytes were transmitted or received > 3. error: slurm_auth_get_host: Lookup failed for 0.0.0.0 > > Please let us know what steps we need to take to resolve these recurring > issues. This is all the same issue of munge being overloaded. The controller's hardware is too slow to handle the workload. I suspect this period was induced by a user sending too many RPCs. Calling `sdiag` should be able to determine who. We have already configured munge the best possible case with 1 thread per physical core. Munge is a cryptographic service which is entirely reliant on the speed of the core and if there is support for more advanced SSE operations. If munge has an issue or is slow, then Slurm has to wait for it for communications. > 2. error: Invalid nodes (osxon092s) for JobId=1190199 Please provide the entire log message. If this is the entire log message, then please provide at least 5 messages from before and after. (In reply to Openfive Support from comment #127) > (In reply to Nate Rini from comment #108) > > > The CPU on this server is very likely just too slow to be a Slurm controller > > for the cluster. I strongly suggest looking into getting a faster server. > > Many of these issues will likely just be resolved by that. > > What are the recommended hardware configurations > > 1. for the latest version of Slurm? > 2. for our current version i.e. 20.11.8 ? We generally don't make different suggestions by the Slurm version installed. The slowest blocking operations (calling out to munge) are actually unchanged for many years. We have a list of publications on our website: https://slurm.schedmd.com/publications.html I suggest watching and reading this one specifically: > Field Notes 6: From The Frontlines of Slurm Support, Video, Jason Booth, SchedMD > https://slurm.schedmd.com/SLUG22/Field_Notes_6.pdf > https://youtu.be/njEgeMUAqMY While I suggest watching/reading the entire presentation, the relevant slides start at #17. To avoid repeating the same information on this ticket, please watch and read the entire presentation. We will be happy to answer any additional questions or provide clarifications to the content of the slides.
*** Ticket 16329 has been marked as a duplicate of this ticket. ***
(In reply to Nate Rini from comment #131) > > SchedulerParameters=bf_continue,bf_max_job_test=10000 > > Can you explain the decision behind this? > > Regarding your current situation, if your scheduler is not tuned for your > workload, then it is likely that backfill never looks deep enough into the > queue for jobs you would expect to run. > > Slurm has a main, a quick scheduler, and backfill schedulers. > > 1. Once the main scheduler reaches the first job it can not run, it stops to > process the job queue. This is normally the highest priority job, but could > be another job further down the queue depending on job size. The job queue > is ordered by computed job priority. It tried to run a job at job submission > if the requested resources are available, and if it does not start out > higher priority jobs. > > Backfill runs on a set cadence and will then backfill jobs, so long as it > does not delay the start time of your higher priority jobs. For backfill to > work correctly, your jobs need to request a run time and the scheduler needs > to be configured for your site's workflow. > > 1. How long is your longest running job? > 2. Does your site prefer larger or smaller jobs to take priority, or a mix > of both? > 3. Should your site limit how many jobs one user can start at each backfill > iteration or in total through backfill? > 4. Roughly how many jobs does your site submit and finish in a day, week, > month? > > The following command can help you understand your mixture of jobs, runtime > and size in order to answer this question. > >$ squeue -o "%.P | %.A | %.u | %.h | %.c | %.C | %.D | %.e | %.l | %.N | %.p | %.S | %.T | %.V" Please provide the requested information. I know some of it was covered in the concall but having all of it will be helpful. After the meeting, I assume that this config has been reactivated: > SchedulerParameters=bf_continue,bf_max_job_test=10000 Please provide a current slurm.conf and output of sdiag.
Please provide a status update. We are waiting on a reply to comment#135
(In reply to Nate Rini from comment #139) > Please provide a status update. We are waiting on a reply to comment#135 Please provide a status update. We are waiting on a reply to comment#135
I'm going to time this ticket out. I assume the process of converting to a new machine for the controller is taking a while. Please respond when that is complete and we can continue debugging the issues (if needed).