7906 – Cluster problem with slurmdbd

Ticket 7906 - Cluster problem with slurmdbd

Summary: Cluster problem with slurmdbd

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmdbd (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Jason Booth
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-10-09 17:35 MDT by darrellp
Modified:	2019-10-11 12:31 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Allen AI
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (1.39 KB, text/plain) 2019-10-09 17:35 MDT, darrellp	Details
gres.conf (83 bytes, text/plain) 2019-10-09 17:36 MDT, darrellp	Details
database.output (1.71 KB, text/plain) 2019-10-09 17:37 MDT, darrellp	Details
slurmctld-debug.log (11.34 KB, text/x-log) 2019-10-10 12:03 MDT, darrellp	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description darrellp 2019-10-09 17:35:48 MDT

Created attachment 11891 [details]
slurm.conf

Folks, 
   I am having some trouble getting accounting up and running and would like to ask for some adult guidance here. What I have is a single node cluster with the below stats, reporting to a remote database. 

Slurm version: 19.05.2
OS: Ubuntu 18.04.2
GRES: 1 Nvidia Titan V GPU 

What I am seeing is sacct can see the cluster I set up via sacctmgr, but it gets a cluster not registered from slurmdbd when it tries to log data to it. 

Here is the command I ran to add the cluster:
   56  sacctmgr add cluster allennlp

When I list the clusters, it appears to be there: 

darrellp@it-test ~ $ sudo sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
  allennlp                            0     0         1                                                                                           normal    


When I run sacct -a, here is what I see in the slurmdbd logs. It appears to be reading from the allennlp tables

[2019-10-09T16:14:54.670] debug2: Opened connection 10 from 172.16.3.83
[2019-10-09T16:14:54.671] debug:  REQUEST_PERSIST_INIT: CLUSTER:allennlp VERSION:8704 UID:2149 IP:172.16.3.83 CONN:10
[2019-10-09T16:14:54.671] debug2: acct_storage_p_get_connection: request new connection 1
[2019-10-09T16:14:54.671] debug2: Attempting to connect to util02:3306
[2019-10-09T16:14:54.717] debug2: DBD_GET_JOBS_COND: called
[2019-10-09T16:14:54.761] debug2: DBD_FINI: CLOSE:1 COMMIT:0
[2019-10-09T16:14:54.762] debug4: got 0 commits
[2019-10-09T16:14:54.762] debug2: persistent connection is closed
[2019-10-09T16:14:54.762] debug2: Closed connection 10 uid(2149)


However when I run a job, it reports the message 'cluster not registered'

[2019-10-09T16:15:17.825] debug2: DBD_STEP_START: ID:9.0 NAME:bash SUBMIT:1570662917
[2019-10-09T16:15:17.826] debug4: We can't get a db_index for this combo, time_submit=1570662917 and id_job=9.  We must not have heard about the start yet, no big deal, we will get one right after this.
[2019-10-09T16:15:17.827] debug2: as_mysql_job_start: called
[2019-10-09T16:15:17.829] debug3: DBD_STEP_START: cluster not registered
[2019-10-09T16:15:21.937] debug2: DBD_JOB_START: START CALL ID:9 NAME:bash INX:0
[2019-10-09T16:15:21.937] debug2: as_mysql_job_start: called
[2019-10-09T16:15:21.939] debug3: DBD_JOB_START: cluster not registered
[2019-10-09T16:15:36.037] debug2: DBD_JOB_COMPLETE: ID:9
[2019-10-09T16:15:36.037] debug2: as_mysql_job_complete() called
[2019-10-09T16:15:36.039] debug3: DBD_JOB_COMPLETE: cluster not registered
[2019-10-09T16:15:36.085] debug2: DBD_STEP_COMPLETE: ID:9.0 SUBMIT:1570662917
[2019-10-09T16:15:36.087] debug3: DBD_STEP_COMPLETE: cluster not registered
[2019-10-09T16:16:36.861] debug2: DBD_CLUSTER_TRES: called for allennlp(1=12,2=1,3=0,4=1,5=12,6=0,7=0,8=0,1001=1)
[2019-10-09T16:16:36.863] debug3: DBD_CLUSTER_TRES: cluster not registered

When I attempt to point a job directly at the cluster, it also fails saying the allennlp is not there, despite that is show up in sacctmgmr list cluster command

$ srun -w it-test -M allennlp -G 1 --pty bash -i 
srun: error: 'allennlp' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters.

Is there something obvious that I am missing here? Attaching my configs and db info  

Cheers
D

Comment 1 darrellp 2019-10-09 17:36:12 MDT

Created attachment 11892 [details]
gres.conf

Comment 2 darrellp 2019-10-09 17:36:37 MDT

Created attachment 11893 [details]
slurmdbd.conf

Comment 3 darrellp 2019-10-09 17:37:13 MDT

Created attachment 11894 [details]
database.output

Comment 4 Jason Booth 2019-10-10 09:13:27 MDT

Hi D - 

"sacctmgr list cluster" should list the connection which it does not in your case.

For example from my cluster you see the registration with an entry with ControlHost,ControlPort.


$ sacctmgr list clusters format=Cluster,ControlHost,ControlPort,RPC,Share,QOS
   Cluster     ControlHost  ControlPort   RPC     Share                  QOS 
---------- --------------- ------------ ----- --------- -------------------- 
   cluster       127.0.0.1         8817  8704         1               normal 




Please check that you are not using SELinux or a firewall which may be preventing connection upto the database.


Is "AccountingStorageHost=it-test.corp.ai2.in" the same box that runs slurmctld?

Comment 5 darrellp 2019-10-10 10:46:11 MDT

There is nothing blocking the call as far as I can tell. I can see the traffic and the response in tcpdump (from both it-test and the dbhost) when making both sacct and sacctmgr calls, and I can log into the db from a mysql client.

darrellp@it-test ~ $ mysql -h util02 -u slurm -p 
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 10
Server version: 5.7.26-0ubuntu0.16.04.1 (Ubuntu)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 


This is in addition to the fact that the sacctmgr add cluster allennlp altered the db. I do not believe that there is anything blocking traffic.  

As for the AccountstorageHost, that is correct, however this is using the FQDN, whereas other parts of the config are using the short name, so I changed this to the short name and restarted to see if that could be it (I have seen stranger things), but the same problem persists. 

Thanks
D

Comment 6 Jason Booth 2019-10-10 11:38:06 MDT

Would you please set debug2 and bounce the slurmctld and then gather the slurmctld.log for us to review?

You should see entries such as:

 debug:  slurmdbd: Sent PersistInit msg
or

debug2: slurm_connect failed: Connection refused
debug2: Error connecting slurm stream socket at 127.0.0.1:7920: Connection refused
error: slurmdbd: Sending PersistInit msg: Connection refused

Please make sure to change back to a lower loglevel when finished.

Comment 7 darrellp 2019-10-10 12:03:04 MDT

Well that is annoying. After adding the debugging flag and restarting, it suddenly decided to work. All I did was add these two lines: 

SlurmctldDebug=debug2
SlurmctldLogFile=/var/log/slurmctld.log

Now I have restarted everything in the past to no avail, but here is what I got just now.. 


darrellp@it-test /etc/slurm-llnl $ sudo vi slurm.conf 
darrellp@it-test /etc/slurm-llnl $ sudo systemctl restart slurmctld
darrellp@it-test /etc/slurm-llnl $ sudo systemctl restart slurmd
darrellp@it-test /etc/slurm-llnl $ sacct -a
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
darrellp@it-test /etc/slurm-llnl $ sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
  allennlp       127.0.0.1         6817  8704         1                                                                                           normal           
(failed reverse-i-search)`stun': sacctmgr list clu^Cer
darrellp@it-test /etc/slurm-llnl $ srun -w it-test -G 1 --pty bash -i 
darrellp@it-test /etc/slurm-llnl $ ls
gres.conf  slurm.conf  slurmdbd.conf
darrellp@it-test /etc/slurm-llnl $ exit
exit
darrellp@it-test /etc/slurm-llnl $ sacct -a
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
11                 bash   allennlp                     2  COMPLETED      0:0 



Is there any insight that you can derive from this? 

(attaching the debug log)

Comment 8 darrellp 2019-10-10 12:03:34 MDT

Created attachment 11908 [details]
slurmctld-debug.log

Comment 9 Jason Booth 2019-10-10 14:00:35 MDT

>Is there something obvious that I am missing here? Attaching my configs and db info  


Unfortunately I only have the logs from after the most recent restart so I can not comment further on what happened in the past. The logs I have now look completely normal. If restarts did not work in the past but do now then this may suggest some type of network communication issue between the two. Do you have a network admin that is pushing out security changes that might have affected this?

Comment 10 darrellp 2019-10-10 14:55:51 MDT

Nope. There have been no network changes in time of working this bug (that
team reports to me)

Cheers
D

On Thu, Oct 10, 2019 at 1:00 PM <bugs@schedmd.com> wrote:

> *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=7906#c9> on bug
> 7906 <https://bugs.schedmd.com/show_bug.cgi?id=7906> from Jason Booth
> <jbooth@schedmd.com> *
>
> >Is there something obvious that I am missing here? Attaching my configs and db info
>
>
> Unfortunately I only have the logs from after the most recent restart so I can
> not comment further on what happened in the past. The logs I have now look
> completely normal. If restarts did not work in the past but do now then this
> may suggest some type of network communication issue between the two. Do you
> have a network admin that is pushing out security changes that might have
> affected this?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 11 Jason Booth 2019-10-10 15:06:18 MDT

If you have the controller logs before the raise in debug level then I can review those to see if I can spot the disconnect, otherwise, we can move to close out this case.

Comment 12 darrellp 2019-10-11 12:31:38 MDT

We do but it was dumping to syslog so it is buried and likely incomplete. I think I have the config to go forward in our allennlp cluster and if I see it there, I will re-open. In the mean time, thank you Jason

Feel free to close this now. 

Cheers
D