| Summary: | Cluster problem with slurmdbd | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | darrellp |
| Component: | slurmdbd | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Allen AI | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
gres.conf database.output slurmctld-debug.log |
||
Created attachment 11892 [details]
gres.conf
Created attachment 11893 [details]
slurmdbd.conf
Created attachment 11894 [details]
database.output
Hi D - "sacctmgr list cluster" should list the connection which it does not in your case. For example from my cluster you see the registration with an entry with ControlHost,ControlPort. $ sacctmgr list clusters format=Cluster,ControlHost,ControlPort,RPC,Share,QOS Cluster ControlHost ControlPort RPC Share QOS ---------- --------------- ------------ ----- --------- -------------------- cluster 127.0.0.1 8817 8704 1 normal Please check that you are not using SELinux or a firewall which may be preventing connection upto the database. Is "AccountingStorageHost=it-test.corp.ai2.in" the same box that runs slurmctld? There is nothing blocking the call as far as I can tell. I can see the traffic and the response in tcpdump (from both it-test and the dbhost) when making both sacct and sacctmgr calls, and I can log into the db from a mysql client. darrellp@it-test ~ $ mysql -h util02 -u slurm -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 10 Server version: 5.7.26-0ubuntu0.16.04.1 (Ubuntu) Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> This is in addition to the fact that the sacctmgr add cluster allennlp altered the db. I do not believe that there is anything blocking traffic. As for the AccountstorageHost, that is correct, however this is using the FQDN, whereas other parts of the config are using the short name, so I changed this to the short name and restarted to see if that could be it (I have seen stranger things), but the same problem persists. Thanks D Would you please set debug2 and bounce the slurmctld and then gather the slurmctld.log for us to review? You should see entries such as: debug: slurmdbd: Sent PersistInit msg or debug2: slurm_connect failed: Connection refused debug2: Error connecting slurm stream socket at 127.0.0.1:7920: Connection refused error: slurmdbd: Sending PersistInit msg: Connection refused Please make sure to change back to a lower loglevel when finished. Well that is annoying. After adding the debugging flag and restarting, it suddenly decided to work. All I did was add these two lines:
SlurmctldDebug=debug2
SlurmctldLogFile=/var/log/slurmctld.log
Now I have restarted everything in the past to no avail, but here is what I got just now..
darrellp@it-test /etc/slurm-llnl $ sudo vi slurm.conf
darrellp@it-test /etc/slurm-llnl $ sudo systemctl restart slurmctld
darrellp@it-test /etc/slurm-llnl $ sudo systemctl restart slurmd
darrellp@it-test /etc/slurm-llnl $ sacct -a
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
darrellp@it-test /etc/slurm-llnl $ sacctmgr list cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
allennlp 127.0.0.1 6817 8704 1 normal
(failed reverse-i-search)`stun': sacctmgr list clu^Cer
darrellp@it-test /etc/slurm-llnl $ srun -w it-test -G 1 --pty bash -i
darrellp@it-test /etc/slurm-llnl $ ls
gres.conf slurm.conf slurmdbd.conf
darrellp@it-test /etc/slurm-llnl $ exit
exit
darrellp@it-test /etc/slurm-llnl $ sacct -a
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
11 bash allennlp 2 COMPLETED 0:0
Is there any insight that you can derive from this?
(attaching the debug log)
Created attachment 11908 [details]
slurmctld-debug.log
>Is there something obvious that I am missing here? Attaching my configs and db info
Unfortunately I only have the logs from after the most recent restart so I can not comment further on what happened in the past. The logs I have now look completely normal. If restarts did not work in the past but do now then this may suggest some type of network communication issue between the two. Do you have a network admin that is pushing out security changes that might have affected this?
Nope. There have been no network changes in time of working this bug (that team reports to me) Cheers D On Thu, Oct 10, 2019 at 1:00 PM <bugs@schedmd.com> wrote: > *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=7906#c9> on bug > 7906 <https://bugs.schedmd.com/show_bug.cgi?id=7906> from Jason Booth > <jbooth@schedmd.com> * > > >Is there something obvious that I am missing here? Attaching my configs and db info > > > Unfortunately I only have the logs from after the most recent restart so I can > not comment further on what happened in the past. The logs I have now look > completely normal. If restarts did not work in the past but do now then this > may suggest some type of network communication issue between the two. Do you > have a network admin that is pushing out security changes that might have > affected this? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > If you have the controller logs before the raise in debug level then I can review those to see if I can spot the disconnect, otherwise, we can move to close out this case. We do but it was dumping to syslog so it is buried and likely incomplete. I think I have the config to go forward in our allennlp cluster and if I see it there, I will re-open. In the mean time, thank you Jason Feel free to close this now. Cheers D |
Created attachment 11891 [details] slurm.conf Folks, I am having some trouble getting accounting up and running and would like to ask for some adult guidance here. What I have is a single node cluster with the below stats, reporting to a remote database. Slurm version: 19.05.2 OS: Ubuntu 18.04.2 GRES: 1 Nvidia Titan V GPU What I am seeing is sacct can see the cluster I set up via sacctmgr, but it gets a cluster not registered from slurmdbd when it tries to log data to it. Here is the command I ran to add the cluster: 56 sacctmgr add cluster allennlp When I list the clusters, it appears to be there: darrellp@it-test ~ $ sudo sacctmgr list cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- allennlp 0 0 1 normal When I run sacct -a, here is what I see in the slurmdbd logs. It appears to be reading from the allennlp tables [2019-10-09T16:14:54.670] debug2: Opened connection 10 from 172.16.3.83 [2019-10-09T16:14:54.671] debug: REQUEST_PERSIST_INIT: CLUSTER:allennlp VERSION:8704 UID:2149 IP:172.16.3.83 CONN:10 [2019-10-09T16:14:54.671] debug2: acct_storage_p_get_connection: request new connection 1 [2019-10-09T16:14:54.671] debug2: Attempting to connect to util02:3306 [2019-10-09T16:14:54.717] debug2: DBD_GET_JOBS_COND: called [2019-10-09T16:14:54.761] debug2: DBD_FINI: CLOSE:1 COMMIT:0 [2019-10-09T16:14:54.762] debug4: got 0 commits [2019-10-09T16:14:54.762] debug2: persistent connection is closed [2019-10-09T16:14:54.762] debug2: Closed connection 10 uid(2149) However when I run a job, it reports the message 'cluster not registered' [2019-10-09T16:15:17.825] debug2: DBD_STEP_START: ID:9.0 NAME:bash SUBMIT:1570662917 [2019-10-09T16:15:17.826] debug4: We can't get a db_index for this combo, time_submit=1570662917 and id_job=9. We must not have heard about the start yet, no big deal, we will get one right after this. [2019-10-09T16:15:17.827] debug2: as_mysql_job_start: called [2019-10-09T16:15:17.829] debug3: DBD_STEP_START: cluster not registered [2019-10-09T16:15:21.937] debug2: DBD_JOB_START: START CALL ID:9 NAME:bash INX:0 [2019-10-09T16:15:21.937] debug2: as_mysql_job_start: called [2019-10-09T16:15:21.939] debug3: DBD_JOB_START: cluster not registered [2019-10-09T16:15:36.037] debug2: DBD_JOB_COMPLETE: ID:9 [2019-10-09T16:15:36.037] debug2: as_mysql_job_complete() called [2019-10-09T16:15:36.039] debug3: DBD_JOB_COMPLETE: cluster not registered [2019-10-09T16:15:36.085] debug2: DBD_STEP_COMPLETE: ID:9.0 SUBMIT:1570662917 [2019-10-09T16:15:36.087] debug3: DBD_STEP_COMPLETE: cluster not registered [2019-10-09T16:16:36.861] debug2: DBD_CLUSTER_TRES: called for allennlp(1=12,2=1,3=0,4=1,5=12,6=0,7=0,8=0,1001=1) [2019-10-09T16:16:36.863] debug3: DBD_CLUSTER_TRES: cluster not registered When I attempt to point a job directly at the cluster, it also fails saying the allennlp is not there, despite that is show up in sacctmgmr list cluster command $ srun -w it-test -M allennlp -G 1 --pty bash -i srun: error: 'allennlp' can't be reached now, or it is an invalid entry for --cluster. Use 'sacctmgr list clusters' to see available clusters. Is there something obvious that I am missing here? Attaching my configs and db info Cheers D