Ticket 14147

Summary: slurmdbd.log - error: problem getting jobs for cluster xxxx
Product: Slurm Reporter: Patrick <phock>
Component: DatabaseAssignee: Ben Glines <ben.glines>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.8   
Hardware: Linux   
OS: Linux   
Site: Goodyear Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Patrick 2022-05-24 01:04:13 MDT
I noticed that the slurmdbd.log for our SLURM setup in Luxembourg is logging lots of lines like this:
 /var/log/slurm # tail slurmdbd.log
[2022-05-24T08:42:56.208] error: Problem getting jobs for cluster gica
[2022-05-24T08:42:56.321] error: Problem getting jobs for cluster gica
[2022-05-24T08:46:29.240] error: Problem getting jobs for cluster gica
[2022-05-24T08:46:29.359] error: Problem getting jobs for cluster gica
[2022-05-24T08:47:56.674] error: Problem getting jobs for cluster gica
[2022-05-24T08:47:56.793] error: Problem getting jobs for cluster gica
[2022-05-24T08:51:29.000] error: Problem getting jobs for cluster gica
[2022-05-24T08:51:29.111] error: Problem getting jobs for cluster gica

I'm not sure what is triggering these errors as we only have one cluster in the database which is "gicl", and there is no reference to "gica" in the configuration files:

 # /usr/local/slurm/bin/sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
      gicl     [ctrld]         6817  9216         1                                                                                           normal

I'm wondering if this may be some leftover in the database from the initial setup where we may have tried to define multiple clusters at some time. Could you please advise what may be triggering these errors in the log and how to correct?
thanks
Comment 1 Ben Glines 2022-05-25 13:39:18 MDT
Hi Patrick,

Could you please attach your slurm.conf, gres.conf, and cgroup.conf?

Also could you reply with the output for the following:
1. Output of sinfo
2. Output of sdiag
Comment 2 Ben Glines 2022-05-25 17:00:14 MDT
Based on the timestamps of these errors it seems like a periodic occurrence, which suggests that a crontab is trying to get jobs from that database, or some script that a user has set up.

If you set DebugLevel=debug for slurmdbd, you should be able to see what IP the request is coming in on. Could you set that debug level and reply with logs of these same requests for the "gica" database?
Comment 3 Patrick 2022-05-27 02:36:29 MDT
Hello Ben - 
thank you very much for the informmation.
Using "DebugLevel=debug" I was able to identify the IP address from which the incorrect queries were originating and have been able to find the scripts that were causing the error messages.
I will mark this bug as closed.
thanks for your help.