| Summary: | slurmdbd.log - error: problem getting jobs for cluster xxxx | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Patrick <phock> |
| Component: | Database | Assignee: | Ben Glines <ben.glines> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Goodyear | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi Patrick, Could you please attach your slurm.conf, gres.conf, and cgroup.conf? Also could you reply with the output for the following: 1. Output of sinfo 2. Output of sdiag Based on the timestamps of these errors it seems like a periodic occurrence, which suggests that a crontab is trying to get jobs from that database, or some script that a user has set up. If you set DebugLevel=debug for slurmdbd, you should be able to see what IP the request is coming in on. Could you set that debug level and reply with logs of these same requests for the "gica" database? Hello Ben - thank you very much for the informmation. Using "DebugLevel=debug" I was able to identify the IP address from which the incorrect queries were originating and have been able to find the scripts that were causing the error messages. I will mark this bug as closed. thanks for your help. |
I noticed that the slurmdbd.log for our SLURM setup in Luxembourg is logging lots of lines like this: /var/log/slurm # tail slurmdbd.log [2022-05-24T08:42:56.208] error: Problem getting jobs for cluster gica [2022-05-24T08:42:56.321] error: Problem getting jobs for cluster gica [2022-05-24T08:46:29.240] error: Problem getting jobs for cluster gica [2022-05-24T08:46:29.359] error: Problem getting jobs for cluster gica [2022-05-24T08:47:56.674] error: Problem getting jobs for cluster gica [2022-05-24T08:47:56.793] error: Problem getting jobs for cluster gica [2022-05-24T08:51:29.000] error: Problem getting jobs for cluster gica [2022-05-24T08:51:29.111] error: Problem getting jobs for cluster gica I'm not sure what is triggering these errors as we only have one cluster in the database which is "gicl", and there is no reference to "gica" in the configuration files: # /usr/local/slurm/bin/sacctmgr show cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- gicl [ctrld] 6817 9216 1 normal I'm wondering if this may be some leftover in the database from the initial setup where we may have tried to define multiple clusters at some time. Could you please advise what may be triggering these errors in the log and how to correct? thanks