| Summary: | DBD_JOB_COMPLETE: cluster not registered | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
| Component: | slurmdbd | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 17.11.13 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DTU Physics | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Ole.H.Nielsen@fysik.dtu.dk
2019-02-13 13:51:41 MST
Hi Ole, As you've found, I saw this behavior in bug 5761, though I'm not exactly sure what is causing it. I don't believe it's a serious problem, but I'll need to dive deeper. How often are you seeing these errors, and do you know how to reliably reproduce it? Thanks! Michael (In reply to Michael Hinton from comment #1) > As you've found, I saw this behavior in bug 5761, though I'm not exactly > sure what is causing it. I don't believe it's a serious problem, but I'll > need to dive deeper. > > How often are you seeing these errors, and do you know how to reliably > reproduce it? The error message occurs frequently: $ grep registered /var/log/slurm/slurmdbd.log | tail [2018-12-11T15:14:21.839] DBD_JOB_COMPLETE: cluster not registered [2019-01-02T09:16:05.905] DBD_JOB_COMPLETE: cluster not registered [2019-01-03T08:42:50.165] DBD_JOB_COMPLETE: cluster not registered [2019-01-16T16:37:52.204] DBD_JOB_COMPLETE: cluster not registered [2019-01-27T09:36:49.378] DBD_JOB_COMPLETE: cluster not registered [2019-02-11T09:22:44.341] DBD_JOB_COMPLETE: cluster not registered [2019-02-11T09:30:54.914] DBD_JOB_COMPLETE: cluster not registered [2019-02-13T14:40:55.213] DBD_JOB_COMPLETE: cluster not registered [2019-02-18T08:54:16.075] DBD_JOB_COMPLETE: cluster not registered [2019-02-18T09:02:12.743] DBD_JOB_COMPLETE: cluster not registered It seems to occur simultaneously with restarts of slurmctld: $ zcat /var/log/slurm/slurmctld.log-20190217.gz | grep 'started on cluster' [2019-02-11T09:22:43.442] slurmctld version 17.11.13 started on cluster niflheim [2019-02-11T09:30:53.585] slurmctld version 17.11.13 started on cluster niflheim [2019-02-13T14:40:54.207] slurmctld version 17.11.13 started on cluster niflheim $ grep 'started on cluster' /var/log/slurm/slurmctld.log* /var/log/slurm/slurmctld.log:[2019-02-18T08:54:15.229] slurmctld version 17.11.13 started on cluster niflheim /var/log/slurm/slurmctld.log:[2019-02-18T09:02:11.900] slurmctld version 17.11.13 started on cluster niflheim Thanks, that's really helpful. I'll see what I can do to reproduce the issue. Is it just DBD_JOB_COMPLETE, or do you also see it with DBD_JOB_START, like in bug 5761? What happens when you grep for "cluster not registered"? (In reply to Michael Hinton from comment #3) > Thanks, that's really helpful. I'll see what I can do to reproduce the issue. > > Is it just DBD_JOB_COMPLETE, or do you also see it with DBD_JOB_START, like > in bug 5761? What happens when you grep for "cluster not registered"? There's only that single DBD_JOB_COMPLETE line. Here are all lines from a time interval: [2019-02-15T08:34:16.494] slurmdbd version 17.11.13 started [2019-02-16T00:01:08.095] Warning: Note very large processing time from daily_rollup for niflheim: usec=67995918 began=00:00:00.099 [2019-02-17T13:00:10.909] Warning: Note very large processing time from daily_rollup for niflheim: usec=6245619 began=13:00:04.663 [2019-02-17T15:00:11.012] Warning: Note very large processing time from daily_rollup for niflheim: usec=6249969 began=15:00:04.762 [2019-02-17T18:00:09.450] Warning: Note very large processing time from daily_rollup for niflheim: usec=5654474 began=18:00:03.796 [2019-02-17T19:00:09.162] Warning: Note very large processing time from daily_rollup for niflheim: usec=5412176 began=19:00:03.750 [2019-02-18T08:54:16.075] DBD_JOB_COMPLETE: cluster not registered [2019-02-18T09:02:12.743] DBD_JOB_COMPLETE: cluster not registered I did grep for "cluster not registered" as shown above - what more should I do? (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #4) > There's only that single DBD_JOB_COMPLETE line. Here are all lines from a > time interval: > > [2019-02-15T08:34:16.494] slurmdbd version 17.11.13 started > [2019-02-16T00:01:08.095] Warning: Note very large processing time from > daily_rollup for niflheim: usec=67995918 began=00:00:00.099 > [2019-02-17T13:00:10.909] Warning: Note very large processing time from > daily_rollup for niflheim: usec=6245619 began=13:00:04.663 > [2019-02-17T15:00:11.012] Warning: Note very large processing time from > daily_rollup for niflheim: usec=6249969 began=15:00:04.762 > [2019-02-17T18:00:09.450] Warning: Note very large processing time from > daily_rollup for niflheim: usec=5654474 began=18:00:03.796 > [2019-02-17T19:00:09.162] Warning: Note very large processing time from > daily_rollup for niflheim: usec=5412176 began=19:00:03.750 > [2019-02-18T08:54:16.075] DBD_JOB_COMPLETE: cluster not registered > [2019-02-18T09:02:12.743] DBD_JOB_COMPLETE: cluster not registered Thanks. > I did grep for "cluster not registered" as shown above - what more should I > do? Whoops, you did indeed. My mistake. I think that's enough for me to start with; if there is anything else I need, I'll let you know. Thanks! Michael Hey Ole,
Here is the relevant code in 17.11, from src/slurmdbd/proc_req#_job_complete():
if (!slurmdbd_conn->conn->rem_port) {
info("DBD_JOB_COMPLETE: cluster not registered");
slurmdbd_conn->conn->rem_port =
clusteracct_storage_g_register_disconn_ctld(
slurmdbd_conn->db_conn,
slurmdbd_conn->conn->rem_host);
_add_registered_cluster(slurmdbd_conn);
}
The message happens when slurmdbd senses a disconnected slurmctld (rem_port is null). When this happens, slurmdbd simply reconnects to slurmctld. This explains why you see this after every slurmctld restart.
I think this is normal behavior and is something you can just ignore. In 18.08, this message is actually changed from log level `info` to log level `debug3` so it won't confuse admins further. See commit https://github.com/SchedMD/slurm/commit/1b953033966550308856d0707b0fa159233831bd.
Let me know if that answers you questions.
Thanks,
-Michael
(In reply to Michael Hinton from comment #6) > Hey Ole, > > Here is the relevant code in 17.11, from > src/slurmdbd/proc_req#_job_complete(): > > if (!slurmdbd_conn->conn->rem_port) { > info("DBD_JOB_COMPLETE: cluster not registered"); > slurmdbd_conn->conn->rem_port = > clusteracct_storage_g_register_disconn_ctld( > slurmdbd_conn->db_conn, > slurmdbd_conn->conn->rem_host); > > _add_registered_cluster(slurmdbd_conn); > } > > The message happens when slurmdbd senses a disconnected slurmctld (rem_port > is null). When this happens, slurmdbd simply reconnects to slurmctld. This > explains why you see this after every slurmctld restart. > > I think this is normal behavior and is something you can just ignore. In > 18.08, this message is actually changed from log level `info` to log level > `debug3` so it won't confuse admins further. See commit > https://github.com/SchedMD/slurm/commit/ > 1b953033966550308856d0707b0fa159233831bd. > > Let me know if that answers you questions. Hi Michael, thanks for a perfect explanation! I'm relieved that this doesn't indicate something bad in our database. The solution of changing the log level sounds like a good idea. Please close this case for now. Thanks, Ole Great! Closing. |