Ticket 6514

Summary: DBD_JOB_COMPLETE: cluster not registered
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: slurmdbdAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.13   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2019-02-13 13:51:41 MST
We're getting these warnings in slurmdbd.log:

[2019-02-11T09:30:53.550] slurmdbd version 17.11.13 started
[2019-02-11T09:30:54.914] DBD_JOB_COMPLETE: cluster not registered
[2019-02-13T14:40:54.093] error: Problem getting jobs for cluster
[2019-02-13T14:40:55.213] DBD_JOB_COMPLETE: cluster not registered

This is similar to bug 1050 and bug 5761 except that our configuration appears to be correct and consistent:

$ sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
  niflheim    10.2.128.110         6817  8192         1                                                                                           normal

$ scontrol show config | grep ClusterName
ClusterName             = niflheim

How may the above warning (error?) come about in this case?  Is there some stale data in the database which triggers the message?

Thanks,
Ole
Comment 1 Michael Hinton 2019-02-19 09:11:22 MST
Hi Ole,

As you've found, I saw this behavior in bug 5761, though I'm not exactly sure what is causing it. I don't believe it's a serious problem, but I'll need to dive deeper.

How often are you seeing these errors, and do you know how to reliably reproduce it?

Thanks!
Michael
Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2019-02-19 11:31:51 MST
(In reply to Michael Hinton from comment #1)
> As you've found, I saw this behavior in bug 5761, though I'm not exactly
> sure what is causing it. I don't believe it's a serious problem, but I'll
> need to dive deeper.
> 
> How often are you seeing these errors, and do you know how to reliably
> reproduce it?

The error message occurs frequently:

$ grep registered /var/log/slurm/slurmdbd.log | tail
[2018-12-11T15:14:21.839] DBD_JOB_COMPLETE: cluster not registered
[2019-01-02T09:16:05.905] DBD_JOB_COMPLETE: cluster not registered
[2019-01-03T08:42:50.165] DBD_JOB_COMPLETE: cluster not registered
[2019-01-16T16:37:52.204] DBD_JOB_COMPLETE: cluster not registered
[2019-01-27T09:36:49.378] DBD_JOB_COMPLETE: cluster not registered
[2019-02-11T09:22:44.341] DBD_JOB_COMPLETE: cluster not registered
[2019-02-11T09:30:54.914] DBD_JOB_COMPLETE: cluster not registered
[2019-02-13T14:40:55.213] DBD_JOB_COMPLETE: cluster not registered
[2019-02-18T08:54:16.075] DBD_JOB_COMPLETE: cluster not registered
[2019-02-18T09:02:12.743] DBD_JOB_COMPLETE: cluster not registered

It seems to occur simultaneously with restarts of slurmctld:

$ zcat /var/log/slurm/slurmctld.log-20190217.gz |  grep 'started on cluster'
[2019-02-11T09:22:43.442] slurmctld version 17.11.13 started on cluster niflheim
[2019-02-11T09:30:53.585] slurmctld version 17.11.13 started on cluster niflheim
[2019-02-13T14:40:54.207] slurmctld version 17.11.13 started on cluster niflheim

$  grep 'started on cluster' /var/log/slurm/slurmctld.log*
/var/log/slurm/slurmctld.log:[2019-02-18T08:54:15.229] slurmctld version 17.11.13 started on cluster niflheim
/var/log/slurm/slurmctld.log:[2019-02-18T09:02:11.900] slurmctld version 17.11.13 started on cluster niflheim
Comment 3 Michael Hinton 2019-02-19 11:43:28 MST
Thanks, that's really helpful. I'll see what I can do to reproduce the issue.

Is it just DBD_JOB_COMPLETE, or do you also see it with DBD_JOB_START, like in bug 5761? What happens when you grep for "cluster not registered"?
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2019-02-19 11:47:22 MST
(In reply to Michael Hinton from comment #3)
> Thanks, that's really helpful. I'll see what I can do to reproduce the issue.
> 
> Is it just DBD_JOB_COMPLETE, or do you also see it with DBD_JOB_START, like
> in bug 5761? What happens when you grep for "cluster not registered"?

There's only that single DBD_JOB_COMPLETE line.  Here are all lines from a time interval:

[2019-02-15T08:34:16.494] slurmdbd version 17.11.13 started
[2019-02-16T00:01:08.095] Warning: Note very large processing time from daily_rollup for niflheim: usec=67995918 began=00:00:00.099
[2019-02-17T13:00:10.909] Warning: Note very large processing time from daily_rollup for niflheim: usec=6245619 began=13:00:04.663
[2019-02-17T15:00:11.012] Warning: Note very large processing time from daily_rollup for niflheim: usec=6249969 began=15:00:04.762
[2019-02-17T18:00:09.450] Warning: Note very large processing time from daily_rollup for niflheim: usec=5654474 began=18:00:03.796
[2019-02-17T19:00:09.162] Warning: Note very large processing time from daily_rollup for niflheim: usec=5412176 began=19:00:03.750
[2019-02-18T08:54:16.075] DBD_JOB_COMPLETE: cluster not registered
[2019-02-18T09:02:12.743] DBD_JOB_COMPLETE: cluster not registered

I did grep for "cluster not registered" as shown above - what more should I do?
Comment 5 Michael Hinton 2019-02-19 12:01:50 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #4)
> There's only that single DBD_JOB_COMPLETE line.  Here are all lines from a
> time interval:
> 
> [2019-02-15T08:34:16.494] slurmdbd version 17.11.13 started
> [2019-02-16T00:01:08.095] Warning: Note very large processing time from
> daily_rollup for niflheim: usec=67995918 began=00:00:00.099
> [2019-02-17T13:00:10.909] Warning: Note very large processing time from
> daily_rollup for niflheim: usec=6245619 began=13:00:04.663
> [2019-02-17T15:00:11.012] Warning: Note very large processing time from
> daily_rollup for niflheim: usec=6249969 began=15:00:04.762
> [2019-02-17T18:00:09.450] Warning: Note very large processing time from
> daily_rollup for niflheim: usec=5654474 began=18:00:03.796
> [2019-02-17T19:00:09.162] Warning: Note very large processing time from
> daily_rollup for niflheim: usec=5412176 began=19:00:03.750
> [2019-02-18T08:54:16.075] DBD_JOB_COMPLETE: cluster not registered
> [2019-02-18T09:02:12.743] DBD_JOB_COMPLETE: cluster not registered
Thanks.

> I did grep for "cluster not registered" as shown above - what more should I
> do?
Whoops, you did indeed. My mistake.

I think that's enough for me to start with; if there is anything else I need, I'll let you know.

Thanks!
Michael
Comment 6 Michael Hinton 2019-02-25 11:02:44 MST
Hey Ole,

Here is the relevant code in 17.11, from src/slurmdbd/proc_req#_job_complete():

	if (!slurmdbd_conn->conn->rem_port) {
		info("DBD_JOB_COMPLETE: cluster not registered");
		slurmdbd_conn->conn->rem_port =
			clusteracct_storage_g_register_disconn_ctld(
				slurmdbd_conn->db_conn,
				slurmdbd_conn->conn->rem_host);

		_add_registered_cluster(slurmdbd_conn);
	}

The message happens when slurmdbd senses a disconnected slurmctld (rem_port is null). When this happens, slurmdbd simply reconnects to slurmctld. This explains why you see this after every slurmctld restart.

I think this is normal behavior and is something you can just ignore. In 18.08, this message is actually changed from log level `info` to log level `debug3` so it won't confuse admins further. See commit https://github.com/SchedMD/slurm/commit/1b953033966550308856d0707b0fa159233831bd.

Let me know if that answers you questions.

Thanks,
-Michael
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2019-02-25 11:58:36 MST
(In reply to Michael Hinton from comment #6)
> Hey Ole,
> 
> Here is the relevant code in 17.11, from
> src/slurmdbd/proc_req#_job_complete():
> 
> 	if (!slurmdbd_conn->conn->rem_port) {
> 		info("DBD_JOB_COMPLETE: cluster not registered");
> 		slurmdbd_conn->conn->rem_port =
> 			clusteracct_storage_g_register_disconn_ctld(
> 				slurmdbd_conn->db_conn,
> 				slurmdbd_conn->conn->rem_host);
> 
> 		_add_registered_cluster(slurmdbd_conn);
> 	}
> 
> The message happens when slurmdbd senses a disconnected slurmctld (rem_port
> is null). When this happens, slurmdbd simply reconnects to slurmctld. This
> explains why you see this after every slurmctld restart.
> 
> I think this is normal behavior and is something you can just ignore. In
> 18.08, this message is actually changed from log level `info` to log level
> `debug3` so it won't confuse admins further. See commit
> https://github.com/SchedMD/slurm/commit/
> 1b953033966550308856d0707b0fa159233831bd.
> 
> Let me know if that answers you questions.

Hi Michael, thanks for a perfect explanation!  I'm relieved that this doesn't indicate something bad in our database.  The solution of changing the log level sounds like a good idea.

Please close this case for now.

Thanks,
Ole
Comment 8 Michael Hinton 2019-02-25 12:00:16 MST
Great! Closing.