If there is a problem between federation siblings and they are unable to sync (for example, bug 6783), the regular scheduler and the backfill scheduler will both return without doing any work. With my setup, I was unable to determine that this was the case; all I saw was no jobs starting. 'scontrol show fed' does not note an error that the sibling isn't synced and the local cluster won't schedule. The only logs for this are at debug level: [2019-03-31T19:05:24.891] debug: sched: schedule() returning, federation siblings not synced yet Since the production loglevel recommendation is (IIRC) info, most users would never see this. Please consider moving the debug() statements to a more serious loglevel. For monitoring, there should be another field to show if the state is synced, like: # scontrol show fed Federation: fed1 Self: cluster1:192.168.0.2:6817 ID:1 FedState:ACTIVE Features: Sibling: cluster2:192.168.0.3:6817 ID:2 FedState:ACTIVE Features: PersistConnSend/Recv:Yes/Yes Synced:No
Hi Matt, >For monitoring, there should be another field to show if the state is synced, >like: ># scontrol show fed >Federation: fed1 >Self: cluster1:192.168.0.2:6817 ID:1 FedState:ACTIVE Features: >Sibling: cluster2:192.168.0.3:6817 ID:2 FedState:ACTIVE Features: >PersistConnSend/Recv:Yes/Yes Synced:No I have converted this over to an enhancement since this will require additional work/code to accommodate this request. > Since the production loglevel recommendation is (IIRC) info, most users would never see this. Please consider moving the debug() statements to a more serious loglevel. Agreed. We will look into changing this.
This is added in: https://github.com/SchedMD/slurm/commit/6d1f02d3b79c68f6d4fdca931d56db57992f2c03 Thanks!