Ticket 6791

Summary: Federation unsynced prevents scheduling without warnings
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: FederationAssignee: Unassigned Developer <dev-unassigned>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: alex
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
Site: NOAA Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: ORNL NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 19.05.0pre4
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Matt Ezell 2019-04-02 08:08:49 MDT
If there is a problem between federation siblings and they are unable to sync (for example, bug 6783), the regular scheduler and the backfill scheduler will both return without doing any work.  With my setup, I was unable to determine that this was the case; all I saw was no jobs starting.  'scontrol show fed' does not note an error that the sibling isn't synced and the local cluster won't schedule.  The only logs for this are at debug level:

[2019-03-31T19:05:24.891] debug:  sched: schedule() returning, federation siblings not synced yet

Since the production loglevel recommendation is (IIRC) info, most users would never see this.  Please consider moving the debug() statements to a more serious loglevel.

For monitoring, there should be another field to show if the state is synced, like:

# scontrol show fed
Federation: fed1
Self:       cluster1:192.168.0.2:6817 ID:1 FedState:ACTIVE Features:
Sibling:    cluster2:192.168.0.3:6817 ID:2 FedState:ACTIVE Features: PersistConnSend/Recv:Yes/Yes Synced:No
Comment 3 Jason Booth 2019-04-02 16:10:55 MDT
Hi Matt,

>For monitoring, there should be another field to show if the state is synced, >like:

># scontrol show fed
>Federation: fed1
>Self:       cluster1:192.168.0.2:6817 ID:1 FedState:ACTIVE Features:
>Sibling:    cluster2:192.168.0.3:6817 ID:2 FedState:ACTIVE Features: >PersistConnSend/Recv:Yes/Yes Synced:No

I have converted this over to an enhancement since this will require additional work/code to accommodate this request. 



> Since the production loglevel recommendation is (IIRC) info, most users would never see this.  Please consider moving the debug() statements to a more serious loglevel.

Agreed. We will look into changing this.
Comment 7 Brian Christiansen 2019-04-29 09:59:54 MDT
This is added in:

https://github.com/SchedMD/slurm/commit/6d1f02d3b79c68f6d4fdca931d56db57992f2c03

Thanks!