| Summary: | squeue --steps always calls slurm_load_federation | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Thomas HAMEL <hmlth> |
| Component: | User Commands | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | dwightman, fabecassis, jblomqvist, lyeager |
| Version: | 18.08.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | Debian |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Thomas, Our system couldn't match your email address with a support contract. Could you please let me know which site you work for so we can match this request with a support contract? Once we can match this request up with a support contract the SchedMD support engineers can help you resolve this issue. Thanks, Jacob I'm not sure we still have a direct support contract with SchedMD, I will open this with our provider. But I still think it's a real issue, I also noticed it with custom output and the -u flag : $ sdiag |grep -i fed_info REQUEST_FED_INFO ( 2049) count:1259 ave_time:428 total_time:538997 $ squeue -o "%i %P %t %M %N" -u $USER JOBID PARTITION ST TIME NODELIST $ sdiag |grep -i fed_info REQUEST_FED_INFO ( 2049) count:1260 ave_time:428 total_time:539355 $ squeue -o "%i %P %t %M %N" --local -u $USER JOBID PARTITION ST TIME NODELIST $ sdiag |grep -i fed_info REQUEST_FED_INFO ( 2049) count:1260 ave_time:428 total_time:539355 [...] $ sdiag |grep -i fed_info REQUEST_FED_INFO ( 2049) count:1261 ave_time:428 total_time:539814 $ squeue -o "%i %P %t %M %N" JOBID PARTITION ST TIME NODELIST $ sdiag |grep -i fed_info REQUEST_FED_INFO ( 2049) count:1261 ave_time:428 total_time:539814 Thanks for the detailed bug report, Thomas! At our site, we recently improved our RPC monitoring and noticed this same issue. We put in a hacky workaround (setting "SQUEUE_LOCAL=1" in /etc/environment), but we would love to see this get fixed at some point. It's sad to see how many unnecessary REQUEST_FED_INFO RPCs our scheduler was processing. |
"squeue" and "squeue --steps" behaves differently. "squeue --steps" always triggers a REQUEST_FED_INFO RPC, and I don't think that the expected behavior without the "--federation" flag. On a cluster with federation disabled: ``` $ squeue -v ----------------------------- all = false array = false federation = false format = (null) iterate = 0 job_flag = 0 jobs = (null) licenses = (null) local = false names = (null) nodes = partitions = (null) priority = false reservation = (null) sibling = false sort = (null) start_flag = 0 states = (null) step_flag = 0 steps = (null) users = (null) verbose = 1 ----------------------------- Fri Jul 26 17:24:30 2019 last_update_time=1564154670 records=0 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) $ sdiag | grep -i fed; squeue --steps --local ; sdiag |grep -i fed REQUEST_FED_INFO ( 2049) count:31 ave_time:352 total_time:10942 STEPID NAME PARTITION USER TIME NODELIST REQUEST_FED_INFO ( 2049) count:31 ave_time:352 total_time:10942 $ sdiag | grep -i fed; squeue ; sdiag |grep -i fed REQUEST_FED_INFO ( 2049) count:31 ave_time:352 total_time:10942 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) REQUEST_FED_INFO ( 2049) count:31 ave_time:352 total_time:10942 $ sdiag | grep -i fed; squeue --federation ; sdiag |grep -i fed REQUEST_FED_INFO ( 2049) count:31 ave_time:352 total_time:10942 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) REQUEST_FED_INFO ( 2049) count:32 ave_time:350 total_time:11220 $ sdiag | grep -i fed; squeue --steps ; sdiag |grep -i fed REQUEST_FED_INFO ( 2049) count:32 ave_time:350 total_time:11220 STEPID NAME PARTITION USER TIME NODELIST REQUEST_FED_INFO ( 2049) count:33 ave_time:348 total_time:11498 ``` Some of our users have a heavy usage of the "--steps" flag, and it generates a very large number of useless RPC in the end. By looking at the code quickly it seems the test is correctly implemented in node_info.c: ``` if ((show_flags & SHOW_FEDERATION) && !(show_flags & SHOW_LOCAL) && (slurm_load_federation(&ptr) == SLURM_SUCCESS) && ``` https://github.com/SchedMD/slurm/blob/a1dd5f46b9cd9130f8c4db668eba5260fb2788af/src/api/node_info.c#L697 But not in job_step_info.c: ``` if ((show_flags & SHOW_LOCAL) == 0) { if (slurm_load_federation(&ptr) || !cluster_in_federation(ptr, cluster_name)) { ``` https://github.com/scibian/slurm-llnl/blob/c5d5116d75061a3dcf5b549c5c83e80e0cb114a0/src/api/job_step_info.c#L498