Ticket 10922

Summary: slurmctld stops logging to syslog after "scontrol setdebug <loglevel>"
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: slurmctldAssignee: Skyler Malinowski <skyler>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=12249
Site: Stanford Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Kilian Cavalotti 2021-02-22 11:33:21 MST
Hi,

Splitting this off #10827:

-- 8< --------------------------------------
I also noticed an odd behavior when trying to change logging level with "scontrol setdebug": it makes the daemon stop logging to systemd. I typically use "journalctl -u slurmctld -f" to follow the controller logs, and when the logging level is changed with scontrol, logging stops and nothing is logged anymore through journalctl until the daemon is restarted. 
The logging level used doesn't seem to make any difference, and it doesn't happen when enabling or disabling debugflags. It also looks like a relatively recent behavior as I don't recall seeing it in previous versions. 
-- 8< --------------------------------------

I'm not sure how to best illustrate this, but here's the output slurmctld sent to syslog (recorded with journalctl):

-- 8< --------------------------------------
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: _job_complete: JobId=18852775_139(18853484) done
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: _job_complete: JobId=18852775_118(18853463) WEXITSTATUS 0
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: _job_complete: JobId=18852775_118(18853463) done
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: _job_complete: JobId=18852775_122(18853467) WEXITSTATUS 0
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: _job_complete: JobId=18852775_122(18853467) done
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: prolog_running_decr: Configuration for JobId=18848792_23(18853995) is complete
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: _slurm_rpc_submit_batch_job: JobId=18854018 InitPrio=117471 usec=18755
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: _job_complete: JobId=18852775_108(18853453) WEXITSTATUS 0
Feb 22 10:23:18 sh03-sl01.int slurmctld[11472]: _job_complete: JobId=18852775_108(18853453) done
Feb 22 10:27:13 sh03-sl01.int slurmctld[11472]: topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
Feb 22 10:27:13 sh03-sl01.int slurmctld[11472]: restoring original state of nodes
Feb 22 10:27:13 sh03-sl01.int slurmctld[11472]: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 141 partitions
Feb 22 10:27:18 sh03-sl01.int slurmctld[11472]: SchedulerParameters=default_queue_depth=5000,max_rpc_cnt=128,max_sched_time=5,partition_job_depth=100,sched_max_job_start=0,sched_min_interval=5000000
-- 8< --------------------------------------


At 10:24, I ran "scontrol setdebug verbose", and the change is not recorded in the log. At 10:25, I ran "scontrol setdebug info" (which is the default loglevel configured in slurm.conf), and that change is not recorded either. Nothing is logged at all between the first "scontrol setdebug" and the "scontrol reconfigure" I ran at 10:27, which restored logging.

Job activity remained usual during those 3 minutes, but nothing was recorded. I verified that syslog was still working by using `logger`, which successfully recorded messages in /var/log/messages.

Please let me know if I can provide any additional information.

Thanks!
--
Kilian
Comment 1 Skyler Malinowski 2021-02-23 15:38:26 MST
Hi Kilian,

I was able to reproduce the issue, have tracked down where the issue is, and now working on a patch to address it.

Thanks,
Skyler
Comment 2 Kilian Cavalotti 2021-02-23 15:39:40 MST
Hi Skyler, 

(In reply to Skyler Malinowski from comment #1)
> I was able to reproduce the issue, have tracked down where the issue is, and
> now working on a patch to address it.

Excellent to hear, thanks so much for the update!

Cheers,
--
Kilian
Comment 9 Skyler Malinowski 2021-04-05 13:19:17 MDT
Hi Kilian,

Just wanted to give you an update. The patch has been created but currently is waiting for review. Internally, things had been busy around the release of 20.11.5. Now that it has shipped, I will be trying again to get this fix merged in for the next release window.

Thanks for being patient. Stay awesome! :)

Regards,
Skyler
Comment 10 Kilian Cavalotti 2021-04-05 13:50:55 MDT
Hi Skyler,


On Mon, Apr 5, 2021 at 12:19 PM <bugs@schedmd.com> wrote:
> Just wanted to give you an update. The patch has been created but currently is
> waiting for review. Internally, things had been busy around the release of
> 20.11.5. Now that it has shipped, I will be trying again to get this fix merged
> in for the next release window.

That's great, thanks a lot for the update!

Cheers,
--
Kilian
Comment 11 Skyler Malinowski 2021-05-17 09:11:09 MDT
Hi Kilian,

Quick update: still awaiting patch review.

Thank you for being patient.

-- Skyler
Comment 12 Kilian Cavalotti 2021-05-17 09:49:25 MDT
On Mon, May 17, 2021 at 8:11 AM <bugs@schedmd.com> wrote:
> Quick update: still awaiting patch review.

Thanks for the update!

Cheers,
--
Kilian
Comment 13 Skyler Malinowski 2021-06-14 14:48:39 MDT
Monthly update: still awaiting patch review. There is a backlog of patch reviews including this one.

Thanks for being patient. As always, stay awesome! :)
Comment 14 Kilian Cavalotti 2021-06-14 14:55:16 MDT
On Mon, Jun 14, 2021 at 1:48 PM <bugs@schedmd.com> wrote:
> Monthly update: still awaiting patch review. There is a backlog of patch
> reviews including this one.
>
> Thanks for being patient. As always, stay awesome! :)

Thanks Skyler! Hope it gets to the review board soon. :)

Cheers,
--
Kilian
Comment 15 Kilian Cavalotti 2021-09-09 08:22:55 MDT
Hi Skyler,

I'm wondering if you had any update on the status of that patch? If it didn't get reviewed yet, would it still possible to post it here?

Thanks!
--
Kilian
Comment 28 Skyler Malinowski 2021-09-21 09:18:20 MDT
Hi Killian,

Thanks for having patience yet again. Good news is the patch has been reviewed and merged. Thanks for poking the beast (so to speak), it got things moving.

Below is the merge commit:
https://github.com/SchedMD/slurm/commit/26302293ac

Although this merge was for 21.08, the changes should also work with 20.11. Feel free to try it out or at least look forward to it in 21.08.

Best,
Skyler

P.S. Stay awesome! :)
Comment 29 Kilian Cavalotti 2021-09-21 12:17:37 MDT
Hi Skyler, 

> Thanks for having patience yet again. Good news is the patch has been
> reviewed and merged. Thanks for poking the beast (so to speak), it got
> things moving.
> 
> Below is the merge commit:
> https://github.com/SchedMD/slurm/commit/26302293ac
> 
> Although this merge was for 21.08, the changes should also work with 20.11.
> Feel free to try it out or at least look forward to it in 21.08.

Excellent, thanks a lot for the update!
I'll give it a try on 20.11 as well and let you know if I have trouble.

Cheers,
--
Kilian