| Summary: | Updated reservations filling up logs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bill Broadley <bill.broadley> |
| Component: | slurmctld | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | ben.glines |
| Version: | 23.02.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NREL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Bill Broadley
2023-09-18 14:30:35 MDT
You could momentarily change your debug level to debug2, then search for a line in your logs looking something like this:
> slurmctld: debug2: Processing RPC: REQUEST_UPDATE_RESERVATION from UID=1000
Then you'll be able to see what user may be submitting such an RPC. (likely coming from `scontrol update reservation`)
Note that changing to debug2 will greatly increase the amount of logs that the controller will generate. Please make sure that you have enough storage to handle this, and just be careful to turn the debug level back to info once you're done.
1. `scontrol setdebug debug2`
2. Find RPC log for REQUEST_UPDATE_RESERVATION
3. `scontrol setdebug info`
You can also check sdiag to see how much time the controller is spending on these REQUEST_UPDATE_RESERVATION RPCs (I presume it's not that much time and probably not affecting performance)
I ran this:
$ scontrol setdebug debug2; date; sleep 600; scontrol setdebug info; date
Wed Sep 20 11:08:18 MDT 2023
Wed Sep 20 11:18:18 MDT 2023
From the logs for that period:
$ cat /var/log/slurm/slurmctld.log | grep -i reservation| awk ' { print $1 " " $2 " " $3 " " $4 }'
[2023-09-20T11:12:23.592] sched: Updated reservation=End_Of_Year
[2023-09-20T11:12:23.643] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:12:32.101] sched: Updated reservation=End_Of_Year
[2023-09-20T11:12:32.152] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:12:35.212] sched: Updated reservation=End_Of_Year
[2023-09-20T11:12:35.263] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:12:42.339] sched: Updated reservation=End_Of_Year
[2023-09-20T11:12:42.390] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:12:48.462] sched: Updated reservation=End_Of_Year
[2023-09-20T11:12:48.513] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:12:49.569] sched: Updated reservation=End_Of_Year
[2023-09-20T11:12:49.620] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:15:24.640] debug2: Updating reservations
[2023-09-20T11:17:52.618] sched: Updated reservation=End_Of_Year
[2023-09-20T11:17:52.673] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:17:57.749] sched: Updated reservation=End_Of_Year
[2023-09-20T11:17:57.800] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:18:00.862] sched: Updated reservation=End_Of_Year
[2023-09-20T11:18:00.914] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:18:11.255] sched: Updated reservation=End_Of_Year
[2023-09-20T11:18:11.308] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:18:17.385] sched: Updated reservation=End_Of_Year
[2023-09-20T11:18:17.437] sched: Updated reservation=SYSTIME-100423
[2023-09-20T11:18:18.492] sched: Updated reservation=End_Of_Year
[2023-09-20T11:18:18.543] sched: Updated reservation=SYSTIME-100423
So why do I get 12 updates in the logs, with zero scontrol update reservations?
I did find one call, but not for the above period:
$ cat /var/log/slurm/slurmctld.log | grep -i REQUEST_UPDATE_RESERVATION
[2023-09-20T10:53:48.545] debug2: Processing RPC: REQUEST_UPDATE_RESERVATION from UID=0
But not for the period in question. I'm seeing 1900 updates a day and from what I can tell nearly zero of them are from scontrol update reservations.
BTW, we have no automated system to create or edit reservations, it's all manual. We currently have 11 of them in the system. But only ReservationName=SYSTIME-100423 gets updated 1000s of times a day. Does slurm update the reservation every time a node changes state? > Does slurm update the reservation every time a node changes state?
Not normally unless you are using flags to replace the nodes. I have a few questions for you.
1. Regarding these logs:
> [2023-09-18T14:20:02.616] sched: Updated reservation=End_Of_Year
> [2023-09-18T14:20:02.667] sched: Updated reservation=SYSTIME-100423
Is there more to the log entry other than this? I ask because the update should list
more information such as:
> sched: Updated reservation=jason_2 users=jason nodes=n[1-5] cores=30 licenses=(null)
> tres=cpu=60 watts=4294967294 start=2023-09-25T14:57:16 end=2023-09-25T15:57:16
> MaxStartDelay= Comment=
2. Can you share the reservations with us "scontrol show res SYSTIME-100423" "scontrol show res End_Of_Year"?
May I get an update to comment#4? If this is no longer an issue for you, then please let me know, and I can close this issue out accordingly. |