21975 – High Agent Queue Size Stall

Ticket 21975 - High Agent Queue Size Stall

Summary: High Agent Queue Size Stall

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	24.11.1
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Oscar Hernández
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2025-02-03 14:14 MST by Paul Edmon
Modified:	2025-03-11 08:21 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	24.11.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Current slurm.conf (69.32 KB, text/x-matlab) 2025-02-03 14:16 MST, Paul Edmon	Details
Current topology.conf (4.77 KB, text/x-matlab) 2025-02-03 14:16 MST, Paul Edmon	Details
Graph of agent queue and server threads (158.39 KB, image/png) 2025-02-03 14:16 MST, Paul Edmon	Details
Node state graph (157.02 KB, image/png) 2025-02-03 14:17 MST, Paul Edmon	Details
sdiag output during stall (24.22 KB, text/plain) 2025-02-03 14:26 MST, Paul Edmon	Details
Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00 (2.96 MB, application/x-compressed) 2025-02-04 08:26 MST, Paul Edmon	Details
holy7c12406 Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00 (26.91 KB, application/x-compressed) 2025-02-04 10:25 MST, Paul Edmon	Details
holy8a24507 Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00 (152.78 KB, application/x-compressed) 2025-02-04 10:27 MST, Paul Edmon	Details
Feb6_core.112406 (1.59 MB, application/x-gzip) 2025-02-06 11:52 MST, Jason Wells	Details
Log of gcore and gdb commands (14.94 KB, application/x-gzip) 2025-02-06 11:53 MST, Jason Wells	Details
2nd core - zip and then gzip to try to fit within size limit (287.23 MB, application/zip) 2025-02-06 11:55 MST, Jason Wells	Details
Thread count and agent queue size for the last two days, 2/5 and 2/6. (1.17 MB, image/png) 2025-02-07 08:09 MST, Jason Wells	Details
A directory of various logs for the last incident (6.08 MB, application/zip) 2025-02-07 11:13 MST, Jason Wells	Details
Yesterday's full logs (11.57 MB, application/x-gzip) 2025-02-07 11:20 MST, Jason Wells	Details
Plot Allocs per minute (17.10 KB, image/png) 2025-02-11 03:27 MST, Oscar Hernández	Details
Job Stats for 2/11/2025 (170.75 KB, image/png) 2025-02-11 07:46 MST, Paul Edmon	Details
Log for crash at 07:55 on 02-12-25 (75.18 KB, application/x-bzip2) 2025-02-12 09:24 MST, Paul Edmon	Details
Patch for segfault (1005 bytes, patch) 2025-02-12 10:11 MST, Oscar Hernández	Details \| Diff
Jobs started between 2-05-25 to 2-12-25 (164.40 KB, image/png) 2025-02-12 11:13 MST, Paul Edmon	Details
slurm.conf after tuning (69.74 KB, text/x-matlab) 2025-03-05 09:51 MST, Paul Edmon	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2025-02-03 14:14:55 MST

Similar to https://support.schedmd.com/show_bug.cgi?id=21847 We are seeing a high agent queue size stall. I was hoping the upgrade to 24.11.1 (which we did today) would alleviate this issue but unfortunately it has not. I've captured the sdiag when it was happening and have attached a graph of the agent size. You can also see a pile up in the completing jobs in another graph.

I tried to take a core dump but the signal I sent didn't generate one sadly. That said restarting the scheduler caused it flush so it is some state it is getting in where it stalls in dealing with traffic and then it just keeps piling up without flushing. I suspect this may be triggered when the scheduler hits max_rpc_cnt.

In summary the scheduler is up and running but stalls occasionally with this problem, a restart flushes the queue. If you send me the commands for reliably generating a core dump once it gets into this state again I can force one. I'm also happy to turn on whatever other debugging you feel you need or any suggestions you have for tweaking our config, which I attached.

Comment 1 Paul Edmon 2025-02-03 14:16:18 MST

Created attachment 40627 [details]
Current slurm.conf

Comment 2 Paul Edmon 2025-02-03 14:16:32 MST

Created attachment 40628 [details]
Current topology.conf

Comment 3 Paul Edmon 2025-02-03 14:16:48 MST

Created attachment 40629 [details]
Graph of agent queue and server threads

Comment 4 Paul Edmon 2025-02-03 14:17:47 MST

Created attachment 40630 [details]
Node state graph

Comment 5 Jason Booth 2025-02-03 14:24:15 MST

Would it be possible to also share the full output of 'sdiag' during one of these stacked agent queues?

Comment 6 Paul Edmon 2025-02-03 14:26:55 MST

Created attachment 40631 [details]
sdiag output during stall

Comment 7 Jason Booth 2025-02-03 15:24:55 MST

Next time this occurs please run the following gcore and gdb commands and attach the logs for that day as well.

>gcore $(pidof slurmctld)
>[...]
>warning: Memory read failed for corefile section, 4096 bytes at >0xffffffffff600000.
>Saved corefile core.182997
>[Inferior 1 (process 182997) detached]
>gdb -batch -ex 't a a bt f'  slurmctld ./core.182997

Comment 8 Paul Edmon 2025-02-04 07:39:54 MST

Yup will do. I've attached the sdiag while its in this state last night 
here: https://support.schedmd.com/attachment.cgi?id=40631

On 2/3/2025 5:24 PM, bugs@schedmd.com wrote:
> Jason Booth <mailto:jbooth@schedmd.com> changed ticket 21975 
> <https://support.schedmd.com/show_bug.cgi?id=21975>
> What 	Removed 	Added
> Assignee 	support@schedmd.com 	oscar.hernandez@schedmd.com
>
> *Comment # 7 <https://support.schedmd.com/show_bug.cgi?id=21975#c7> on 
> ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> from 
> Jason Booth <mailto:jbooth@schedmd.com> *
> Next time this occurs please run the following gcore and gdb commands and
> attach the logs for that day as well.
>
> >gcore $(pidof slurmctld) >[...] >warning: Memory read failed for corefile section, 4096 bytes 
> at >0xffffffffff600000. >Saved corefile core.182997 >[Inferior 1 
> (process 182997) detached] >gdb -batch -ex 't a a bt f' slurmctld 
> ./core.182997
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 9 Oscar Hernández 2025-02-04 08:18:10 MST

Hi Paul,

As Jason suggested, the backtrace will be really useful to see what is happening at that particular moment.

In the meantime. I have been looking at the sdiag. It draws my attention the fact that from the pending RPCs, half of them correspond to: SRUN_TIMEOUT.

>Pending RPC statistics
>        SRUN_TIMEOUT                            ( 7002) count:16721
>        REQUEST_TERMINATE_JOB                   ( 6011) count:2822
>        SRUN_JOB_COMPLETE                       ( 7004) count:1789
>        REQUEST_LAUNCH_PROLOG                   ( 6017) count:5100
>        REQUEST_BATCH_JOB_LAUNCH                ( 4005) count:5048
>        REQUEST_KILL_TIMELIMIT                  ( 6009) count:6298
>        REQUEST_KILL_PREEMPTED                  ( 6016) count:54
>        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:3
>        REQUEST_HEALTH_CHECK                    ( 1011) count:1
>        RESPONSE_RESOURCE_ALLOCATION            ( 4002) count:52
>        REQUEST_SIGNAL_TASKS                    ( 6004) count:10
>        SRUN_PING                               ( 7001) count:9
That would explain the peak in CG state jobs observed. Would you be able to send a compressed controller log from the same time you took the sdiag output?

Kind regards,
Oscar

Comment 10 Paul Edmon 2025-02-04 08:26:06 MST

Created attachment 40643 [details]
Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00

The stall in question ran from about 15:00 to 16:15 when the scheduler was restarted.

Comment 11 Paul Edmon 2025-02-04 08:27:02 MST

Yup just including it plus stuff before and after.

On 2/4/2025 10:18 AM, bugs@schedmd.com wrote:
>
> *Comment # 9 <https://support.schedmd.com/show_bug.cgi?id=21975#c9> on 
> ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> from 
> Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Hi Paul,
>
> As Jason suggested, the backtrace will be really useful to see what is
> happening at that particular moment.
>
> In the meantime. I have been looking at the sdiag. It draws my attention the
> fact that from the pending RPCs, half of them correspond to: SRUN_TIMEOUT.
>
> >Pending RPC statistics > SRUN_TIMEOUT ( 7002) count:16721 > REQUEST_TERMINATE_JOB ( 6011) 
> count:2822 > SRUN_JOB_COMPLETE ( 7004) count:1789 > 
> REQUEST_LAUNCH_PROLOG ( 6017) count:5100 > REQUEST_BATCH_JOB_LAUNCH ( 
> 4005) count:5048 > REQUEST_KILL_TIMELIMIT ( 6009) count:6298 > 
> REQUEST_KILL_PREEMPTED ( 6016) count:54 > REQUEST_CANCEL_JOB_STEP ( 
> 5005) count:3 > REQUEST_HEALTH_CHECK ( 1011) count:1 > 
> RESPONSE_RESOURCE_ALLOCATION ( 4002) count:52 > REQUEST_SIGNAL_TASKS ( 
> 6004) count:10 > SRUN_PING ( 7001) count:9
> That would explain the peak in CG state jobs observed. Would you be able to
> send a compressed controller log from the same time you took the sdiag output?
>
> Kind regards,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 12 Oscar Hernández 2025-02-04 09:15:55 MST

Thanks for that.

Seems that issues are related to a big jobarray that was submitted at 14:39 (job 1947577) which had many of its array memberst time-out at the same time.

>Feb 03 14:39:43 holy-slurm02.rc.fas.harvard.edu slurmctld[8536]: 
>_slurm_rpc_submit_batch_job: JobId=1947577 InitPrio=14386520 usec=4504
This seems to be a jobarray that has around 4k members, most of them being allocated between 14:39 and 15:00.

>$ grep "1947577" slurm-02-03-25.log | grep "]: sched: Alloca" | wc -l
>4044
Logs show how most of this jobs simultaneously timed-out starting at 15:09. Which seems to correspond with the CG nodes started to peak.

>$ grep "1947577" slurm-02-03-25.log | grep -i "]: Time" | wc -l
>2506
In fact at the same second at 15:09:49, 230 jobs timed out:

>grep "1947577" slurm-02-03-25.log | grep -i "]: Time" | grep "15:09:49" | wc -l
>230 
Will try to figure out what could have caused slurm to block there. But definitely seems related with the arrayjob.

Comment 13 Paul Edmon 2025-02-04 09:19:44 MST

Interesting. Here is the job script that was submitted, just for your 
reference. Let me know if you want me to pull any other job stats.

#!/bin/bash
#SBATCH 
--array=110-254,269-285,291,301,308,322-325,335-337,343-454,460-543,557-658,663-1512,1517-1606,1612-1778,1783,1813,1828-1994,1999-2016,2021-2044,2056-2063,2068-2080,2086-2143,2150-2151,2172,2195-2466,2475-2688,2693-2776,2802-2920,2929-2964,2969-3022,3028-3043,3048-3089,3095-3098,3105-3133,3140-3179,3189-3196,3214-3227,3240-3258,3264-3270,3278-3339,3353,3373-3376,3381-3392,3400-3404,3410,3417-3421,3484
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --time=00:30:00
#SBATCH -p serial_requeue
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<redacted>
#SBATCH 
--output=/n/holylabs/LABS/tkhanna_lab/Lab/ARD_stagetwo/logs/rerun_slurm_%A_%a.out

# Set Environmental Variables
export LAB_DIR=/n/holylabs/LABS/tkhanna_lab/Lab
export my_sif=/n/holylabs/LABS/tkhanna_lab/Lab/resources/rstudio4.sif
export R_LIBS_USER=/n/home03/halidaee/R/rstudiosif/4.5/

# Set computational parameters
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_THREADING_LAYER=GNU
export OPENBLAS_CORETYPE=Haswell

# Set simulation parameters
export simulations_per_index=20
export indices_per_triple=25

# Define environment variables to pass
vars=("LAB_DIR" "R_LIBS_USER" "OMP_NUM_THREADS" "MKL_NUM_THREADS" 
"OPENBLAS_NUM_THREADS"
       "MKL_THREADING_LAYER" "OPENBLAS_CORETYPE" "SLURM_CPUS_PER_TASK"
       "R_LIBS_USER" "SLURM_ARRAY_TASK_ID"
       "simulations_per_index" "indices_per_triple")

# Initialize ENV_VARS
ENV_VARS=""
for var in "${vars[@]}"; do
     ENV_VARS+="${var}=${!var},"
done
ENV_VARS=${ENV_VARS%,}

# Go to project directory
cd /n/holylabs/LABS/tkhanna_lab/Lab/ARD_stagetwo

# Run the analysis
singularity exec --cleanenv --env "$ENV_VARS" 
/n/holylabs/LABS/tkhanna_lab/Lab/resources/rstudio4.sif /usr/bin/zsh -c 
"source /n/home03/halidaee/.zshrc && Rscript --no-save --no-restore 
--verbose code/regressions/create_regressions_data.R > 
logs/rerun_regressions_job_array_output_${SLURM_ARRAY_TASK_ID}.log 2>&1"


On 2/4/2025 11:15 AM, bugs@schedmd.com wrote:
>
> *Comment # 12 <https://support.schedmd.com/show_bug.cgi?id=21975#c12> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Thanks for that.
>
> Seems that issues are related to a big jobarray that was submitted at 14:39
> (job 1947577) which had many of its array memberst time-out at the same time.
>
> >Feb 03 14:39:43 holy-slurm02.rc.fas.harvard.edu slurmctld[8536]: >_slurm_rpc_submit_batch_job: JobId=1947577 InitPrio=14386520 usec=4504
> This seems to be a jobarray that has around 4k members, most of them being
> allocated between 14:39 and 15:00.
>
> >$ grep "1947577" slurm-02-03-25.log | grep "]: sched: Alloca" | wc -l >4044
> Logs show how most of this jobs simultaneously timed-out starting at 15:09.
> Which seems to correspond with the CG nodes started to peak.
>
> >$ grep "1947577" slurm-02-03-25.log | grep -i "]: Time" | wc -l >2506
> In fact at the same second at 15:09:49, 230 jobs timed out:
>
> >grep "1947577" slurm-02-03-25.log | grep -i "]: Time" | grep "15:09:49" | wc -l >230
> Will try to figure out what could have caused slurm to block there. But
> definitely seems related with the arrayjob.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 14 Oscar Hernández 2025-02-04 09:53:06 MST

>Interesting. Here is the job script that was submitted, just for your 
>reference. Let me know if you want me to pull any other job stats.
Thanks! It would also help if you could share the slurmd.log of some of the nodes involved in the same time-frame as the controller log shared. For example, node holy7c12406.

Comment 15 Paul Edmon 2025-02-04 10:25:02 MST

Created attachment 40644 [details]
holy7c12406 Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00

Comment 16 Paul Edmon 2025-02-04 10:27:15 MST

Created attachment 40645 [details]
holy8a24507 Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00

Comment 17 Paul Edmon 2025-02-04 10:28:04 MST

Sure. I've put in two logs, holy7c12406 and holy8a24507. Let me know if 
you need more.

-Paul Edmon-

On 2/4/2025 11:53 AM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://support.schedmd.com/show_bug.cgi?id=21975#c14> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> >Interesting. Here is the job script that was submitted, just for your >reference. Let me know if you want me to pull any other job stats.
> Thanks! It would also help if you could share the slurmd.log of some of the
> nodes involved in the same time-frame as the controller log shared. For
> example, node holy7c12406.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 18 Oscar Hernández 2025-02-05 12:07:32 MST

Hi Paul,

Many thanks for the different logs and data provided, that helped me confirm my initial suspicion, the culprit seems to be the already mentioned SRUN_TIMEOUT RPC.

I believe I have been able to reproduce a similar behavior like yours in my test environment, by having around 500 jobs timing out at the same time. I can get my controller overloaded by RPCs which keep accumulating. However, eventually, after a few minutes, my controller is able to recover (jobs stay CG for a while though, and nodes appear as CG too).

In your case, things seem to indicate that it was a matter of having a sustained rate of jobs timing out during the problematic hour (between 15:00-16:00), which prevented the controller to get up to date with the RPC processing. In fact, I found another job array with 1k jobs timing out also at that time (1956880). 

#About the RPC that seems to be flooding the queue:

This SRUN_TIMEOUT RPC is issued around 3 times per job, to indicate srun that timelimit for it is approaching (is sent every 30s, starting 1min before hitting timelimit). Your conflicting array job did not have any SRUN inside it though, which was making me wonder why Slurm was issuing that many RPCs.

Turns out that, since you have "SlurmctldParameters=enable_stepmgr" active. The controller will send an RPC to the stepmgr so that it later handles the connection to all hosts (no matter if there is an srun involved or not). But this single connection to the stepmgr, in your unfortunate situation, was sent many, many times...

>        SRUN_TIMEOUT                            ( 7002) count:16721
>        REQUEST_TERMINATE_JOB                   ( 6011) count:2822
>        SRUN_JOB_COMPLETE                       ( 7004) count:1789
>        REQUEST_LAUNCH_PROLOG                   ( 6017) count:5100
>        REQUEST_BATCH_JOB_LAUNCH                ( 4005) count:5048
>        REQUEST_KILL_TIMELIMIT                  ( 6009) count:6298
On the compute node side of things, jobs were never notified of the actual timeout that forces the job kill (REQUEST_KILL_TIMELIMIT). You can see in the count above how the queue did not get to process these ones. With jobs and nodes being left in CG for a long time.

As you mentioned, the restart flushes the queue. So, once the controller is back online, all the SRUN_TIMEOUT are gone and the controller automatically recovers.

#How to prevent SRUN

As mentioned. This RPC is issued as consequence of having "SlurmctldParameters=enable_stepmgr". Without stepmgr enabled, this RPC won't be generated, so you should not suffer from this. 

Did you notice an improvement with the stepmgr option activated? Would it be an option to remove it?

As a note: Stepmgr is mainly beneficial with jobs submitting many steps, but arrayjobs won't see any benefit from activating it. Jobs that may use many steps, could still manually request it with the sbatch option "--stepmgr".

Meanwhile, I will check if I can improve the handling for SRUN_TIMEOUT, to prevent this flooding when so many jobs timeout with stepmgr enabled.

Hope that gives a bit of an insight of what happened.

Kind regards,

Oscar

Comment 19 Paul Edmon 2025-02-05 12:30:04 MST

Very interesting. As you have seen our cluster is very active with jobs 
coming and going constantly and jobs all exiting at once due to various 
issues. We do have a few users that launch tons of steps which is why I 
enabled the stepmgr, but we probably do not have enough to justify 
stepmgr if its creating the RPC load.

I will drop the stepmgr flag and we will see if that helps.

Just as a quick note, I go on vacation tomorrow so my colleague Jason 
Wells (who has cc'd himself on this ticket) will handle communications 
while I am out.

-Paul Edmon-

On 2/5/2025 2:07 PM, bugs@schedmd.com wrote:
>
> *Comment # 18 <https://support.schedmd.com/show_bug.cgi?id=21975#c18> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Hi Paul,
>
> Many thanks for the different logs and data provided, that helped me confirm my
> initial suspicion, the culprit seems to be the already mentioned SRUN_TIMEOUT
> RPC.
>
> I believe I have been able to reproduce a similar behavior like yours in my
> test environment, by having around 500 jobs timing out at the same time. I can
> get my controller overloaded by RPCs which keep accumulating. However,
> eventually, after a few minutes, my controller is able to recover (jobs stay CG
> for a while though, and nodes appear as CG too).
>
> In your case, things seem to indicate that it was a matter of having a
> sustained rate of jobs timing out during the problematic hour (between
> 15:00-16:00), which prevented the controller to get up to date with the RPC
> processing. In fact, I found another job array with 1k jobs timing out also at
> that time (1956880).
>
> #About the RPC that seems to be flooding the queue:
>
> This SRUN_TIMEOUT RPC is issued around 3 times per job, to indicate srun that
> timelimit for it is approaching (is sent every 30s, starting 1min before
> hitting timelimit). Your conflicting array job did not have any SRUN inside it
> though, which was making me wonder why Slurm was issuing that many RPCs.
>
> Turns out that, since you have "SlurmctldParameters=enable_stepmgr" active. The
> controller will send an RPC to the stepmgr so that it later handles the
> connection to all hosts (no matter if there is an srun involved or not). But
> this single connection to the stepmgr, in your unfortunate situation, was sent
> many, many times...
>
> >        SRUN_TIMEOUT                            ( 7002) count:16721 > REQUEST_TERMINATE_JOB ( 6011) count:2822 > SRUN_JOB_COMPLETE ( 
> 7004) count:1789 > REQUEST_LAUNCH_PROLOG ( 6017) count:5100 > 
> REQUEST_BATCH_JOB_LAUNCH ( 4005) count:5048 > REQUEST_KILL_TIMELIMIT ( 
> 6009) count:6298
> On the compute node side of things, jobs were never notified of the actual
> timeout that forces the job kill (REQUEST_KILL_TIMELIMIT). You can see in the
> count above how the queue did not get to process these ones. With jobs and
> nodes being left in CG for a long time.
>
> As you mentioned, the restart flushes the queue. So, once the controller is
> back online, all the SRUN_TIMEOUT are gone and the controller automatically
> recovers.
>
> #How to prevent SRUN
>
> As mentioned. This RPC is issued as consequence of having
> "SlurmctldParameters=enable_stepmgr". Without stepmgr enabled, this RPC won't
> be generated, so you should not suffer from this.
>
> Did you notice an improvement with the stepmgr option activated? Would it be an
> option to remove it?
>
> As a note: Stepmgr is mainly beneficial with jobs submitting many steps, but
> arrayjobs won't see any benefit from activating it. Jobs that may use many
> steps, could still manually request it with the sbatch option "--stepmgr".
>
> Meanwhile, I will check if I can improve the handling for SRUN_TIMEOUT, to
> prevent this flooding when so many jobs timeout with stepmgr enabled.
>
> Hope that gives a bit of an insight of what happened.
>
> Kind regards,
>
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 20 Jason Wells 2025-02-06 11:32:43 MST

I'm unsure if the monitoring that we've setup is just giving us new visibility into the behavior of Slurm or if I have the gcore/gdb that you need. How can I send files to you all? I don't see an option.

Comment 21 Jason Wells 2025-02-06 11:52:18 MST

Created attachment 40703 [details]
Feb6_core.112406

1st core

Comment 22 Jason Wells 2025-02-06 11:53:11 MST

Created attachment 40704 [details]
Log of gcore and gdb commands

Termnial log of submitting the gcore and gdb commands

Comment 23 Jason Wells 2025-02-06 11:55:07 MST

Created attachment 40705 [details]
2nd core - zip and then gzip to try to fit within size limit

2nd core. Unzip and then ungzip.

Comment 24 Oscar Hernández 2025-02-07 00:54:46 MST

Hi Jason,

>I'm unsure if the monitoring that we've setup is just giving us new visibility 
>into the behavior of Slurm or if I have the gcore/gdb that you need. How can I 
>send files to you all? I don't see an option.
Apologies, I was already out and did not see the message yesterday. I see you managed to upload files :).

What did you observe when the gcore was generated? Was your cluster again with any jobs/nodes in CG state? Do you have the output of an sdiag run during that time-frame?

I will check the backtrace shared and let you know if I find anything interesting, but knowing the context a bit would help. Looks like you restarted the controller afterwards.

BTW, corfiles are not really useful for us (I cannot get backtraces without your original environment), so ther eis no need to upload them.  We are only interested in the backtrace you can generate from them.

When was the 2nc core generated? Could you also shre the output of:

>gdb -batch -ex 't a a bt f'  slurmctld ./core.182997
For that new core.

Thanks,
Oscar

Comment 25 Jason Wells 2025-02-07 07:18:37 MST

"Log of gcore and gdb commands" has the output of the commands. I just ran gdb on the second core and got the below output:

[root@holy-slurm02 ~]# gdb -batch -ex 't a a bt f'  slurmctld ./core.112406

warning: core file may not match specified executable file.
[New LWP 112406]
[New LWP 112407]
[New LWP 112408]
[New LWP 112409]
[New LWP 112411]
[New LWP 112412]
[New LWP 112413]
[New LWP 112414]
[New LWP 112415]
[New LWP 112416]
[New LWP 112417]
[New LWP 112418]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `slurmctld: slurmscriptd'.
#0  0x000014aab786cac1 in poll () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x14aab8d40740 (LWP 112406))]

Thread 12 (Thread 0x14aab613e700 (LWP 112418)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88d9cc4 <__func__.16181> "watch", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b1d8 <mgr+376>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 1, tv_usec = 22723475648608}
        tv2 = {tv_sec = 22723473029035, tv_usec = 22723473283840}
        tv_str = '\000' <repeats 19 times>
        delta_t = 22723473279423
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b1d8 <mgr+376>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88d9cc4 <__func__.16181> "watch") at events.c:100
No locals.
#3  0x000014aab889c3d7 in watch (arg=<optimized out>) at watch.c:1378
        __func__ = "watch"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 11 (Thread 0x14aab623f700 (LWP 112417)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 16194240, tv_usec = 15898016}
        tv2 = {tv_sec = 22723432804000, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 22723432803976
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf71ac0) at workers.c:226
        work = 0x0
        worker = 0xf71ac0
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 10 (Thread 0x14aab6340700 (LWP 112416)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 0, tv_usec = 15898016}
        tv2 = {tv_sec = 22723433856672, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 0
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf62e70) at workers.c:226
        work = 0x0
        worker = 0xf62e70
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 9 (Thread 0x14aab6441700 (LWP 112415)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 16135072, tv_usec = 15898016}
        tv2 = {tv_sec = 22723434909344, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 22723434909320
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf633a0) at workers.c:226
        work = 0x0
        worker = 0xf633a0
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 8 (Thread 0x14aab6542700 (LWP 112414)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 0, tv_usec = 15898016}
        tv2 = {tv_sec = 22723435962016, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 0
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf5e340) at workers.c:226
        work = 0x0
        worker = 0xf5e340
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 7 (Thread 0x14aab6643700 (LWP 112413)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 0, tv_usec = 15898016}
        tv2 = {tv_sec = 22723437014688, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 0
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf4a1f0) at workers.c:226
        work = 0x0
        worker = 0xf4a1f0
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 6 (Thread 0x14aab6744700 (LWP 112412)):
#0  0x000014aab7879307 in epoll_wait () from /lib64/libc.so.6
No symbol table info available.
#1  0x000014aab8892fcc in _poll (caller=0x14aab88d9c90 <__func__.16117> "_poll_connections") at epoll.c:494
        nfds = -1
        rc = 0
        events_count = 301
        epoll = 8
        fd_count = 2
        events = 0xf2b5a0
        __func__ = "_poll"
#2  0x000014aab8896b85 in pollctl_poll (caller=caller@entry=0x14aab88d9c90 <__func__.16117> "_poll_connections") at polling.c:144
No locals.
#3  0x000014aab889868c in _poll_connections (conmgr_args=..., arg=<optimized out>) at watch.c:1116
        rc = <optimized out>
        __func__ = "_poll_connections"
#4  0x000014aab889cab9 in wrap_work (work=<optimized out>, work@entry=0x14aab0000c30) at work.c:181
        con = 0x0
        __func__ = "wrap_work"
#5  0x000014aab889d536 in _worker (arg=0xf44110) at workers.c:251
        work = 0x14aab0000c30
        worker = 0xf44110
        __func__ = "_worker"
#6  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 5 (Thread 0x14aab6845700 (LWP 112411)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 0, tv_usec = 15898016}
        tv2 = {tv_sec = 22723439120032, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 0
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf5a4c0) at workers.c:226
        work = 0x0
        worker = 0xf5a4c0
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x14aab6946700 (LWP 112409)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 16160800, tv_usec = 15898016}
        tv2 = {tv_sec = 22723440172704, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 22723440172680
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf69820) at workers.c:226
        work = 0x0
        worker = 0xf69820
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x14aab6a47700 (LWP 112408)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 0, tv_usec = 15898016}
        tv2 = {tv_sec = 22723441225376, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 0
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf54040) at workers.c:226
        work = 0x0
        worker = 0xf54040
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x14aab8d3f700 (LWP 112407)):
#0  0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77
        err = <optimized out>
        tv1 = {tv_sec = 0, tv_usec = 15898016}
        tv2 = {tv_sec = 22723477892768, tv_usec = 15898056}
        tv_str = '\000' <repeats 19 times>
        delta_t = 0
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        __func__ = "_wait"
        err = <optimized out>
        err = <optimized out>
#2  event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100
No locals.
#3  0x000014aab889d4a0 in _worker (arg=0xf7c150) at workers.c:226
        work = 0x0
        worker = 0xf7c150
        __func__ = "_worker"
#4  0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5  0x000014aab77738d3 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x14aab8d40740 (LWP 112406)):
#0  0x000014aab786cac1 in poll () from /lib64/libc.so.6
No symbol table info available.
#1  0x000014aab876fa8a in poll (__timeout=-1, __nfds=<optimized out>, __fds=0xf62440) at /usr/include/bits/poll2.h:38
No locals.
#2  _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0xf62440) at eio.c:352
        n = <optimized out>
        timeout = -1
        n = <optimized out>
        timeout = <optimized out>
#3  eio_handle_mainloop (eio=0xf618f0) at eio.c:315
        retval = 0
        pollfds = 0xf62440
        map = 0xf57870
        maxnfds = 1
        nfds = 2
        n = <optimized out>
        shutdown_time = <optimized out>
        __func__ = "eio_handle_mainloop"
#4  0x00000000004c27f3 in slurmscriptd_run_slurmscriptd (argc=argc@entry=2, argv=argv@entry=0x7ffd431735f8, binary_path=binary_path@entry=0x721dc0 <binary> "/usr/sbin/slurmctld") at slurmscriptd.c:1177
        i = <optimized out>
        rc = 0
        ack = 0
        __func__ = "slurmscriptd_run_slurmscriptd"
#5  0x00000000004332a4 in main (argc=2, argv=0x7ffd431735f8) at controller.c:619
        error_code = <optimized out>
        start = {tv_sec = 0, tv_usec = 0}
        now = {tv_sec = 22723454820352, tv_usec = 22723475683688}
        stat_buf = {st_dev = 22723475838398, st_ino = 110527148, st_nlink = 157536133, st_mode = 3098747161, st_uid = 5290, st_gid = 3098842544, __pad0 = 5290, st_rdev = 22723454810860, st_size = 0, st_blksize = 22723477909504, st_blocks = 1, st_atim = {tv_sec = 64313170774099024, tv_nsec = 22723477979696}, st_mtim = {tv_sec = 22723477899072, tv_nsec = 22723477979696}, st_ctim = {tv_sec = 64313170775482372, tv_nsec = 140725729047792}, __glibc_reserved = {22723475777776, 22723477979696, 22723475821647}}
        rlim = {rlim_cur = 1736415839, rlim_max = 22723475682544}
        config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK}
        prep_callbacks = {prolog_slurmctld = 0x498105 <prep_prolog_slurmctld_callback>, epilog_slurmctld = 0x4983ea <prep_epilog_slurmctld_callback>}
        backup_has_control = false
        slurmscriptd_mode = true
        conf_file = <optimized out>
        stepmgr_ops = {acct_db_conn = 0x0, active_feature_list = 0x0, job_list = 0x0, last_job_update = 0x725ef0 <last_job_update>, up_node_bitmap = 0x0, job_config_fini = 0x451ff8 <job_config_fini>, find_job_record = 0x44e2b1 <find_job_record>, find_job_array_rec = 0x44e510 <find_job_array_rec>, agent_queue_request = 0x42aba3 <agent_queue_request>, find_front_end_record = 0x4430d1 <find_front_end_record>}
        __func__ = "main"

Comment 26 Jason Wells 2025-02-07 07:20:56 MST

And what we are seeing now is the scheduler just pausing for long periods. I put an alert from Grafana in place that queries sdiag and we're getting alerts no about Grafana not finding a data source. That means sdiag is not working, which I have confirmed from the command line. It's happening several times a day.

Comment 27 Jason Wells 2025-02-07 07:22:44 MST

It does not require a restart to start working again though. So the change Paul made sort of improved things I guess. I just don't know if what we are seeing now is new behavior or not. We didn't have this monitoring in place before.

Comment 28 Oscar Hernández 2025-02-07 07:54:32 MST

I see. Having the controller not responding is not a good sign, if it fails responding to sdiag, I am worried it may also fail to respond to other user commands. 

AFAIK, the only change applied was to remove the "SlurmctldParameters=enable_stepmgr" option.

Do you have any sdiag output around the time you gathered the corefile? Or could you give any insights on how the agent queue has behaved since the change. I am thinking of some graphs like the ones Paul shared in comments 3 and 4.

Would you be able to share the controller logs of that same time-frame?

I want to make sure the change suggested did not make things worse, with the controller being more loaded now.

Kind regards,
Oscar

Comment 29 Jason Wells 2025-02-07 08:09:02 MST

Created attachment 40716 [details]
Thread count and agent queue size for the last two days, 2/5 and 2/6.

Agent queue size has largely recovered. We see peaks but they aren't > than a few hundred, rather than the 30,000 I saw before.

Comment 30 Jason Wells 2025-02-07 08:41:09 MST

Slurm logs outright stop when we get these sdiag alerts. It's uncanny.

Feb  7 06:40:08 holy-slurm02 slurmctld[3332960]: _job_complete: JobId=2741401_4383(2746279) done
Feb  7 06:40:08 holy-slurm02 slurmctld[3332960]: slurmctld: _job_complete: JobId=2741401_4871(2746821) WEXITSTATUS 0
Feb  7 06:46:51 holy-slurm02 slurmctld[3332960]: slurmctld: update_node: node holy7c20212 state set to ALLOCATED
Feb  7 06:46:51 holy-slurm02 slurmctld[3332960]: update_node: node holy7c20212 state set to ALLOCATED

6 minute gap.

Feb 7 08:30:09 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=5854117 began=08:30:03.867
Feb 7 08:30:09 holy-slurm02 slurmctld[3332960]: slurmctld: _build_node_list: No nodes satisfy JobId=2640982 requirements in partition unrestricted
Feb 7 08:30:09 holy-slurm02 slurmctld[3332960]: _build_node_list: No nodes satisfy JobId=2640982 requirements in partition unrestricted
Feb 7 08:30:09 holy-slurm02 slurmctld[3332960]: slurmctld: _pick_best_nodes: JobId=2640983 never runnable in partition unrestricted
Feb 7 08:37:06 holy-slurm02 slurmctld[3332960]: slurmctld: _build_node_list: No nodes satisfy JobId=2640982 requirements in partition unrestricted
Feb 7 08:37:06 holy-slurm02 slurmctld[3332960]: _build_node_list: No nodes satisfy JobId=2640982 requirements in partition unrestricted

7 minute gap.

Also, we see a lot of these:
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=6511985 began=08:25:35.073
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3146523 began=08:25:38.439
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=6863042 began=08:25:34.722
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=6357665 began=08:25:35.229
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=5204941 began=08:25:36.382
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3730445 began=08:25:37.857
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3123205 began=08:25:38.465
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=5541292 began=08:25:36.047
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3423677 began=08:25:38.165
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=6977701 began=08:25:34.611
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=7190986 began=08:25:34.398
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=7163827 began=08:25:34.425
Feb  7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=5989352 began=08:25:35.600

We've increased max_rpc_cnt from 64 to 128 in the hopes that will help.

We also see a lot of these but maybe unrelated: error: Select plugin failed to set job resources, nodes

Comment 31 Oscar Hernández 2025-02-07 09:31:24 MST

I can see the agent queue improved. So seems the change had some impact in the agent queue. But I want to discard it did not have any other negative impact.

I have been studying the backtraces shared. In the first one, the controller has a thread holding the job_write lock in the quick scheduling loop. And I can see some RPCs waiting for the lock to be released. Apparently, things seem as expected, but since the corefile is just a snapshot, I cannot discard this quick scheduling loop taking too long.

According to the trace, the scheduling that holds the lock was started at the timestamp (1738864383). Do you appreciate any gap in the logs similar to the ones shown around this time.

Could you share some full extracts from the logs containing this gaps?

Many thanks for your collaboration with that!

Comment 32 Jason Wells 2025-02-07 11:13:26 MST

Created attachment 40718 [details]
A directory of various logs for the last incident

Full logs for today, two gdb outputs for a 12:47PM incident, and I got sdiag at the very start of that time as well.

Comment 33 Jason Wells 2025-02-07 11:20:04 MST

Created attachment 40719 [details]
Yesterday's full logs

So you can see all of yesterday's logs.

Comment 34 Jason Wells 2025-02-07 11:21:40 MST

For your request for yesterday's logs, look around 12:53:03.

Comment 35 Oscar Hernández 2025-02-07 13:55:33 MST

Thanks a lot. 

From the core shared from 12:47. I can see we are again in the scheduling loop holding the lock, but in a different function. The log shows the controller really busy having trouble to attend petitions... I will further analyze this on monday. 

But seems tuning something like sched_min_interval=1000000 could help alleviating a bit the load.


https://slurm.schedmd.com/slurm.conf.html#OPT_sched_min_interval=#

Comment 36 Oscar Hernández 2025-02-10 12:04:46 MST

Hi jason,

Just updating here a bit.

As mentioned, seems the backtrace is stuck in the fast_scheduling loop. In bt shared I can appreciate how the process holding the job_write lock is:

>Thread 73:
>#11 0x00000000004779f2 in _schedule (full_queue=false) at job_scheduler.c:1851
>sched_start = 1738950062
The timestamp seems to be about the same time we can appreciate a blank in the logs:

>Feb  7 12:41:00 holy-slurm02 slurmctld[2333625]: _job_complete: JobId=2814657_2973(2820545) WEXITSTATUS 0
>Feb  7 12:47:35 holy-slurm02 slurmctld[2333625]: slurmctld: error: Select plugin failed to set job resources, nodes
>Feb  7 12:47:35 holy-slurm02 slurmctld[2333625]: error: Select plugin failed to set job resources, nodes
>Feb  7 12:47:36 holy-slurm02 slurmctld[2333625]: slurmctld: _build_node_list: No nodes satisfy JobId=2808679 
>requirements in partition unrestricted
Do you still have the corefile around? Would you be able to get the following from the JobId from the corefile:

>#open gdb with:
>$ gdb slurmctld corefile
>$ t 73
>$ f 11
>$ p job_ptr->job_id
I would like to discard this job options being the responsible for the issue. Could you get the jobscript for that job id?

>But seems tuning something like sched_min_interval=1000000 could help alleviating a bit the load.
Also, about this. Your logs do show a very active scheduler. I would consider tuning sched_min_interval. thinking it better, the 1000000 (1s) suggested might be too much. But I would consider to start with something like 100000(0.1s) or so. The idea is to allow the controller to process RPCs more often, instead of being continually trying to schedule new jobs. 

Kind regards,
Oscar

Comment 37 Paul Edmon 2025-02-10 12:13:52 MST

Yeah, we have a lot of different users doing a lot of different things 
on the cluster. Things get quite chatty with all of the jobs going in 
and out and users pinging for status from the scheduler.

I will set the sched_min_interval to 100000 and see how it goes. Jason 
is going to pull the information you asked for from the core file and 
from the job and get that to you.

-Paul Edmon-

On 2/10/2025 2:04 PM, bugs@schedmd.com wrote:
>
> *Comment # 36 <https://support.schedmd.com/show_bug.cgi?id=21975#c36> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Hi jason,
>
> Just updating here a bit.
>
> As mentioned, seems the backtrace is stuck in the fast_scheduling loop. In bt
> shared I can appreciate how the process holding the job_write lock is:
>
> >Thread 73: >#11 0x00000000004779f2 in _schedule (full_queue=false) at 
> job_scheduler.c:1851 >sched_start = 1738950062
> The timestamp seems to be about the same time we can appreciate a blank in the
> logs:
>
> >Feb  7 12:41:00 holy-slurm02 slurmctld[2333625]: _job_complete: JobId=2814657_2973(2820545) WEXITSTATUS 0 >Feb 7 12:47:35 holy-slurm02 slurmctld[2333625]: slurmctld: error: 
> Select plugin failed to set job resources, nodes >Feb 7 12:47:35 
> holy-slurm02 slurmctld[2333625]: error: Select plugin failed to set 
> job resources, nodes >Feb 7 12:47:36 holy-slurm02 slurmctld[2333625]: 
> slurmctld: _build_node_list: No nodes satisfy JobId=2808679 
> >requirements in partition unrestricted
> Do you still have the corefile around? Would you be able to get the following
> from the JobId from the corefile:
>
> >#open gdb with: >$ gdb slurmctld corefile >$ t 73 >$ f 11 >$ p job_ptr->job_id
> I would like to discard this job options being the responsible for the issue.
> Could you get the jobscript for that job id?
>
> >But seems tuning something like sched_min_interval=1000000 could help alleviating a bit the load.
> Also, about this. Your logs do show a very active scheduler. I would consider
> tuning sched_min_interval. thinking it better, the 1000000 (1s) suggested might
> be too much. But I would consider to start with something like 100000(0.1s) or
> so. The idea is to allow the controller to process RPCs more often, instead of
> being continually trying to schedule new jobs.
>
> Kind regards,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 38 Jason Wells 2025-02-10 12:32:57 MST

Dear Oscar,

I kept the core files around:
(gdb) t 73
[Switching to thread 73 (Thread 0x14b16c265700 (LWP 2333722))]
#0  0x000014b1938908be in bit_size (b=b@entry=0x14abd8975840) at bitstring.c:384
384	bitstring.c: No such file or directory.
(gdb) f 11
#11 0x00000000004779f2 in _schedule (full_queue=false) at job_scheduler.c:1851
1851	job_scheduler.c: No such file or directory.
(gdb) p job_ptr->job_id
$1 = 2810918

# sacct -B -j 2810918
Batch Script for 2810918
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -p edwards,shared
#SBATCH -c 16
#SBATCH -t 2-12:00
#SBATCH -o trim_galore_second_round_RNA_outgroup_%j.out
#SBATCH -e trim_galore_second_round_RNA_outgroup_%j.err 
#SBATCH --mem=150000
#SBATCH --mail-type=END


output_directory="<redacted>"
input_directory="<redacted>"

cat filenames_no_extension.txt |
while read LINE;
do 
trim_galore \
--paired \
-j 1 --retain_unpaired \
--phred33 \
--output_dir \
${output_directory} \
--length 35 -q 0 \
--stringency 5 \
${input_directory}/${LINE}_R1_001.fastq.gz ${input_directory}/${LINE}_R2_001.fastq.gz
done;

Comment 39 Oscar Hernández 2025-02-11 03:27:23 MST

Created attachment 40746 [details]
Plot Allocs per minute

Thanks Jason, Paul,

Looking at the job Jason shared, seems a pretty simple job and it was submitted a couple of hours before the controller blank in the logs, so it does not seem it being directly the cause of any slowdown.

So, the thing is that both backtraces we got show the controller currently doing work in fast scheduling cycle. But both of them are in different spot, so I cannot say for sure we get stuck in any function there. This spot is the most frequently run, so chances of catching it are high, but the fact that there is no logging made during that time make me think that either: 1 - We are taking to long to run fast scheduling sometimes 2 - We run this function many times leaving no time to process RPCs in between. 

The suggested parameter of sched_min_interval aims to help in situation 2. About 1, maybe some other backtrace could help. Just to discard things, is there any chance controller could be swapping? That is a long shot but could explain a bit the behavior observed.

Just for analysis. I plotted the submissions/Allocations registered in the shared log per minute (will leave it attached). It is interesting that the periods the controller stops responding, usually match with the peaks in which controller allocates aprox 1k jobs/min.

Also, I do not see any clear relation with the removal of "enable_stepmgr" option and this non-responsive controller periods. Were you aware of these before the changes? How have things gone since the sched_min_interval change, any other relevant blank/nin-responsive moments?

Kind regards,
Oscar

Comment 40 Paul Edmon 2025-02-11 07:43:31 MST

Yeah, that graph matches our internal graphs of sdiag which show the 
same trends.

To start with the last thing first, removing the enable_stepmgr looks to 
alleviate the agent queue build up issue. That has not reoccurred since 
we dropped that. So it appears that we have two different issues. The 
initial presenting issue was the agent queue build up which was what 
caused this ticket.  The second is the scheduler being overwhelmed and 
dropping traffic. The second appears to be new and may be coincident 
with our upgrade from 24.05.4 to 24.11.1. This may be a result of the 
new conmgr set up, we have it all set to defaults and it may be time to 
start tweaking those.

As to what is going on, I think we may be getting into a thrashing 
situation. I have seen it happen before that the scheduler will get in a 
state where jobs exit as fast as they enter. Thus the scheduler ends up 
effectively filling a sinking boat. Previously this happened when users 
submitted jobs that ran for less than 10 minutes. Effectively what 
happened in that situation is that the same job was scheduled and then 
exited immediately thus causing an RPC storm and also the scheduler 
trying to grab a new job to allocate. Previously this was solved by 
either going to defer mode or using max_rpc_cnt. With the new conmgr we 
haven't seen the thread count get above 64 and our max_rpc_cnt is set to 
128 so I don't think we are even hitting the throttler. We also have 
encouraged users to submit jobs for longer than 10 minutes as otherwise 
the scheduler is doing more work than the job, but we haven't set a hard 
limit and even if we did users may lie and send in jobs that are set for 
20 minutes but then exit immediately due to the job actually being short.

I get the feeling we are in this thrashing regime again where the 
scheduler is just getting overwhelmed with traffic and decides to go 
head down (so to speak) and focus on scheduling and ignore other 
requests. Eventually it gets so stuck that it wedges itself.

That's my theory at least given what you have just explained. As to how 
to deal with it, we have set the sched_min_interval to 100000 as you 
have suggested and we haven't seen any issues yet. Perhaps there are 
more tweaks to conmgr to give it more horsepower. Here are the hardware 
details of our slurm master:

[root@holy-slurm02 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  1
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel(R) Corporation
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
BIOS Model name:     Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz
Stepping:            7
CPU MHz:             2900.000
BogoMIPS:            5800.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts 
rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq 
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid 
dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx 
f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 
invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced 
tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 
avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx 
smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec 
xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm 
ida arat pln pts hwp_epp pku ospke avx512_vnni md_clear flush_l1d 
arch_capabilities

[root@holy-slurm02 ~]# free -g
               total        used        free      shared buff/cache   
available
Mem:            187         117          15           3 54          64
Swap:             0           0           0

You can see we have Swap turned off, so no swapping should be occurring. 
We also have SSD's on the backend for storage. So the system should have 
plenty of horsepower. It may be that we need to tweak the settings to 
allow it to utilize it.

By the way as reference, we have the slurmdbd running on this host as 
well. So latency to the DB shouldn't be an issue.

-Paul Edmon-

On 2/11/2025 5:27 AM, bugs@schedmd.com wrote:
>
> *Comment # 39 <https://support.schedmd.com/show_bug.cgi?id=21975#c39> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Createdattachment 40746 <attachment.cgi?id=40746> [details] 
> <attachment.cgi?id=40746&action=edit>
> Plot Allocs per minute
>
> Thanks Jason, Paul,
>
> Looking at the job Jason shared, seems a pretty simple job and it was submitted
> a couple of hours before the controller blank in the logs, so it does not seem
> it being directly the cause of any slowdown.
>
> So, the thing is that both backtraces we got show the controller currently
> doing work in fast scheduling cycle. But both of them are in different spot, so
> I cannot say for sure we get stuck in any function there. This spot is the most
> frequently run, so chances of catching it are high, but the fact that there is
> no logging made during that time make me think that either: 1 - We are taking
> to long to run fast scheduling sometimes 2 - We run this function many times
> leaving no time to process RPCs in between.
>
> The suggested parameter of sched_min_interval aims to help in situation 2.
> About 1, maybe some other backtrace could help. Just to discard things, is
> there any chance controller could be swapping? That is a long shot but could
> explain a bit the behavior observed.
>
> Just for analysis. I plotted the submissions/Allocations registered in the
> shared log per minute (will leave it attached). It is interesting that the
> periods the controller stops responding, usually match with the peaks in which
> controller allocates aprox 1k jobs/min.
>
> Also, I do not see any clear relation with the removal of "enable_stepmgr"
> option and this non-responsive controller periods. Were you aware of these
> before the changes? How have things gone since the sched_min_interval change,
> any other relevant blank/nin-responsive moments?
>
> Kind regards,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 41 Paul Edmon 2025-02-11 07:46:35 MST

Created attachment 40747 [details]
Job Stats for 2/11/2025

Just adding this for comparison against the plot that Oscar made. This is from sdiag.

Comment 42 Oscar Hernández 2025-02-11 12:15:40 MST

>To start with the last thing first, removing the enable_stepmgr looks to 
>alleviate the agent queue build up issue. That has not reoccurred since 
>we dropped that. So it appears that we have two different issues. The 
>initial presenting issue was the agent queue build up which was what 
>caused this ticket.
Great. Yes, they look like 2 different issues to me. So, good thing we got the agent queue issues sorted out.

Now, about the overwhelmed controller. You are right here, with the introduction of conmgr some relevant changes have been made to the thread handling. Now, the controller will have a fixed pool of threads (instead of continually creating and destroying threads), which will attend petitions. The default value for this pool of threads is 64, configurable via the parameter SlurmCtldParameter=conmgr_threads[1].

>With the new conmgr we haven't seen the thread count get above 64 and our 
>max_rpc_cnt is set to 128 so I don't think we are even hitting the 
>throttler.
So, effectively, max_threads will not go over 64 with the throttler never actually taking place. Though, both backfill and main schedulers should yield locks anyway based on timers. With main scheduling loop never taking more than 2s, and bf yielding locks based on the defaults:

bf_yield_interval[2] -> 2s
bf_yield_sleep[3]    -> 0.5s

With the traces seen, problem seems to be lock contention rather than concurrency. So, for further tuning maybe you could try some value < 64 for max_rpc_cnt, to force the throttling more often. Or, maybe a better option could be to increase conmgr_threads (128?) and set max_rpc_cnt at 64. 

But you will need to test how these options adapt to your cluster. Because the first scenario could lead to a continuous deferral of scheduling to process RPCs, while the second could overload the controller by having way many threads than actual cores. I will appreciate any feedback you can give if you decide to play with this values.

>Just adding this for comparison against the plot that Oscar made. This is 
>from sdiag
In any case, I have seen the graphs and seems a few 1k peaks happened with no issues, which is good news. Seems changes helped a bit, so maybe we can keep an eye with the current params to see if the situation repeats.

>I get the feeling we are in this thrashing regime again where the
>scheduler is just getting overwhelmed with traffic and decides to go 
>head down (so to speak) and focus on scheduling and ignore other 
>requests. Eventually it gets so stuck that it wedges itself.
Kind of the impression I got, hopefully, changes made help alleviate the situation. Changing debuglevel to debug could help to understand if something in the scheduling logic happens during the blanks. But logs will grow pretty fast with your cluster activity.

Kind regards,
Oscar

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_conmgr_threads
[2]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_interval=#
[3]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_sleep=#

Comment 43 Paul Edmon 2025-02-11 12:22:10 MST

I'm inclined to go conmgr threads at 128 and then max_rpc_cnt to 64. 
That way we get the throttling and have enough head room in threads to 
handle additional traffic.  That may also allow us to remove the 
throttle for the sched_min_interval. Worth experimenting a bit to see 
what happens.

I will start with conmgr to 128 and max_rpc_cnt to 64 and watch it for a 
bit to see how that behaves.

-Paul Edmon-

On 2/11/2025 2:15 PM, bugs@schedmd.com wrote:
>
> *Comment # 42 <https://support.schedmd.com/show_bug.cgi?id=21975#c42> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> >To start with the last thing first, removing the enable_stepmgr looks to >alleviate the agent queue build up issue. That has not reoccurred 
> since >we dropped that. So it appears that we have two different 
> issues. The >initial presenting issue was the agent queue build up 
> which was what >caused this ticket.
> Great. Yes, they look like 2 different issues to me. So, good thing we got the
> agent queue issues sorted out.
>
> Now, about the overwhelmed controller. You are right here, with the
> introduction of conmgr some relevant changes have been made to the thread
> handling. Now, the controller will have a fixed pool of threads (instead of
> continually creating and destroying threads), which will attend petitions. The
> default value for this pool of threads is 64, configurable via the parameter
> SlurmCtldParameter=conmgr_threads[1].
>
> >With the new conmgr we haven't seen the thread count get above 64 and our >max_rpc_cnt is set to 128 so I don't think we are even hitting the 
> >throttler.
> So, effectively, max_threads will not go over 64 with the throttler never
> actually taking place. Though, both backfill and main schedulers should yield
> locks anyway based on timers. With main scheduling loop never taking more than
> 2s, and bf yielding locks based on the defaults:
>
> bf_yield_interval[2] -> 2s
> bf_yield_sleep[3]    -> 0.5s
>
> With the traces seen, problem seems to be lock contention rather than
> concurrency. So, for further tuning maybe you could try some value < 64 for
> max_rpc_cnt, to force the throttling more often. Or, maybe a better option
> could be to increase conmgr_threads (128?) and set max_rpc_cnt at 64.
>
> But you will need to test how these options adapt to your cluster. Because the
> first scenario could lead to a continuous deferral of scheduling to process
> RPCs, while the second could overload the controller by having way many threads
> than actual cores. I will appreciate any feedback you can give if you decide to
> play with this values.
>
> >Just adding this for comparison against the plot that Oscar made. This is >from sdiag
> In any case, I have seen the graphs and seems a few 1k peaks happened with no
> issues, which is good news. Seems changes helped a bit, so maybe we can keep an
> eye with the current params to see if the situation repeats.
>
> >I get the feeling we are in this thrashing regime again where the >scheduler is just getting overwhelmed with traffic and decides to go 
> >head down (so to speak) and focus on scheduling and ignore other 
> >requests. Eventually it gets so stuck that it wedges itself.
> Kind of the impression I got, hopefully, changes made help alleviate the
> situation. Changing debuglevel to debug could help to understand if something
> in the scheduling logic happens during the blanks. But logs will grow pretty
> fast with your cluster activity.
>
> Kind regards,
> Oscar
>
> [1]https://slurm.schedmd.com/slurm.conf.html#OPT_conmgr_threads
> [2]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_interval=#
> [3]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_sleep=#
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 44 Paul Edmon 2025-02-12 09:22:50 MST

FYI slurmctld crashed this morning at 7:55. Here is the initial log:

Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu slurmctld[1088629]: 
slurmctld: _slurm_rpc_submit_batch_job: JobId=3582023 InitPrio=8179811 
usec=3995
Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu slurmctld[1088629]: 
_slurm_rpc_submit_batch_job: JobId=3582023 InitPrio=8179811 usec=3995
Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu kernel: 
worker[17][1088647]: segfault at 40 ip 00001484de861346 sp 
00001484dbed58d0 error 4 in libpthread-2.28.so[1484de854000+1b000]
Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu kernel: Code: 24 83 c8 
02 83 f8 03 74 f4 e9 d1 fd ff ff 66 0f 1f 44 00 00 f3 0f 1e fa 41 57 41 
56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 90 <8b> 57 >
Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu systemd[1]: Started 
Process Core Dump (PID 3005258/UID 0).
Feb 12 07:55:00 holy-slurm02.rc.fas.harvard.edu 
systemd-coredump[3005259]: Process 1088629 (slurmctld) of user 57812 
dumped core.

Stack trace of thread 1088647:
#0  0x00001484de861346 __pthread_rwlock_wrlock (libpthread.so.0)
#1  0x00001484deb345b3 list_append (libslurmfull.so)
#2  0x000000000046f2ce _foreach_add_to_preemptee_job_id (slurmctld)
#3  0x00001484deb34fbf list_for_each_max (libslurmfull.so)
#4  0x00001484deb350af list_for_each (libslurmfull.so)
#5  0x0000000000470996 _foreach_job_start_data_part (slurmctld)
#6  0x0000000000474805 job_start_data (slurmctld)
#7  0x0000000000462a42 job_allocate (slurmctld)
#8  0x000000000049e997 _slurm_rpc_job_will_run (slurmctld)
#9  0x00000000004a8b1c slurmctld_req (slurmctld)
#10 0x0000000000430f57 _service_connection (slurmctld)
#11 0x00001484dec2a94f _wrap_on_extract (libslurmfull.so)
#12 0x00001484dec3aab9 wrap_work (libslurmfull.so)
#13 0x00001484dec3b536 _worker (libslurmfull.so)
#14 0x00001484de85c1ca start_thread (libpthread.so.0)
#15 0x00001484ddb118d3 __clone (libc.so.6)


I will drop the log around this. I also have the core dump so let me 
know what commands you want me to run on it.

-Paul Edmon-

On 2/11/25 2:15 PM, bugs@schedmd.com wrote:
>
> *Comment # 42 <https://support.schedmd.com/show_bug.cgi?id=21975#c42> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> >To start with the last thing first, removing the enable_stepmgr looks to >alleviate the agent queue build up issue. That has not reoccurred 
> since >we dropped that. So it appears that we have two different 
> issues. The >initial presenting issue was the agent queue build up 
> which was what >caused this ticket.
> Great. Yes, they look like 2 different issues to me. So, good thing we got the
> agent queue issues sorted out.
>
> Now, about the overwhelmed controller. You are right here, with the
> introduction of conmgr some relevant changes have been made to the thread
> handling. Now, the controller will have a fixed pool of threads (instead of
> continually creating and destroying threads), which will attend petitions. The
> default value for this pool of threads is 64, configurable via the parameter
> SlurmCtldParameter=conmgr_threads[1].
>
> >With the new conmgr we haven't seen the thread count get above 64 and our >max_rpc_cnt is set to 128 so I don't think we are even hitting the 
> >throttler.
> So, effectively, max_threads will not go over 64 with the throttler never
> actually taking place. Though, both backfill and main schedulers should yield
> locks anyway based on timers. With main scheduling loop never taking more than
> 2s, and bf yielding locks based on the defaults:
>
> bf_yield_interval[2] -> 2s
> bf_yield_sleep[3]    -> 0.5s
>
> With the traces seen, problem seems to be lock contention rather than
> concurrency. So, for further tuning maybe you could try some value < 64 for
> max_rpc_cnt, to force the throttling more often. Or, maybe a better option
> could be to increase conmgr_threads (128?) and set max_rpc_cnt at 64.
>
> But you will need to test how these options adapt to your cluster. Because the
> first scenario could lead to a continuous deferral of scheduling to process
> RPCs, while the second could overload the controller by having way many threads
> than actual cores. I will appreciate any feedback you can give if you decide to
> play with this values.
>
> >Just adding this for comparison against the plot that Oscar made. This is >from sdiag
> In any case, I have seen the graphs and seems a few 1k peaks happened with no
> issues, which is good news. Seems changes helped a bit, so maybe we can keep an
> eye with the current params to see if the situation repeats.
>
> >I get the feeling we are in this thrashing regime again where the >scheduler is just getting overwhelmed with traffic and decides to go 
> >head down (so to speak) and focus on scheduling and ignore other 
> >requests. Eventually it gets so stuck that it wedges itself.
> Kind of the impression I got, hopefully, changes made help alleviate the
> situation. Changing debuglevel to debug could help to understand if something
> in the scheduling logic happens during the blanks. But logs will grow pretty
> fast with your cluster activity.
>
> Kind regards,
> Oscar
>
> [1]https://slurm.schedmd.com/slurm.conf.html#OPT_conmgr_threads
> [2]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_interval=#
> [3]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_sleep=#
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 45 Paul Edmon 2025-02-12 09:24:48 MST

Created attachment 40765 [details]
Log for crash at 07:55 on 02-12-25

Comment 46 Oscar Hernández 2025-02-12 09:48:27 MST

Hey Paul,

Seems unrelated to the previous changes. This segfault seems to be similar to one we fixed a couple of weeks ago:

https://github.com/SchedMD/slurm/commit/6b7ceb0746c7

Would you be able to patch the controller? What is the current status of the controller now?

Kind regards,
Oscar

Comment 47 Paul Edmon 2025-02-12 09:56:59 MST

Ah okay. Good to see its fixed. When will the formal release of that 
come? As if it is soon then we can wait for that. The scheduler is 
working fine right now, it restarted without any issue.

If its going to take a bit for the formal release then we may want to 
emergency patch.

On 2/12/25 11:48 AM, bugs@schedmd.com wrote:
>
> *Comment # 46 <https://support.schedmd.com/show_bug.cgi?id=21975#c46> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Hey Paul,
>
> Seems unrelated to the previous changes. This segfault seems to be similar to
> one we fixed a couple of weeks ago:
>
> https://github.com/SchedMD/slurm/commit/6b7ceb0746c7
>
> Would you be able to patch the controller? What is the current status of the
> controller now?
>
> Kind regards,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 48 Paul Edmon 2025-02-12 09:59:31 MST

Also other than that crash this morning, things look stable with the new 
thread arrangement. If it looks good for the next day I may on Thursday 
try dropping the sched_min_interval to see if the higher thread count 
and max_rpc_cnt are sufficient to keep things in check. If not we can 
fail back to sched_min_interval.

On 2/12/25 11:48 AM, bugs@schedmd.com wrote:
>
> *Comment # 46 <https://support.schedmd.com/show_bug.cgi?id=21975#c46> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Hey Paul,
>
> Seems unrelated to the previous changes. This segfault seems to be similar to
> one we fixed a couple of weeks ago:
>
> https://github.com/SchedMD/slurm/commit/6b7ceb0746c7
>
> Would you be able to patch the controller? What is the current status of the
> controller now?
>
> Kind regards,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 49 Oscar Hernández 2025-02-12 10:11:14 MST

Created attachment 40767 [details]
Patch for segfault

I am attaching a patch cherrypicking this commit from the current 24.11 branch.

We still don't have a tentative release date for 24.11.2. It should fall around the end of this month. But given that it is a segfault that can be triggered by a particular job submission, I would suggest to patch it as soon as possible.

My colleague Ben explains about the issue in ticket 21997 comment 6.

Btw, did you finally change conmgr_threads value? I will wait for some feedback on the cluster general behavior.

Kind regards,
Oscar

Comment 50 Paul Edmon 2025-02-12 10:14:26 MST

Blegh, that's pretty bad. I will look at patching ASAP.

I made the change to conmgr_threads=128 and max_rpc_cnt=64 yesterday 
afternoon. So far it looks stable and I haven't see any thrashing by the 
scheduler. Let me know if you want more data than that as I'm happy to 
pull stats.

On 2/12/25 12:11 PM, bugs@schedmd.com wrote:
>
> *Comment # 49 <https://support.schedmd.com/show_bug.cgi?id=21975#c49> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Createdattachment 40767 <attachment.cgi?id=40767&action=diff> [details] 
> <attachment.cgi?id=40767&action=edit>
> Patch for segfault
>
> I am attaching a patch cherrypicking this commit from the current 24.11 branch.
>
> We still don't have a tentative release date for 24.11.2. It should fall around
> the end of this month. But given that it is a segfault that can be triggered by
> a particular job submission, I would suggest to patch it as soon as possible.
>
> My colleague Ben explains about the issue inticket 21997 comment 6 <show_bug.cgi?id=21997#c6>.
>
> Btw, did you finally change conmgr_threads value? I will wait for some feedback
> on the cluster general behavior.
>
> Kind regards,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 51 Oscar Hernández 2025-02-12 11:05:25 MST

>Also other than that crash this morning, things look stable with the new 
>thread arrangement. If it looks good for the next day I may on Thursday 
>try dropping the sched_min_interval to see if the higher thread count 
>and max_rpc_cnt are sufficient to keep things in check. If not we can 
>fail back to sched_min_interval.
Sorry, missed that :). Sounds good.

>I made the change to conmgr_threads=128 and max_rpc_cnt=64 yesterday 
>afternoon. So far it looks stable and I haven't see any thrashing by the 
>scheduler. Let me know if you want more data than that as I'm happy to 
>pull stats.
Good. Mainly interested because all that new conmgr logic aims to improve high throughput situations like yours, so feedback is highly appreciated. For now it is just enough to know that the "stuck" controller situations are under control. Well, just one question for curiosity: 

Have you noticed sched_min_interval having an impact in the number of started jobs?

Comment 52 Paul Edmon 2025-02-12 11:12:24 MST

Yes, at least in periods of high demand. I've attached a graph of our 
job's started which includes in yellow the jobs per minute. You can see 
that the peaks are nearly as high after the change on Feb 10th. But on 
average it looks the same.

-Paul Edmon-

On 2/12/25 1:05 PM, bugs@schedmd.com wrote:
>
> *Comment # 51 <https://support.schedmd.com/show_bug.cgi?id=21975#c51> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> >Also other than that crash this morning, things look stable with the new >thread arrangement. If it looks good for the next day I may on 
> Thursday >try dropping the sched_min_interval to see if the higher 
> thread count >and max_rpc_cnt are sufficient to keep things in check. 
> If not we can >fail back to sched_min_interval.
> Sorry, missed that :). Sounds good.
>
> >I made the change to conmgr_threads=128 and max_rpc_cnt=64 yesterday >afternoon. So far it looks stable and I haven't see any thrashing by 
> the >scheduler. Let me know if you want more data than that as I'm 
> happy to >pull stats.
> Good. Mainly interested because all that new conmgr logic aims to improve high
> throughput situations like yours, so feedback is highly appreciated. For now it
> is just enough to know that the "stuck" controller situations are under
> control. Well, just one question for curiosity:
>
> Have you noticed sched_min_interval having an impact in the number of started
> jobs?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 53 Paul Edmon 2025-02-12 11:13:58 MST

Created attachment 40768 [details]
Jobs started between 2-05-25 to 2-12-25

Comment 54 Oscar Hernández 2025-02-13 04:16:17 MST

>You can see that the peaks are nearly as high after the change on Feb 10th. But on 
>average it looks the same.
I see. That makes sense, since we are reducing the number of times quick scheduling runs per minute. If average throughput is similar though, tuning sched_min_interval seems like a valid approach here to help distribute the load maintaining while keeping the controller responsive.

I will keep an eye on this ticket to see how things evolve during this week.

Thanks for the feedback!

Comment 55 Paul Edmon 2025-02-13 07:20:32 MST

I concur. Right now I'm pretty happy with the behavior of the scheduler 
so I'm not inclined to tweak anything. The throughput looks fine and I 
haven't seen any stalls lately.

On 2/13/2025 6:16 AM, bugs@schedmd.com wrote:
>
> *Comment # 54 <https://support.schedmd.com/show_bug.cgi?id=21975#c54> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> >You can see that the peaks are nearly as high after the change on Feb 10th. But on >average it looks the same.
> I see. That makes sense, since we are reducing the number of times quick
> scheduling runs per minute. If average throughput is similar though, tuning
> sched_min_interval seems like a valid approach here to help distribute the load
> maintaining while keeping the controller responsive.
>
> I will keep an eye on this ticket to see how things evolve during this week.
>
> Thanks for the feedback!
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 56 Paul Edmon 2025-02-13 07:46:34 MST

Just a FYI, as a test I dropped the sched_min_interval. I want to see if 
the max_rpc_cnt limit is sufficient to keep things under control. If 
things get out of hand I will reset sched_min_interval.

Also I added bf_allow_magnetic_slot because its a feature I've wanted to 
add and now seems as good a time as any to turn it on.

On 2/13/2025 6:16 AM, bugs@schedmd.com wrote:
>
> *Comment # 54 <https://support.schedmd.com/show_bug.cgi?id=21975#c54> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> >You can see that the peaks are nearly as high after the change on Feb 10th. But on >average it looks the same.
> I see. That makes sense, since we are reducing the number of times quick
> scheduling runs per minute. If average throughput is similar though, tuning
> sched_min_interval seems like a valid approach here to help distribute the load
> maintaining while keeping the controller responsive.
>
> I will keep an eye on this ticket to see how things evolve during this week.
>
> Thanks for the feedback!
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 57 Oscar Hernández 2025-02-17 09:55:31 MST

Hi Paul,

Just checking in here. Any relevant issue after the changes mentioned?

Cheers,
Oscar

Comment 58 Paul Edmon 2025-02-18 08:12:47 MST

Nope everything is stable and has been behaving much better. I think 
setting max_rpc_cnt lower than the conmgr_threads is what did it as it 
has the scheduler automatically throttle when the thread count is high 
and has sufficient additional head room to then clear out the additional 
requests before it hits conmgr_threads. The current combo of 
conmgr_threads=128 and max_rpc_cnt=64 looks to be working great.

-Paul Edmon-

On 2/17/25 11:55 AM, bugs@schedmd.com wrote:
>
> *Comment # 57 <https://support.schedmd.com/show_bug.cgi?id=21975#c57> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Hi Paul,
>
> Just checking in here. Any relevant issue after the changes mentioned?
>
> Cheers,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 59 Oscar Hernández 2025-03-05 03:40:13 MST

Hi Paul,

Just checking again, to make sure things remain stable.

Also to inform that I currently have under review a proposal to mitigate the SRUN_TIMEOUT storm you initially suffered. The idea is to allow you to enable stepmgr again.

Will keep you updated.

Kind regards,
Oscar

Comment 60 Paul Edmon 2025-03-05 07:35:49 MST

Yup everything is working great at this point. We haven't see any stalls 
and we upgraded to 24.11.2 with out any hitches.

At this point you can resolve this ticket. If you want a copy of the 
configuration we ended up with let me know as it may be useful with 
regards to the recommendations for the new conmgr setup.

-Paul Edmon-

On 3/5/2025 5:40 AM, bugs@schedmd.com wrote:
>
> *Comment # 59 <https://support.schedmd.com/show_bug.cgi?id=21975#c59> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Hi Paul,
>
> Just checking again, to make sure things remain stable.
>
> Also to inform that I currently have under review a proposal to mitigate the
> SRUN_TIMEOUT storm you initially suffered. The idea is to allow you to enable
> stepmgr again.
>
> Will keep you updated.
>
> Kind regards,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 61 Oscar Hernández 2025-03-05 09:48:47 MST

Hi Paul,

>Yup everything is working great at this point. We haven't see any stalls 
>and we upgraded to 24.11.2 with out any hitches.
Glad to hear that! Yeah, it would be great if you can attach a copy of the current slurm.conf just for completion.

>At this point you can resolve this ticket
Would you mind if I keep this one open a little longer? I am using it as reference to internally track the SRUN_TIMEOUT fix with stepmgr.

Cheers,
Oscar

Comment 62 Paul Edmon 2025-03-05 09:51:30 MST

Created attachment 41014 [details]
slurm.conf after tuning

Comment 63 Paul Edmon 2025-03-05 09:52:00 MST

Yup, I've uploaded it.

Feel free to keep the ticket open as long as you like.

-Paul Edmon-

On 3/5/2025 11:48 AM, bugs@schedmd.com wrote:
>
> *Comment # 61 <https://support.schedmd.com/show_bug.cgi?id=21975#c61> 
> on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> 
> from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> *
> Hi Paul,
>
> >Yup everything is working great at this point. We haven't see any stalls >and we upgraded to 24.11.2 with out any hitches.
> Glad to hear that! Yeah, it would be great if you can attach a copy of the
> current slurm.conf just for completion.
>
> >At this point you can resolve this ticket
> Would you mind if I keep this one open a little longer? I am using it as
> reference to internally track the SRUN_TIMEOUT fix with stepmgr.
>
> Cheers,
> Oscar
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 64 Oscar Hernández 2025-03-11 08:21:43 MDT

Hi Paul,

Resolving this one. Starting on the next release (24.11.4), the problem of the accumulating agent threads with SRUN_TIMEOUT has been addressed.

Feel free to consider enabling stepmgr again after upgrading.

>    commit 40f1173181f207f7f0a2a4999086fba530ecd7df 
> 
>    Avoid unnecessary SRUN_TIMEOUT forwarding to stepmgr
>    
>    Prevent the controller to send a message of type SRUN_TIMEOUT to the
>    stepmgr when stepmgr won't need to communicate to any srun process.  
Thanks,
Oscar