Similar to https://support.schedmd.com/show_bug.cgi?id=21847 We are seeing a high agent queue size stall. I was hoping the upgrade to 24.11.1 (which we did today) would alleviate this issue but unfortunately it has not. I've captured the sdiag when it was happening and have attached a graph of the agent size. You can also see a pile up in the completing jobs in another graph. I tried to take a core dump but the signal I sent didn't generate one sadly. That said restarting the scheduler caused it flush so it is some state it is getting in where it stalls in dealing with traffic and then it just keeps piling up without flushing. I suspect this may be triggered when the scheduler hits max_rpc_cnt. In summary the scheduler is up and running but stalls occasionally with this problem, a restart flushes the queue. If you send me the commands for reliably generating a core dump once it gets into this state again I can force one. I'm also happy to turn on whatever other debugging you feel you need or any suggestions you have for tweaking our config, which I attached.
Created attachment 40627 [details] Current slurm.conf
Created attachment 40628 [details] Current topology.conf
Created attachment 40629 [details] Graph of agent queue and server threads
Created attachment 40630 [details] Node state graph
Would it be possible to also share the full output of 'sdiag' during one of these stacked agent queues?
Created attachment 40631 [details] sdiag output during stall
Next time this occurs please run the following gcore and gdb commands and attach the logs for that day as well. >gcore $(pidof slurmctld) >[...] >warning: Memory read failed for corefile section, 4096 bytes at >0xffffffffff600000. >Saved corefile core.182997 >[Inferior 1 (process 182997) detached] >gdb -batch -ex 't a a bt f' slurmctld ./core.182997
Yup will do. I've attached the sdiag while its in this state last night here: https://support.schedmd.com/attachment.cgi?id=40631 On 2/3/2025 5:24 PM, bugs@schedmd.com wrote: > Jason Booth <mailto:jbooth@schedmd.com> changed ticket 21975 > <https://support.schedmd.com/show_bug.cgi?id=21975> > What Removed Added > Assignee support@schedmd.com oscar.hernandez@schedmd.com > > *Comment # 7 <https://support.schedmd.com/show_bug.cgi?id=21975#c7> on > ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> from > Jason Booth <mailto:jbooth@schedmd.com> * > Next time this occurs please run the following gcore and gdb commands and > attach the logs for that day as well. > > >gcore $(pidof slurmctld) >[...] >warning: Memory read failed for corefile section, 4096 bytes > at >0xffffffffff600000. >Saved corefile core.182997 >[Inferior 1 > (process 182997) detached] >gdb -batch -ex 't a a bt f' slurmctld > ./core.182997 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Hi Paul, As Jason suggested, the backtrace will be really useful to see what is happening at that particular moment. In the meantime. I have been looking at the sdiag. It draws my attention the fact that from the pending RPCs, half of them correspond to: SRUN_TIMEOUT. >Pending RPC statistics > SRUN_TIMEOUT ( 7002) count:16721 > REQUEST_TERMINATE_JOB ( 6011) count:2822 > SRUN_JOB_COMPLETE ( 7004) count:1789 > REQUEST_LAUNCH_PROLOG ( 6017) count:5100 > REQUEST_BATCH_JOB_LAUNCH ( 4005) count:5048 > REQUEST_KILL_TIMELIMIT ( 6009) count:6298 > REQUEST_KILL_PREEMPTED ( 6016) count:54 > REQUEST_CANCEL_JOB_STEP ( 5005) count:3 > REQUEST_HEALTH_CHECK ( 1011) count:1 > RESPONSE_RESOURCE_ALLOCATION ( 4002) count:52 > REQUEST_SIGNAL_TASKS ( 6004) count:10 > SRUN_PING ( 7001) count:9 That would explain the peak in CG state jobs observed. Would you be able to send a compressed controller log from the same time you took the sdiag output? Kind regards, Oscar
Created attachment 40643 [details] Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00 The stall in question ran from about 15:00 to 16:15 when the scheduler was restarted.
Yup just including it plus stuff before and after. On 2/4/2025 10:18 AM, bugs@schedmd.com wrote: > > *Comment # 9 <https://support.schedmd.com/show_bug.cgi?id=21975#c9> on > ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> from > Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Hi Paul, > > As Jason suggested, the backtrace will be really useful to see what is > happening at that particular moment. > > In the meantime. I have been looking at the sdiag. It draws my attention the > fact that from the pending RPCs, half of them correspond to: SRUN_TIMEOUT. > > >Pending RPC statistics > SRUN_TIMEOUT ( 7002) count:16721 > REQUEST_TERMINATE_JOB ( 6011) > count:2822 > SRUN_JOB_COMPLETE ( 7004) count:1789 > > REQUEST_LAUNCH_PROLOG ( 6017) count:5100 > REQUEST_BATCH_JOB_LAUNCH ( > 4005) count:5048 > REQUEST_KILL_TIMELIMIT ( 6009) count:6298 > > REQUEST_KILL_PREEMPTED ( 6016) count:54 > REQUEST_CANCEL_JOB_STEP ( > 5005) count:3 > REQUEST_HEALTH_CHECK ( 1011) count:1 > > RESPONSE_RESOURCE_ALLOCATION ( 4002) count:52 > REQUEST_SIGNAL_TASKS ( > 6004) count:10 > SRUN_PING ( 7001) count:9 > That would explain the peak in CG state jobs observed. Would you be able to > send a compressed controller log from the same time you took the sdiag output? > > Kind regards, > Oscar > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Thanks for that. Seems that issues are related to a big jobarray that was submitted at 14:39 (job 1947577) which had many of its array memberst time-out at the same time. >Feb 03 14:39:43 holy-slurm02.rc.fas.harvard.edu slurmctld[8536]: >_slurm_rpc_submit_batch_job: JobId=1947577 InitPrio=14386520 usec=4504 This seems to be a jobarray that has around 4k members, most of them being allocated between 14:39 and 15:00. >$ grep "1947577" slurm-02-03-25.log | grep "]: sched: Alloca" | wc -l >4044 Logs show how most of this jobs simultaneously timed-out starting at 15:09. Which seems to correspond with the CG nodes started to peak. >$ grep "1947577" slurm-02-03-25.log | grep -i "]: Time" | wc -l >2506 In fact at the same second at 15:09:49, 230 jobs timed out: >grep "1947577" slurm-02-03-25.log | grep -i "]: Time" | grep "15:09:49" | wc -l >230 Will try to figure out what could have caused slurm to block there. But definitely seems related with the arrayjob.
Interesting. Here is the job script that was submitted, just for your reference. Let me know if you want me to pull any other job stats. #!/bin/bash #SBATCH --array=110-254,269-285,291,301,308,322-325,335-337,343-454,460-543,557-658,663-1512,1517-1606,1612-1778,1783,1813,1828-1994,1999-2016,2021-2044,2056-2063,2068-2080,2086-2143,2150-2151,2172,2195-2466,2475-2688,2693-2776,2802-2920,2929-2964,2969-3022,3028-3043,3048-3089,3095-3098,3105-3133,3140-3179,3189-3196,3214-3227,3240-3258,3264-3270,3278-3339,3353,3373-3376,3381-3392,3400-3404,3410,3417-3421,3484 #SBATCH --nodes=1 #SBATCH --cpus-per-task=2 #SBATCH --mem=4G #SBATCH --time=00:30:00 #SBATCH -p serial_requeue #SBATCH --mail-type=ALL #SBATCH --mail-user=<redacted> #SBATCH --output=/n/holylabs/LABS/tkhanna_lab/Lab/ARD_stagetwo/logs/rerun_slurm_%A_%a.out # Set Environmental Variables export LAB_DIR=/n/holylabs/LABS/tkhanna_lab/Lab export my_sif=/n/holylabs/LABS/tkhanna_lab/Lab/resources/rstudio4.sif export R_LIBS_USER=/n/home03/halidaee/R/rstudiosif/4.5/ # Set computational parameters export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 export OPENBLAS_NUM_THREADS=1 export MKL_THREADING_LAYER=GNU export OPENBLAS_CORETYPE=Haswell # Set simulation parameters export simulations_per_index=20 export indices_per_triple=25 # Define environment variables to pass vars=("LAB_DIR" "R_LIBS_USER" "OMP_NUM_THREADS" "MKL_NUM_THREADS" "OPENBLAS_NUM_THREADS" "MKL_THREADING_LAYER" "OPENBLAS_CORETYPE" "SLURM_CPUS_PER_TASK" "R_LIBS_USER" "SLURM_ARRAY_TASK_ID" "simulations_per_index" "indices_per_triple") # Initialize ENV_VARS ENV_VARS="" for var in "${vars[@]}"; do ENV_VARS+="${var}=${!var}," done ENV_VARS=${ENV_VARS%,} # Go to project directory cd /n/holylabs/LABS/tkhanna_lab/Lab/ARD_stagetwo # Run the analysis singularity exec --cleanenv --env "$ENV_VARS" /n/holylabs/LABS/tkhanna_lab/Lab/resources/rstudio4.sif /usr/bin/zsh -c "source /n/home03/halidaee/.zshrc && Rscript --no-save --no-restore --verbose code/regressions/create_regressions_data.R > logs/rerun_regressions_job_array_output_${SLURM_ARRAY_TASK_ID}.log 2>&1" On 2/4/2025 11:15 AM, bugs@schedmd.com wrote: > > *Comment # 12 <https://support.schedmd.com/show_bug.cgi?id=21975#c12> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Thanks for that. > > Seems that issues are related to a big jobarray that was submitted at 14:39 > (job 1947577) which had many of its array memberst time-out at the same time. > > >Feb 03 14:39:43 holy-slurm02.rc.fas.harvard.edu slurmctld[8536]: >_slurm_rpc_submit_batch_job: JobId=1947577 InitPrio=14386520 usec=4504 > This seems to be a jobarray that has around 4k members, most of them being > allocated between 14:39 and 15:00. > > >$ grep "1947577" slurm-02-03-25.log | grep "]: sched: Alloca" | wc -l >4044 > Logs show how most of this jobs simultaneously timed-out starting at 15:09. > Which seems to correspond with the CG nodes started to peak. > > >$ grep "1947577" slurm-02-03-25.log | grep -i "]: Time" | wc -l >2506 > In fact at the same second at 15:09:49, 230 jobs timed out: > > >grep "1947577" slurm-02-03-25.log | grep -i "]: Time" | grep "15:09:49" | wc -l >230 > Will try to figure out what could have caused slurm to block there. But > definitely seems related with the arrayjob. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
>Interesting. Here is the job script that was submitted, just for your >reference. Let me know if you want me to pull any other job stats. Thanks! It would also help if you could share the slurmd.log of some of the nodes involved in the same time-frame as the controller log shared. For example, node holy7c12406.
Created attachment 40644 [details] holy7c12406 Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00
Created attachment 40645 [details] holy8a24507 Log from 2025-02-03T14:00:00 to 2025-02-03T19:00:00
Sure. I've put in two logs, holy7c12406 and holy8a24507. Let me know if you need more. -Paul Edmon- On 2/4/2025 11:53 AM, bugs@schedmd.com wrote: > > *Comment # 14 <https://support.schedmd.com/show_bug.cgi?id=21975#c14> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > >Interesting. Here is the job script that was submitted, just for your >reference. Let me know if you want me to pull any other job stats. > Thanks! It would also help if you could share the slurmd.log of some of the > nodes involved in the same time-frame as the controller log shared. For > example, node holy7c12406. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Hi Paul, Many thanks for the different logs and data provided, that helped me confirm my initial suspicion, the culprit seems to be the already mentioned SRUN_TIMEOUT RPC. I believe I have been able to reproduce a similar behavior like yours in my test environment, by having around 500 jobs timing out at the same time. I can get my controller overloaded by RPCs which keep accumulating. However, eventually, after a few minutes, my controller is able to recover (jobs stay CG for a while though, and nodes appear as CG too). In your case, things seem to indicate that it was a matter of having a sustained rate of jobs timing out during the problematic hour (between 15:00-16:00), which prevented the controller to get up to date with the RPC processing. In fact, I found another job array with 1k jobs timing out also at that time (1956880). #About the RPC that seems to be flooding the queue: This SRUN_TIMEOUT RPC is issued around 3 times per job, to indicate srun that timelimit for it is approaching (is sent every 30s, starting 1min before hitting timelimit). Your conflicting array job did not have any SRUN inside it though, which was making me wonder why Slurm was issuing that many RPCs. Turns out that, since you have "SlurmctldParameters=enable_stepmgr" active. The controller will send an RPC to the stepmgr so that it later handles the connection to all hosts (no matter if there is an srun involved or not). But this single connection to the stepmgr, in your unfortunate situation, was sent many, many times... > SRUN_TIMEOUT ( 7002) count:16721 > REQUEST_TERMINATE_JOB ( 6011) count:2822 > SRUN_JOB_COMPLETE ( 7004) count:1789 > REQUEST_LAUNCH_PROLOG ( 6017) count:5100 > REQUEST_BATCH_JOB_LAUNCH ( 4005) count:5048 > REQUEST_KILL_TIMELIMIT ( 6009) count:6298 On the compute node side of things, jobs were never notified of the actual timeout that forces the job kill (REQUEST_KILL_TIMELIMIT). You can see in the count above how the queue did not get to process these ones. With jobs and nodes being left in CG for a long time. As you mentioned, the restart flushes the queue. So, once the controller is back online, all the SRUN_TIMEOUT are gone and the controller automatically recovers. #How to prevent SRUN As mentioned. This RPC is issued as consequence of having "SlurmctldParameters=enable_stepmgr". Without stepmgr enabled, this RPC won't be generated, so you should not suffer from this. Did you notice an improvement with the stepmgr option activated? Would it be an option to remove it? As a note: Stepmgr is mainly beneficial with jobs submitting many steps, but arrayjobs won't see any benefit from activating it. Jobs that may use many steps, could still manually request it with the sbatch option "--stepmgr". Meanwhile, I will check if I can improve the handling for SRUN_TIMEOUT, to prevent this flooding when so many jobs timeout with stepmgr enabled. Hope that gives a bit of an insight of what happened. Kind regards, Oscar
Very interesting. As you have seen our cluster is very active with jobs coming and going constantly and jobs all exiting at once due to various issues. We do have a few users that launch tons of steps which is why I enabled the stepmgr, but we probably do not have enough to justify stepmgr if its creating the RPC load. I will drop the stepmgr flag and we will see if that helps. Just as a quick note, I go on vacation tomorrow so my colleague Jason Wells (who has cc'd himself on this ticket) will handle communications while I am out. -Paul Edmon- On 2/5/2025 2:07 PM, bugs@schedmd.com wrote: > > *Comment # 18 <https://support.schedmd.com/show_bug.cgi?id=21975#c18> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Hi Paul, > > Many thanks for the different logs and data provided, that helped me confirm my > initial suspicion, the culprit seems to be the already mentioned SRUN_TIMEOUT > RPC. > > I believe I have been able to reproduce a similar behavior like yours in my > test environment, by having around 500 jobs timing out at the same time. I can > get my controller overloaded by RPCs which keep accumulating. However, > eventually, after a few minutes, my controller is able to recover (jobs stay CG > for a while though, and nodes appear as CG too). > > In your case, things seem to indicate that it was a matter of having a > sustained rate of jobs timing out during the problematic hour (between > 15:00-16:00), which prevented the controller to get up to date with the RPC > processing. In fact, I found another job array with 1k jobs timing out also at > that time (1956880). > > #About the RPC that seems to be flooding the queue: > > This SRUN_TIMEOUT RPC is issued around 3 times per job, to indicate srun that > timelimit for it is approaching (is sent every 30s, starting 1min before > hitting timelimit). Your conflicting array job did not have any SRUN inside it > though, which was making me wonder why Slurm was issuing that many RPCs. > > Turns out that, since you have "SlurmctldParameters=enable_stepmgr" active. The > controller will send an RPC to the stepmgr so that it later handles the > connection to all hosts (no matter if there is an srun involved or not). But > this single connection to the stepmgr, in your unfortunate situation, was sent > many, many times... > > > SRUN_TIMEOUT ( 7002) count:16721 > REQUEST_TERMINATE_JOB ( 6011) count:2822 > SRUN_JOB_COMPLETE ( > 7004) count:1789 > REQUEST_LAUNCH_PROLOG ( 6017) count:5100 > > REQUEST_BATCH_JOB_LAUNCH ( 4005) count:5048 > REQUEST_KILL_TIMELIMIT ( > 6009) count:6298 > On the compute node side of things, jobs were never notified of the actual > timeout that forces the job kill (REQUEST_KILL_TIMELIMIT). You can see in the > count above how the queue did not get to process these ones. With jobs and > nodes being left in CG for a long time. > > As you mentioned, the restart flushes the queue. So, once the controller is > back online, all the SRUN_TIMEOUT are gone and the controller automatically > recovers. > > #How to prevent SRUN > > As mentioned. This RPC is issued as consequence of having > "SlurmctldParameters=enable_stepmgr". Without stepmgr enabled, this RPC won't > be generated, so you should not suffer from this. > > Did you notice an improvement with the stepmgr option activated? Would it be an > option to remove it? > > As a note: Stepmgr is mainly beneficial with jobs submitting many steps, but > arrayjobs won't see any benefit from activating it. Jobs that may use many > steps, could still manually request it with the sbatch option "--stepmgr". > > Meanwhile, I will check if I can improve the handling for SRUN_TIMEOUT, to > prevent this flooding when so many jobs timeout with stepmgr enabled. > > Hope that gives a bit of an insight of what happened. > > Kind regards, > > Oscar > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
I'm unsure if the monitoring that we've setup is just giving us new visibility into the behavior of Slurm or if I have the gcore/gdb that you need. How can I send files to you all? I don't see an option.
Created attachment 40703 [details] Feb6_core.112406 1st core
Created attachment 40704 [details] Log of gcore and gdb commands Termnial log of submitting the gcore and gdb commands
Created attachment 40705 [details] 2nd core - zip and then gzip to try to fit within size limit 2nd core. Unzip and then ungzip.
Hi Jason, >I'm unsure if the monitoring that we've setup is just giving us new visibility >into the behavior of Slurm or if I have the gcore/gdb that you need. How can I >send files to you all? I don't see an option. Apologies, I was already out and did not see the message yesterday. I see you managed to upload files :). What did you observe when the gcore was generated? Was your cluster again with any jobs/nodes in CG state? Do you have the output of an sdiag run during that time-frame? I will check the backtrace shared and let you know if I find anything interesting, but knowing the context a bit would help. Looks like you restarted the controller afterwards. BTW, corfiles are not really useful for us (I cannot get backtraces without your original environment), so ther eis no need to upload them. We are only interested in the backtrace you can generate from them. When was the 2nc core generated? Could you also shre the output of: >gdb -batch -ex 't a a bt f' slurmctld ./core.182997 For that new core. Thanks, Oscar
"Log of gcore and gdb commands" has the output of the commands. I just ran gdb on the second core and got the below output: [root@holy-slurm02 ~]# gdb -batch -ex 't a a bt f' slurmctld ./core.112406 warning: core file may not match specified executable file. [New LWP 112406] [New LWP 112407] [New LWP 112408] [New LWP 112409] [New LWP 112411] [New LWP 112412] [New LWP 112413] [New LWP 112414] [New LWP 112415] [New LWP 112416] [New LWP 112417] [New LWP 112418] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `slurmctld: slurmscriptd'. #0 0x000014aab786cac1 in poll () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x14aab8d40740 (LWP 112406))] Thread 12 (Thread 0x14aab613e700 (LWP 112418)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88d9cc4 <__func__.16181> "watch", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b1d8 <mgr+376>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 1, tv_usec = 22723475648608} tv2 = {tv_sec = 22723473029035, tv_usec = 22723473283840} tv_str = '\000' <repeats 19 times> delta_t = 22723473279423 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b1d8 <mgr+376>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88d9cc4 <__func__.16181> "watch") at events.c:100 No locals. #3 0x000014aab889c3d7 in watch (arg=<optimized out>) at watch.c:1378 __func__ = "watch" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 11 (Thread 0x14aab623f700 (LWP 112417)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 16194240, tv_usec = 15898016} tv2 = {tv_sec = 22723432804000, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 22723432803976 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf71ac0) at workers.c:226 work = 0x0 worker = 0xf71ac0 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 10 (Thread 0x14aab6340700 (LWP 112416)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 0, tv_usec = 15898016} tv2 = {tv_sec = 22723433856672, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 0 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf62e70) at workers.c:226 work = 0x0 worker = 0xf62e70 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 9 (Thread 0x14aab6441700 (LWP 112415)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 16135072, tv_usec = 15898016} tv2 = {tv_sec = 22723434909344, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 22723434909320 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf633a0) at workers.c:226 work = 0x0 worker = 0xf633a0 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 8 (Thread 0x14aab6542700 (LWP 112414)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 0, tv_usec = 15898016} tv2 = {tv_sec = 22723435962016, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 0 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf5e340) at workers.c:226 work = 0x0 worker = 0xf5e340 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 7 (Thread 0x14aab6643700 (LWP 112413)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 0, tv_usec = 15898016} tv2 = {tv_sec = 22723437014688, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 0 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf4a1f0) at workers.c:226 work = 0x0 worker = 0xf4a1f0 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 6 (Thread 0x14aab6744700 (LWP 112412)): #0 0x000014aab7879307 in epoll_wait () from /lib64/libc.so.6 No symbol table info available. #1 0x000014aab8892fcc in _poll (caller=0x14aab88d9c90 <__func__.16117> "_poll_connections") at epoll.c:494 nfds = -1 rc = 0 events_count = 301 epoll = 8 fd_count = 2 events = 0xf2b5a0 __func__ = "_poll" #2 0x000014aab8896b85 in pollctl_poll (caller=caller@entry=0x14aab88d9c90 <__func__.16117> "_poll_connections") at polling.c:144 No locals. #3 0x000014aab889868c in _poll_connections (conmgr_args=..., arg=<optimized out>) at watch.c:1116 rc = <optimized out> __func__ = "_poll_connections" #4 0x000014aab889cab9 in wrap_work (work=<optimized out>, work@entry=0x14aab0000c30) at work.c:181 con = 0x0 __func__ = "wrap_work" #5 0x000014aab889d536 in _worker (arg=0xf44110) at workers.c:251 work = 0x14aab0000c30 worker = 0xf44110 __func__ = "_worker" #6 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #7 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 5 (Thread 0x14aab6845700 (LWP 112411)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 0, tv_usec = 15898016} tv2 = {tv_sec = 22723439120032, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 0 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf5a4c0) at workers.c:226 work = 0x0 worker = 0xf5a4c0 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x14aab6946700 (LWP 112409)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 16160800, tv_usec = 15898016} tv2 = {tv_sec = 22723440172704, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 22723440172680 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf69820) at workers.c:226 work = 0x0 worker = 0xf69820 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x14aab6a47700 (LWP 112408)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 0, tv_usec = 15898016} tv2 = {tv_sec = 22723441225376, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 0 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf54040) at workers.c:226 work = 0x0 worker = 0xf54040 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x14aab8d3f700 (LWP 112407)): #0 0x000014aab84c448c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000014aab8893ca5 in _wait (caller=0x14aab88da368 <__func__.14417> "_worker", max_sleep=..., mutex=0x14aab8b1b108 <mgr+168>, event=0x14aab8b1b258 <mgr+504>) at events.c:77 err = <optimized out> tv1 = {tv_sec = 0, tv_usec = 15898016} tv2 = {tv_sec = 22723477892768, tv_usec = 15898056} tv_str = '\000' <repeats 19 times> delta_t = 0 tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> __func__ = "_wait" err = <optimized out> err = <optimized out> #2 event_wait_now (event=event@entry=0x14aab8b1b258 <mgr+504>, mutex=mutex@entry=0x14aab8b1b108 <mgr+168>, max_sleep=..., caller=caller@entry=0x14aab88da368 <__func__.14417> "_worker") at events.c:100 No locals. #3 0x000014aab889d4a0 in _worker (arg=0xf7c150) at workers.c:226 work = 0x0 worker = 0xf7c150 __func__ = "_worker" #4 0x000014aab84be1ca in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x000014aab77738d3 in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x14aab8d40740 (LWP 112406)): #0 0x000014aab786cac1 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000014aab876fa8a in poll (__timeout=-1, __nfds=<optimized out>, __fds=0xf62440) at /usr/include/bits/poll2.h:38 No locals. #2 _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0xf62440) at eio.c:352 n = <optimized out> timeout = -1 n = <optimized out> timeout = <optimized out> #3 eio_handle_mainloop (eio=0xf618f0) at eio.c:315 retval = 0 pollfds = 0xf62440 map = 0xf57870 maxnfds = 1 nfds = 2 n = <optimized out> shutdown_time = <optimized out> __func__ = "eio_handle_mainloop" #4 0x00000000004c27f3 in slurmscriptd_run_slurmscriptd (argc=argc@entry=2, argv=argv@entry=0x7ffd431735f8, binary_path=binary_path@entry=0x721dc0 <binary> "/usr/sbin/slurmctld") at slurmscriptd.c:1177 i = <optimized out> rc = 0 ack = 0 __func__ = "slurmscriptd_run_slurmscriptd" #5 0x00000000004332a4 in main (argc=2, argv=0x7ffd431735f8) at controller.c:619 error_code = <optimized out> start = {tv_sec = 0, tv_usec = 0} now = {tv_sec = 22723454820352, tv_usec = 22723475683688} stat_buf = {st_dev = 22723475838398, st_ino = 110527148, st_nlink = 157536133, st_mode = 3098747161, st_uid = 5290, st_gid = 3098842544, __pad0 = 5290, st_rdev = 22723454810860, st_size = 0, st_blksize = 22723477909504, st_blocks = 1, st_atim = {tv_sec = 64313170774099024, tv_nsec = 22723477979696}, st_mtim = {tv_sec = 22723477899072, tv_nsec = 22723477979696}, st_ctim = {tv_sec = 64313170775482372, tv_nsec = 140725729047792}, __glibc_reserved = {22723475777776, 22723477979696, 22723475821647}} rlim = {rlim_cur = 1736415839, rlim_max = 22723475682544} config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK} prep_callbacks = {prolog_slurmctld = 0x498105 <prep_prolog_slurmctld_callback>, epilog_slurmctld = 0x4983ea <prep_epilog_slurmctld_callback>} backup_has_control = false slurmscriptd_mode = true conf_file = <optimized out> stepmgr_ops = {acct_db_conn = 0x0, active_feature_list = 0x0, job_list = 0x0, last_job_update = 0x725ef0 <last_job_update>, up_node_bitmap = 0x0, job_config_fini = 0x451ff8 <job_config_fini>, find_job_record = 0x44e2b1 <find_job_record>, find_job_array_rec = 0x44e510 <find_job_array_rec>, agent_queue_request = 0x42aba3 <agent_queue_request>, find_front_end_record = 0x4430d1 <find_front_end_record>} __func__ = "main"
And what we are seeing now is the scheduler just pausing for long periods. I put an alert from Grafana in place that queries sdiag and we're getting alerts no about Grafana not finding a data source. That means sdiag is not working, which I have confirmed from the command line. It's happening several times a day.
It does not require a restart to start working again though. So the change Paul made sort of improved things I guess. I just don't know if what we are seeing now is new behavior or not. We didn't have this monitoring in place before.
I see. Having the controller not responding is not a good sign, if it fails responding to sdiag, I am worried it may also fail to respond to other user commands. AFAIK, the only change applied was to remove the "SlurmctldParameters=enable_stepmgr" option. Do you have any sdiag output around the time you gathered the corefile? Or could you give any insights on how the agent queue has behaved since the change. I am thinking of some graphs like the ones Paul shared in comments 3 and 4. Would you be able to share the controller logs of that same time-frame? I want to make sure the change suggested did not make things worse, with the controller being more loaded now. Kind regards, Oscar
Created attachment 40716 [details] Thread count and agent queue size for the last two days, 2/5 and 2/6. Agent queue size has largely recovered. We see peaks but they aren't > than a few hundred, rather than the 30,000 I saw before.
Slurm logs outright stop when we get these sdiag alerts. It's uncanny. Feb 7 06:40:08 holy-slurm02 slurmctld[3332960]: _job_complete: JobId=2741401_4383(2746279) done Feb 7 06:40:08 holy-slurm02 slurmctld[3332960]: slurmctld: _job_complete: JobId=2741401_4871(2746821) WEXITSTATUS 0 Feb 7 06:46:51 holy-slurm02 slurmctld[3332960]: slurmctld: update_node: node holy7c20212 state set to ALLOCATED Feb 7 06:46:51 holy-slurm02 slurmctld[3332960]: update_node: node holy7c20212 state set to ALLOCATED 6 minute gap. Feb 7 08:30:09 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=5854117 began=08:30:03.867 Feb 7 08:30:09 holy-slurm02 slurmctld[3332960]: slurmctld: _build_node_list: No nodes satisfy JobId=2640982 requirements in partition unrestricted Feb 7 08:30:09 holy-slurm02 slurmctld[3332960]: _build_node_list: No nodes satisfy JobId=2640982 requirements in partition unrestricted Feb 7 08:30:09 holy-slurm02 slurmctld[3332960]: slurmctld: _pick_best_nodes: JobId=2640983 never runnable in partition unrestricted Feb 7 08:37:06 holy-slurm02 slurmctld[3332960]: slurmctld: _build_node_list: No nodes satisfy JobId=2640982 requirements in partition unrestricted Feb 7 08:37:06 holy-slurm02 slurmctld[3332960]: _build_node_list: No nodes satisfy JobId=2640982 requirements in partition unrestricted 7 minute gap. Also, we see a lot of these: Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=6511985 began=08:25:35.073 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3146523 began=08:25:38.439 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=6863042 began=08:25:34.722 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=6357665 began=08:25:35.229 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=5204941 began=08:25:36.382 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3730445 began=08:25:37.857 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3123205 began=08:25:38.465 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=5541292 began=08:25:36.047 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3423677 began=08:25:38.165 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=6977701 began=08:25:34.611 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=7190986 began=08:25:34.398 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=7163827 began=08:25:34.425 Feb 7 08:25:41 holy-slurm02 slurmctld[3332960]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=5989352 began=08:25:35.600 We've increased max_rpc_cnt from 64 to 128 in the hopes that will help. We also see a lot of these but maybe unrelated: error: Select plugin failed to set job resources, nodes
I can see the agent queue improved. So seems the change had some impact in the agent queue. But I want to discard it did not have any other negative impact. I have been studying the backtraces shared. In the first one, the controller has a thread holding the job_write lock in the quick scheduling loop. And I can see some RPCs waiting for the lock to be released. Apparently, things seem as expected, but since the corefile is just a snapshot, I cannot discard this quick scheduling loop taking too long. According to the trace, the scheduling that holds the lock was started at the timestamp (1738864383). Do you appreciate any gap in the logs similar to the ones shown around this time. Could you share some full extracts from the logs containing this gaps? Many thanks for your collaboration with that!
Created attachment 40718 [details] A directory of various logs for the last incident Full logs for today, two gdb outputs for a 12:47PM incident, and I got sdiag at the very start of that time as well.
Created attachment 40719 [details] Yesterday's full logs So you can see all of yesterday's logs.
For your request for yesterday's logs, look around 12:53:03.
Thanks a lot. From the core shared from 12:47. I can see we are again in the scheduling loop holding the lock, but in a different function. The log shows the controller really busy having trouble to attend petitions... I will further analyze this on monday. But seems tuning something like sched_min_interval=1000000 could help alleviating a bit the load. https://slurm.schedmd.com/slurm.conf.html#OPT_sched_min_interval=#
Hi jason, Just updating here a bit. As mentioned, seems the backtrace is stuck in the fast_scheduling loop. In bt shared I can appreciate how the process holding the job_write lock is: >Thread 73: >#11 0x00000000004779f2 in _schedule (full_queue=false) at job_scheduler.c:1851 >sched_start = 1738950062 The timestamp seems to be about the same time we can appreciate a blank in the logs: >Feb 7 12:41:00 holy-slurm02 slurmctld[2333625]: _job_complete: JobId=2814657_2973(2820545) WEXITSTATUS 0 >Feb 7 12:47:35 holy-slurm02 slurmctld[2333625]: slurmctld: error: Select plugin failed to set job resources, nodes >Feb 7 12:47:35 holy-slurm02 slurmctld[2333625]: error: Select plugin failed to set job resources, nodes >Feb 7 12:47:36 holy-slurm02 slurmctld[2333625]: slurmctld: _build_node_list: No nodes satisfy JobId=2808679 >requirements in partition unrestricted Do you still have the corefile around? Would you be able to get the following from the JobId from the corefile: >#open gdb with: >$ gdb slurmctld corefile >$ t 73 >$ f 11 >$ p job_ptr->job_id I would like to discard this job options being the responsible for the issue. Could you get the jobscript for that job id? >But seems tuning something like sched_min_interval=1000000 could help alleviating a bit the load. Also, about this. Your logs do show a very active scheduler. I would consider tuning sched_min_interval. thinking it better, the 1000000 (1s) suggested might be too much. But I would consider to start with something like 100000(0.1s) or so. The idea is to allow the controller to process RPCs more often, instead of being continually trying to schedule new jobs. Kind regards, Oscar
Yeah, we have a lot of different users doing a lot of different things on the cluster. Things get quite chatty with all of the jobs going in and out and users pinging for status from the scheduler. I will set the sched_min_interval to 100000 and see how it goes. Jason is going to pull the information you asked for from the core file and from the job and get that to you. -Paul Edmon- On 2/10/2025 2:04 PM, bugs@schedmd.com wrote: > > *Comment # 36 <https://support.schedmd.com/show_bug.cgi?id=21975#c36> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Hi jason, > > Just updating here a bit. > > As mentioned, seems the backtrace is stuck in the fast_scheduling loop. In bt > shared I can appreciate how the process holding the job_write lock is: > > >Thread 73: >#11 0x00000000004779f2 in _schedule (full_queue=false) at > job_scheduler.c:1851 >sched_start = 1738950062 > The timestamp seems to be about the same time we can appreciate a blank in the > logs: > > >Feb 7 12:41:00 holy-slurm02 slurmctld[2333625]: _job_complete: JobId=2814657_2973(2820545) WEXITSTATUS 0 >Feb 7 12:47:35 holy-slurm02 slurmctld[2333625]: slurmctld: error: > Select plugin failed to set job resources, nodes >Feb 7 12:47:35 > holy-slurm02 slurmctld[2333625]: error: Select plugin failed to set > job resources, nodes >Feb 7 12:47:36 holy-slurm02 slurmctld[2333625]: > slurmctld: _build_node_list: No nodes satisfy JobId=2808679 > >requirements in partition unrestricted > Do you still have the corefile around? Would you be able to get the following > from the JobId from the corefile: > > >#open gdb with: >$ gdb slurmctld corefile >$ t 73 >$ f 11 >$ p job_ptr->job_id > I would like to discard this job options being the responsible for the issue. > Could you get the jobscript for that job id? > > >But seems tuning something like sched_min_interval=1000000 could help alleviating a bit the load. > Also, about this. Your logs do show a very active scheduler. I would consider > tuning sched_min_interval. thinking it better, the 1000000 (1s) suggested might > be too much. But I would consider to start with something like 100000(0.1s) or > so. The idea is to allow the controller to process RPCs more often, instead of > being continually trying to schedule new jobs. > > Kind regards, > Oscar > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Dear Oscar, I kept the core files around: (gdb) t 73 [Switching to thread 73 (Thread 0x14b16c265700 (LWP 2333722))] #0 0x000014b1938908be in bit_size (b=b@entry=0x14abd8975840) at bitstring.c:384 384 bitstring.c: No such file or directory. (gdb) f 11 #11 0x00000000004779f2 in _schedule (full_queue=false) at job_scheduler.c:1851 1851 job_scheduler.c: No such file or directory. (gdb) p job_ptr->job_id $1 = 2810918 # sacct -B -j 2810918 Batch Script for 2810918 -------------------------------------------------------------------------------- #!/bin/bash #SBATCH -p edwards,shared #SBATCH -c 16 #SBATCH -t 2-12:00 #SBATCH -o trim_galore_second_round_RNA_outgroup_%j.out #SBATCH -e trim_galore_second_round_RNA_outgroup_%j.err #SBATCH --mem=150000 #SBATCH --mail-type=END output_directory="<redacted>" input_directory="<redacted>" cat filenames_no_extension.txt | while read LINE; do trim_galore \ --paired \ -j 1 --retain_unpaired \ --phred33 \ --output_dir \ ${output_directory} \ --length 35 -q 0 \ --stringency 5 \ ${input_directory}/${LINE}_R1_001.fastq.gz ${input_directory}/${LINE}_R2_001.fastq.gz done;
Created attachment 40746 [details] Plot Allocs per minute Thanks Jason, Paul, Looking at the job Jason shared, seems a pretty simple job and it was submitted a couple of hours before the controller blank in the logs, so it does not seem it being directly the cause of any slowdown. So, the thing is that both backtraces we got show the controller currently doing work in fast scheduling cycle. But both of them are in different spot, so I cannot say for sure we get stuck in any function there. This spot is the most frequently run, so chances of catching it are high, but the fact that there is no logging made during that time make me think that either: 1 - We are taking to long to run fast scheduling sometimes 2 - We run this function many times leaving no time to process RPCs in between. The suggested parameter of sched_min_interval aims to help in situation 2. About 1, maybe some other backtrace could help. Just to discard things, is there any chance controller could be swapping? That is a long shot but could explain a bit the behavior observed. Just for analysis. I plotted the submissions/Allocations registered in the shared log per minute (will leave it attached). It is interesting that the periods the controller stops responding, usually match with the peaks in which controller allocates aprox 1k jobs/min. Also, I do not see any clear relation with the removal of "enable_stepmgr" option and this non-responsive controller periods. Were you aware of these before the changes? How have things gone since the sched_min_interval change, any other relevant blank/nin-responsive moments? Kind regards, Oscar
Yeah, that graph matches our internal graphs of sdiag which show the same trends. To start with the last thing first, removing the enable_stepmgr looks to alleviate the agent queue build up issue. That has not reoccurred since we dropped that. So it appears that we have two different issues. The initial presenting issue was the agent queue build up which was what caused this ticket. The second is the scheduler being overwhelmed and dropping traffic. The second appears to be new and may be coincident with our upgrade from 24.05.4 to 24.11.1. This may be a result of the new conmgr set up, we have it all set to defaults and it may be time to start tweaking those. As to what is going on, I think we may be getting into a thrashing situation. I have seen it happen before that the scheduler will get in a state where jobs exit as fast as they enter. Thus the scheduler ends up effectively filling a sinking boat. Previously this happened when users submitted jobs that ran for less than 10 minutes. Effectively what happened in that situation is that the same job was scheduled and then exited immediately thus causing an RPC storm and also the scheduler trying to grab a new job to allocate. Previously this was solved by either going to defer mode or using max_rpc_cnt. With the new conmgr we haven't seen the thread count get above 64 and our max_rpc_cnt is set to 128 so I don't think we are even hitting the throttler. We also have encouraged users to submit jobs for longer than 10 minutes as otherwise the scheduler is doing more work than the job, but we haven't set a hard limit and even if we did users may lie and send in jobs that are set for 20 minutes but then exit immediately due to the job actually being short. I get the feeling we are in this thrashing regime again where the scheduler is just getting overwhelmed with traffic and decides to go head down (so to speak) and focus on scheduling and ignore other requests. Eventually it gets so stuck that it wedges itself. That's my theory at least given what you have just explained. As to how to deal with it, we have set the sched_min_interval to 100000 as you have suggested and we haven't seen any issues yet. Perhaps there are more tweaks to conmgr to give it more horsepower. Here are the hardware details of our slurm master: [root@holy-slurm02 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 1 Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz BIOS Model name: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz Stepping: 7 CPU MHz: 2900.000 BogoMIPS: 5800.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp_epp pku ospke avx512_vnni md_clear flush_l1d arch_capabilities [root@holy-slurm02 ~]# free -g total used free shared buff/cache available Mem: 187 117 15 3 54 64 Swap: 0 0 0 You can see we have Swap turned off, so no swapping should be occurring. We also have SSD's on the backend for storage. So the system should have plenty of horsepower. It may be that we need to tweak the settings to allow it to utilize it. By the way as reference, we have the slurmdbd running on this host as well. So latency to the DB shouldn't be an issue. -Paul Edmon- On 2/11/2025 5:27 AM, bugs@schedmd.com wrote: > > *Comment # 39 <https://support.schedmd.com/show_bug.cgi?id=21975#c39> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Createdattachment 40746 <attachment.cgi?id=40746> [details] > <attachment.cgi?id=40746&action=edit> > Plot Allocs per minute > > Thanks Jason, Paul, > > Looking at the job Jason shared, seems a pretty simple job and it was submitted > a couple of hours before the controller blank in the logs, so it does not seem > it being directly the cause of any slowdown. > > So, the thing is that both backtraces we got show the controller currently > doing work in fast scheduling cycle. But both of them are in different spot, so > I cannot say for sure we get stuck in any function there. This spot is the most > frequently run, so chances of catching it are high, but the fact that there is > no logging made during that time make me think that either: 1 - We are taking > to long to run fast scheduling sometimes 2 - We run this function many times > leaving no time to process RPCs in between. > > The suggested parameter of sched_min_interval aims to help in situation 2. > About 1, maybe some other backtrace could help. Just to discard things, is > there any chance controller could be swapping? That is a long shot but could > explain a bit the behavior observed. > > Just for analysis. I plotted the submissions/Allocations registered in the > shared log per minute (will leave it attached). It is interesting that the > periods the controller stops responding, usually match with the peaks in which > controller allocates aprox 1k jobs/min. > > Also, I do not see any clear relation with the removal of "enable_stepmgr" > option and this non-responsive controller periods. Were you aware of these > before the changes? How have things gone since the sched_min_interval change, > any other relevant blank/nin-responsive moments? > > Kind regards, > Oscar > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Created attachment 40747 [details] Job Stats for 2/11/2025 Just adding this for comparison against the plot that Oscar made. This is from sdiag.
>To start with the last thing first, removing the enable_stepmgr looks to >alleviate the agent queue build up issue. That has not reoccurred since >we dropped that. So it appears that we have two different issues. The >initial presenting issue was the agent queue build up which was what >caused this ticket. Great. Yes, they look like 2 different issues to me. So, good thing we got the agent queue issues sorted out. Now, about the overwhelmed controller. You are right here, with the introduction of conmgr some relevant changes have been made to the thread handling. Now, the controller will have a fixed pool of threads (instead of continually creating and destroying threads), which will attend petitions. The default value for this pool of threads is 64, configurable via the parameter SlurmCtldParameter=conmgr_threads[1]. >With the new conmgr we haven't seen the thread count get above 64 and our >max_rpc_cnt is set to 128 so I don't think we are even hitting the >throttler. So, effectively, max_threads will not go over 64 with the throttler never actually taking place. Though, both backfill and main schedulers should yield locks anyway based on timers. With main scheduling loop never taking more than 2s, and bf yielding locks based on the defaults: bf_yield_interval[2] -> 2s bf_yield_sleep[3] -> 0.5s With the traces seen, problem seems to be lock contention rather than concurrency. So, for further tuning maybe you could try some value < 64 for max_rpc_cnt, to force the throttling more often. Or, maybe a better option could be to increase conmgr_threads (128?) and set max_rpc_cnt at 64. But you will need to test how these options adapt to your cluster. Because the first scenario could lead to a continuous deferral of scheduling to process RPCs, while the second could overload the controller by having way many threads than actual cores. I will appreciate any feedback you can give if you decide to play with this values. >Just adding this for comparison against the plot that Oscar made. This is >from sdiag In any case, I have seen the graphs and seems a few 1k peaks happened with no issues, which is good news. Seems changes helped a bit, so maybe we can keep an eye with the current params to see if the situation repeats. >I get the feeling we are in this thrashing regime again where the >scheduler is just getting overwhelmed with traffic and decides to go >head down (so to speak) and focus on scheduling and ignore other >requests. Eventually it gets so stuck that it wedges itself. Kind of the impression I got, hopefully, changes made help alleviate the situation. Changing debuglevel to debug could help to understand if something in the scheduling logic happens during the blanks. But logs will grow pretty fast with your cluster activity. Kind regards, Oscar [1]https://slurm.schedmd.com/slurm.conf.html#OPT_conmgr_threads [2]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_interval=# [3]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_sleep=#
I'm inclined to go conmgr threads at 128 and then max_rpc_cnt to 64. That way we get the throttling and have enough head room in threads to handle additional traffic. That may also allow us to remove the throttle for the sched_min_interval. Worth experimenting a bit to see what happens. I will start with conmgr to 128 and max_rpc_cnt to 64 and watch it for a bit to see how that behaves. -Paul Edmon- On 2/11/2025 2:15 PM, bugs@schedmd.com wrote: > > *Comment # 42 <https://support.schedmd.com/show_bug.cgi?id=21975#c42> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > >To start with the last thing first, removing the enable_stepmgr looks to >alleviate the agent queue build up issue. That has not reoccurred > since >we dropped that. So it appears that we have two different > issues. The >initial presenting issue was the agent queue build up > which was what >caused this ticket. > Great. Yes, they look like 2 different issues to me. So, good thing we got the > agent queue issues sorted out. > > Now, about the overwhelmed controller. You are right here, with the > introduction of conmgr some relevant changes have been made to the thread > handling. Now, the controller will have a fixed pool of threads (instead of > continually creating and destroying threads), which will attend petitions. The > default value for this pool of threads is 64, configurable via the parameter > SlurmCtldParameter=conmgr_threads[1]. > > >With the new conmgr we haven't seen the thread count get above 64 and our >max_rpc_cnt is set to 128 so I don't think we are even hitting the > >throttler. > So, effectively, max_threads will not go over 64 with the throttler never > actually taking place. Though, both backfill and main schedulers should yield > locks anyway based on timers. With main scheduling loop never taking more than > 2s, and bf yielding locks based on the defaults: > > bf_yield_interval[2] -> 2s > bf_yield_sleep[3] -> 0.5s > > With the traces seen, problem seems to be lock contention rather than > concurrency. So, for further tuning maybe you could try some value < 64 for > max_rpc_cnt, to force the throttling more often. Or, maybe a better option > could be to increase conmgr_threads (128?) and set max_rpc_cnt at 64. > > But you will need to test how these options adapt to your cluster. Because the > first scenario could lead to a continuous deferral of scheduling to process > RPCs, while the second could overload the controller by having way many threads > than actual cores. I will appreciate any feedback you can give if you decide to > play with this values. > > >Just adding this for comparison against the plot that Oscar made. This is >from sdiag > In any case, I have seen the graphs and seems a few 1k peaks happened with no > issues, which is good news. Seems changes helped a bit, so maybe we can keep an > eye with the current params to see if the situation repeats. > > >I get the feeling we are in this thrashing regime again where the >scheduler is just getting overwhelmed with traffic and decides to go > >head down (so to speak) and focus on scheduling and ignore other > >requests. Eventually it gets so stuck that it wedges itself. > Kind of the impression I got, hopefully, changes made help alleviate the > situation. Changing debuglevel to debug could help to understand if something > in the scheduling logic happens during the blanks. But logs will grow pretty > fast with your cluster activity. > > Kind regards, > Oscar > > [1]https://slurm.schedmd.com/slurm.conf.html#OPT_conmgr_threads > [2]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_interval=# > [3]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_sleep=# > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
FYI slurmctld crashed this morning at 7:55. Here is the initial log: Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu slurmctld[1088629]: slurmctld: _slurm_rpc_submit_batch_job: JobId=3582023 InitPrio=8179811 usec=3995 Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu slurmctld[1088629]: _slurm_rpc_submit_batch_job: JobId=3582023 InitPrio=8179811 usec=3995 Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu kernel: worker[17][1088647]: segfault at 40 ip 00001484de861346 sp 00001484dbed58d0 error 4 in libpthread-2.28.so[1484de854000+1b000] Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu kernel: Code: 24 83 c8 02 83 f8 03 74 f4 e9 d1 fd ff ff 66 0f 1f 44 00 00 f3 0f 1e fa 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 90 <8b> 57 > Feb 12 07:54:37 holy-slurm02.rc.fas.harvard.edu systemd[1]: Started Process Core Dump (PID 3005258/UID 0). Feb 12 07:55:00 holy-slurm02.rc.fas.harvard.edu systemd-coredump[3005259]: Process 1088629 (slurmctld) of user 57812 dumped core. Stack trace of thread 1088647: #0 0x00001484de861346 __pthread_rwlock_wrlock (libpthread.so.0) #1 0x00001484deb345b3 list_append (libslurmfull.so) #2 0x000000000046f2ce _foreach_add_to_preemptee_job_id (slurmctld) #3 0x00001484deb34fbf list_for_each_max (libslurmfull.so) #4 0x00001484deb350af list_for_each (libslurmfull.so) #5 0x0000000000470996 _foreach_job_start_data_part (slurmctld) #6 0x0000000000474805 job_start_data (slurmctld) #7 0x0000000000462a42 job_allocate (slurmctld) #8 0x000000000049e997 _slurm_rpc_job_will_run (slurmctld) #9 0x00000000004a8b1c slurmctld_req (slurmctld) #10 0x0000000000430f57 _service_connection (slurmctld) #11 0x00001484dec2a94f _wrap_on_extract (libslurmfull.so) #12 0x00001484dec3aab9 wrap_work (libslurmfull.so) #13 0x00001484dec3b536 _worker (libslurmfull.so) #14 0x00001484de85c1ca start_thread (libpthread.so.0) #15 0x00001484ddb118d3 __clone (libc.so.6) I will drop the log around this. I also have the core dump so let me know what commands you want me to run on it. -Paul Edmon- On 2/11/25 2:15 PM, bugs@schedmd.com wrote: > > *Comment # 42 <https://support.schedmd.com/show_bug.cgi?id=21975#c42> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > >To start with the last thing first, removing the enable_stepmgr looks to >alleviate the agent queue build up issue. That has not reoccurred > since >we dropped that. So it appears that we have two different > issues. The >initial presenting issue was the agent queue build up > which was what >caused this ticket. > Great. Yes, they look like 2 different issues to me. So, good thing we got the > agent queue issues sorted out. > > Now, about the overwhelmed controller. You are right here, with the > introduction of conmgr some relevant changes have been made to the thread > handling. Now, the controller will have a fixed pool of threads (instead of > continually creating and destroying threads), which will attend petitions. The > default value for this pool of threads is 64, configurable via the parameter > SlurmCtldParameter=conmgr_threads[1]. > > >With the new conmgr we haven't seen the thread count get above 64 and our >max_rpc_cnt is set to 128 so I don't think we are even hitting the > >throttler. > So, effectively, max_threads will not go over 64 with the throttler never > actually taking place. Though, both backfill and main schedulers should yield > locks anyway based on timers. With main scheduling loop never taking more than > 2s, and bf yielding locks based on the defaults: > > bf_yield_interval[2] -> 2s > bf_yield_sleep[3] -> 0.5s > > With the traces seen, problem seems to be lock contention rather than > concurrency. So, for further tuning maybe you could try some value < 64 for > max_rpc_cnt, to force the throttling more often. Or, maybe a better option > could be to increase conmgr_threads (128?) and set max_rpc_cnt at 64. > > But you will need to test how these options adapt to your cluster. Because the > first scenario could lead to a continuous deferral of scheduling to process > RPCs, while the second could overload the controller by having way many threads > than actual cores. I will appreciate any feedback you can give if you decide to > play with this values. > > >Just adding this for comparison against the plot that Oscar made. This is >from sdiag > In any case, I have seen the graphs and seems a few 1k peaks happened with no > issues, which is good news. Seems changes helped a bit, so maybe we can keep an > eye with the current params to see if the situation repeats. > > >I get the feeling we are in this thrashing regime again where the >scheduler is just getting overwhelmed with traffic and decides to go > >head down (so to speak) and focus on scheduling and ignore other > >requests. Eventually it gets so stuck that it wedges itself. > Kind of the impression I got, hopefully, changes made help alleviate the > situation. Changing debuglevel to debug could help to understand if something > in the scheduling logic happens during the blanks. But logs will grow pretty > fast with your cluster activity. > > Kind regards, > Oscar > > [1]https://slurm.schedmd.com/slurm.conf.html#OPT_conmgr_threads > [2]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_interval=# > [3]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_yield_sleep=# > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Created attachment 40765 [details] Log for crash at 07:55 on 02-12-25
Hey Paul, Seems unrelated to the previous changes. This segfault seems to be similar to one we fixed a couple of weeks ago: https://github.com/SchedMD/slurm/commit/6b7ceb0746c7 Would you be able to patch the controller? What is the current status of the controller now? Kind regards, Oscar
Ah okay. Good to see its fixed. When will the formal release of that come? As if it is soon then we can wait for that. The scheduler is working fine right now, it restarted without any issue. If its going to take a bit for the formal release then we may want to emergency patch. On 2/12/25 11:48 AM, bugs@schedmd.com wrote: > > *Comment # 46 <https://support.schedmd.com/show_bug.cgi?id=21975#c46> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Hey Paul, > > Seems unrelated to the previous changes. This segfault seems to be similar to > one we fixed a couple of weeks ago: > > https://github.com/SchedMD/slurm/commit/6b7ceb0746c7 > > Would you be able to patch the controller? What is the current status of the > controller now? > > Kind regards, > Oscar > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Also other than that crash this morning, things look stable with the new thread arrangement. If it looks good for the next day I may on Thursday try dropping the sched_min_interval to see if the higher thread count and max_rpc_cnt are sufficient to keep things in check. If not we can fail back to sched_min_interval. On 2/12/25 11:48 AM, bugs@schedmd.com wrote: > > *Comment # 46 <https://support.schedmd.com/show_bug.cgi?id=21975#c46> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Hey Paul, > > Seems unrelated to the previous changes. This segfault seems to be similar to > one we fixed a couple of weeks ago: > > https://github.com/SchedMD/slurm/commit/6b7ceb0746c7 > > Would you be able to patch the controller? What is the current status of the > controller now? > > Kind regards, > Oscar > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Created attachment 40767 [details] Patch for segfault I am attaching a patch cherrypicking this commit from the current 24.11 branch. We still don't have a tentative release date for 24.11.2. It should fall around the end of this month. But given that it is a segfault that can be triggered by a particular job submission, I would suggest to patch it as soon as possible. My colleague Ben explains about the issue in ticket 21997 comment 6. Btw, did you finally change conmgr_threads value? I will wait for some feedback on the cluster general behavior. Kind regards, Oscar
Blegh, that's pretty bad. I will look at patching ASAP. I made the change to conmgr_threads=128 and max_rpc_cnt=64 yesterday afternoon. So far it looks stable and I haven't see any thrashing by the scheduler. Let me know if you want more data than that as I'm happy to pull stats. On 2/12/25 12:11 PM, bugs@schedmd.com wrote: > > *Comment # 49 <https://support.schedmd.com/show_bug.cgi?id=21975#c49> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Createdattachment 40767 <attachment.cgi?id=40767&action=diff> [details] > <attachment.cgi?id=40767&action=edit> > Patch for segfault > > I am attaching a patch cherrypicking this commit from the current 24.11 branch. > > We still don't have a tentative release date for 24.11.2. It should fall around > the end of this month. But given that it is a segfault that can be triggered by > a particular job submission, I would suggest to patch it as soon as possible. > > My colleague Ben explains about the issue inticket 21997 comment 6 <show_bug.cgi?id=21997#c6>. > > Btw, did you finally change conmgr_threads value? I will wait for some feedback > on the cluster general behavior. > > Kind regards, > Oscar > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
>Also other than that crash this morning, things look stable with the new >thread arrangement. If it looks good for the next day I may on Thursday >try dropping the sched_min_interval to see if the higher thread count >and max_rpc_cnt are sufficient to keep things in check. If not we can >fail back to sched_min_interval. Sorry, missed that :). Sounds good. >I made the change to conmgr_threads=128 and max_rpc_cnt=64 yesterday >afternoon. So far it looks stable and I haven't see any thrashing by the >scheduler. Let me know if you want more data than that as I'm happy to >pull stats. Good. Mainly interested because all that new conmgr logic aims to improve high throughput situations like yours, so feedback is highly appreciated. For now it is just enough to know that the "stuck" controller situations are under control. Well, just one question for curiosity: Have you noticed sched_min_interval having an impact in the number of started jobs?
Yes, at least in periods of high demand. I've attached a graph of our job's started which includes in yellow the jobs per minute. You can see that the peaks are nearly as high after the change on Feb 10th. But on average it looks the same. -Paul Edmon- On 2/12/25 1:05 PM, bugs@schedmd.com wrote: > > *Comment # 51 <https://support.schedmd.com/show_bug.cgi?id=21975#c51> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > >Also other than that crash this morning, things look stable with the new >thread arrangement. If it looks good for the next day I may on > Thursday >try dropping the sched_min_interval to see if the higher > thread count >and max_rpc_cnt are sufficient to keep things in check. > If not we can >fail back to sched_min_interval. > Sorry, missed that :). Sounds good. > > >I made the change to conmgr_threads=128 and max_rpc_cnt=64 yesterday >afternoon. So far it looks stable and I haven't see any thrashing by > the >scheduler. Let me know if you want more data than that as I'm > happy to >pull stats. > Good. Mainly interested because all that new conmgr logic aims to improve high > throughput situations like yours, so feedback is highly appreciated. For now it > is just enough to know that the "stuck" controller situations are under > control. Well, just one question for curiosity: > > Have you noticed sched_min_interval having an impact in the number of started > jobs? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Created attachment 40768 [details] Jobs started between 2-05-25 to 2-12-25
>You can see that the peaks are nearly as high after the change on Feb 10th. But on >average it looks the same. I see. That makes sense, since we are reducing the number of times quick scheduling runs per minute. If average throughput is similar though, tuning sched_min_interval seems like a valid approach here to help distribute the load maintaining while keeping the controller responsive. I will keep an eye on this ticket to see how things evolve during this week. Thanks for the feedback!
I concur. Right now I'm pretty happy with the behavior of the scheduler so I'm not inclined to tweak anything. The throughput looks fine and I haven't seen any stalls lately. On 2/13/2025 6:16 AM, bugs@schedmd.com wrote: > > *Comment # 54 <https://support.schedmd.com/show_bug.cgi?id=21975#c54> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > >You can see that the peaks are nearly as high after the change on Feb 10th. But on >average it looks the same. > I see. That makes sense, since we are reducing the number of times quick > scheduling runs per minute. If average throughput is similar though, tuning > sched_min_interval seems like a valid approach here to help distribute the load > maintaining while keeping the controller responsive. > > I will keep an eye on this ticket to see how things evolve during this week. > > Thanks for the feedback! > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Just a FYI, as a test I dropped the sched_min_interval. I want to see if the max_rpc_cnt limit is sufficient to keep things under control. If things get out of hand I will reset sched_min_interval. Also I added bf_allow_magnetic_slot because its a feature I've wanted to add and now seems as good a time as any to turn it on. On 2/13/2025 6:16 AM, bugs@schedmd.com wrote: > > *Comment # 54 <https://support.schedmd.com/show_bug.cgi?id=21975#c54> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > >You can see that the peaks are nearly as high after the change on Feb 10th. But on >average it looks the same. > I see. That makes sense, since we are reducing the number of times quick > scheduling runs per minute. If average throughput is similar though, tuning > sched_min_interval seems like a valid approach here to help distribute the load > maintaining while keeping the controller responsive. > > I will keep an eye on this ticket to see how things evolve during this week. > > Thanks for the feedback! > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Hi Paul, Just checking in here. Any relevant issue after the changes mentioned? Cheers, Oscar
Nope everything is stable and has been behaving much better. I think setting max_rpc_cnt lower than the conmgr_threads is what did it as it has the scheduler automatically throttle when the thread count is high and has sufficient additional head room to then clear out the additional requests before it hits conmgr_threads. The current combo of conmgr_threads=128 and max_rpc_cnt=64 looks to be working great. -Paul Edmon- On 2/17/25 11:55 AM, bugs@schedmd.com wrote: > > *Comment # 57 <https://support.schedmd.com/show_bug.cgi?id=21975#c57> > on ticket 21975 <https://support.schedmd.com/show_bug.cgi?id=21975> > from Oscar Hernández <mailto:oscar.hernandez@schedmd.com> * > Hi Paul, > > Just checking in here. Any relevant issue after the changes mentioned? > > Cheers, > Oscar > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >