Created attachment 40652 [details] output of 'thread apply all bt full' Hi, We just moved from 25.05 to 24.11.1 and are facing repeated segfaults in slurmctld (the same issue already happened twice in less than 4 hours). Here's the stack trace: (gdb) bt #0 0x00007f307988536b in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x00007f3079b4c25b in list_append (l=0x0, x=0x7f30141456a0) at list.c:257 #2 0x000000000046c05e in _foreach_add_to_preemptee_job_id (x=0x7a0ca70, arg=0x7f301453d820) at job_scheduler.c:4311 #3 0x00007f3079b4cc16 in list_for_each_max (l=0x7f3014065460, max=max@entry=0x7f307685f9a4, f=f@entry=0x46c00a <_foreach_add_to_preemptee_job_id>, arg=arg@entry=0x7f301453d820, break_on_fail=break_on_fail@entry=1, write_lock=write_lock@entry=1) at list.c:605 #4 0x00007f3079b4ccfe in list_for_each (l=<optimized out>, f=f@entry=0x46c00a <_foreach_add_to_preemptee_job_id>, arg=arg@entry=0x7f301453d820) at list.c:572 #5 0x000000000046d777 in _foreach_job_start_data_part (x=0x669d5c0, arg=arg@entry=0x7f307685fad0) at job_scheduler.c:4470 #6 0x0000000000471165 in job_start_data (job_ptr=0x7f301405ca40, resp=resp@entry=0x7f307685fc48) at job_scheduler.c:4523 #7 0x0000000000460093 in job_allocate (job_desc=job_desc@entry=0x7f3018315380, immediate=immediate@entry=0, will_run=will_run@entry=1, resp=resp@entry=0x7f307685fc48, allocate=allocate@entry=1, submit_uid=<optimized out>, cron=cron@entry=false, job_pptr=job_pptr@entry=0x7f307685fc40, err_msg=err_msg@entry=0x7f307685fc50, protocol_version=10752) at job_mgr.c:4234 #8 0x0000000000499449 in _slurm_rpc_job_will_run (msg=0x7f301817ff70) at proc_req.c:2425 #9 0x00000000004a369b in slurmctld_req (msg=msg@entry=0x7f301817ff70, this_rpc=this_rpc@entry=0x715f90 <slurmctld_rpcs+4784>) at proc_req.c:6935 #10 0x0000000000430690 in _service_connection (conmgr_args=..., input_fd=17, output_fd=17, arg=0x7f301817ff70) at controller.c:1717 #11 0x00007f3079c3fb07 in _wrap_on_extract (conmgr_args=..., arg=<optimized out>) at con.c:1480 #12 0x00007f3079c4f47a in wrap_work (work=work@entry=0x7f3014004d60) at work.c:181 #13 0x00007f3079c4ffa3 in _worker (arg=0x2240f10) at workers.c:251 #14 0x00007f3079881ea5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f3078c82b2d in clone () from /lib64/libc.so.6 Attached is the output of 'thread apply all bt full'. Thanks! -- Kilian
We moved from 24.05 to 24.11.1, not from 25.05, of course :) Cheers, -- Kilian
Hi, Raising this to Sev1 as the segfault happened multiple times overnight and took down our 4 controllers, one after the other, which made the system unavailable for several hours. Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #3) > Hi, > > Raising this to Sev1 as the segfault happened multiple times overnight and > took down our 4 controllers, one after the other, which made the system > unavailable for several hours. > > Cheers, > -- > Kilian Hi Killian, I have a reproducer for this, and I'm working to see how to best fix this.
Created attachment 40663 [details] 24.11 v1 test Looks like this happens when you run a `--test-only` job (sbatch/srun option) that would preempt another job. e.g.: slurm.conf: > PreemptType=preempt/qos qos config: > ben@xps:~/slurm/24.11/xps$ sacctmgr show qos format=name,preempt > Name Preempt > ---------- ---------- > normal > high normal low priority job: > $ sbatch --wrap="sleep 100000" -wn1 --exclusive --qos=normal Now submit high priority job with --test-only that would preempt low priority job: > $ sbatch --wrap="sleep 10000" -wn1 --exclusive --qos=high --test-only > Submitted batch job 36 > * slurmctld segfault * I created a patch that fixes the segault in my environment. Could you please test it in your environment? Specifically try to submit a high priority job with --test-only, and see if you see an output like this: > Submitted batch job 55 > sbatch: Job 56 to start at 2025-02-05T09:09:58 using 12 processors on nodes n1 in partition a > sbatch: Preempts: 55 If you see the "Preempts" line, then the patch worked.
Hi Ben, (In reply to Ben Glines from comment #6) > Looks like this happens when you run a `--test-only` job (sbatch/srun > option) that would preempt another job. e.g.: Thanks for the quick investigation! I can confirm that the reproducer works: preemption with `--test-only` seems fatal to the controller. :\ $ sbatch -p serc --test-only --exclusive --wrap="echo ok" allocation failure: Zero Bytes were transmitted or received > I created a patch that fixes the segault in my environment. Could you please > test it in your environment? Specifically try to submit a high priority job > with --test-only, and see if you see an output like this: > > Submitted batch job 55 > > sbatch: Job 56 to start at 2025-02-05T09:09:58 using 12 processors on nodes n1 in partition a > > sbatch: Preempts: 55 > If you see the "Preempts" line, then the patch worked. I've applied the patch, deployed it, and now, submitting a preempting job with `--test-only` seems to work, so it looks like it's working! $ sbatch -p serc --test-only --exclusive --wrap="echo ok" sbatch: Job 59363366 to start at 2025-02-05T09:12:02 using 24 processors on nodes sh02-10n62 in partition serc sbatch: Preempts: 59339582,59336519 I'm decreasing back to Sev3, will monitor for a few more hours, and give an update later today. Thanks a lot! Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #7) > I've applied the patch, deployed it, and now, submitting a preempting job > with `--test-only` seems to work, so it looks like it's working! That's great! Thanks for letting us know. I'll update you on what we decide to do upstream. > I'm decreasing back to Sev3, will monitor for a few more hours, and give an > update later today. Sounds good.
(In reply to Ben Glines from comment #8) > > I'm decreasing back to Sev3, will monitor for a few more hours, and give an > > update later today. > Sounds good. No more of this segfault to report after applying the patch. Thank you! -- Kilian
Good to hear! I just noticed that we actually already found this internally too, and there was a patch put upstream last week: https://github.com/SchedMD/slurm/commit/6b7ceb0746c7 The fix should be available in the next 24.11 release. I'm glad we were able to get you a patch sooner though :) Thanks for logging this! Closing now.