Ticket 21997 - slurmctld segfault in job_allocate()
Summary: slurmctld segfault in job_allocate()
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 24.11.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Ben Glines
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-02-04 23:02 MST by Kilian Cavalotti
Modified: 2025-02-05 16:44 MST (History)
3 users (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 24.11.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
output of 'thread apply all bt full' (100.78 KB, text/x-log)
2025-02-04 23:02 MST, Kilian Cavalotti
Details
24.11 v1 test (819 bytes, patch)
2025-02-05 09:22 MST, Ben Glines
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2025-02-04 23:02:42 MST
Created attachment 40652 [details]
output of 'thread apply all bt full'

Hi,

We just moved from 25.05 to 24.11.1 and are facing repeated segfaults in slurmctld (the same issue already happened twice in less than 4 hours).

Here's the stack trace:

(gdb) bt
#0  0x00007f307988536b in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x00007f3079b4c25b in list_append (l=0x0, x=0x7f30141456a0) at list.c:257
#2  0x000000000046c05e in _foreach_add_to_preemptee_job_id (x=0x7a0ca70, arg=0x7f301453d820) at job_scheduler.c:4311
#3  0x00007f3079b4cc16 in list_for_each_max (l=0x7f3014065460, max=max@entry=0x7f307685f9a4, f=f@entry=0x46c00a <_foreach_add_to_preemptee_job_id>, arg=arg@entry=0x7f301453d820, break_on_fail=break_on_fail@entry=1, write_lock=write_lock@entry=1) at list.c:605
#4  0x00007f3079b4ccfe in list_for_each (l=<optimized out>, f=f@entry=0x46c00a <_foreach_add_to_preemptee_job_id>, arg=arg@entry=0x7f301453d820) at list.c:572
#5  0x000000000046d777 in _foreach_job_start_data_part (x=0x669d5c0, arg=arg@entry=0x7f307685fad0) at job_scheduler.c:4470
#6  0x0000000000471165 in job_start_data (job_ptr=0x7f301405ca40, resp=resp@entry=0x7f307685fc48) at job_scheduler.c:4523
#7  0x0000000000460093 in job_allocate (job_desc=job_desc@entry=0x7f3018315380, immediate=immediate@entry=0, will_run=will_run@entry=1, resp=resp@entry=0x7f307685fc48, allocate=allocate@entry=1, submit_uid=<optimized out>, cron=cron@entry=false,
    job_pptr=job_pptr@entry=0x7f307685fc40, err_msg=err_msg@entry=0x7f307685fc50, protocol_version=10752) at job_mgr.c:4234
#8  0x0000000000499449 in _slurm_rpc_job_will_run (msg=0x7f301817ff70) at proc_req.c:2425
#9  0x00000000004a369b in slurmctld_req (msg=msg@entry=0x7f301817ff70, this_rpc=this_rpc@entry=0x715f90 <slurmctld_rpcs+4784>) at proc_req.c:6935
#10 0x0000000000430690 in _service_connection (conmgr_args=..., input_fd=17, output_fd=17, arg=0x7f301817ff70) at controller.c:1717
#11 0x00007f3079c3fb07 in _wrap_on_extract (conmgr_args=..., arg=<optimized out>) at con.c:1480
#12 0x00007f3079c4f47a in wrap_work (work=work@entry=0x7f3014004d60) at work.c:181
#13 0x00007f3079c4ffa3 in _worker (arg=0x2240f10) at workers.c:251
#14 0x00007f3079881ea5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f3078c82b2d in clone () from /lib64/libc.so.6

Attached is the output of 'thread apply all bt full'.

Thanks!
--
Kilian
Comment 1 Kilian Cavalotti 2025-02-04 23:14:01 MST
We moved from 24.05 to 24.11.1, not from 25.05, of course :)

Cheers,
--
Kilian
Comment 3 Kilian Cavalotti 2025-02-05 08:15:23 MST
Hi,

Raising this to Sev1 as the segfault happened multiple times overnight and took down our 4 controllers, one after the other, which made the system unavailable for several hours.

Cheers,
--
Kilian
Comment 5 Ben Glines 2025-02-05 09:03:16 MST
(In reply to Kilian Cavalotti from comment #3)
> Hi,
> 
> Raising this to Sev1 as the segfault happened multiple times overnight and
> took down our 4 controllers, one after the other, which made the system
> unavailable for several hours.
> 
> Cheers,
> --
> Kilian

Hi Killian,

I have a reproducer for this, and I'm working to see how to best fix this.
Comment 6 Ben Glines 2025-02-05 09:22:16 MST
Created attachment 40663 [details]
24.11 v1 test

Looks like this happens when you run a `--test-only` job (sbatch/srun option) that would preempt another job. e.g.:

slurm.conf:
> PreemptType=preempt/qos
qos config:
> ben@xps:~/slurm/24.11/xps$ sacctmgr show qos format=name,preempt
>       Name    Preempt
> ---------- ----------
>     normal
>       high     normal
low priority job:
> $ sbatch --wrap="sleep 100000" -wn1 --exclusive --qos=normal
Now submit high priority job with --test-only that would preempt low priority job:
> $ sbatch --wrap="sleep 10000" -wn1 --exclusive --qos=high --test-only
> Submitted batch job 36
> * slurmctld segfault *

I created a patch that fixes the segault in my environment. Could you please test it in your environment? Specifically try to submit a high priority job with --test-only, and see if you see an output like this:
> Submitted batch job 55
> sbatch: Job 56 to start at 2025-02-05T09:09:58 using 12 processors on nodes n1 in partition a
> sbatch:   Preempts: 55
If you see the "Preempts" line, then the patch worked.
Comment 7 Kilian Cavalotti 2025-02-05 09:54:53 MST
Hi Ben,

(In reply to Ben Glines from comment #6)
> Looks like this happens when you run a `--test-only` job (sbatch/srun
> option) that would preempt another job. e.g.:

Thanks for the quick investigation! 

I can confirm that the reproducer works: preemption with `--test-only` seems fatal to the controller. :\

$ sbatch -p serc --test-only --exclusive --wrap="echo ok"
allocation failure: Zero Bytes were transmitted or received


> I created a patch that fixes the segault in my environment. Could you please
> test it in your environment? Specifically try to submit a high priority job
> with --test-only, and see if you see an output like this:
> > Submitted batch job 55
> > sbatch: Job 56 to start at 2025-02-05T09:09:58 using 12 processors on nodes n1 in partition a
> > sbatch:   Preempts: 55
> If you see the "Preempts" line, then the patch worked.

I've applied the patch, deployed it, and now, submitting a preempting job with `--test-only` seems to work, so it looks like it's working!

$ sbatch -p serc --test-only --exclusive --wrap="echo ok"
sbatch: Job 59363366 to start at 2025-02-05T09:12:02 using 24 processors on nodes sh02-10n62 in partition serc
sbatch:   Preempts: 59339582,59336519


I'm decreasing back to Sev3, will monitor for a few more hours, and give an update later today.

Thanks a lot!

Cheers,
--
Kilian
Comment 8 Ben Glines 2025-02-05 09:57:09 MST
(In reply to Kilian Cavalotti from comment #7)
> I've applied the patch, deployed it, and now, submitting a preempting job
> with `--test-only` seems to work, so it looks like it's working!
That's great! Thanks for letting us know. I'll update you on what we decide to do upstream.
> I'm decreasing back to Sev3, will monitor for a few more hours, and give an
> update later today.
Sounds good.
Comment 9 Kilian Cavalotti 2025-02-05 15:28:54 MST
(In reply to Ben Glines from comment #8)
> > I'm decreasing back to Sev3, will monitor for a few more hours, and give an
> > update later today.
> Sounds good.

No more of this segfault to report after applying the patch.

Thank you!
--
Kilian
Comment 10 Ben Glines 2025-02-05 16:44:16 MST
Good to hear!

I just noticed that we actually already found this internally too, and there was a patch put upstream last week:
https://github.com/SchedMD/slurm/commit/6b7ceb0746c7

The fix should be available in the next 24.11 release. I'm glad we were able to get you a patch sooner though :) Thanks for logging this! Closing now.