Ticket 23658

Summary: Fix FreeBSD bind() and connect() calls: pass correct sockaddr length
Product: Slurm Reporter: Rikka Göring <rikka.goering>
Component: slurmdAssignee: Tim McMullan <mcmullan>
Status: OPEN --- QA Contact:
Severity: C - Contributions    
Priority: --- CC: mcmullan
Version: 26.05.x   
Hardware: Other   
OS: Other   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Patch to fix FreeBSD bind()/connect() calls by passing correct sockaddr lengths (uses _bsd_sockaddr_len_fix() and sets sun_len where needed). No functional changes for Linux.
freebsd-bind-fix-rebased.patch: Normalize sockaddr length handling on FreeBSD (bind/connect)
freebsd-bind-fix-rebased.patch: Normalize sockaddr length handling on FreeBSD (bind/connect)
0001 build: link sh5util against HDF5 high-level library when required
0002 configure: detect presence of BSD sockaddr length members
0003 common: add sockaddr length fixup helpers (feature-based)
0004 net: use slurm_sockaddr_fixlen() for bind/connect/getnameinfo call sites
0005 common: slurm_resolv: zero-init resolver state before res_ninit()

Description Rikka Göring 2025-09-06 04:52:57 MDT
Created attachment 43024 [details]
Patch to fix FreeBSD bind()/connect() calls by passing correct sockaddr lengths (uses _bsd_sockaddr_len_fix() and sets sun_len where needed). No functional changes for Linux.

On FreeBSD, bind() and connect() require the correct sockaddr length
to be passed, including the sa_len/sun_len fields, otherwise the call
may fail with EINVAL. This patch ensures the correct length is computed
and applied in all affected places.

The changes are minimal and guarded by #if defined(__FreeBSD__), so
there is no impact on Linux or other platforms.

Original patch prepared for the FreeBSD port of Slurm; refined during
review by Vladimir Druzenko to minimize scope and keep diffs clean.
Comment 1 Rikka Göring 2025-09-13 18:30:10 MDT
Thanks for assigning this. I just wanted to follow up and mention that I’m ready to rebase, retest, or adjust the patch if needed to help it along.
Comment 2 Tim McMullan 2025-09-29 09:09:39 MDT
Hello again!

I started to look at this and have realized that master and 23.11 have diverged on this one.  If you are able to rebase this to current master I'd really appreciate it!

Thanks!
--Tim
Comment 3 Rikka Göring 2025-09-29 18:49:17 MDT
I’ve started rebasing this patch onto current master. The process is taking a bit longer than expected because several affected files have been moved, split, or significantly changed since the 23.11 branch. I’m in the middle of translating the fixes into the new layout and will continue working on this rebase. I’ll report back once I have a clean patch ready for review.
Comment 4 Rikka Göring 2025-09-30 13:46:01 MDT
Rebase update:

I’ve rebased the FreeBSD sockaddr length fixes onto current master. Several call
sites moved into helpers (e.g., slurm_open_unix_stream) and conmgr was split, so
I translated the changes to the new layout. The build is green locally.

The patch ensures AF_UNIX uses SUN_LEN() + sun_len and AF_INET/AF_INET6 set
sin_len/sin6_len on FreeBSD, passing the exact namelen to bind()/connect() to
avoid EINVAL. Attaching freebsd-bind-fix-rebased.patch for review.
Comment 5 Rikka Göring 2025-09-30 13:49:07 MDT
Created attachment 43187 [details]
freebsd-bind-fix-rebased.patch: Normalize sockaddr length handling on FreeBSD (bind/connect)

Rebased patch against current master. Multiple files were moved or refactored since the
original version (e.g., conmgr split, stepd connect folded into slurm_open_unix_stream),
so the FreeBSD fixes were translated to the new locations.

This patch ensures:
- AF_UNIX: use SUN_LEN() and set sun_len before bind()/connect()
- AF_INET/AF_INET6: set sin_len/sin6_len and pass the correct namelen
- Avoids EINVAL on FreeBSD when binding or connecting sockets

Build tested successfully on FreeBSD. Please review for upstream inclusion.
Comment 6 Tim McMullan 2025-10-06 07:14:49 MDT
(In reply to Rikka Göring from comment #5)
> Created attachment 43187 [details]
> freebsd-bind-fix-rebased.patch: Normalize sockaddr length handling on
> FreeBSD (bind/connect)
> 
> Rebased patch against current master. Multiple files were moved or
> refactored since the
> original version (e.g., conmgr split, stepd connect folded into
> slurm_open_unix_stream),
> so the FreeBSD fixes were translated to the new locations.
> 
> This patch ensures:
> - AF_UNIX: use SUN_LEN() and set sun_len before bind()/connect()
> - AF_INET/AF_INET6: set sin_len/sin6_len and pass the correct namelen
> - Avoids EINVAL on FreeBSD when binding or connecting sockets
> 
> Build tested successfully on FreeBSD. Please review for upstream inclusion.

Hey, thank you for the effort here!

I was trying to look at this today and noticed that the file is 64 bytes in size and seems to not actually contain the patch.  I think maybe its just the wrong file?
Comment 7 Rikka Göring 2025-10-06 17:47:18 MDT
(In reply to Tim McMullan from comment #6)
> (In reply to Rikka Göring from comment #5)
> > Created attachment 43187 [details]
> > freebsd-bind-fix-rebased.patch: Normalize sockaddr length handling on
> > FreeBSD (bind/connect)
> > 
> > Rebased patch against current master. Multiple files were moved or
> > refactored since the
> > original version (e.g., conmgr split, stepd connect folded into
> > slurm_open_unix_stream),
> > so the FreeBSD fixes were translated to the new locations.
> > 
> > This patch ensures:
> > - AF_UNIX: use SUN_LEN() and set sun_len before bind()/connect()
> > - AF_INET/AF_INET6: set sin_len/sin6_len and pass the correct namelen
> > - Avoids EINVAL on FreeBSD when binding or connecting sockets
> > 
> > Build tested successfully on FreeBSD. Please review for upstream inclusion.
> 
> Hey, thank you for the effort here!
> 
> I was trying to look at this today and noticed that the file is 64 bytes in
> size and seems to not actually contain the patch.  I think maybe its just
> the wrong file?

Hi Tim,

I’m very sorry about the mix-up with the missing patch file earlier — I didn’t have the exported patch anymore, but thankfully I still had my local git branch and was able to recreate it cleanly against current master.

The new attachment should now apply correctly and reflect all the intended FreeBSD bind()/connect() length fixes.
If you notice anything else that needs adjusting, please don’t hesitate to let me know — that’s exactly what I’m here for!

Best,
Rikka
Comment 8 Rikka Göring 2025-10-06 17:48:31 MDT
Created attachment 43245 [details]
freebsd-bind-fix-rebased.patch: Normalize sockaddr length handling on FreeBSD (bind/connect)
Comment 9 Tim McMullan 2025-10-07 06:53:30 MDT
(In reply to Rikka Göring from comment #7)
> (In reply to Tim McMullan from comment #6)
> > (In reply to Rikka Göring from comment #5)
> > > Created attachment 43187 [details]
> > > freebsd-bind-fix-rebased.patch: Normalize sockaddr length handling on
> > > FreeBSD (bind/connect)
> > > 
> > > Rebased patch against current master. Multiple files were moved or
> > > refactored since the
> > > original version (e.g., conmgr split, stepd connect folded into
> > > slurm_open_unix_stream),
> > > so the FreeBSD fixes were translated to the new locations.
> > > 
> > > This patch ensures:
> > > - AF_UNIX: use SUN_LEN() and set sun_len before bind()/connect()
> > > - AF_INET/AF_INET6: set sin_len/sin6_len and pass the correct namelen
> > > - Avoids EINVAL on FreeBSD when binding or connecting sockets
> > > 
> > > Build tested successfully on FreeBSD. Please review for upstream inclusion.
> > 
> > Hey, thank you for the effort here!
> > 
> > I was trying to look at this today and noticed that the file is 64 bytes in
> > size and seems to not actually contain the patch.  I think maybe its just
> > the wrong file?
> 
> Hi Tim,
> 
> I’m very sorry about the mix-up with the missing patch file earlier — I
> didn’t have the exported patch anymore, but thankfully I still had my local
> git branch and was able to recreate it cleanly against current master.
> 
> The new attachment should now apply correctly and reflect all the intended
> FreeBSD bind()/connect() length fixes.
> If you notice anything else that needs adjusting, please don’t hesitate to
> let me know — that’s exactly what I’m here for!
> 
> Best,
> Rikka

Hi Rikka!

No worries, I've done the same thing myself in the past!  I'm glad you were able to re-create it cleanly with relative ease.

Thanks again for this, I'm taking a look at this and will let you know how that all goes!
--Tim
Comment 11 Rikka Göring 2026-01-10 14:43:41 MST
Hi Tim,
just a friendly ping on this ticket — have you had a chance to look at attachment 43245 [details]?
If it needs any adjusting or a rebase on current master, just let me know and I’ll update it.
No rush at all; I just wanted to make sure it didn’t fall off the radar.
Thanks!
—Rikka
Comment 12 Tim McMullan 2026-01-13 12:27:20 MST
Hi Rikka!

I'm sorry for the delayed reply on this!  Before SC and the holiday season kicked in I had a couple thoughts on this, but when I went to do some testing I was having trouble getting the cluster to talk to itself at all and I've been poking around that.

Are you able to spin up a cluster that actually runs a job with this patch?

I had to make a couple fixes to build properly on Linux and I think it would be ideal if we just made a common function or two to do this manipulation in a common spot /way.

I'm still getting back in the swing of things (just got back from a longer stint of time off) so will get back to you a little later with some more details, but I wanted to get back to you quickly!

Thanks!
--Tim
Comment 13 Rikka Göring 2026-01-17 03:07:21 MST
Hi Tim — absolutely no worries about the delay! That’s why I was pinging through. I figured SchedMD folks had a lot going on around new year (SC, holidays), so I didn’t expect a quick answer.

On this patch: I initially couldn’t get my cluster to communicate. The attachment was basically a straight rebase of the 23.11 patch, and I clearly missed something during testing and thought it was working when it wasn’t. I’ve since made a few adjustments, and now I can get a FreeBSD controller + FreeBSD worker to talk and run jobs (e.g. `srun -N1 -n1 /bin/hostname` works fine).

I’ll send an updated patch once it’s cleaned up and builds on Linux again.

That said, the updated version still has `#if defined(__FreeBSD__)` guards, and I agree that’s not the ideal long-term solution. Next I’ll rework it into a platform-agnostic approach with a common helper (or two) in a shared spot, as you suggested.

Sorry for breaking the Linux build with the previous patch — that shouldn’t have happened. I’ll add more generalized testing on my side going forward to avoid that.

If it helps, I can send a short list of the call sites I touched, but I’ll hold off attaching work-in-progress until it builds cleanly on Linux as well.

—Rikka
Comment 14 Tim McMullan 2026-01-20 07:21:03 MST
(In reply to Rikka Göring from comment #13)
> Hi Tim — absolutely no worries about the delay! That’s why I was pinging
> through. I figured SchedMD folks had a lot going on around new year (SC,
> holidays), so I didn’t expect a quick answer.

Thank you!  It has certainly been exciting times!

> On this patch: I initially couldn’t get my cluster to communicate. The
> attachment was basically a straight rebase of the 23.11 patch, and I clearly
> missed something during testing and thought it was working when it wasn’t.
> I’ve since made a few adjustments, and now I can get a FreeBSD controller +
> FreeBSD worker to talk and run jobs (e.g. `srun -N1 -n1 /bin/hostname` works
> fine).
> 
> I’ll send an updated patch once it’s cleaned up and builds on Linux again.

Thanks!  That at least makes me feel a bit more sane!  25.05 and beyond have had some substantial network stack changes which does make doing other development in the network stack a little more difficult.  If it helps, my suggestion would be to just try things on the master branch since ultimately that is where I will end up landing the changes first.  We *shouldn't* introduce new build problems for FreeBSD since it is in our build test pipeline.

> That said, the updated version still has `#if defined(__FreeBSD__)` guards,
> and I agree that’s not the ideal long-term solution. Next I’ll rework it
> into a platform-agnostic approach with a common helper (or two) in a shared
> spot, as you suggested.

That sounds great to me!

> Sorry for breaking the Linux build with the previous patch — that shouldn’t
> have happened. I’ll add more generalized testing on my side going forward to
> avoid that.

No worries!  I really appreciate all the work you have put in to this, and fixing the linux build isn't a big deal :)

> If it helps, I can send a short list of the call sites I touched, but I’ll
> hold off attaching work-in-progress until it builds cleanly on Linux as well.

I'm happy to just wait for the new patch since you are already working on it.

Thanks again for your help and work on this!
--Tim
Comment 15 Rikka Göring 2026-01-27 02:03:27 MST
Created attachment 44208 [details]
0001 build: link sh5util against HDF5 high-level library when required
Comment 16 Rikka Göring 2026-01-27 02:06:29 MST
Created attachment 44209 [details]
0002 configure: detect presence of BSD sockaddr length members
Comment 17 Rikka Göring 2026-01-27 02:07:05 MST
Created attachment 44210 [details]
0003 common: add sockaddr length fixup helpers (feature-based)
Comment 18 Rikka Göring 2026-01-27 02:07:38 MST
Created attachment 44211 [details]
0004 net: use slurm_sockaddr_fixlen() for bind/connect/getnameinfo call sites
Comment 19 Rikka Göring 2026-01-27 02:08:25 MST
Created attachment 44212 [details]
0005 common: slurm_resolv: zero-init resolver state before res_ninit()
Comment 20 Rikka Göring 2026-01-27 02:09:38 MST
Hi Tim,

I’ve reworked the BSD sockaddr-length fix into a small series on current master and can now reliably get a FreeBSD controller and FreeBSD worker to communicate again (the original bind/connect EINVAL issue is resolved with this series). Linux builds were unaffected in my testing.

Summary of the series:
 - 2/5: configure: detect presence of BSD-style sockaddr length members (sa_len/sin_len/sin6_len/sun_len)
 - 3/5: common: introduce feature-based helpers in src/common/slurm_sockaddr.[ch] (no OS-specific guards)
 - 4/5: networking: wire the helpers into the affected bind/connect/getnameinfo call sites

While testing, I hit a separate FreeBSD crash in DNS SRV discovery:

 - 5/5: common: slurm_resolv: zero-initialize struct __res_state before res_ninit() (prevents a segfault in resolver teardown on FreeBSD). I kept this as its own commit so it can be reviewed/landed independently if you prefer.

Note: I’m currently debugging a separate “job step doesn’t complete / stepd not staying up” issue on the worker (no listener on /var/spool/slurmd/_<job.step>). I don’t believe it’s caused by the sockaddr-length changes, but I’m continuing to investigate in parallel and will report back once I have a concrete root cause.

If it helps review, feel free to cherry-pick individual commits from the series.

Thanks again,
Rikka
Comment 21 Tim McMullan 2026-01-29 11:37:39 MST
Thank you so much for this!

I'm going to start taking a look at it and testing it and I'll let you know how it goes!

Thanks again, I really appreciate all of this!
--Tim