Ticket 10100

Summary: Allow send_fd_over_pipe() to work with SOCK_STREAM sockets instead of just SOCK_DGRAM
Product: Slurm Reporter: Aditi Gaur <agaur>
Component: OtherAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: C - Contributions    
Priority: --- CC: csamuel
Version: 20.11.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8993
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.5 21.08.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: reproducer
fix for the fd issue.

Description Aditi Gaur 2020-10-29 16:03:07 MDT
Created attachment 16420 [details]
reproducer

Hello,

During the upstreaming process of a patch- I encountered a bug in src/common/fd.c in send_fd_over_pipe function as well as receieve_fd_over_pipe.

The bug is in how the message is constructed. According to the unix 7 man page:

"To pass file descriptors or credentials over a SOCK_STREAM, you need to send or receive at least one byte of nonancillary data in the same sendmsg(2) or recvmsg(2) call."

Since both the functions do not add the nonancillary data, They are unable to pass the fd correctly. I have added a reproducer to that effect. You will have to run it manually in order to see that the fd is not passed correctly.

I have also provided a patch that fixes the issue. After the patch, the fd is passed correctly. Please let me know if this is something you can reproduce on your end, hopefully its not just unique to my OS.
Comment 1 Aditi Gaur 2020-10-29 16:04:22 MDT
Created attachment 16421 [details]
fix for the fd issue.

I also modified the comment above the functions, since SCM_RIGHTS is used, the fd's can be passed between any processes, not just parent child.
Comment 2 Tim Wickberg 2020-10-29 16:15:35 MDT
What system are you seeing the problem on?

I can see that note in the man page now, but haven't seen issues with this in practice on most systems, and am curious as to what can actually expose this.

I'll look into reviewing this on 20.02 instead of 20.11, it looks important enough to pull in as a bug fix.

- Tim
Comment 3 Aditi Gaur 2020-10-29 16:25:15 MDT
I am seeing this on my dev system which is ubuntu latest. I encountered this problem during the testing of my patch here: https://bugs.schedmd.com/show_bug.cgi?id=8993

I have not seen this issue on our production systems however.
Comment 4 Aditi Gaur 2020-10-29 22:04:43 MDT
So I looked through the code to see if the bug could have been exposed somewhere else. I didnt find any evidence of that. I do see the `send_fd_over_pipe` function being used in src/slurmd/slurmd/req.c in function `_open_as_other`, but in that case the socketpair created a few lines above is created as SOCK_DGRAM. This particular issue only applies with SOCK_STREAM flag.

So I dont think in existing code this is an issue as long as SOCK_DGRAM is used. Adding an extra byte however would cover both the cases.
Comment 5 Tim Wickberg 2020-11-02 16:37:32 MST
Updating the title.
Comment 6 Danny Auble 2021-03-03 14:21:07 MST
Aditi, I can easily reproduce the error here.  Thanks for the fix.

One question I have is if there a reason you have the msg.msg_iovlen different in the 2 places?

msg.msg_iovlen = sizeof(c);

vs

msg.msg_iovlen = sizeof(iov)/sizeof(iov[0]);

The code looks very similar, but figured I would ask. The values appear to be the same as well.

Thanks!  Outside of this I think this is the correct way to go.
Comment 7 Aditi Gaur 2021-03-03 14:59:33 MST
Hi Danny,

They both do look the same. I think its probably a mistake. Either can be used.
Comment 8 Danny Auble 2021-03-04 09:04:08 MST
Comment on attachment 16421 [details]
fix for the fd issue.

This has been added in 20.11.5 commit 3b7660a2f49f
Comment 9 Danny Auble 2021-03-04 09:04:37 MST
Please reopen if anything further is needed here.