Ticket 18139 - salloc: socket leak with X11 forwarding?
Summary: salloc: socket leak with X11 forwarding?
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 23.02.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-11-06 16:47 MST by Kilian Cavalotti
Modified: 2024-02-08 18:24 MST (History)
1 user (show)

See Also:
Site: Stanford
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.7 23.11.0rc2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
reproducer script (75 bytes, text/x-python)
2023-11-06 16:47 MST, Kilian Cavalotti
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2023-11-06 16:47:01 MST
Created attachment 33202 [details]
reproducer script

Hi!

We had a user reporting an issue that I believe is the consequence of salloc leaking TCP sockets when X11 forwarding is enabled. I'm attaching a minimal a reproducer script using Python matplotlib.

The issue manifests itself by salloc returning error messages like this after some time, on the submission host (not the compute nodes where the job is executed):
“salloc: error: Error on msg accept socket: Too many open files”. 

And looking at open files on the submission host, it indeed looks like the number of sockets opened by salloc is increasing each time the script is executed on the compute node.

login1 $ salloc -p test --x11
salloc: Pending job allocation 35824643
salloc: Nodes sh02-01n60 are ready for job
sh02-01n60 (job 35824643) $ 

Before the first execution of the script, we check the number of sockets open on the login node:
login1 $ lsof -wp $(pidof salloc)  | grep sock | wc -l
0

We start the script on the compute node:
sh02-01n60 (job 35824643) $ python3 sock_leak_test.py
sh02-01n60 (job 35824643) $ 

and check open sockets on the login node:
login1 $ lsof -wp $(pidof salloc)  | grep sock | wc -l
10
login1 $

after the second execution of the script:
sh02-01n60 (job 35824643) $ python3 sock_leak_test.py
sh02-01n60 (job 35824643) $ 

login1 $ lsof -wp $(pidof salloc) | grep sock | wc -l
20
login1 $

and so on... sockets keep being created by salloc. 

After a few executions:
login1 $ lsof -wp $(pidof salloc)  | grep sock
salloc  34870 kilian    6u  sock         0,7      0t0          272262770 protocol: TCP
salloc  34870 kilian    7u  sock         0,7      0t0          272262773 protocol: TCP
salloc  34870 kilian    8u  sock         0,7      0t0          272446102 protocol: TCP
salloc  34870 kilian    9u  sock         0,7      0t0          272446105 protocol: TCP
salloc  34870 kilian   10u  sock         0,7      0t0          270974766 protocol: TCP
salloc  34870 kilian   11u  sock         0,7      0t0          270974769 protocol: TCP
salloc  34870 kilian   12u  sock         0,7      0t0          271960477 protocol: TCP
salloc  34870 kilian   13u  sock         0,7      0t0          271960480 protocol: TCP
salloc  34870 kilian   14u  sock         0,7      0t0          272379311 protocol: TCP
salloc  34870 kilian   15u  sock         0,7      0t0          272379314 protocol: TCP
salloc  34870 kilian   16u  sock         0,7      0t0          273164584 protocol: TCP
salloc  34870 kilian   17u  sock         0,7      0t0          273164587 protocol: TCP
salloc  34870 kilian   18u  sock         0,7      0t0          272943063 protocol: TCP
salloc  34870 kilian   19u  sock         0,7      0t0          273146482 protocol: TCP
salloc  34870 kilian   20u  sock         0,7      0t0          273155142 protocol: TCP
salloc  34870 kilian   21u  sock         0,7      0t0          273155145 protocol: TCP
salloc  34870 kilian   22u  sock         0,7      0t0          271960908 protocol: TCP
salloc  34870 kilian   23u  sock         0,7      0t0          271960911 protocol: TCP
salloc  34870 kilian   24u  sock         0,7      0t0          272943070 protocol: TCP
salloc  34870 kilian   25u  sock         0,7      0t0          273146508 protocol: TCP
[...]
salloc  34870 kilian  146u  sock         0,7      0t0          273038539 protocol: TCP
salloc  34870 kilian  147u  sock         0,7      0t0          273038542 protocol: TCP
salloc  34870 kilian  148u  sock         0,7      0t0          273196151 protocol: TCP
salloc  34870 kilian  149u  sock         0,7      0t0          273396055 protocol: TCP
salloc  34870 kilian  150u  sock         0,7      0t0          273392526 protocol: TCP
salloc  34870 kilian  151u  sock         0,7      0t0          273392529 protocol: TCP
salloc  34870 kilian  152u  sock         0,7      0t0          273392530 protocol: TCP
salloc  34870 kilian  153u  sock         0,7      0t0          273396057 protocol: TCP
salloc  34870 kilian  154u  sock         0,7      0t0          273196154 protocol: TCP
salloc  34870 kilian  155u  sock         0,7      0t0          273196157 protocol: TCP
login1 $


This seems to happen with any X11 program, using Slurm built-in X11 forwarding (PrologFlags=x11)

Is this a known issue?

Thanks!
--
Kilian
Comment 2 Tim McMullan 2023-11-08 06:07:57 MST
Hi Kilian,

I've been able to reproduce this and it looks like it may be a behavior that has been around for a while.

I can why this is happening in the code and I'm taking a look at some possible ways to avoid this.

Thanks!
--Tim
Comment 3 Kilian Cavalotti 2023-11-08 08:37:35 MST
Hi Tim,

(In reply to Tim McMullan from comment #2)
> I've been able to reproduce this and it looks like it may be a behavior that
> has been around for a while.
> 
> I can why this is happening in the code and I'm taking a look at some
> possible ways to avoid this.

Glad to hear the problem can be reproduced. Thanks for looking into it! 

Cheers,
--
Kilian
Comment 10 Tim McMullan 2023-11-13 08:04:00 MST
Hi Kilian,

I tracked down the issue here.  The fix has been pushed in https://github.com/SchedMD/slurm/commit/ecfc7f6ff7 and should be included in 23.02.7!

Thanks for bringing this to our attention! I'll close this now, but let us know if you have any questions!
--Tim
Comment 11 Kilian Cavalotti 2023-11-13 10:10:06 MST
(In reply to Tim McMullan from comment #10)
> Hi Kilian,
> 
> I tracked down the issue here.  The fix has been pushed in
> https://github.com/SchedMD/slurm/commit/ecfc7f6ff7 and should be included in
> 23.02.7!
> 
> Thanks for bringing this to our attention! I'll close this now, but let us
> know if you have any questions!

Excellent, thanks Tim!

Cheers,
--
Kilian