Created attachment 33202 [details] reproducer script Hi! We had a user reporting an issue that I believe is the consequence of salloc leaking TCP sockets when X11 forwarding is enabled. I'm attaching a minimal a reproducer script using Python matplotlib. The issue manifests itself by salloc returning error messages like this after some time, on the submission host (not the compute nodes where the job is executed): “salloc: error: Error on msg accept socket: Too many open files”. And looking at open files on the submission host, it indeed looks like the number of sockets opened by salloc is increasing each time the script is executed on the compute node. login1 $ salloc -p test --x11 salloc: Pending job allocation 35824643 salloc: Nodes sh02-01n60 are ready for job sh02-01n60 (job 35824643) $ Before the first execution of the script, we check the number of sockets open on the login node: login1 $ lsof -wp $(pidof salloc) | grep sock | wc -l 0 We start the script on the compute node: sh02-01n60 (job 35824643) $ python3 sock_leak_test.py sh02-01n60 (job 35824643) $ and check open sockets on the login node: login1 $ lsof -wp $(pidof salloc) | grep sock | wc -l 10 login1 $ after the second execution of the script: sh02-01n60 (job 35824643) $ python3 sock_leak_test.py sh02-01n60 (job 35824643) $ login1 $ lsof -wp $(pidof salloc) | grep sock | wc -l 20 login1 $ and so on... sockets keep being created by salloc. After a few executions: login1 $ lsof -wp $(pidof salloc) | grep sock salloc 34870 kilian 6u sock 0,7 0t0 272262770 protocol: TCP salloc 34870 kilian 7u sock 0,7 0t0 272262773 protocol: TCP salloc 34870 kilian 8u sock 0,7 0t0 272446102 protocol: TCP salloc 34870 kilian 9u sock 0,7 0t0 272446105 protocol: TCP salloc 34870 kilian 10u sock 0,7 0t0 270974766 protocol: TCP salloc 34870 kilian 11u sock 0,7 0t0 270974769 protocol: TCP salloc 34870 kilian 12u sock 0,7 0t0 271960477 protocol: TCP salloc 34870 kilian 13u sock 0,7 0t0 271960480 protocol: TCP salloc 34870 kilian 14u sock 0,7 0t0 272379311 protocol: TCP salloc 34870 kilian 15u sock 0,7 0t0 272379314 protocol: TCP salloc 34870 kilian 16u sock 0,7 0t0 273164584 protocol: TCP salloc 34870 kilian 17u sock 0,7 0t0 273164587 protocol: TCP salloc 34870 kilian 18u sock 0,7 0t0 272943063 protocol: TCP salloc 34870 kilian 19u sock 0,7 0t0 273146482 protocol: TCP salloc 34870 kilian 20u sock 0,7 0t0 273155142 protocol: TCP salloc 34870 kilian 21u sock 0,7 0t0 273155145 protocol: TCP salloc 34870 kilian 22u sock 0,7 0t0 271960908 protocol: TCP salloc 34870 kilian 23u sock 0,7 0t0 271960911 protocol: TCP salloc 34870 kilian 24u sock 0,7 0t0 272943070 protocol: TCP salloc 34870 kilian 25u sock 0,7 0t0 273146508 protocol: TCP [...] salloc 34870 kilian 146u sock 0,7 0t0 273038539 protocol: TCP salloc 34870 kilian 147u sock 0,7 0t0 273038542 protocol: TCP salloc 34870 kilian 148u sock 0,7 0t0 273196151 protocol: TCP salloc 34870 kilian 149u sock 0,7 0t0 273396055 protocol: TCP salloc 34870 kilian 150u sock 0,7 0t0 273392526 protocol: TCP salloc 34870 kilian 151u sock 0,7 0t0 273392529 protocol: TCP salloc 34870 kilian 152u sock 0,7 0t0 273392530 protocol: TCP salloc 34870 kilian 153u sock 0,7 0t0 273396057 protocol: TCP salloc 34870 kilian 154u sock 0,7 0t0 273196154 protocol: TCP salloc 34870 kilian 155u sock 0,7 0t0 273196157 protocol: TCP login1 $ This seems to happen with any X11 program, using Slurm built-in X11 forwarding (PrologFlags=x11) Is this a known issue? Thanks! -- Kilian
Hi Kilian, I've been able to reproduce this and it looks like it may be a behavior that has been around for a while. I can why this is happening in the code and I'm taking a look at some possible ways to avoid this. Thanks! --Tim
Hi Tim, (In reply to Tim McMullan from comment #2) > I've been able to reproduce this and it looks like it may be a behavior that > has been around for a while. > > I can why this is happening in the code and I'm taking a look at some > possible ways to avoid this. Glad to hear the problem can be reproduced. Thanks for looking into it! Cheers, -- Kilian
Hi Kilian, I tracked down the issue here. The fix has been pushed in https://github.com/SchedMD/slurm/commit/ecfc7f6ff7 and should be included in 23.02.7! Thanks for bringing this to our attention! I'll close this now, but let us know if you have any questions! --Tim
(In reply to Tim McMullan from comment #10) > Hi Kilian, > > I tracked down the issue here. The fix has been pushed in > https://github.com/SchedMD/slurm/commit/ecfc7f6ff7 and should be included in > 23.02.7! > > Thanks for bringing this to our attention! I'll close this now, but let us > know if you have any questions! Excellent, thanks Tim! Cheers, -- Kilian