Created attachment 8299 [details] sdiag we are running testharness on the system currently. we have a cleanup job that gets submitted immediately at the end of a job ant those clean up job submissions are failing with sbatch: error: Batch job submission failed: Socket timed out on send/recv operation Command: sbatch --dependency=afterany:bad_submission submit_cleanup.slurm sbatch: error: Batch job submission failed: Job dependency problem I looked at few other issues that were opened for the same error message and I tried changing the below on slurm.conf and did scontrol reconfigure SchedulerParameters=max_rpc_cnt=192 SchedulerParameters=MessageTimeout=30 This didnot help. I have attached the sdiag output on the ticket, can you please take a look at it and provide us with some suggestions to resolve this issue.
Hi Surendra, You have some errors in you slurm.conf that make it so that neither of those changes are being applied. `SchedulerParameters` needs to be a single line, or else only the last line will stick. So `max_rpc_cnt` was being ignored. Second, MessageTimeout is not a SchedulerParameter; it is a standalone parameter. So that was also being ignored. It needs to be on its own line. You’ll want to do this instead: SchedulerParameters=max_rpc_cnt=192 MessageTimeout=30 Again, if you have multiple SchedulingParameters, they need to be comma-separated on the same line. E.g.: SchedulerParameters=max_rpc_cnt=192,sched_min_interval=500000 As a sanity check, do `scontrol show config` after `scontrol reconfigure` to prove that `SchedulerParameters` and `MessageTimeout` are what you expect. Let’s first see if that solves the problem before looking at other things we can change. Also, can you give me your slurm.conf? If the above changes still don’t work, you can try increasing MessageTimeout to even higher values. Some sites even increase it up to 100. One thing to note is that user `root` has a really high RPC count (581,449), and that REQUEST_NODE_INFO (sinfo) is being called quite a lot with an average latency of nearly 1 second (918,641 microseconds). The total_time metric is two orders of magnitude larger than the next RPC. It’s possible that your monitoring scripts are overloading slurmctld. You can also try increasing sched_min_interval, e.g. to 500000 (0.5 seconds). For other ideas on what to optimize, look at https://slurm.schedmd.com/high_throughput.html. Bug 4304 also mentions most of what I said above. Let me know if that helps. Thanks, Michael
Closing out ticket. Let me know if you have any other questions. Thanks, Michael