Ticket 6030 - sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Summary: sbatch: error: Batch job submission failed: Socket timed out on send/recv ope...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 17.11.9
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-11-13 16:56 MST by surendra
Modified: 2018-12-19 13:26 MST (History)
0 users

See Also:
Site: NREL
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
sdiag (4.27 KB, text/plain)
2018-11-13 16:56 MST, surendra
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description surendra 2018-11-13 16:56:45 MST
Created attachment 8299 [details]
sdiag

we are running testharness on the system currently. we have a cleanup job that gets submitted immediately at the end of a job ant those clean up job submissions are failing with 

sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

Command: sbatch --dependency=afterany:bad_submission submit_cleanup.slurm
sbatch: error: Batch job submission failed: Job dependency problem

I looked at few other issues that were opened for the same error message and I tried changing the below on slurm.conf and did scontrol reconfigure

SchedulerParameters=max_rpc_cnt=192  
SchedulerParameters=MessageTimeout=30 

This didnot help. I have attached the sdiag output on the ticket, can you please take a look at it and provide us with some suggestions to resolve this issue.
Comment 2 Michael Hinton 2018-11-14 16:04:06 MST
Hi Surendra,

You have some errors in you slurm.conf that make it so that neither of those changes are being applied.

`SchedulerParameters` needs to be a single line, or else only the last line will stick. So `max_rpc_cnt` was being ignored.

Second, MessageTimeout is not a SchedulerParameter; it is a standalone parameter. So that was also being ignored. It needs to be on its own line.

You’ll want to do this instead:

    SchedulerParameters=max_rpc_cnt=192
    MessageTimeout=30

Again, if you have multiple SchedulingParameters, they need to be comma-separated on the same line. E.g.:

    SchedulerParameters=max_rpc_cnt=192,sched_min_interval=500000

As a sanity check, do `scontrol show config` after `scontrol reconfigure` to prove that `SchedulerParameters` and `MessageTimeout` are what you expect.

Let’s first see if that solves the problem before looking at other things we can change.

Also, can you give me your slurm.conf?

If the above changes still don’t work, you can try increasing MessageTimeout to even higher values. Some sites even increase it up to 100.

One thing to note is that user `root` has a really high RPC count (581,449), and that REQUEST_NODE_INFO (sinfo) is being called quite a lot with an average latency of nearly 1 second (918,641 microseconds). The total_time metric is two orders of magnitude larger than the next RPC. It’s possible that your monitoring scripts are overloading slurmctld.

You can also try increasing sched_min_interval, e.g. to 500000 (0.5 seconds).

For other ideas on what to optimize, look at https://slurm.schedmd.com/high_throughput.html. Bug 4304 also mentions most of what I said above.

Let me know if that helps.

Thanks,
Michael
Comment 4 Michael Hinton 2018-12-19 13:26:50 MST
Closing out ticket. Let me know if you have any other questions.

Thanks,
Michael