| Summary: | native slurm srun loops in retry when cookie lease is delayed beyond initial 10 second timeout | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | kohnke |
| Component: | slurmctld | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 16.05.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=3318 | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
kohnke
2017-03-07 10:34:58 MST
Can you please clarify some of the background on this? You've filed it against NERSC - if this is correct I need to add Doug on to the bug as he owns that support contract. They're also running 17.02.1 now, not the reported 16.05.9 marked up on the bug. It looks like this relates to bug 3318? What value do you have set for TCPTimeout on this system? That's presumably what's triggering the srun connection timeouts shown here - NERSC has this increased to ~ 60s on Cori. This bug is a spin off of NERSC Cray bug 848563 and NERSC SchedMD bug 3318 related to issues hit when Aries network throttle or quiesce activity happens. I have time on our internal system again. I do not have access to the NERSC system. So, all of my comments are related to the behavior I am seeing on our internal machine running 16.05.9 along with my corrected libalpscomm_sn library to increase library cookie connect timeouts from 10 seconds to 120 seconds. Our slurm.conf file did not have TCPTimeout set, so I set that to 60 seconds. That doesn't seem to make a difference in the communication between srun and slurmctld when there is a delay in leasing the cookies. I think there is a 10 second timeout firing somewhere here, which may be coming from the call to slurm_step_ctx_create_timeout() within launch_common_create_job_step(). srun: debug: _slurm_recv_timeout at 0 of 4, timeout srun: error: slurm_receive_msg: Socket timed out on send/recv operation srun: Job step creation temporarily disabled, retrying I enabled TimeCray and Steps DebugFlags to gather more info. Here is an example where TCPTimeout is set to 60, and there was about a 30-35 second delay in leasing cookies. The cookie leasing activity is covered in the switch_p_build_jobinfo log message and shows that took 35 seconds. Slurmctld notes the long time taken and then reports "Requested nodes are busy". The delayed leased cookies are received by slurmctld since those cookie ids are later successfully released by slurmctld after I ctrl-C srun. [2017-03-08T08:59:59.577] sched: _slurm_rpc_allocate_resources JobId=7 NodeList=nid00032 usec=361 [2017-03-08T08:59:59.579] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T08:59:59.580] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=1 [2017-03-08T08:59:59.582] step pick 1-1 nodes, avail:nid00032 idle:nid00032 picked:NONE [2017-03-08T08:59:59.583] step picked 0 of 1 nodes [2017-03-08T08:59:59.584] Picked nodes nid00032 when accumulating from nid00032 [2017-03-08T08:59:59.585] debug: laying out the 1 tasks on 1 hosts nid00032 dist 1 [2017-03-08T08:59:59.587] debug: reserved ports 20002 for step 7.0 [2017-03-08T09:00:09.768] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:00:20.126] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:00:30.822] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:00:35.468] switch_p_build_jobinfo (switch_cray.c:217) call took: usec=35879644 [2017-03-08T09:00:35.468] step alloc on job node 0 (nid00032) used 1 of 16 CPUs [2017-03-08T09:00:35.470] _initialize_event (select_cray.c:646) call took: usec=11 [2017-03-08T09:00:35.472] _update_app (select_cray.c:819) call took: usec=1637 [2017-03-08T09:00:35.473] select_p_step_start (select_cray.c:2200) call took: usec=3212 [2017-03-08T09:00:35.475] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=35896588 began=08:59:59.579 [2017-03-08T09:00:35.477] sched: _slurm_rpc_job_step_create: StepId=7.0 nid00032 usec=35896588 [2017-03-08T09:00:35.478] debug: backfill: beginning [2017-03-08T09:00:35.480] debug: backfill: no jobs to backfill [2017-03-08T09:00:35.481] Warning: Note very large processing time from _slurmctld_background: usec=35353663 began=09:00:00.125 [2017-03-08T09:00:35.483] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0 [2017-03-08T09:00:35.484] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=25715753 began=09:00:09.768 [2017-03-08T09:00:35.485] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:00:35.487] Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=6744188 began=09:00:28.741 [2017-03-08T09:00:35.488] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0 [2017-03-08T09:00:35.489] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=15363021 began=09:00:20.126 [2017-03-08T09:00:35.491] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:00:35.492] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0 [2017-03-08T09:00:35.493] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=4670986 began=09:00:30.822 [2017-03-08T09:00:35.494] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:00:35.549] Warning: Note very large processing time from _aeld_event_loop: took: usec=39736 began=09:00:35.510 [2017-03-08T09:00:36.484] debug: sched: Running job scheduler [2017-03-08T09:01:05.481] debug: backfill: beginning [2017-03-08T09:01:05.482] debug: backfill: no jobs to backfill [2017-03-08T09:01:09.517] debug: Spawning ping agent for nid000[32-35,40-43] [2017-03-08T09:01:24.386] debug: aeld sync event [2017-03-08T09:01:24.389] debug: _start_session: Created aeld session fd 7 [2017-03-08T09:01:35.484] debug: backfill: beginning [2017-03-08T09:01:35.486] debug: backfill: no jobs to backfill [2017-03-08T09:01:36.543] debug: sched: Running job scheduler [2017-03-08T09:01:43.564] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:01:43.565] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=1 [2017-03-08T09:01:43.567] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:01:44.920] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:01:44.922] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0 [2017-03-08T09:01:44.923] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:02:05.488] debug: backfill: beginning [2017-03-08T09:02:05.490] debug: backfill: no jobs to backfill [2017-03-08T09:02:36.601] debug: sched: Running job scheduler [2017-03-08T09:02:52.994] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:02:52.995] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=1 [2017-03-08T09:02:52.996] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:02:53.493] debug: backfill: beginning [2017-03-08T09:02:53.495] debug: backfill: no jobs to backfill [2017-03-08T09:02:55.700] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:02:55.702] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0 [2017-03-08T09:02:55.704] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:03:23.497] debug: backfill: beginning [2017-03-08T09:03:23.498] debug: backfill: no jobs to backfill [2017-03-08T09:03:36.656] debug: sched: Running job scheduler [2017-03-08T09:04:03.775] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:04:03.775] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=1 [2017-03-08T09:04:03.777] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:04:04.501] debug: backfill: beginning [2017-03-08T09:04:04.503] debug: backfill: no jobs to backfill [2017-03-08T09:04:08.780] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320 [2017-03-08T09:04:08.781] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0 [2017-03-08T09:04:08.783] _slurm_rpc_job_step_create for job 7: Requested nodes are busy [2017-03-08T09:04:26.705] debug: Updating partition uid access list [2017-03-08T09:04:26.705] debug: cray: select_p_state_save [2017-03-08T09:04:29.752] debug: Spawning ping agent for nid000[32-35,40-43] [2017-03-08T09:04:34.505] debug: backfill: beginning [2017-03-08T09:04:34.505] debug: backfill: no jobs to backfill [2017-03-08T09:04:36.761] debug: sched: Running job scheduler [2017-03-08T09:05:04.507] debug: backfill: beginning [2017-03-08T09:05:04.507] debug: backfill: no jobs to backfill [2017-03-08T09:05:16.185] job_complete: JobID=7 State=0x1 NodeCnt=1 WTERMSIG 1 [2017-03-08T09:05:16.185] job_complete: JobID=7 State=0x8003 NodeCnt=1 done [2017-03-08T09:05:16.188] _initialize_event (select_cray.c:646) call took: usec=10 [2017-03-08T09:05:16.191] _update_app (select_cray.c:819) call took: usec=2234 [2017-03-08T09:05:16.192] select_p_step_finish (select_cray.c:2276) call took: usec=3601 [2017-03-08T09:05:16.203] debug: sched: Running job scheduler [2017-03-08T09:05:16.308] Warning: Note very large processing time from _aeld_event_loop: took: usec=40278 began=09:05:16.268 [2017-03-08T09:05:17.165] _run_nhc jobid 7 and apid 7 completion took: usec=971845 [2017-03-08T09:05:17.167] step dealloc on job node 0 (nid00032) used: 0 of 16 CPUs [2017-03-08T09:05:17.172] switch_p_job_step_complete (switch_cray.c:934) call took: usec=4177 [2017-03-08T09:05:17.174] switch_p_free_jobinfo (switch_cray.c:273) call took: usec=3 [2017-03-08T09:05:17.175] debug: freed ports 20002 for step 7.0 [2017-03-08T09:05:17.827] _run_nhc jobid 7 and apid 0 completion took: usec=639365 Looking over the code, the timeout fed into slurm_step_ctx_create_timeout should always be at least 60 seconds, unless you're using the -I flag:
if (opt.immediate) {
step_wait = MAX(1, opt.immediate -
difftime(time(NULL),
srun_begin_time)) *
1000;
} else {
slurmctld_timeout = MIN(300, MAX(60,
slurm_get_slurmctld_timeout()));
step_wait = ((getpid() % 10) +
slurmctld_timeout) * 1000;
}
job->step_ctx = slurm_step_ctx_create_timeout(&job->ctx_params, step_wait);
So I think a different timeout is factoring in here. Is MessageTimeout set? I believe that defaults to 10 seconds. Some of the other settings NERSC has running that may be relevant:
BatchStartTimeout=60
MessageTimeout=60 ## default is 10, units in seconds
SlurmctldTimeout=120 ## failover time
SlurmdTimeout=300 ## timeout before node marked down
I am not explicitly using an srun -I flag, but I am invoking srun outside of a batch job or salloc. I'll try an salloc and see what happens. With the salloc and setting MessageTimeout=60, srun and slurmctld can handle a delayed cookie lease at least to a 60 second delay. That's good news. I'm losing my dedicated time. I will try this tomorrow and see if delays between 60 and 120 seconds are handled correctly. Thanks for suggesting the MessageTimeout setting. My testing last night showed this behavior with MessageTimeout=60,
with my modified libalpscomm_sn default 120 second timeout within an
salloc session, and manually making ncmd unavailable/available for
varying amount of time. I also had some DebugFlags set.
1) ncmd unavailable for 80 seconds, srun periodically makes an RPC
call causing slurmctld to try to lease cookies, cookie lease successful
after that 80 seconds, srun receives cookies and completes execution
kohnke@opal-p2:/home/users/kohnke> srun -n 1 hostname
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation temporarily disabled, retrying
srun: Job step created
nid00032
srun: switch_p_free_jobinfo (switch_cray.c:273) call took: usec=1
2) ncmd unavailable for 100 seconds; like #1, srun receives cookies
and completes execution
kohnke@opal-p2:/home/users/kohnke> srun -n 1 hostname
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation temporarily disabled, retrying
srun: Job step created
nid00032
srun: switch_p_free_jobinfo (switch_cray.c:273) call took: usec=1
3) ncmd unavailable for 130 seconds, slurmctld receives error return
from libalpscomm cookie lease call after 120 seconds; slurmctld later
tries another cookie lease for this same step and is successful; however,
srun doesn't appear to be told about the cookies and continues making
RPC calls until I did a ctrl-C on srun
kohnke@opal-p2:/home/users/kohnke> srun -n 1 hostname
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation temporarily disabled, retrying
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
^Csrun: Cancelled pending job step with signal 2
srun: error: Unable to create job step: Job/step already completing or completed
Here is an srun backtrace where srun did not receive the cookies and
continued for several minutes before I did a ctrl-C:
gdb) bt
#0 0x00007f188eb79bfd in poll () from /lib64/libc.so.6
#1 0x000000000057b598 in slurm_step_ctx_create_timeout (step_params=0x813e30,
timeout=60000) at step_ctx.c:266
#2 0x000000000058733c in launch_common_create_job_step (job=0x813d50,
use_all_cpus=true, signal_function=0x591d48 <_signal_while_allocating>,
destroy_job=0x7f1c5c <destroy_job>) at launch.c:319
#3 0x00007f188aa66b83 in launch_p_create_job_step (job=0x813d50,
use_all_cpus=true, signal_function=0x591d48 <_signal_while_allocating>,
destroy_job=0x7f1c5c <destroy_job>) at launch_slurm.c:493
#4 0x00000000005879ca in launch_g_create_job_step (job=0x813d50,
use_all_cpus=true, signal_function=0x591d48 <_signal_while_allocating>,
destroy_job=0x7f1c5c <destroy_job>) at launch.c:504
#5 0x0000000000593802 in create_job_step (job=0x813d50, use_all_cpus=true)
at allocate.c:897
#6 0x000000000058fd4f in create_srun_job (p_job=0x7ecd00 <job>,
got_alloc=0x7fff906aebdf, slurm_started=false, handle_signals=true)
at srun_job.c:632
#7 0x000000000042cd6c in srun (ac=4, av=0x7fff906aedc8) at srun.c:194
#8 0x000000000042dc21 in main (argc=4, argv=0x7fff906aedc8)
at srun.wrapper.c:17
Here is some slurmctld log entries related to #3 above where there is an
error return from the 120 second timeout in libalpscomm. I only had one
step at a time executing. It looks like the step id changed from 7.0
to 7.1.
[2017-03-09T18:51:25.656] switch_p_build_jobinfo (switch_cray.c:217) call took: usec=120000707
[2017-03-09T18:51:25.658] error: (cookies.c: 236: release_cookies) alpsc_release_cookies failed: src/lib/alpscomm_sn/cookie.c:617 Invalid value for cookie_ids
[2017-03-09T18:51:25.660] switch_p_free_jobinfo (switch_cray.c:273) call took: usec=1
[2017-03-09T18:51:25.661] debug: freed ports 20012 for step 7.0
[2017-03-09T18:51:25.663] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=120017772 began=18:49:25.645
[2017-03-09T18:51:25.664] _slurm_rpc_job_step_create for job 7: Error configuring interconnect
[2017-03-09T18:51:25.666] Warning: Note very large processing time from _slurmctld_background: usec=119931355 began=18:49:25.733
[2017-03-09T18:51:25.667] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0
[2017-03-09T18:51:25.668] step pick 1-1 nodes, avail:nid00032 idle:nid00032 picked:NONE
[2017-03-09T18:51:25.670] step picked 0 of 1 nodes
[2017-03-09T18:51:25.671] Picked nodes nid00032 when accumulating from nid00032
[2017-03-09T18:51:25.672] debug: laying out the 1 tasks on 1 hosts nid00032 dist 1
[2017-03-09T18:51:25.674] debug: reserved ports 20013 for step 7.1
[2017-03-09T18:51:26.008] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
Even with a MessageTimeout of 60 seconds, as long as a cookie lease can
happen before the libalpscomm 120 second timeout expires, the cookies are
successfully leased by slurmctld and provided to srun. However, it
appears that if more than 120 seconds elapses and an error is returned
to slurmctld, slurmctld can later successfully get leased cookies, but srun
is not told about those cookies and continues to wait and periodically make
an RPC call until a ctrl-C is done to srun.
Adding Doug on as I'd previously noted it required on NERSC issues. All of that diagnostic info is good to have, but I think this all needs to be addressed as part of our response to bug 3318. I don't think that chasing all the possible timeout mismatches between the Cray APIs and slurmd/slurmctld is a good use of our time at present. - Tim |