Ticket 3542

Summary:	native slurm srun loops in retry when cookie lease is delayed beyond initial 10 second timeout
Product:	Slurm	Reporter:	kohnke
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	16.05.9
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=3318
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description kohnke 2017-03-07 10:34:58 MST

On a native slurm system using the Cray provided ncmd cookie daemon,
if there is a greater than 10 second delay is leasing cookies
from ncmd, srun continues in this loop until receiving a ctrl-C:

srun: jobid 10: nodes(1):`nid00035', cpu counts: 16(x1)
srun: debug2: creating job with 1 tasks
srun: debug:  requesting job 10, user 10320, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name hostname, relative 65534
srun: debug:  _slurm_recv_timeout at 0 of 4, timeout
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation temporarily disabled, retrying
srun: debug:  _slurm_recv_timeout at 0 of 4, timeout
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation still disabled, retrying
srun: debug:  _slurm_recv_timeout at 0 of 4, timeout
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation still disabled, retrying
srun: debug:  _slurm_recv_timeout at 0 of 4, timeout
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation still disabled, retrying
srun: debug:  _slurm_recv_timeout at 0 of 4, timeout
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation still disabled, retrying
srun: debug:  _slurm_recv_timeout at 0 of 4, timeout
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation still disabled, retrying


The slurmctld logfile has these entries:

[2017-03-07T08:40:31.058] debug:  laying out the 1 tasks on 1 hosts nid00035 dist 1
[2017-03-07T08:40:31.061] debug:  reserved ports 20006 for step 10.0
[2017-03-07T08:41:36.698] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=65642590 began=08:40:31.056
[2017-03-07T08:41:36.699] debug:  backfill: beginning
[2017-03-07T08:41:36.701] debug:  backfill: no jobs to backfill
[2017-03-07T08:41:36.702] Warning: Note very large processing time from _slurmctld_background: usec=65286041 began=08:40:31.416
[2017-03-07T08:41:36.704] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=55480347 began=08:40:41.222
[2017-03-07T08:41:36.706] error: The modification time of /var/spool/slurm/job_state moved backwards by 15 seconds
[2017-03-07T08:41:36.707] error: The clock of the file system and this computer appear to not be synchronized
[2017-03-07T08:41:36.708] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=45169014 began=08:40:51.537
[2017-03-07T08:41:36.711] Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=24450987 began=08:41:12.255
[2017-03-07T08:41:36.712] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=34572681 began=08:41:02.138
[2017-03-07T08:41:36.714] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=23380641 began=08:41:13.333
[2017-03-07T08:41:36.716] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=11013883 began=08:41:25.702


There is currently a 10 second timeout within the libalpscomm_sn code
related to cookie lease and release.  I have made code changes there
to extend the connect timeout to ncmd from 10 seconds to 120 seconds
to handle delays related to Aries network throttle or quiesce activity.
By extending the connect timeout, the library is able to handle
longer connect delays and provide the requested leased cookies.
However, srun doesn't appear to be told about the receipt of the
delayed leased cookies.

The above information was gathered from a Cray internal system running
Slurm 16.05.9 and having greater than 10 second delays with contacting ncmd.

Comment 1 Tim Wickberg 2017-03-07 17:42:58 MST

Can you please clarify some of the background on this?

You've filed it against NERSC - if this is correct I need to add Doug on to the bug as he owns that support contract. They're also running 17.02.1 now, not the reported 16.05.9 marked up on the bug.

It looks like this relates to bug 3318?

What value do you have set for TCPTimeout on this system? That's presumably what's triggering the srun connection timeouts shown here - NERSC has this increased to ~ 60s on Cori.

Comment 2 kohnke 2017-03-08 08:26:32 MST

This bug is a spin off of NERSC Cray bug 848563 and NERSC SchedMD
bug 3318 related to issues hit when Aries network throttle or quiesce
activity happens.  I have time on our internal system again.  I do not have
access to the NERSC system.  So, all of my comments are related to the
behavior I am seeing on our internal machine running 16.05.9 along
with my corrected libalpscomm_sn library to increase library cookie
connect timeouts from 10 seconds to 120 seconds.

Our slurm.conf file did not have TCPTimeout set, so I set that to
60 seconds.  That doesn't seem to make a difference in the communication
between srun and slurmctld when there is a delay in leasing the
cookies.

I think there is a 10 second timeout firing somewhere here, which
may be coming from the call to slurm_step_ctx_create_timeout() within
launch_common_create_job_step().

srun: debug:  _slurm_recv_timeout at 0 of 4, timeout
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation temporarily disabled, retrying

I enabled TimeCray and Steps DebugFlags to gather more info.  Here is
an example where TCPTimeout is set to 60, and there was about a
30-35 second delay in leasing cookies.  The cookie leasing activity is
covered in the switch_p_build_jobinfo log message and shows that
took 35 seconds.  Slurmctld notes the long time taken and then reports
"Requested nodes are busy".  The delayed leased cookies are received
by slurmctld since those cookie ids are later successfully released
by slurmctld after I ctrl-C srun.


[2017-03-08T08:59:59.577] sched: _slurm_rpc_allocate_resources JobId=7 NodeList=nid00032 usec=361
[2017-03-08T08:59:59.579] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T08:59:59.580] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=1
[2017-03-08T08:59:59.582] step pick 1-1 nodes, avail:nid00032 idle:nid00032 picked:NONE
[2017-03-08T08:59:59.583] step picked 0 of 1 nodes
[2017-03-08T08:59:59.584] Picked nodes nid00032 when accumulating from nid00032
[2017-03-08T08:59:59.585] debug:  laying out the 1 tasks on 1 hosts nid00032 dist 1
[2017-03-08T08:59:59.587] debug:  reserved ports 20002 for step 7.0
[2017-03-08T09:00:09.768] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:00:20.126] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:00:30.822] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:00:35.468] switch_p_build_jobinfo (switch_cray.c:217) call took: usec=35879644
[2017-03-08T09:00:35.468] step alloc on job node 0 (nid00032) used 1 of 16 CPUs
[2017-03-08T09:00:35.470] _initialize_event (select_cray.c:646) call took: usec=11
[2017-03-08T09:00:35.472] _update_app (select_cray.c:819) call took: usec=1637
[2017-03-08T09:00:35.473] select_p_step_start (select_cray.c:2200) call took: usec=3212
[2017-03-08T09:00:35.475] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=35896588 began=08:59:59.579
[2017-03-08T09:00:35.477] sched: _slurm_rpc_job_step_create: StepId=7.0 nid00032 usec=35896588
[2017-03-08T09:00:35.478] debug:  backfill: beginning
[2017-03-08T09:00:35.480] debug:  backfill: no jobs to backfill
[2017-03-08T09:00:35.481] Warning: Note very large processing time from _slurmctld_background: usec=35353663 began=09:00:00.125
[2017-03-08T09:00:35.483] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0
[2017-03-08T09:00:35.484] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=25715753 began=09:00:09.768
[2017-03-08T09:00:35.485] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:00:35.487] Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=6744188 began=09:00:28.741
[2017-03-08T09:00:35.488] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0
[2017-03-08T09:00:35.489] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=15363021 began=09:00:20.126
[2017-03-08T09:00:35.491] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:00:35.492] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0
[2017-03-08T09:00:35.493] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=4670986 began=09:00:30.822
[2017-03-08T09:00:35.494] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:00:35.549] Warning: Note very large processing time from _aeld_event_loop: took: usec=39736 began=09:00:35.510
[2017-03-08T09:00:36.484] debug:  sched: Running job scheduler
[2017-03-08T09:01:05.481] debug:  backfill: beginning
[2017-03-08T09:01:05.482] debug:  backfill: no jobs to backfill
[2017-03-08T09:01:09.517] debug:  Spawning ping agent for nid000[32-35,40-43]
[2017-03-08T09:01:24.386] debug:  aeld sync event
[2017-03-08T09:01:24.389] debug:  _start_session: Created aeld session fd 7
[2017-03-08T09:01:35.484] debug:  backfill: beginning
[2017-03-08T09:01:35.486] debug:  backfill: no jobs to backfill
[2017-03-08T09:01:36.543] debug:  sched: Running job scheduler
[2017-03-08T09:01:43.564] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:01:43.565] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=1
[2017-03-08T09:01:43.567] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:01:44.920] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:01:44.922] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0
[2017-03-08T09:01:44.923] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:02:05.488] debug:  backfill: beginning
[2017-03-08T09:02:05.490] debug:  backfill: no jobs to backfill
[2017-03-08T09:02:36.601] debug:  sched: Running job scheduler
[2017-03-08T09:02:52.994] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:02:52.995] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=1
[2017-03-08T09:02:52.996] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:02:53.493] debug:  backfill: beginning
[2017-03-08T09:02:53.495] debug:  backfill: no jobs to backfill
[2017-03-08T09:02:55.700] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:02:55.702] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0
[2017-03-08T09:02:55.704] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:03:23.497] debug:  backfill: beginning
[2017-03-08T09:03:23.498] debug:  backfill: no jobs to backfill
[2017-03-08T09:03:36.656] debug:  sched: Running job scheduler
[2017-03-08T09:04:03.775] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:04:03.775] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=1
[2017-03-08T09:04:03.777] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:04:04.501] debug:  backfill: beginning
[2017-03-08T09:04:04.503] debug:  backfill: no jobs to backfill
[2017-03-08T09:04:08.780] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320
[2017-03-08T09:04:08.781] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0
[2017-03-08T09:04:08.783] _slurm_rpc_job_step_create for job 7: Requested nodes are busy
[2017-03-08T09:04:26.705] debug:  Updating partition uid access list
[2017-03-08T09:04:26.705] debug:  cray: select_p_state_save
[2017-03-08T09:04:29.752] debug:  Spawning ping agent for nid000[32-35,40-43]
[2017-03-08T09:04:34.505] debug:  backfill: beginning
[2017-03-08T09:04:34.505] debug:  backfill: no jobs to backfill
[2017-03-08T09:04:36.761] debug:  sched: Running job scheduler
[2017-03-08T09:05:04.507] debug:  backfill: beginning
[2017-03-08T09:05:04.507] debug:  backfill: no jobs to backfill
[2017-03-08T09:05:16.185] job_complete: JobID=7 State=0x1 NodeCnt=1 WTERMSIG 1
[2017-03-08T09:05:16.185] job_complete: JobID=7 State=0x8003 NodeCnt=1 done
[2017-03-08T09:05:16.188] _initialize_event (select_cray.c:646) call took: usec=10
[2017-03-08T09:05:16.191] _update_app (select_cray.c:819) call took: usec=2234
[2017-03-08T09:05:16.192] select_p_step_finish (select_cray.c:2276) call took: usec=3601
[2017-03-08T09:05:16.203] debug:  sched: Running job scheduler
[2017-03-08T09:05:16.308] Warning: Note very large processing time from _aeld_event_loop: took: usec=40278 began=09:05:16.268
[2017-03-08T09:05:17.165] _run_nhc jobid 7 and apid 7 completion took: usec=971845
[2017-03-08T09:05:17.167] step dealloc on job node 0 (nid00032) used: 0 of 16 CPUs
[2017-03-08T09:05:17.172] switch_p_job_step_complete (switch_cray.c:934) call took: usec=4177
[2017-03-08T09:05:17.174] switch_p_free_jobinfo (switch_cray.c:273) call took: usec=3
[2017-03-08T09:05:17.175] debug:  freed ports 20002 for step 7.0
[2017-03-08T09:05:17.827] _run_nhc jobid 7 and apid 0 completion took: usec=639365

Comment 3 Tim Wickberg 2017-03-08 08:46:23 MST

Looking over the code, the timeout fed into slurm_step_ctx_create_timeout should always be at least 60 seconds, unless you're using the -I flag:

if (opt.immediate) {
	step_wait = MAX(1, opt.immediate -
			   difftime(time(NULL),
				    srun_begin_time)) *
		    1000;
} else {
	slurmctld_timeout = MIN(300, MAX(60,
		slurm_get_slurmctld_timeout()));
	step_wait = ((getpid() % 10) +
		     slurmctld_timeout) * 1000;
}
job->step_ctx = slurm_step_ctx_create_timeout(&job->ctx_params, step_wait);

So I think a different timeout is factoring in here. Is MessageTimeout set? I believe that defaults to 10 seconds. Some of the other settings NERSC has running that may be relevant:

BatchStartTimeout=60
MessageTimeout=60 ## default is 10, units in seconds
SlurmctldTimeout=120  ## failover time
SlurmdTimeout=300 ## timeout before node marked down

Comment 4 kohnke 2017-03-08 09:27:52 MST

I am not explicitly using an srun -I flag, but I am invoking srun
outside of a batch job or salloc.  I'll try an salloc and see what
happens.

With the salloc and setting MessageTimeout=60, srun and slurmctld
can handle a delayed cookie lease at least to a 60 second delay.
That's good news.  I'm losing my dedicated time.  I will try
this tomorrow and see if delays between 60 and 120 seconds are
handled correctly.  Thanks for suggesting the MessageTimeout
setting.

Comment 5 kohnke 2017-03-10 10:52:10 MST

My testing last night showed this behavior with MessageTimeout=60,
with my modified libalpscomm_sn default 120 second timeout within an
salloc session, and manually making ncmd unavailable/available for
varying amount of time.  I also had some DebugFlags set.

1)  ncmd unavailable for 80 seconds, srun periodically makes an RPC
call causing slurmctld to try to lease cookies, cookie lease successful
after that 80 seconds, srun receives cookies and completes execution

kohnke@opal-p2:/home/users/kohnke> srun -n 1 hostname
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation temporarily disabled, retrying
srun: Job step created
nid00032
srun: switch_p_free_jobinfo (switch_cray.c:273) call took: usec=1


2)  ncmd unavailable for 100 seconds; like #1, srun receives cookies
and completes execution

kohnke@opal-p2:/home/users/kohnke> srun -n 1 hostname
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation temporarily disabled, retrying
srun: Job step created
nid00032
srun: switch_p_free_jobinfo (switch_cray.c:273) call took: usec=1


3) ncmd unavailable for 130 seconds, slurmctld receives error return
from libalpscomm cookie lease call after 120 seconds; slurmctld later
tries another cookie lease for this same step and is successful; however,
srun doesn't appear to be told about the cookies and continues making
RPC calls until I did a ctrl-C on srun

kohnke@opal-p2:/home/users/kohnke> srun -n 1 hostname
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: Job step creation temporarily disabled, retrying
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
^Csrun: Cancelled pending job step with signal 2
srun: error: Unable to create job step: Job/step already completing or completed

Here is an srun backtrace where srun did not receive the cookies and
continued for several minutes before I did a ctrl-C:

gdb) bt
#0  0x00007f188eb79bfd in poll () from /lib64/libc.so.6
#1  0x000000000057b598 in slurm_step_ctx_create_timeout (step_params=0x813e30,
    timeout=60000) at step_ctx.c:266
#2  0x000000000058733c in launch_common_create_job_step (job=0x813d50,
    use_all_cpus=true, signal_function=0x591d48 <_signal_while_allocating>,
    destroy_job=0x7f1c5c <destroy_job>) at launch.c:319
#3  0x00007f188aa66b83 in launch_p_create_job_step (job=0x813d50,
    use_all_cpus=true, signal_function=0x591d48 <_signal_while_allocating>,
    destroy_job=0x7f1c5c <destroy_job>) at launch_slurm.c:493
#4  0x00000000005879ca in launch_g_create_job_step (job=0x813d50,
    use_all_cpus=true, signal_function=0x591d48 <_signal_while_allocating>,
    destroy_job=0x7f1c5c <destroy_job>) at launch.c:504
#5  0x0000000000593802 in create_job_step (job=0x813d50, use_all_cpus=true)
    at allocate.c:897
#6  0x000000000058fd4f in create_srun_job (p_job=0x7ecd00 <job>,
    got_alloc=0x7fff906aebdf, slurm_started=false, handle_signals=true)
    at srun_job.c:632
#7  0x000000000042cd6c in srun (ac=4, av=0x7fff906aedc8) at srun.c:194
#8  0x000000000042dc21 in main (argc=4, argv=0x7fff906aedc8)
    at srun.wrapper.c:17


Here is some slurmctld log entries related to #3 above where there is an
error return from the 120 second timeout in libalpscomm.  I only had one
step at a time executing.  It looks like the step id changed from 7.0
to 7.1.

[2017-03-09T18:51:25.656] switch_p_build_jobinfo (switch_cray.c:217) call took: usec=120000707
[2017-03-09T18:51:25.658] error: (cookies.c: 236: release_cookies) alpsc_release_cookies failed: src/lib/alpscomm_sn/cookie.c:617 Invalid value for cookie_ids
[2017-03-09T18:51:25.660] switch_p_free_jobinfo (switch_cray.c:273) call took: usec=1
[2017-03-09T18:51:25.661] debug:  freed ports 20012 for step 7.0
[2017-03-09T18:51:25.663] Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=120017772 began=18:49:25.645
[2017-03-09T18:51:25.664] _slurm_rpc_job_step_create for job 7: Error configuring interconnect
[2017-03-09T18:51:25.666] Warning: Note very large processing time from _slurmctld_background: usec=119931355 began=18:49:25.733
[2017-03-09T18:51:25.667] select_p_step_pick_nodes (select_cray.c:2153) call took: usec=0
[2017-03-09T18:51:25.668] step pick 1-1 nodes, avail:nid00032 idle:nid00032 picked:NONE
[2017-03-09T18:51:25.670] step picked 0 of 1 nodes
[2017-03-09T18:51:25.671] Picked nodes nid00032 when accumulating from nid00032
[2017-03-09T18:51:25.672] debug:  laying out the 1 tasks on 1 hosts nid00032 dist 1
[2017-03-09T18:51:25.674] debug:  reserved ports 20013 for step 7.1
[2017-03-09T18:51:26.008] Processing RPC: REQUEST_JOB_STEP_CREATE from uid=10320


Even with a MessageTimeout of 60 seconds, as long as a cookie lease can
happen before the libalpscomm 120 second timeout expires, the cookies are
successfully leased by slurmctld and provided to srun.  However, it
appears that if more than 120 seconds elapses and an error is returned
to slurmctld, slurmctld can later successfully get leased cookies, but srun
is not told about those cookies and continues to wait and periodically make
an RPC call until a ctrl-C is done to srun.

Comment 6 Tim Wickberg 2017-05-03 12:58:12 MDT

Adding Doug on as I'd previously noted it required on NERSC issues.

All of that diagnostic info is good to have, but I think this all needs to be addressed as part of our response to bug 3318. I don't think that chasing all the possible timeout mismatches between the Cray APIs and slurmd/slurmctld is a good use of our time at present.

- Tim

Comment 7 Tim Wickberg 2017-06-26 07:22:45 MDT

I'm going to go ahead and close this out. Bug 3318 continues to track issues around Slurm communication on Aries interconnects, and I believe solutions for that should address this longer term.

- Tim