Ticket 1622

Summary:	[825617] - Slurmd slow to CONFIRM ALPS reservation resulting in the Slurm Controller killing jobs
Product:	Slurm	Reporter:	Jason Coverston <jason.coverston>
Component:	Cray ALPS	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	brian.gilmer, brian, da
Version:	14.11.3
Hardware:	Cray XC
OS:	Linux
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.11.7 15.08.0-0pre5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Jason Coverston 2015-04-27 05:31:28 MDT

Slurmd is slow to CONFIRM ALPS reservations. This results in lots of "unbound" reservations and the Slurm Controller killing jobs. During which scheduling grinds to a halt.

Job 11938 in the slurmctld log:

sdb:/var/spool-slurm/log # grep 11938 slurmctld.log
[2015-04-25T22:01:58.416] ALPS RESERVATION #39734164, JobId 11938: BASIL -n 64 -N 64 -d 1 -m 512
[2015-04-25T22:01:58.416] sched: Allocate JobId=11441_497 (11938) NodeList=nid00407 #CPUs=64
[2015-04-25T22:02:56.072] Killing orphan batch job 11938
[2015-04-25T22:02:56.072] job_complete: JobID=11441_497(11938) State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
[2015-04-25T22:02:56.072] job_complete: JobID=11441_497(11938) State=0x8003 NodeCnt=1 done
[2015-04-25T22:03:20.043] job_complete: JobID=11441_497(11938) State=0x8003 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0

Job 11938 is ALPS resId 39734164

It took a minute for Slurm to confirm (bind) this reservation

sdb:/var/spool-slurm/log # grep 39734164 /var/opt/cray/alps/log/apsched20150425
2015-04-25 22:01:58: [15706] File new reservation resId 39734164 pagg 0 flags 0x940000
2015-04-25 22:01:58: [15706] Confirmed apid 8535964 resId 39734164 pagg 0 flags 0x940000 nids: 407
2015-04-25 22:02:52: [15706] type bind uid 0 gid 0 apid 0 pagg 0xc100000414 resId 39734164 numCmds 0
2015-04-25 22:02:52: [15706] Bound apid 0 resId 39734164 pagg 0xc100000414 batchId '11938'

Before which it's listed as unbound in ALPS:

  39734164 257Sec. batch:11938   XT  64 64 1    512 NID list,unbound

It took a minute for the slurmd to even get this job:

crystal2:/var/spool-slurm/log # grep 11938 crystal2.log
[2015-04-25T22:02:52.360] debug:  task_p_slurmd_batch_request: 11938
[2015-04-25T22:02:52.363] _run_prolog: prolog with lock for job 11938 ran for 0 seconds
[2015-04-25T22:02:52.429] Launching batch job 11938 for UID 12830
[2015-04-25T22:02:52.442] [11938] laying out the 64 tasks on 1 hosts nid00407 dist 2
[2015-04-25T22:02:52.442] [11938] Message thread started pid = 9396
[2015-04-25T22:02:52.443] [11938] task NONE plugin loaded
[2015-04-25T22:02:52.443] [11938] successfully loaded libjob.so
[2015-04-25T22:02:52.444] [11938] Checkpoint plugin loaded: checkpoint/none
[2015-04-25T22:02:52.444] [11938] mpi type = (null)
[2015-04-25T22:02:52.445] [11938] created jid 0xc100000414
[2015-04-25T22:02:52.717] [11938] spank: opening plugin stack /opt/slurm/14.11.6-jcovers/etc/plugstack.conf
[2015-04-25T22:02:52.717] [11938] mpi type = (null)
[2015-04-25T22:02:52.718] [11938] mpi/none: slurmstepd prefork
[2015-04-25T22:02:52.720] [11938] debug level = 2
[2015-04-25T22:02:52.720] [11938] task 0 (9766) started 2015-04-25T22:02:52
[2015-04-25T22:02:52.720] [11938] task_p_pre_launch_priv: 11938.4294967294
[2015-04-25T22:02:52.721] [11938] job_container none plugin loaded
[2015-04-25T22:02:52.721] [11938] Reading cgroup.conf file /opt/slurm/14.11.6-jcovers/etc/cgroup.conf
[2015-04-25T22:02:52.721] [11938] s_p_parse_file: file "/opt/slurm/14.11.6-jcovers/etc/cgroup.conf" is empty
[2015-04-25T22:02:52.722] [11938] task_p_pre_launch: 11938.4294967294, task 0


But the Slurm controller daemon has marked it as orphaned and begins to kill it. Scheduling grinds to a halt.

[2015-04-25T22:02:56.072] Killing orphan batch job 11938

Comment 1 Danny Auble 2015-04-28 11:35:52 MDT

Jason commits

26624602504
97aedaf890f
01da71b836f
5ab69ccb6de 

should address this issue.

Or you can just to a git pull on the 14.11 branch.

I think this along with the commits mentions in bug 1623 will fix all the KAUST problems outside of the speed of job launch due to serializing the reservation requests to ALPS.  Unless needed this will most likely remain the way it is.

Please report back on your findings.

Comment 2 Danny Auble 2015-04-28 11:42:02 MDT

I will mention sched_min_interval=1000000 should be added to SchedulerParameters in your slurm.conf

The time is in microseconds, so 1 second is the time listed here.  You can put larger numbers as well if you would like, but this is what I tested with and it appeared to work fine.

Comment 3 Jason Coverston 2015-04-29 15:20:32 MDT

Hi Danny,

The patches look very good!

We tested with sched_min_interval=1000000 today on the system.

Everything went well, until we tried 3 x 10,000 array element jobs.

The 3 x 10,000 array element job locked up the controller daemon with this setting and we saw the "unbound" reservations appear in apstat. Jobs dropped to zero eventually.

We killed off everything and bumped it to sched_min_interval=2000000. The controller daemon eventually locked up again. 

We had to end our test session after this unfortunately and could not test with a higher setting. If you have any input on what a good setting would be please advise.

Thanks!! 

Jason

Comment 4 Danny Auble 2015-04-29 15:36:10 MDT

Thanks for the feed back Jason, if this happens again could you please try to attach to the slurmctld and a slurmd with gdb and sent the output of 

thread apply all bt



On April 29, 2015 8:20:32 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1622
>
>--- Comment #3 from Jason Coverston <jcovers@cray.com> ---
>Hi Danny,
>
>The patches look very good!
>
>We tested with sched_min_interval=1000000 today on the system.
>
>Everything went well, until we tried 3 x 10,000 array element jobs.
>
>The 3 x 10,000 array element job locked up the controller daemon with
>this
>setting and we saw the "unbound" reservations appear in apstat. Jobs
>dropped to
>zero eventually.
>
>We killed off everything and bumped it to sched_min_interval=2000000.
>The
>controller daemon eventually locked up again. 
>
>We had to end our test session after this unfortunately and could not
>test with
>a higher setting. If you have any input on what a good setting would be
>please
>advise.
>
>Thanks!! 
>
>Jason
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 5 Jason Coverston 2015-04-30 06:01:55 MDT

Hi Danny,

Will do.

What is the impact of setting this sched_min_interval parameter to 2000000 or
perhaps to a higher value such as 3000000 or 4000000?

Would a higher value be exepcted to avoid the problems seen with the 3 x 10,000
array element job, and allow that test to complete?

Also, could the job launch rate be affected the higher we go with the interval setting?

Comment 6 Danny Auble 2015-04-30 06:13:01 MDT

All that means is there is more time between scheduler runs.  It at worst will only slow scheduling a chunk of jobs by that time period.

I would have expected the 1000000 to solve the problem.  Since this happens on 100000 and 2000000 it really points to something else unknown at the moment.  Can you test on Crystal and see if you can reproduce?  I am hoping if we get a backtrace while it is happening we will be able to see the bottleneck.

Since scheduling jobs is linked to job launch rate it would affect it, but not dramatically unless the number was really high (in the 10s of seconds).

Comment 7 Danny Auble 2015-04-30 07:01:46 MDT

Jason, would you be able to get some logs from the slurctld and slurmd while this issue was happening?

Comment 8 Danny Auble 2015-04-30 10:02:48 MDT

Jason could you also attach the slurm.conf on the system?

Comment 9 Danny Auble 2015-05-18 10:02:02 MDT

Looks like the current state is acceptable.  Please reopen if needed.