Ticket 484

Summary:	Application NHC not run on srun without prior allocation
Product:	Slurm	Reporter:	David Gloe <david.gloe>
Component:	Cray ALPS	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	da
Version:	14.03.x
Hardware:	Linux
OS:	Linux
Site:	CRAY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description David Gloe 2013-10-24 08:50:41 MDT

When first creating an allocation with salloc and then running an srun, both application and reservation NHC are run:
opal-p2:/home/users/c16817 # salloc -n 1
salloc: Granted job allocation 66315
opal-p2:/home/users/c16817 # srun -n 1 hostname
nid00024
opal-p2:/home/users/c16817 # scancel 66315
opal-p2:/home/users/c16817 # salloc: Job allocation 66315 has been revoked.

[2013-10-24T15:39:55.170] sched: _slurm_rpc_allocate_resources JobId=66315 NodeList=nid00024 usec=567
[2013-10-24T15:40:01.812] sched: _slurm_rpc_job_step_create: StepId=66315.0 nid00024 usec=4872
[2013-10-24T15:40:02.527] sched: _slurm_rpc_step_complete StepId=66315.0 usec=4114
[2013-10-24T15:40:02.527] Calling NHC for jobid 66315 and apid 66315 on nodes nid00024(24)
[2013-10-24T15:40:02.854] _run_nhc jobid 66315 and apid 66315 completed took: usec=326722
[2013-10-24T15:40:15.378] build_cg_bitmap: JOB_COMPLETING cleaned state 0x8004
[2013-10-24T15:40:15.379] sched: Cancel of JobId=66315 by UID=0, usec=919
[2013-10-24T15:40:15.380] Calling NHC for jobid 66315 and apid 0 on nodes nid00024(24)
[2013-10-24T15:40:15.705] _run_nhc jobid 66315 and apid 0 completed took: usec=325746

However, when just doing an srun without a prior allocation, no application NHC is run:
opal-p2:/home/users/c16817 # srun -n 1 hostname
nid00024

[2013-10-24T15:37:07.034] sched: _slurm_rpc_allocate_resources JobId=66314 NodeList=nid00024 usec=584
[2013-10-24T15:37:07.042] sched: _slurm_rpc_job_step_create: StepId=66314.0 nid00024 usec=4998
[2013-10-24T15:37:07.386] completing job 66314 status 0
[2013-10-24T15:37:07.386] build_cg_bitmap: JOB_COMPLETING cleaned state 0x8003
[2013-10-24T15:37:07.388] sched: job_complete for JobId=66314 successful, exit code=0
[2013-10-24T15:37:07.389] Calling NHC for jobid 66314 and apid 0 on nodes nid00024(24)
[2013-10-24T15:37:07.715] _run_nhc jobid 66314 and apid 0 completed took: usec=326848
[2013-10-24T15:37:07.739] sched: _slurm_rpc_step_complete StepId=66314.0 usec=5114

It looks like somehow the job complete message comes before the step complete message to slurmctld on srun so it never runs NHC for the step.

Comment 1 Danny Auble 2013-10-24 08:55:12 MDT

I specifically coded it this way.  I didn't think it was needed.  Look at src/plugins/select/cray/select_cray.c select_p_step_finish().

If an extra NHC is really needed just removing the "else if" will make it run.

When I originally coded this it seemed like over kill to have 2 NHC running on the same resources.  Perhaps my understanding of exactly what NHC does is flawed.  If it really is needed we can just remove the else if.

Let me know.

Comment 2 David Gloe 2013-10-24 09:15:39 MDT

I just talked with an NHC developer and he cleared a couple things up.

First, we definitely do want to call both reservation and application cleanup in this case, since the user can specify different tests in each mode. So we'll have some tests which are never run if we don't run application NHC.

Second, there could be an issue if the reservation cleanup is started before the application has exited, but it should be OK if the reservation cleanup starts before application cleanup. Is the job_complete message only sent after the application has completed (I assume this is the case, but just making sure)?

In short, I think the else if you mentioned should be removed and we should always run nhc on job step completion in addition to job completion.

However, I'm slightly curious as to why I had slurmctld debug set to 3 but I didn't see the debug message from that else if:
                debug3("step completion %u.%u was received after job "
                      "allocation is already completing, no extra NHC needed.",
                      step_ptr->job_ptr->job_id, step_ptr->step_id);

Comment 3 Danny Auble 2013-10-24 10:00:55 MDT

Removing the else if is easy to do, but as it is written today I don't think we can easily guarantee the application NHC will finish before the reservation one is started.  We would have to have a counter and either sleep or pthread_cond_wait on something until all the application NHCs finish before starting the reservation NHC.

Could you please verify this is an issue or not.  The NHC is already slowing things down quite a bit, this would definitely slow things down much more and complicate the code even more.

In either case neither the application NHC and reservation NHC will start until after the application is completely done on the nodes it was running on.

Debug levels are as such...

0 = quiet
1 = fatal
2 = error
3 = info
4 = verbose
5 = debug
6 = debug2
7 = debug3
8 = debug4
9 = debug5

or you can just put debug2 or debug3 or whatever instead of the old number way.

This probably would of been one of the very rare times debug3 would of given you helpful information ;).

Comment 4 David Gloe 2013-10-24 10:05:16 MDT

I didn't mean that the application NHC needs to be before the reservation NHC. What I meant is that you shouldn't call the reservation or application NHC before the application itself has exited. I assume that's the case already, but just wanted to make sure.

As long as that's true the application and reservation NHC can run in parallel.

Comment 5 Danny Auble 2013-10-28 04:43:24 MDT

Perfect.  I just removed the else if here 148d619f8b76a82ebd988fb00f8a21b73b7c3263.

This should fix it.  Let me know otherwise.