5577 – Teardown of a Persistent Reservation never completes - possibly pending stage_out

Ticket 5577 - Teardown of a Persistent Reservation never completes - possibly pending stage_out

Summary: Teardown of a Persistent Reservation never completes - possibly pending stage...

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Burst Buffers (show other tickets)
Version:	17.11.7
Hardware:	Cray XC Linux

Severity:	3 - Medium Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-08-16 19:27 MDT by David Paul
Modified:	2018-08-21 02:56 MDT (History)
CC List:	2 users (show)

See Also:
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Paul 2018-08-16 19:27:15 MDT

We have updated our TDS (gerty) to CLE-6.0 UP07. UP07 provides an 'improved Staging API' that should prevent time outs when staging in very large data or a very large number of files.

When using a Persistent Reservation with salloc the Registration remains stuck in teardown, possibly awaiting a signal from Slurm that stage_out has completed, even though no stage_out was requested.

Does slurm 17.11.7 need to be updated to handle this new API?

Thanks.

ctlnet1:~ # dwstat most
    pool units quantity     free     gran
wlm_pool bytes 11.64TiB 11.17TiB 20.14GiB

sess state      token creator owner             created expiration nodes
   4 CA--- dw_scratch     CLI 15448 2018-08-15T15:42:01      never     0
  23 D----    1000892   SLURM 15448 2018-08-16T17:08:57      never     0

inst state sess     bytes nodes             created expiration intact      label public confs
   4 CA---    4 483.38GiB     2 2018-08-15T15:42:01      never intact dw_scratch public     1

conf state inst    type activs
   4 CA---    4 scratch      0

reg state sess conf wait
 20 D--TM   23    4 wait

Comment 1 Alejandro Sanchez 2018-08-20 04:52:09 MDT

Similarly to bug 5576, the logic to change the burst buffer allocation (instance) to BB_STATE_COMPLETE seems to rely upon the exit status being 0 of the dw_wlm_cli teardown function execution.

Teardown is run at every termination of every job that _might_ have a burst buffer. An error of "No matching session" or "token not found" should be fairly common and not indicative of a problem. Perhaps with the upgrade any of these strings changed or newly ones appeared?

Do you happen to have logs for the afflicted job?
Does the job Comment show any relevant information?

Comment 3 Alejandro Sanchez 2018-08-21 02:56:22 MDT

Slurm calls dw_wlm_cli to perform a stage-out operation for all jobs using burst buffers, even without an explicit "#DW stage_out ..." operation in the script because at LANL they are using the DataWarp API to generate stage-out operations. Slurm has no way to determine if those stage-out operations are complete without using the dw_wlm_cli stage-out operation.

As we've done with bug 5576, we're gonna go ahead and mark this as invalid. Please reopen if this you feel this isn't something to be addressed in Cray.