We have updated our TDS (gerty) to CLE-6.0 UP07. UP07 provides an 'improved Staging API' that should prevent time outs when staging in very large data or a very large number of files. When using a Persistent Reservation with salloc the Registration remains stuck in teardown, possibly awaiting a signal from Slurm that stage_out has completed, even though no stage_out was requested. Does slurm 17.11.7 need to be updated to handle this new API? Thanks. ctlnet1:~ # dwstat most pool units quantity free gran wlm_pool bytes 11.64TiB 11.17TiB 20.14GiB sess state token creator owner created expiration nodes 4 CA--- dw_scratch CLI 15448 2018-08-15T15:42:01 never 0 23 D---- 1000892 SLURM 15448 2018-08-16T17:08:57 never 0 inst state sess bytes nodes created expiration intact label public confs 4 CA--- 4 483.38GiB 2 2018-08-15T15:42:01 never intact dw_scratch public 1 conf state inst type activs 4 CA--- 4 scratch 0 reg state sess conf wait 20 D--TM 23 4 wait
Similarly to bug 5576, the logic to change the burst buffer allocation (instance) to BB_STATE_COMPLETE seems to rely upon the exit status being 0 of the dw_wlm_cli teardown function execution. Teardown is run at every termination of every job that _might_ have a burst buffer. An error of "No matching session" or "token not found" should be fairly common and not indicative of a problem. Perhaps with the upgrade any of these strings changed or newly ones appeared? Do you happen to have logs for the afflicted job? Does the job Comment show any relevant information?
Slurm calls dw_wlm_cli to perform a stage-out operation for all jobs using burst buffers, even without an explicit "#DW stage_out ..." operation in the script because at LANL they are using the DataWarp API to generate stage-out operations. Slurm has no way to determine if those stage-out operations are complete without using the dw_wlm_cli stage-out operation. As we've done with bug 5576, we're gonna go ahead and mark this as invalid. Please reopen if this you feel this isn't something to be addressed in Cray.