6395 – synchronize task launch when prolog run time is variable

Ticket 6395 - synchronize task launch when prolog run time is variable

Summary: synchronize task launch when prolog run time is variable

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	17.11.12
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Broderick Gardner
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-01-23 11:34 MST by Ryan Day
Modified:	2019-01-31 14:38 MST (History)
CC List:	1 user (show)

See Also:
Site:	LLNL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ryan Day 2019-01-23 11:34:41 MST

One of our system admins is experimenting with setting up a BeeOND file system (https://www.beegfs.io/wiki/BeeOND) in the job prolog (and tearing it down in the epilog). The trouble that he's running into is that only one node in the allocation sets up the metadata server for the BeeOND file system. The prolog script takes longer to run on that node, and so tasks can get launched on other nodes before the file system is fully set up and available. I was looking for a Slurm configuration option that would prevent any tasks from launching until the prolog has completed on all of the nodes in the allocation. 'PrologFlags=Alloc' appears to do what I want for jobs launched with sbatch or salloc, but if I launch directly with srun, I still see asynchronous launch behavior.

Here's a simple reproducer:

I have a prolog script that sleeps for 60s on the first node in an allocation, and 30s on all other nodes:

[day36@opal186:prolog_test]$ cat waiter.sh
#!/bin/sh

thishost=`/bin/hostname -s`
firsthost=`/bin/scontrol show hostnames ${SLURM_NODELIST} | head -1`

if [[ ${SLURMD_NODENAME} = ${firsthost} ]]; then
	echo "first!" > /g/g0/day36/waitout.${thishost}
	sleep 60
else
	sleep 30
fi

exit 0
[day36@opal186:prolog_test]$

I have a script that prints the host and date:

[day36@opal186:prolog_test]$ cat checker.sh
#!/bin/sh

hostname
date
[day36@opal186:prolog_test]$

and I have a batch script that runs that checker script on two nodes:

[day36@opal186:prolog_test]$ cat check_stuff.sbatch 
#!/bin/sh

#SBATCH -N 2
#SBATCH --reservation=test

srun -N 2 --ntasks-per-node=1 checker.sh
[day36@opal186:prolog_test]$

If I have the waiter.sh prolog script in place and no PrologFlags, one checker.sh task runs 30s before the other one:

[day36@opal186:prolog_test]$ date
Wed Jan 23 09:40:07 PST 2019
[day36@opal186:prolog_test]$ srun -N2 -n2 --reservation=test checker.sh 
opal109
Wed Jan 23 09:40:40 PST 2019
opal108
Wed Jan 23 09:41:08 PST 2019
[day36@opal186:prolog_test]$

If I have the waiter.sh prolog script in place, PrologFlags=Alloc, and run in an sbatch script, the tasks run at the same time:

[day36@opal186:prolog_test]$ sbatch check_stuff.sbatch 
Submitted batch job 46764
[day36@opal186:prolog_test]$ date
Wed Jan 23 10:08:32 PST 2019
…
 [day36@opal186:prolog_test]$ cat slurm-46764.out 
opal108
Wed Jan 23 10:09:32 PST 2019
opal109
Wed Jan 23 10:09:32 PST 2019
[day36@opal186:prolog_test]$

But, if I have the waiter.sh prolog script in place, PrologFlags=Alloc, and run directly with srun, one task still runs 30s before the other:

[day36@opal186:prolog_test]$ date
Wed Jan 23 09:43:43 PST 2019
[day36@opal186:prolog_test]$ srun -N2 -n2 --reservation=test checker.sh 
opal109
Wed Jan 23 09:44:17 PST 2019
opal108
Wed Jan 23 09:44:44 PST 2019

Is there a way to get task launch to wait for all prologs to complete, or do we have to write that synchronization into the prolog scripts themselves?

Comment 1 Broderick Gardner 2019-01-23 16:59:25 MST

There currently isn't a way to setup prologs to wait for other scripts; you would have to write it in the script. Possibly watching for a file in a shared filesystem or potentially a network socket. 

The reason they start at the same time with PrologFlags=Alloc and sbatch is because the prolog is run on all nodes in the allocation before the batch step starts. So all of the nodes have to finish the prolog before any steps are launched by the batch step. With srun, the job step is sent to the nodes from the beginning, so they run as soon as their prolog is finished. 

Does that answer your questions?

Comment 2 Broderick Gardner 2019-01-24 10:11:38 MST

Another option is to create a SPANK plugin that hooks into BeeOND, though that could be a bit more involved than you are looking for. 

Here is a plugin along those lines, that sets up a private temp directory.
https://github.com/hpc2n/spank-private-tmp

Comment 3 Ryan Day 2019-01-24 17:39:54 MST

Okay. That's about what I thought. We'll look into either adding something to the prolog script to make sure that the BeeOND file system is present before finishing or reworking it as a SPANK plugin.

Comment 4 Broderick Gardner 2019-01-28 09:17:13 MST

Okay, closing this ticket then.