One of our system admins is experimenting with setting up a BeeOND file system (https://www.beegfs.io/wiki/BeeOND) in the job prolog (and tearing it down in the epilog). The trouble that he's running into is that only one node in the allocation sets up the metadata server for the BeeOND file system. The prolog script takes longer to run on that node, and so tasks can get launched on other nodes before the file system is fully set up and available. I was looking for a Slurm configuration option that would prevent any tasks from launching until the prolog has completed on all of the nodes in the allocation. 'PrologFlags=Alloc' appears to do what I want for jobs launched with sbatch or salloc, but if I launch directly with srun, I still see asynchronous launch behavior. Here's a simple reproducer: I have a prolog script that sleeps for 60s on the first node in an allocation, and 30s on all other nodes: [day36@opal186:prolog_test]$ cat waiter.sh #!/bin/sh thishost=`/bin/hostname -s` firsthost=`/bin/scontrol show hostnames ${SLURM_NODELIST} | head -1` if [[ ${SLURMD_NODENAME} = ${firsthost} ]]; then echo "first!" > /g/g0/day36/waitout.${thishost} sleep 60 else sleep 30 fi exit 0 [day36@opal186:prolog_test]$ I have a script that prints the host and date: [day36@opal186:prolog_test]$ cat checker.sh #!/bin/sh hostname date [day36@opal186:prolog_test]$ and I have a batch script that runs that checker script on two nodes: [day36@opal186:prolog_test]$ cat check_stuff.sbatch #!/bin/sh #SBATCH -N 2 #SBATCH --reservation=test srun -N 2 --ntasks-per-node=1 checker.sh [day36@opal186:prolog_test]$ If I have the waiter.sh prolog script in place and no PrologFlags, one checker.sh task runs 30s before the other one: [day36@opal186:prolog_test]$ date Wed Jan 23 09:40:07 PST 2019 [day36@opal186:prolog_test]$ srun -N2 -n2 --reservation=test checker.sh opal109 Wed Jan 23 09:40:40 PST 2019 opal108 Wed Jan 23 09:41:08 PST 2019 [day36@opal186:prolog_test]$ If I have the waiter.sh prolog script in place, PrologFlags=Alloc, and run in an sbatch script, the tasks run at the same time: [day36@opal186:prolog_test]$ sbatch check_stuff.sbatch Submitted batch job 46764 [day36@opal186:prolog_test]$ date Wed Jan 23 10:08:32 PST 2019 … [day36@opal186:prolog_test]$ cat slurm-46764.out opal108 Wed Jan 23 10:09:32 PST 2019 opal109 Wed Jan 23 10:09:32 PST 2019 [day36@opal186:prolog_test]$ But, if I have the waiter.sh prolog script in place, PrologFlags=Alloc, and run directly with srun, one task still runs 30s before the other: [day36@opal186:prolog_test]$ date Wed Jan 23 09:43:43 PST 2019 [day36@opal186:prolog_test]$ srun -N2 -n2 --reservation=test checker.sh opal109 Wed Jan 23 09:44:17 PST 2019 opal108 Wed Jan 23 09:44:44 PST 2019 Is there a way to get task launch to wait for all prologs to complete, or do we have to write that synchronization into the prolog scripts themselves?
There currently isn't a way to setup prologs to wait for other scripts; you would have to write it in the script. Possibly watching for a file in a shared filesystem or potentially a network socket. The reason they start at the same time with PrologFlags=Alloc and sbatch is because the prolog is run on all nodes in the allocation before the batch step starts. So all of the nodes have to finish the prolog before any steps are launched by the batch step. With srun, the job step is sent to the nodes from the beginning, so they run as soon as their prolog is finished. Does that answer your questions?
Another option is to create a SPANK plugin that hooks into BeeOND, though that could be a bit more involved than you are looking for. Here is a plugin along those lines, that sets up a private temp directory. https://github.com/hpc2n/spank-private-tmp
Okay. That's about what I thought. We'll look into either adding something to the prolog script to make sure that the BeeOND file system is present before finishing or reworking it as a SPANK plugin.
Okay, closing this ticket then.