Ticket 1047

Summary: SLURM_CHECKPOINT_DIR doesn't seem to be used
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 14.03.6   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.03.7 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kilian Cavalotti 2014-08-18 07:35:24 MDT
Hi,

Trying to play with BLCR checkpointing, and it seems that the user's env SLURM_CHECKPOINT_DIR is not used:

$ echo $SLURM_CHECKPOINT_DIR
/scratch/users/kilian/.chkpnt/

$ sbatch --wrap="sleep 10000"
Submitted batch job 176521
$ scontrol checkpoint able 176521
Yes
$ scontrol checkpoint create 176521
$

But the checkpoint data is not created in /scratch/users/kilian/.chkpnt/, it's created in $HOME/<jobid>:
$ ls -l /scratch/users/kilian/.chkpnt/
total 0
$ ls $HOME/176521
script.ckpt

On the controller logs, I get:
# grep 176521 /var/log/slurm/slurmctld.log
[2014-08-18T12:13:51.689] _slurm_rpc_submit_batch_job JobId=176521 usec=6127
[2014-08-18T12:13:52.299] sched: Allocate JobId=176521 NodeList=sh-5-24 #CPUs=1
[2014-08-18T12:14:06.425] checkpoint_op 0 of 176521.4294967294 complete, rc=0
[2014-08-18T12:14:06.425] _slurm_rpc_checkpoint able for 176521 usec=176
[2014-08-18T12:14:21.762] checkpoint_op 3 of 176521.4294967294 complete, rc=0
[2014-08-18T12:14:21.762] _slurm_rpc_checkpoint create for 176521 usec=17992
[2014-08-18T12:14:31.775] error: checkpoint/blcr: error on checkpoint request 3 to 176521.4294967294: Communication connection failure
[2014-08-18T12:21:41.157] checkpoint_op 3 of 176521.4294967294 complete, rc=0
[2014-08-18T12:21:41.158] _slurm_rpc_checkpoint create for 176521 usec=20114
[2014-08-18T12:21:51.809] checkpoint_op 3 of 176521.4294967294 complete, rc=0
[2014-08-18T12:21:51.809] _slurm_rpc_checkpoint create for 176521 usec=27155

On the compute node:
# grep 176521 /var/log/slurm/slurmd.log
[2014-08-18T12:14:58.994] error: _step_connect: connect() failed dir /var/spool/slurmd node sh-5-24 job 176521 step -2 No such file or directory
[2014-08-18T12:14:58.997] prolog for job 176521 ran for 66 seconds
[2014-08-18T12:14:58.998] Launching batch job 176521 for UID 215845
[2014-08-18T12:14:59.023] [176521] checkpoint/blcr init
[2014-08-18T12:15:02.331] [176521] task/cgroup: /slurm/uid_215845/job_176521: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
[2014-08-18T12:15:02.331] [176521] task/cgroup: /slurm/uid_215845/job_176521/step_4294967294: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB

The ENOENT error is weird because /var/spool/slurmd exists on that node, and everything else work perfectly.

Is there something wrong with my setup (more precisely, where should I look for clues about errors?), or is it a bug? 

Thanks!
Comment 1 Moe Jette 2014-08-18 07:45:58 MDT
The man page doesn't seem to match the code. The code is looking for SBATCH_ env vars. I'll update the man page.
Comment 2 Kilian Cavalotti 2014-08-18 07:50:51 MDT
(In reply to Moe Jette from comment #1)
> The man page doesn't seem to match the code. The code is looking for SBATCH_
> env vars. I'll update the man page.

Right, SBATCH_CHECKPOINT_DIR does the trick.

Thanks!
Comment 3 Moe Jette 2014-08-18 07:52:31 MDT
Documentation will be corrected in next release:

https://github.com/SchedMD/slurm/commit/cd551396c9705ca00e361a841206974ec3020435