Ticket 1047

Summary:	SLURM_CHECKPOINT_DIR doesn't seem to be used
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	Other	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	da
Version:	14.03.6
Hardware:	Linux
OS:	Linux
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.03.7
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Kilian Cavalotti 2014-08-18 07:35:24 MDT

Hi,

Trying to play with BLCR checkpointing, and it seems that the user's env SLURM_CHECKPOINT_DIR is not used:

$ echo $SLURM_CHECKPOINT_DIR
/scratch/users/kilian/.chkpnt/

$ sbatch --wrap="sleep 10000"
Submitted batch job 176521
$ scontrol checkpoint able 176521
Yes
$ scontrol checkpoint create 176521
$

But the checkpoint data is not created in /scratch/users/kilian/.chkpnt/, it's created in $HOME/<jobid>:
$ ls -l /scratch/users/kilian/.chkpnt/
total 0
$ ls $HOME/176521
script.ckpt

On the controller logs, I get:
# grep 176521 /var/log/slurm/slurmctld.log
[2014-08-18T12:13:51.689] _slurm_rpc_submit_batch_job JobId=176521 usec=6127
[2014-08-18T12:13:52.299] sched: Allocate JobId=176521 NodeList=sh-5-24 #CPUs=1
[2014-08-18T12:14:06.425] checkpoint_op 0 of 176521.4294967294 complete, rc=0
[2014-08-18T12:14:06.425] _slurm_rpc_checkpoint able for 176521 usec=176
[2014-08-18T12:14:21.762] checkpoint_op 3 of 176521.4294967294 complete, rc=0
[2014-08-18T12:14:21.762] _slurm_rpc_checkpoint create for 176521 usec=17992
[2014-08-18T12:14:31.775] error: checkpoint/blcr: error on checkpoint request 3 to 176521.4294967294: Communication connection failure
[2014-08-18T12:21:41.157] checkpoint_op 3 of 176521.4294967294 complete, rc=0
[2014-08-18T12:21:41.158] _slurm_rpc_checkpoint create for 176521 usec=20114
[2014-08-18T12:21:51.809] checkpoint_op 3 of 176521.4294967294 complete, rc=0
[2014-08-18T12:21:51.809] _slurm_rpc_checkpoint create for 176521 usec=27155

On the compute node:
# grep 176521 /var/log/slurm/slurmd.log
[2014-08-18T12:14:58.994] error: _step_connect: connect() failed dir /var/spool/slurmd node sh-5-24 job 176521 step -2 No such file or directory
[2014-08-18T12:14:58.997] prolog for job 176521 ran for 66 seconds
[2014-08-18T12:14:58.998] Launching batch job 176521 for UID 215845
[2014-08-18T12:14:59.023] [176521] checkpoint/blcr init
[2014-08-18T12:15:02.331] [176521] task/cgroup: /slurm/uid_215845/job_176521: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB
[2014-08-18T12:15:02.331] [176521] task/cgroup: /slurm/uid_215845/job_176521/step_4294967294: alloc=4000MB mem.limit=4000MB memsw.limit=4000MB

The ENOENT error is weird because /var/spool/slurmd exists on that node, and everything else work perfectly.

Is there something wrong with my setup (more precisely, where should I look for clues about errors?), or is it a bug? 

Thanks!

Comment 1 Moe Jette 2014-08-18 07:45:58 MDT

The man page doesn't seem to match the code. The code is looking for SBATCH_ env vars. I'll update the man page.

Comment 2 Kilian Cavalotti 2014-08-18 07:50:51 MDT

(In reply to Moe Jette from comment #1)
> The man page doesn't seem to match the code. The code is looking for SBATCH_
> env vars. I'll update the man page.

Right, SBATCH_CHECKPOINT_DIR does the trick.

Thanks!

Comment 3 Moe Jette 2014-08-18 07:52:31 MDT

Documentation will be corrected in next release:

https://github.com/SchedMD/slurm/commit/cd551396c9705ca00e361a841206974ec3020435