We've been getting errors like these: [2013-10-16T17:06:03.995] [229249] Unable to create TMPDIR [/scratch/barcoo/jobs/229222]: Permission denied [2013-10-16T17:06:03.995] [229249] Setting TMPDIR to /tmp on Slurm 2.6.1 x86 and Slurm 2.6.2 BG/Q, and it doesn't quite make sense. 1. Note that disparity in the JobId above: stepd is doing _make_tmpdir for 229249 but instead of /scratch/barcoo/jobs/229249, it's complaining about ../229222 -- a different JobId. It's always owned by the same user, though, sometimes it's a job that's been done for a few days. 2. The real TMPDIR value is set by the slurmd prolog, and it seems like it always exists and is owned by the user. It's just that _make_tmpdir is looking at the wrong TMPDIR value for some strange reason. 3. On closer inspection, and we were thrown off for a long time, it doesn't actually affect the job because its process environment indicates the correct TMPDIR value anyway. So it appears to be a complaint about nothing. TMPDIR apparently gets set correctly so that the job can pick up the correct value anyway, but it's very odd that error message comes up as it does. We haven't been able to do an exhaustive check because of course the /proc/<pid>/environ is gone with jobs that have completed, but on the BG/Q, where the Navigator records the environment, an entire day's records for both running and completed jobs where this error message comes up shows that the TMPDIR value is actually correct.
Another detail I'd neglected to mention: TMPDIR is set to /scratch/<cluster>/jobs/$SLURM_JOB_ID in the TaskProlog: echo export TMPDIR=/scratch/barcoo/jobs/$SLURM_JOB_ID while the subdirectory is created beforehand by the Slurmd Prolog without referring to the TMPDIR env var: JOBDIR=/scratch/barcoo/jobs/$SLURM_JOB_ID mkdir -p $JOBDIR chown .. chmod ..
One last but crucial detail: this error does not come up for each and every job. Just a small percentage, in fact.
Hi do you still, see this error? Could you please update the status of this ticket as it has been open for a long time. David
Closing no follow up. David
Apologies for bringing this in very late, but I think we have confirmation of what's going on: This pertains to sbatch jobs that launch other sbatch jobs recursively from the same script. To recap, we see errors coming out of _make_tmpdir from stepd because stepd is trying to create $TMPDIR when it runs as user, whereas the filesystem is not world-writeable. We actually get the Prolog script (runs as root) to `mkdir $TMPDIR` instead. Normally this should avoid the error messages from stepd ("Unable to create TMPDIR .. Permission denied"), since the Prolog script had already created it. Unfortunately, since the current job inherits environment variables by default, it receives the $TMPDIR associated with the predecessor batch script, e.g., # sbatch myjob.slurm Job 1: defaults to TMPDIR=/tmp +--> set by TaskProlog: TMPDIR=/scratch/1 +--> invokes sbatch myjob.slurm recursively before it exits Job 2: inherited: TMPDIR=/scratch/1 +--> TaskProlog sets TMPDIR=/scratch/2 +--> invokes sbatch myjob.slurm again .. Job 3: inherited: TMPDIR=/scratch/2 .. etc. Using `export TMPDIR` in the Prolog does not influence the subsequent stepd, so we can't use it to set TMPDIR correctly. Our workaround is to set --export=NONE or `unset TMPDIR` before each recursive `sbatch` call. I hesitate to call this a bug, but it may be a good idea to document this for users who do call sbatch recursively.