Ticket 496 - Unable to create TMPDIR
Summary: Unable to create TMPDIR
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 2.6.x
Hardware: Linux Linux
: 6 - No support contract
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2013-10-29 19:17 MDT by Jeff Tan
Modified: 2014-08-17 18:49 MDT (History)
1 user (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jeff Tan 2013-10-29 19:17:33 MDT
We've been getting errors like these:

[2013-10-16T17:06:03.995] [229249] Unable to create TMPDIR [/scratch/barcoo/jobs/229222]: Permission denied
[2013-10-16T17:06:03.995] [229249] Setting TMPDIR to /tmp

on Slurm 2.6.1 x86 and Slurm 2.6.2 BG/Q, and it doesn't quite make sense.

1. Note that disparity in the JobId above: stepd is doing _make_tmpdir for 229249 but instead of /scratch/barcoo/jobs/229249, it's complaining about ../229222 -- a different JobId. It's always owned by the same user, though, sometimes it's a job that's been done for a few days.

2. The real TMPDIR value is set by the slurmd prolog, and it seems like it always exists and is owned by the user. It's just that _make_tmpdir is looking at the wrong TMPDIR value for some strange reason.

3. On closer inspection, and we were thrown off for a long time, it doesn't actually affect the job because its process environment indicates the correct TMPDIR value anyway.

So it appears to be a complaint about nothing. TMPDIR apparently gets set correctly so that the job can pick up the correct value anyway, but it's very odd that error message comes up as it does. We haven't been able to do an exhaustive check because of course the /proc/<pid>/environ is gone with jobs that have completed, but on the BG/Q, where the Navigator records the environment, an entire day's records for both running and completed jobs where this error message comes up shows that the TMPDIR value is actually correct.
Comment 1 Jeff Tan 2013-10-29 19:21:50 MDT
Another detail I'd neglected to mention: TMPDIR is set to /scratch/<cluster>/jobs/$SLURM_JOB_ID in the TaskProlog:

        echo export TMPDIR=/scratch/barcoo/jobs/$SLURM_JOB_ID

while the subdirectory is created beforehand by the Slurmd Prolog without referring to the TMPDIR env var:

        JOBDIR=/scratch/barcoo/jobs/$SLURM_JOB_ID
        mkdir -p $JOBDIR
        chown ..
        chmod ..
Comment 2 Jeff Tan 2013-10-29 19:26:30 MDT
One last but crucial detail: this error does not come up for each and every job. Just a small percentage, in fact.
Comment 3 David Bigagli 2014-03-20 11:32:28 MDT
Hi do you still, see this error? Could you please update the status of
this ticket as it has been open for a long time.

David
Comment 4 David Bigagli 2014-04-09 05:41:13 MDT
Closing no follow up.

David
Comment 5 Jeff Tan 2014-08-17 18:49:50 MDT
Apologies for bringing this in very late, but I think we have confirmation of what's going on:

This pertains to sbatch jobs that launch other sbatch jobs recursively from the same script. To recap, we see errors coming out of _make_tmpdir from stepd because stepd is trying to create $TMPDIR when it runs as user, whereas the filesystem is not world-writeable. We actually get the Prolog script (runs as root) to `mkdir $TMPDIR` instead. Normally this should avoid the error messages from stepd ("Unable to create TMPDIR .. Permission denied"), since the Prolog script had already created it. Unfortunately, since the current job inherits environment variables by default, it receives the $TMPDIR associated with the predecessor batch script, e.g.,

# sbatch myjob.slurm
Job 1: defaults to TMPDIR=/tmp
  +--> set by TaskProlog: TMPDIR=/scratch/1
  +--> invokes sbatch myjob.slurm recursively before it exits
Job 2: inherited: TMPDIR=/scratch/1
  +--> TaskProlog sets TMPDIR=/scratch/2
  +--> invokes sbatch myjob.slurm again ..
Job 3: inherited: TMPDIR=/scratch/2
..
etc.

Using `export TMPDIR` in the Prolog does not influence the subsequent stepd, so we can't use it to set TMPDIR correctly. Our workaround is to set --export=NONE or `unset TMPDIR` before each recursive `sbatch` call. I hesitate to call this a bug, but it may be a good idea to document this for users who do call sbatch recursively.