| Summary: | Unable to create TMPDIR | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jeff Tan <jeffetan> |
| Component: | Other | Assignee: | David Bigagli <david> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | da |
| Version: | 2.6.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Jeff Tan
2013-10-29 19:17:33 MDT
Another detail I'd neglected to mention: TMPDIR is set to /scratch/<cluster>/jobs/$SLURM_JOB_ID in the TaskProlog:
echo export TMPDIR=/scratch/barcoo/jobs/$SLURM_JOB_ID
while the subdirectory is created beforehand by the Slurmd Prolog without referring to the TMPDIR env var:
JOBDIR=/scratch/barcoo/jobs/$SLURM_JOB_ID
mkdir -p $JOBDIR
chown ..
chmod ..
One last but crucial detail: this error does not come up for each and every job. Just a small percentage, in fact. Hi do you still, see this error? Could you please update the status of this ticket as it has been open for a long time. David Closing no follow up. David Apologies for bringing this in very late, but I think we have confirmation of what's going on:
This pertains to sbatch jobs that launch other sbatch jobs recursively from the same script. To recap, we see errors coming out of _make_tmpdir from stepd because stepd is trying to create $TMPDIR when it runs as user, whereas the filesystem is not world-writeable. We actually get the Prolog script (runs as root) to `mkdir $TMPDIR` instead. Normally this should avoid the error messages from stepd ("Unable to create TMPDIR .. Permission denied"), since the Prolog script had already created it. Unfortunately, since the current job inherits environment variables by default, it receives the $TMPDIR associated with the predecessor batch script, e.g.,
# sbatch myjob.slurm
Job 1: defaults to TMPDIR=/tmp
+--> set by TaskProlog: TMPDIR=/scratch/1
+--> invokes sbatch myjob.slurm recursively before it exits
Job 2: inherited: TMPDIR=/scratch/1
+--> TaskProlog sets TMPDIR=/scratch/2
+--> invokes sbatch myjob.slurm again ..
Job 3: inherited: TMPDIR=/scratch/2
..
etc.
Using `export TMPDIR` in the Prolog does not influence the subsequent stepd, so we can't use it to set TMPDIR correctly. Our workaround is to set --export=NONE or `unset TMPDIR` before each recursive `sbatch` call. I hesitate to call this a bug, but it may be a good idea to document this for users who do call sbatch recursively.
|