Ticket 923

Summary: Variable interpolation ($HOME, for example) in --output/--error causes jobs to time out
Product: Slurm Reporter: John Morrissey <jwm>
Component: slurmdAssignee: Danny Auble <da>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 14.03.4   
Hardware: Linux   
OS: Linux   
Site: Harvard University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description John Morrissey 2014-06-26 02:57:45 MDT
Jobs time out when the user tries to use variable interpolation in --output (and probably --error too, but I haven't checked specifically).

For example:

[jwm@sa01:pts/11 ~> sbatch -p priority -t 00:10 --output='$HOME/foo-out5' --wrap=hostname
Submitted batch job 12080049

[jwm@holy-slurm01:pts/4 ~> sudo fgrep 12080049 /var/slurmd/log/slurmctld.log [2014-06-24T2
2:07:12.517] _slurm_rpc_submit_batch_job JobId=12080049 usec=65168
[2014-06-24T22:07:13.231] sched: Allocate JobId=12080049 NodeList=holy2a02207 #CPUs=1
[2014-06-24T22:07:29.454] Job 12080049 launch delayed by 16 secs, updating end_time
[2014-06-24T22:08:33.295] Time limit exhausted for JobId=12080049

[jwm@holy2a02207:pts/3 ~> sudo fgrep 12080049 /var/log/messages
Jun 24 22:07:29 holy2a02207 slurmd[18876]: Launching batch job 12080049 for UID 57920
Jun 24 22:07:29 holy2a02207 slurmstepd[37240]: switch NONE plugin loaded
Jun 24 22:07:29 holy2a02207 slurmstepd[37240]: AcctGatherProfile NONE plugin loaded
Jun 24 22:07:29 holy2a02207 slurmstepd[37240]: AcctGatherEnergy NONE plugin loaded
Jun 24 22:07:29 holy2a02207 slurmstepd[37240]: AcctGatherInfiniband NONE plugin loaded
Jun 24 22:07:29 holy2a02207 slurmstepd[37240]: AcctGatherFilesystem NONE plugin loaded
Jun 24 22:07:29 holy2a02207 slurmstepd[37240]: Job accounting gather LINUX plugin loaded
Jun 24 22:07:29 holy2a02207 slurmstepd[37240]: task NONE plugin loaded
Jun 24 22:07:29 holy2a02207 slurmstepd[37240]: Checkpoint plugin loaded: checkpoint/none
Jun 24 22:07:30 holy2a02207 slurmstepd[37240]: error: Could not open stdout file /n/home_r
c/jwm/$HOME/foo-out5: No such file or directory
Jun 24 22:07:30 holy2a02207 slurmstepd[37240]: error: IO setup failed: No such file or dir
ectory
Jun 24 22:07:30 holy2a02207 kernel: slurmstepd[37240] general protection ip:3b7d407551 sp:
7fffc661f300 error:0 in libaudit.so.1.0.0[3b7d400000+17000]
Jun 24 22:08:33 holy2a02207 slurmd[18876]: Purging vestigial job script /var/slurmd/spool/
slurmd/job12080049/slurm_script
Jun 24 22:08:33 holy2a02207 slurmd[18876]: Job 12080049: timeout: sent SIGTERM to 0 active
 steps
Jun 24 22:08:33 holy2a02207 slurmd[18876]: Purging vestigial job script /var/slurmd/spool/
slurmd/job12080049/slurm_script
Jun 24 22:08:33 holy2a02207 slurmd[18876]: Purging vestigial job script /var/slurmd/spool/
slurmd/job12080049/slurm_script
Jun 24 22:08:33 holy2a02207 slurmd[18876]: Purging vestigial job script /var/slurmd/spool/
slurmd/job12080049/slurm_script
Jun 24 22:08:33 holy2a02207 slurm_epilog[37375]: Starting epilog for job 12080049 by jwm..
.
Jun 24 22:08:37 holy2a02207 slurm_epilog[37385]: Epilog for job 12080049 by jwm complete.


Note the GPF in libaudit. We're running a stock 14.03.4, built in a clean CentOS 6 chroot using mock. audit-libs-2.2-2 is installed at build- and run-time, which AFAICT is the latest el6 version.

The previous version of Slurm we ran (2.6.9+a bunch of patches) failed fast in the I/O setup stage, I'm guessing because stepd didn't GPF at that point, leaving the job in the air.
Comment 1 John Morrissey 2014-06-26 03:00:45 MDT
Also (just to be clear), I'm not asking for variable interpolation to work here, but figured I'd put the GPF on your radar, and having the job time out is a non-obvious failure mode for the submitting user.
Comment 2 Danny Auble 2014-06-26 07:00:45 MDT
Hey John, if you drop the single quotes or use double quotes from around the output dir it should work.  Putting the single quotes there means literal.

Either

sbatch -p priority -t 00:10 --output=$HOME/foo-out5 --wrap=hostname
sbatch -p priority -t 00:10 --output="$HOME/foo-out5" --wrap=hostname

should do what you would like.

Please reopen if you find differently or if this doesn't suit your needs.
Comment 3 Danny Auble 2014-06-26 07:03:41 MDT
John, on the other point you bring up, in my testings I see the job fail immediately.  I don't see it timeout.  Are you sure that a timeout is happening?
Comment 4 John Morrissey 2014-06-26 09:41:31 MDT
On Thu, Jun 26, 2014 at 07:03:41PM +0000, bugs@schedmd.com wrote:
> --- Comment #3 from Danny Auble <da@schedmd.com> ---
> John, on the other point you bring up, in my testings I see the job fail
> immediately.  I don't see it timeout.  Are you sure that a timeout is
> happening?

Yes. Affected jobs finish in the TIMEOUT state. For this sample job:

[jwm@holy-slurm01:pts/4 ~> sacct -j 12080049
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
12080049         sbatch   priority   rc_admin          1    TIMEOUT      0:1 

john
Comment 5 John Morrissey 2014-06-26 09:47:00 MDT
On Thu, Jun 26, 2014 at 07:00:45PM +0000, bugs@schedmd.com wrote:
> --- Comment #2 from Danny Auble <da@schedmd.com> ---
> Hey John, if you drop the single quotes or use double quotes from around
> the output dir it should work. Putting the single quotes there means
> literal.
> 
> Either
> 
> sbatch -p priority -t 00:10 --output=$HOME/foo-out5 --wrap=hostname
> sbatch -p priority -t 00:10 --output="$HOME/foo-out5" --wrap=hostname
> 
> should do what you would like.

Right. The original job submission (from our end user) was more complex than
that, so I distilled it down to a simple reproducible case.

In their case, they were using an sbatch script with #SBATCH comments
specifying --output and --error, so there's no way for the shell to do the
interpolation.

It's a mistake or wishful thinking on their part, and the single quoting was
a straightforward way of passing $HOME into the job in the same way the
sbatch script would have in this case.

Make sense?

john
Comment 6 Danny Auble 2014-06-26 10:11:46 MDT
Ah, in this case what you have stated turns out to be a very different scenario :).

If I do what you state I indeed get the same results of job not being able to create the directory no matter which quotes I give it, but I still can't get the timeout.

Jun 24 22:07:30 holy2a02207 kernel: slurmstepd[37240] general protection ip:3b7d407551 sp:7fffc661f300 error:0 in libaudit.so.1.0.0[3b7d400000+17000]

Appears something bad happened though.  Do you see any core files or such on the matter?  I am guessing the slurmstepd died and that is the reason for the timeout.

Does this happen every time or is it hard to reproduce?


On the variable interpolation, there is currently nothing going through to make the #SBATCH options look at $ENVVARS.  That could be a feature request, but I don't see it happening very soon.
Comment 7 John Morrissey 2014-06-30 08:30:17 MDT
On Thu, Jun 26, 2014 at 10:11:46PM +0000, bugs@schedmd.com wrote:
> If I do what you state I indeed get the same results of job not being able to
> create the directory no matter which quotes I give it, but I still can't get
> the timeout.
> 
> Jun 24 22:07:30 holy2a02207 kernel: slurmstepd[37240] general protection
> ip:3b7d407551 sp:7fffc661f300 error:0 in libaudit.so.1.0.0[3b7d400000+17000]
> 
> Appears something bad happened though.  Do you see any core files or such on
> the matter?  I am guessing the slurmstepd died and that is the reason for the
> timeout.
> 
> Does this happen every time or is it hard to reproduce?

I poked at it some more, and it seems some of our compute nodes do this
every time this job is submitted, but others are fine with it (i.e., the job
~immediately becomes FAILED because of the I/O redirection and doesn't
TIMEOUT).

I can't discern any rhyme or reason for it; these compute nodes are all
running an identical, automatically installed OS (a CentOS 6 snapshot from a
few months ago) on more or less identical hardware.

AFAIK, the process should be getting a SIGSEGV and would therefore generate
a core, but I don't see any in SlurmdSpoolDir or in the working directory of
the job.

john
Comment 8 Danny Auble 2014-06-30 08:43:03 MDT
Interesting.  Could you look and see what the ulimit -c is on the problem nodes?  Perhaps it is too low to create a core file.  Did you happen to check the log dir of the slurmd?
Comment 9 Danny Auble 2014-07-03 06:29:49 MDT
John do you have any more on this?
Comment 10 John Morrissey 2014-07-03 07:35:42 MDT
> --- Comment #9 from Danny Auble <da@schedmd.com> ---
> John do you have any more on this?

Haven't had a chance to dig more into this, but I'll try to make time for it
early next week.

john
Comment 11 John Morrissey 2014-07-09 04:26:13 MDT
I played around a bunch with this, and found:

- RedHat's implementation of LSB's daemon() (in /lib/lsb/init-functions) sets the core dump size to 0 (ulimit -c 0) by default, unless overridden with DAEMON_COREFILE_LIMIT. I'm not sure where the proper place is to set that, but the environment calling the init script works fine.

- Even after removing the core dump limit and submitting a job to reproduce the problem on a node manifesting the GPF:

[jwm@holy2a02207:pts/15 /tmp> sudo env DAEMON_COREFILE_LIMIT=unlimited /etc/init.d/slurm start 
starting slurmd:                                           [  OK  ]
[jwm@holy2a02207:pts/15 /tmp> sudo fgrep core /proc/$(pidof slurmd)/limitsMax core file size        unlimited            unlimited            bytes     
[jwm@holy2a02207:pts/15 /tmp> sbatch -p general -w holy2a02207 --output '$HOME/foo' --wrap=hostname

I couldn't find cores anywhere (PWD during job submission or SlurmdSpoolDir - SlurmdLogFile is unset in our environment).

- I tried attaching gdb and had it trace various combinations of forks and execs, but the most useful output I got was:

[jwm@holy2a02207:pts/15 ~> sudo gdb /usr/sbin/slurmd $(pidof slurmd)
0x0000003b7ac0e84d in accept () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install slurm-14.03.4-2fasrc00.el6.x86_64
(gdb) set follow-fork-mode child
(gdb) set follow-exec-mode new
(gdb) cont
Continuing.
[New Thread 0x2b44e5e53700 (LWP 27373)]
[New Thread 0x2b44e6771700 (LWP 27374)]
[Thread 0x2b44e6771700 (LWP 27374) exited]
[New process 27376]
[Thread debugging using libthread_db enabled]
process 27376 is executing new program: /usr/sbin/slurmstepd

Program exited normally.


[jwm@holy2a02207:pts/15 ~> sudo gdb /usr/sbin/slurmd $(pidof slurmd)
0x0000003b7ac0e84d in accept () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install slurm-14.03.4-2fasrc00.el6.x86_64
(gdb) set follow-exec-mode same 
(gdb) set follow-fork-mode child
(gdb) cont
Continuing.
[New Thread 0x2b44e5e53700 (LWP 27447)]
[New Thread 0x2b44e6771700 (LWP 27448)]
[Thread 0x2b44e6771700 (LWP 27448) exited]
[New process 27450]
[Thread debugging using libthread_db enabled]
process 27450 is executing new program: /usr/sbin/slurmstepd
[Thread debugging using libthread_db enabled]

Program exited normally.

despite slurmstepd GPFing:

Jul  9 12:07:26 holy2a02207 kernel: slurmstepd[27721] general protection ip:3b7d407551 sp:7fff79d9dc40 error:0 in libaudit.so.1.0.0[3b7d400000+17000]

- I tried starting slurmd in debug mode (slurmd -Dv), and couldn't reproduce the GPF with the triggering job.


Danny, unless you want to pursue this further (together, of course) or can think of some way of getting a backtrace when the GPF happens, I'm content to back slowly out of the rabbithole and close this bug. The job is clearly misbehaving, the failure mode isn't catastrophic, and this has been not the most trivial thing to debug.
Comment 12 Danny Auble 2014-07-09 10:55:43 MDT
I would just keep an eye out for cores.

I would expect handling the core limit in /etc/security/limits.conf or use your init script.

But in any case I would expect gdb to find it.  Strange it doesn't.  I have never used 

(gdb) set follow-fork-mode child
(gdb) set follow-exec-mode new

I am guessing it will follow the slurmstepd when it is forked.

When you started the slurmd -D things just worked like normal?

That seems to me that something is happening in the environment.

Could you verify the slurmstepd has an unlimited core size?

I am guessing it does, but just to make sure.

Let's try a couple of more things, I would at least like to understand what is causing the GPF since it is easy to reproduce.

If I could log onto a node with you as root that would be helpful.  Is that possible?
Comment 13 Danny Auble 2014-07-16 09:42:00 MDT
John, have you been able to get anything else on this?
Comment 14 Danny Auble 2014-07-21 04:54:39 MDT
Please reopen if you have anything more on the subject.  I would like to know what libaudit is doing, or not doing in this case, (it appears to be outside of Slurm though, at least I don't know about libaudit).
Comment 15 Danny Auble 2014-07-21 04:58:31 MDT
Obviously the slurmstepd is linking to libaudit I am just wondering why it doesn't appear to like this particular case.  I am personally using libaudit 2.3.2 in ubuntu 14.04.
Comment 16 John Morrissey 2014-08-12 03:59:23 MDT
Hi Danny, sorry, I haven't been able to make time for this. I'm leaving Harvard in the next couple of days, but I'll pass this bug along to Paul Edmon.

-john