| Summary: | Variable interpolation ($HOME, for example) in --output/--error causes jobs to time out | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | John Morrissey <jwm> |
| Component: | slurmd | Assignee: | Danny Auble <da> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da |
| Version: | 14.03.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Harvard University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
John Morrissey
2014-06-26 02:57:45 MDT
Also (just to be clear), I'm not asking for variable interpolation to work here, but figured I'd put the GPF on your radar, and having the job time out is a non-obvious failure mode for the submitting user. Hey John, if you drop the single quotes or use double quotes from around the output dir it should work. Putting the single quotes there means literal. Either sbatch -p priority -t 00:10 --output=$HOME/foo-out5 --wrap=hostname sbatch -p priority -t 00:10 --output="$HOME/foo-out5" --wrap=hostname should do what you would like. Please reopen if you find differently or if this doesn't suit your needs. John, on the other point you bring up, in my testings I see the job fail immediately. I don't see it timeout. Are you sure that a timeout is happening? On Thu, Jun 26, 2014 at 07:03:41PM +0000, bugs@schedmd.com wrote: > --- Comment #3 from Danny Auble <da@schedmd.com> --- > John, on the other point you bring up, in my testings I see the job fail > immediately. I don't see it timeout. Are you sure that a timeout is > happening? Yes. Affected jobs finish in the TIMEOUT state. For this sample job: [jwm@holy-slurm01:pts/4 ~> sacct -j 12080049 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 12080049 sbatch priority rc_admin 1 TIMEOUT 0:1 john On Thu, Jun 26, 2014 at 07:00:45PM +0000, bugs@schedmd.com wrote: > --- Comment #2 from Danny Auble <da@schedmd.com> --- > Hey John, if you drop the single quotes or use double quotes from around > the output dir it should work. Putting the single quotes there means > literal. > > Either > > sbatch -p priority -t 00:10 --output=$HOME/foo-out5 --wrap=hostname > sbatch -p priority -t 00:10 --output="$HOME/foo-out5" --wrap=hostname > > should do what you would like. Right. The original job submission (from our end user) was more complex than that, so I distilled it down to a simple reproducible case. In their case, they were using an sbatch script with #SBATCH comments specifying --output and --error, so there's no way for the shell to do the interpolation. It's a mistake or wishful thinking on their part, and the single quoting was a straightforward way of passing $HOME into the job in the same way the sbatch script would have in this case. Make sense? john Ah, in this case what you have stated turns out to be a very different scenario :). If I do what you state I indeed get the same results of job not being able to create the directory no matter which quotes I give it, but I still can't get the timeout. Jun 24 22:07:30 holy2a02207 kernel: slurmstepd[37240] general protection ip:3b7d407551 sp:7fffc661f300 error:0 in libaudit.so.1.0.0[3b7d400000+17000] Appears something bad happened though. Do you see any core files or such on the matter? I am guessing the slurmstepd died and that is the reason for the timeout. Does this happen every time or is it hard to reproduce? On the variable interpolation, there is currently nothing going through to make the #SBATCH options look at $ENVVARS. That could be a feature request, but I don't see it happening very soon. On Thu, Jun 26, 2014 at 10:11:46PM +0000, bugs@schedmd.com wrote: > If I do what you state I indeed get the same results of job not being able to > create the directory no matter which quotes I give it, but I still can't get > the timeout. > > Jun 24 22:07:30 holy2a02207 kernel: slurmstepd[37240] general protection > ip:3b7d407551 sp:7fffc661f300 error:0 in libaudit.so.1.0.0[3b7d400000+17000] > > Appears something bad happened though. Do you see any core files or such on > the matter? I am guessing the slurmstepd died and that is the reason for the > timeout. > > Does this happen every time or is it hard to reproduce? I poked at it some more, and it seems some of our compute nodes do this every time this job is submitted, but others are fine with it (i.e., the job ~immediately becomes FAILED because of the I/O redirection and doesn't TIMEOUT). I can't discern any rhyme or reason for it; these compute nodes are all running an identical, automatically installed OS (a CentOS 6 snapshot from a few months ago) on more or less identical hardware. AFAIK, the process should be getting a SIGSEGV and would therefore generate a core, but I don't see any in SlurmdSpoolDir or in the working directory of the job. john Interesting. Could you look and see what the ulimit -c is on the problem nodes? Perhaps it is too low to create a core file. Did you happen to check the log dir of the slurmd? John do you have any more on this? > --- Comment #9 from Danny Auble <da@schedmd.com> ---
> John do you have any more on this?
Haven't had a chance to dig more into this, but I'll try to make time for it
early next week.
john
I played around a bunch with this, and found: - RedHat's implementation of LSB's daemon() (in /lib/lsb/init-functions) sets the core dump size to 0 (ulimit -c 0) by default, unless overridden with DAEMON_COREFILE_LIMIT. I'm not sure where the proper place is to set that, but the environment calling the init script works fine. - Even after removing the core dump limit and submitting a job to reproduce the problem on a node manifesting the GPF: [jwm@holy2a02207:pts/15 /tmp> sudo env DAEMON_COREFILE_LIMIT=unlimited /etc/init.d/slurm start starting slurmd: [ OK ] [jwm@holy2a02207:pts/15 /tmp> sudo fgrep core /proc/$(pidof slurmd)/limitsMax core file size unlimited unlimited bytes [jwm@holy2a02207:pts/15 /tmp> sbatch -p general -w holy2a02207 --output '$HOME/foo' --wrap=hostname I couldn't find cores anywhere (PWD during job submission or SlurmdSpoolDir - SlurmdLogFile is unset in our environment). - I tried attaching gdb and had it trace various combinations of forks and execs, but the most useful output I got was: [jwm@holy2a02207:pts/15 ~> sudo gdb /usr/sbin/slurmd $(pidof slurmd) 0x0000003b7ac0e84d in accept () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install slurm-14.03.4-2fasrc00.el6.x86_64 (gdb) set follow-fork-mode child (gdb) set follow-exec-mode new (gdb) cont Continuing. [New Thread 0x2b44e5e53700 (LWP 27373)] [New Thread 0x2b44e6771700 (LWP 27374)] [Thread 0x2b44e6771700 (LWP 27374) exited] [New process 27376] [Thread debugging using libthread_db enabled] process 27376 is executing new program: /usr/sbin/slurmstepd Program exited normally. [jwm@holy2a02207:pts/15 ~> sudo gdb /usr/sbin/slurmd $(pidof slurmd) 0x0000003b7ac0e84d in accept () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install slurm-14.03.4-2fasrc00.el6.x86_64 (gdb) set follow-exec-mode same (gdb) set follow-fork-mode child (gdb) cont Continuing. [New Thread 0x2b44e5e53700 (LWP 27447)] [New Thread 0x2b44e6771700 (LWP 27448)] [Thread 0x2b44e6771700 (LWP 27448) exited] [New process 27450] [Thread debugging using libthread_db enabled] process 27450 is executing new program: /usr/sbin/slurmstepd [Thread debugging using libthread_db enabled] Program exited normally. despite slurmstepd GPFing: Jul 9 12:07:26 holy2a02207 kernel: slurmstepd[27721] general protection ip:3b7d407551 sp:7fff79d9dc40 error:0 in libaudit.so.1.0.0[3b7d400000+17000] - I tried starting slurmd in debug mode (slurmd -Dv), and couldn't reproduce the GPF with the triggering job. Danny, unless you want to pursue this further (together, of course) or can think of some way of getting a backtrace when the GPF happens, I'm content to back slowly out of the rabbithole and close this bug. The job is clearly misbehaving, the failure mode isn't catastrophic, and this has been not the most trivial thing to debug. I would just keep an eye out for cores. I would expect handling the core limit in /etc/security/limits.conf or use your init script. But in any case I would expect gdb to find it. Strange it doesn't. I have never used (gdb) set follow-fork-mode child (gdb) set follow-exec-mode new I am guessing it will follow the slurmstepd when it is forked. When you started the slurmd -D things just worked like normal? That seems to me that something is happening in the environment. Could you verify the slurmstepd has an unlimited core size? I am guessing it does, but just to make sure. Let's try a couple of more things, I would at least like to understand what is causing the GPF since it is easy to reproduce. If I could log onto a node with you as root that would be helpful. Is that possible? John, have you been able to get anything else on this? Please reopen if you have anything more on the subject. I would like to know what libaudit is doing, or not doing in this case, (it appears to be outside of Slurm though, at least I don't know about libaudit). Obviously the slurmstepd is linking to libaudit I am just wondering why it doesn't appear to like this particular case. I am personally using libaudit 2.3.2 in ubuntu 14.04. Hi Danny, sorry, I haven't been able to make time for this. I'm leaving Harvard in the next couple of days, but I'll pass this bug along to Paul Edmon. -john |