Created attachment 4045 [details] slurmd log Hello, We have a "shared" partition on cori where users can request just a single core on our haswell nodes. This means that up to 32 jobs can run independently on these. Since moving to slurm 16.05.9 (though there may have been some of this in 16.05.8 as well), it seems we are getting a lot of nodes dropping off (maybe 2-3 per day) owing to "batch job completion failure". Looking the in the slurmd log, I see that dup2() is failing and slurmstepd is failing to send a message, following by loss of the starting job (and marking down of the node). In looking at the accounting data it seems that in each of these cases > 20 jobs are trying to start (that is an estimate, not a hard number of job starts). [2017-02-13T17:34:45.250] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.252] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error [2017-02-13T17:34:45.425] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.430] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error [2017-02-13T17:34:45.591] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.596] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error [2017-02-13T17:34:45.597] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.600] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error [2017-02-13T17:34:45.640] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.643] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error [2017-02-13T17:34:45.865] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.867] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2017-02-13T17:34:45.870] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.873] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2017-02-13T17:34:45.876] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.879] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2017-02-13T17:34:45.882] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:45.884] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2017-02-13T17:34:46.977] error: dup2 over STDIN_FILENO: Bad file descriptor [2017-02-13T17:34:46.981] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error I'm pretty sure that we should not have exhausted the overall open file system on these systems, and so dup2() must be getting an invalid file descriptor to work with (i.e., perhaps the pipe for stdin isn't getting set up correctly). I'll attach the slurmd logs. -Doug
Doug - A few questions for you based on some discussion with Dominik earlier: - Is sbcast being used on these nodes? I don't see any evidence of it in the slurmd log from the past day, but there are a few quirks there that we're looking into fixing. Although I don't see any evidence they're causing this at the moment. - Can you attach the slurmctld log from around the time that this dup2 error message hit? [2017-02-13T17:34:45.425] error: dup2 over STDIN_FILENO: Bad file descriptor I'm just curious what else is going on with the system at that point in time.
Hi Tim, I'll send the slurmctld logs after lunch. We are only seeing this on our shared nodes -- i.e., ones that run multiple jobs per node and never across nodes. This issue seems to only happen when the node is relatively idle and is starting 10s of jobs at the same time (i.e., user submits a large job array). We see this on both cori and edison. Perhaps 1-3 per day. -Doug
Created attachment 4065 [details] patch Hi I have some idea what is going on there. But to be sure could you apply this patch? If it doesn't help at least give us some extra info. Dominik
Hi Dominik, I'll be traveling for a few days, but we have a scheduled maintenance next Tuesday. I'll get this in place next Tuesday and will let you know then how it goes. Thanks, Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Thu, Feb 16, 2017 at 3:42 PM, <bugs@schedmd.com> wrote: > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=3466#c4> on bug > 3466 <https://bugs.schedmd.com/show_bug.cgi?id=3466> from Dominik > Bartkiewicz <bart@schedmd.com> * > > Created attachment 4065 [details] <https://bugs.schedmd.com/attachment.cgi?id=4065&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=4065&action=edit> > patch > > Hi > > I have some idea what is going on there. > But to be sure could you apply this patch? > If it doesn't help at least give us some extra info. > > Dominik > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
OK, this patch is queued up to be installed during the maintenance tomorrow. Will let you know how it goes. -Doug
I've found a few likely sources of this issue when investigating a few issues Dominik had discovered; I'm working on a patch for the 17.02 branch now that should be back-portable to 16.05 as well if necessary. Just FYI, although I'm not sure if you've been testing it, there's a late RPC change that'll be in 17.02 (that differs from 17.02-rc1) to support files over 2GB. I'm not sure if you've run into that before, but we wanted to make sure the RPC change happened rather than need to wait until 17.11. - Tim
Great. We've just got started with 17.02 on gerty -- I don't think it's really up yet owing to a bug (task plugin ordering between cray/cgroup) that James will or has filed. Regarding this bug, I did apply the provided patch but found that we are still having this issue. -Doug
(In reply to Doug Jacobsen from comment #8) > Great. We've just got started with 17.02 on gerty -- I don't think it's > really up yet owing to a bug (task plugin ordering between cray/cgroup) > that James will or has filed. This was fixed with commit e5fb93d49b711d; I haven't seen a bug filed from NERSC on this though. > Regarding this bug, I did apply the provided patch but found that we are > still having this issue. > -Doug You don't happen to have any output from that? The patch was just adding an extra diagnostic message in hopes of narrowing down the issue, there was no functional change there.
Created attachment 4087 [details] recent slurmd log with d ebugging output
I've attached the relevant log, here is the relevant-looking data: [2017-02-21T21:28:14.773] [3823118] Created file /var/opt/cray/alps/spool/status6056184792584719886 [2017-02-21T21:28:14.775] [3823118] task_p_pre_launch: Using sched_affinity for tasks [2017-02-21T21:28:14.806] Launching batch job 3823108 for UID 53069 [2017-02-21T21:28:14.806] [3823125] core_spec/cray: init [2017-02-21T21:28:14.809] error: dup2 over STDIN_FILENO to_stepd[0]=0: Bad file descriptor [2017-02-21T21:28:14.810] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process [2017-02-21T21:28:14.810] Launching batch job 3823127 for UID 53069 [2017-02-21T21:28:14.812] Could not launch job 3823108 and not able to requeue it, cancelling job [2017-02-21T21:28:14.812] error: dup2 over STDIN_FILENO to_stepd[0]=0: Bad file descriptor [2017-02-21T21:28:14.814] [3823125] Created file /var/opt/cray/alps/spool/status6056184792584719893 [2017-02-21T21:28:14.814] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2017-02-21T21:28:14.814] Launching batch job 3823129 for UID 53069 [2017-02-21T21:28:14.816] [3823125] task_p_pre_launch: Using sched_affinity for tasks [2017-02-21T21:28:14.851] Launching batch job 3823122 for UID 53069
Hrm... the more I look through the bcast code the more convinced I am that's not the source of the issue. I have a series of patches to clean up the code some, but haven't found any conclusive issue there, and will wait until after the 17.02 release to commit them. Looking through the log you've provided I don't see the errors Dominik had added with his patch, and am forced to conclude the issue is somewhere else. Do any of your local patches do anything with the slurmd file descriptors? As best I can tell at the moment, slurmd must be racing something closing out the pipe early - there's a relatively small window between setting up the pipe and calling dup2(), and none of the other error messages from that section are hitting. It does look like you have the Shifter SPANK plugin loaded, although quickly skimming through that code I don't believe that should be causing this.
This just happened on another node -- same signature as typical, however this time I finally thought to look at the slurmd's open file descriptors: nid02274:~ # ls -l /proc/52826/fd total 0 l-wx------ 1 root root 64 Feb 25 06:26 0 -> /var/spool/slurmd/cred_state.new (deleted) lrwx------ 1 root root 64 Feb 25 06:26 1 -> /dev/null lrwx------ 1 root root 64 Feb 25 06:26 2 -> /dev/null lrwx------ 1 root root 64 Feb 25 06:26 3 -> socket:[784209] lrwx------ 1 root root 64 Feb 25 06:26 4 -> /dev/null lrwx------ 1 root root 64 Feb 25 06:26 5 -> socket:[241869] l-wx------ 1 root root 64 Feb 25 06:26 6 -> /var/run/slurmd.pid l-wx------ 1 root root 64 Feb 25 06:26 8 -> /var/spool/slurmd/nid02274.log lr-x------ 1 root root 64 Feb 25 06:26 9 -> /proc/job nid02274:~ # As compared to a typical slurmd: nid02273:~ # ls -l /proc/52884/fd total 0 lrwx------ 1 root root 64 Feb 25 06:27 0 -> /dev/null lrwx------ 1 root root 64 Feb 25 06:27 1 -> /dev/null lrwx------ 1 root root 64 Feb 25 06:27 2 -> /dev/null lrwx------ 1 root root 64 Feb 25 06:27 3 -> socket:[793545] lrwx------ 1 root root 64 Feb 25 06:27 4 -> /dev/null lrwx------ 1 root root 64 Feb 25 06:27 5 -> socket:[228273] l-wx------ 1 root root 64 Feb 25 06:27 6 -> /var/run/slurmd.pid l-wx------ 1 root root 64 Feb 25 06:27 7 -> /var/spool/slurmd/nid02273.log lr-x------ 1 root root 64 Feb 25 06:27 9 -> /proc/job nid02273:~ # Perhaps this is related to _forkexec_slurmstepd happening while save_cred_state is running?
I think I see the issue in slurm 16.05.9 the cleanup segment of save_cred_state() has: if (cred_fd > 0) close(cred_fd); instead of checking if cred_fd >= 0 Looks like this was fixed in slurm 17.02 already (thank you coverity!) https://github.com/SchedMD/slurm/commit/0d4773a95129e45e8e3d9892e23432fe26fea60b
I suppose the question now is why (rarely) save_cred_state() is getting the new cred state on fd 0. My read of the conditions of save_cred_state is that it should run fairly frequently (especially on our shared nodes), but that if it happens to open the it as fd 0 (stdin), then it will never close with the 16.05 logic, and then this will occur on the next fork.
i suppose something must be racing a close of stdin (/dev/null typically) and save_cred_state.
(In reply to Doug Jacobsen from comment #18) > i suppose something must be racing a close of stdin (/dev/null typically) > and save_cred_state. Yeah, that is still the main question from my perspective. Fixing that "leak" may actually make the problem worse, as it'd then let other files continue to land on fd 0, possibly screwing up all successive jobs on the node. _forkexec_slurmstepd() implicitly assumes that fd 0/1/2 are safe to close. If part of the pipe communicating with slurmd lands on those then the stepd launch will fail. Do you mind sending me a current set of the patches you're running on Cori? David had outlined them previously; I'd like to look through those and make sure nothing looks awry. I know there's some internal content in them, if you want to email them to me directly that's fine. And are there any other SPANK plugins in use besides Shifter? The Shifter plugin itself looked fine, assuming you're running a version in sync with your github repo.
I don't mind sending our patches privately (many come from SchedMD anyway), but this summary might be informative: [dmj@dmjdev deploy]$ for i in $(cat cori.patches); do echo $i; egrep '^---' ../patches/$i ; echo; echo; done nersc_job_comp.patch --- a/src/plugins/jobcomp/nersc/jobcomp_nersc.c 1969-12-31 16:00:00.000000000 -0800 --- a/src/plugins/jobcomp/nersc/Makefile.am 1969-12-31 16:00:00.000000000 -0800 nersc_jobcomp_configure.patch --- slurm-15.08.0-0rc1/configure.ac 2015-08-20 16:19:40.000000000 -0700 --- slurm-15.08.0-0rc1/src/plugins/jobcomp/Makefile.am 2015-05-21 13:20:09.000000000 -0700 nersc_slurm_errno_16.05.patch --- a/slurm/slurm_errno.h 2016-05-31 14:42:39.000000000 -0700 --- a/src/common/slurm_errno.c 2016-05-31 14:42:39.000000000 -0700 nhc_tracing_16.05.patch --- a/slurm/slurm.h.in 2016-05-31 14:42:39.000000000 -0700 --- a/src/common/read_config.c 2016-05-31 14:42:39.000000000 -0700 --- a/src/plugins/select/cray/select_cray.c 2016-05-31 14:42:39.000000000 -0700 set_slurmd_service_limits.patch --- a/etc/slurmd.service.in 2016-07-13 12:52:39.590940453 -0700 hack_around_right_bracket_2.patch --- a/src/slurmctld/port_mgr.c 2016-07-13 12:52:41.114274791 -0700 dmj_priority_mgmt.patch --- a/src/plugins/sched/backfill/backfill.c 2017-01-31 11:55:44.000000000 -0800 srun_start_scalability.patch --- a/src/srun/libsrun/launch.c slurm_16.05.6_bug3257_tempfix.patch --- a/src/common/pack.h disable_kmem_cgroup_3b5befc9e8565.patch --- --- a/src/common/xcgroup_read_config.c --- a/src/common/xcgroup_read_config.h --- a/src/plugins/task/cgroup/task_cgroup_memory.c slurm-16.05.9-bug3446-f6d42fdbb293.patch --- --- a/src/slurmctld/job_mgr.c --- a/src/slurmctld/job_scheduler.c --- a/src/slurmctld/node_scheduler.c --- a/src/slurmctld/power_save.c --- a/src/slurmctld/slurmctld.h slurm-16.05.9-bug3461_bd7504fbe098.patch --- --- a/src/plugins/burst_buffer/cray/burst_buffer_cray.c --- a/src/slurmctld/job_mgr.c slurm-16.05.9-bug3466-sharedstartup-dup2-issue.patch --- a/src/slurmd/slurmd/req.c [dmj@dmjdev deploy]$ Other than the provided patch for 3466 we don't have anything that should directly patch slurmd, I think. Regarding spank patches, I have three others loaded, two of which I have the source for: [dmj@dmjdev deploy]$ cat ../config/ansible/roles/smw_slurm_config/files/cori_plugstack.conf | awk '{print $2}' /usr/lib64/shifter/shifter_slurm.so /usr/lib/nersc-slurm-plugins/vtune.so /opt/cray/pe/atp/libAtpSLaunch.so /usr/lib/nersc-slurm-plugins/zonesort.so [dmj@dmjdev deploy]$ shifter you've seen. zonesort is attached on 3188. I can send you vtune (about to be replaced with a more capable, generalized perf loader plugin. I don't have Cray's libAtpSLaunch source.
However, I'd guess that Cray's ATP plugin isn't to blame: dmj@cori08:~> objdump -S /opt/cray/pe/atp/libAtpSLaunch.so | grep call | grep open dmj@cori08:~> objdump -S /opt/cray/pe/atp/libAtpSLaunch.so | grep call | grep close dmj@cori08:~>
vtune is similarly quiet on the open/closing front: [dmj@dmjdev nerscSpank]$ egrep 'open|close' spank_vtune.c [dmj@dmjdev nerscSpank]$ spank_zonesort.c does open and close some file descriptors, but all the opens and closes look safe to me.
Thanks for the summary, the only one I'd be curious about at the moment based on that listing is disable_kmem_cgroup_3b5befc9e8565.patch . (Does that patch something now available in 17.02? I vaguely recall some work with KMem limits getting merged at one point.) Although if you're willing to send them all over, I'll add them to our current file for reference. But I'm guessing the fault lies elsewhere, and will have to keep perusing. One thing that may help, although I'm not sure if you're willing to run it, would be to replace all the close() calls throughout the source with an instrumented version, and have it dump stack traces out whenever fd 0 is closed. If you'd be willing to run with that for a day or so I can work out the best way to make that happen. Now that I think about it - do you have anything exciting in the prolog/epilog at this point? I know at one point you were SIGHUP'ing slurmd to rotate the logs out; if you still have something like that in place that could help explain the intermittent timing.
[dmj@dmjdev files]$ cat cori_nodeprolog.sh #!/bin/bash count=0 while [[ "$count" -le 10 ]]; do output=$(id "$SLURM_JOB_USER") retcode=$? if [[ "$retcode" == 0 ]]; then break fi pid=$(pgrep -f /usr/sbin/sssd) if [[ "$pid" -gt 1 ]]; then kill -12 $pid ## send sigusr2 to sssd to force it online echo "nersc-nodeprolog|$SLURM_JOB_ID|Sending SIGUSR2 to ssd ($pid), try number $count" > /dev/kmsg fi count=$(( $count + 1 )) if [[ "$count" -gt 1 ]]; then sleep 1 fi done groups "$SLURM_JOB_USER" >/dev/null echo "nersc-nodeprolog|$SLURM_JOB_ID|$(hostname)|$(date +%Y-%m-%dT%H:%M:%S)" > /dev/kmsg exit 0 [dmj@dmjdev files]$ [dmj@dmjdev files]$ [dmj@dmjdev files]$ [dmj@dmjdev files]$ [dmj@dmjdev files]$ cat cori_nodeepilog.sh #!/bin/bash HOSTNAME=$(printf "nid%05d" $(cat /proc/cray_xt/nid)) TIMESTAMP=$(date +%Y-%m-%dT%H:%M:%S) ### put any node modification and/or cleanup tasks here /usr/sbin/unsetupRoot ### send slurmd logs to rabbitmq; does a SIGHUP # backup slurmd log, rabbitmq_slurmdlog is not working perfectly all the time #cd /var/spool/slurmd #cp "$HOSTNAME.log" "$HOSTNAME.$TIMESTAMP.log" #/usr/lib/nersc-slurm-plugins/rabbitmq_slurmdlog ## temporarily stop sighupping slurmd since we think this might be causing problems #pkill -1 -f /usr/sbin/slurmd ### exec nhc after ALL other actions ### this is done last because the nhc will detach from the job container exec /etc/slurm/nhc_slurm_epilog.sh [dmj@dmjdev files]$ [dmj@dmjdev files]$ I'm not sure that nhc_slurm_epilog.sh is mine to share since it is primarily authored by Cray. It does close all stdio descriptors --- but that is in the already-forked epilog, it doesn't do anything to change the state of slurmd. unsetupRoot is a part of shifter cleanup, and should not impact the state of slurmd either (unmounts things under /var/udiMount).
The disable kmem cgroup patch is this one (from 17.02): https://github.com/SchedMD/slurm/commit/3b5befc9e8565 We enabled it to try to prevent some lustre-related crashes in CLE6, but then found that the memory cgroup has many other pitfalls even setting the kmem enforcement aside in 3.12 series kernels. And so have disabled the Memory Constraints yet again. I don't see any opens or closes. Could we use an LD_PRELOAD hack to achieve the instrumented close()? I just need to make sure it _only_ affects slurmd (and possibly slurmstepd), and doesn't infect the environment of the user jobs (since they may be instrumenting close themselves).
Created attachment 4104 [details] close.c LD_PRELOAD library > Could we use an LD_PRELOAD hack to achieve the instrumented close()? I just > need to make sure it _only_ affects slurmd (and possibly slurmstepd), and > doesn't infect the environment of the user jobs (since they may be > instrumenting close themselves). That was my thinking, but I'm actually on a train today and don't have convenient access back to my main test box at the second... I'm attaching a "close.c" which does the interception. I think, but have not had a chance to test, that the fork/exec() slurmstepd does should prevent this from leaking into child processes, but that certainly bears some testing. My trivial test.c used to verify this works: #include <unistd.h> int main(int argc, char **argv) { close(0); }
OK, I reserved a node, built the library in /root/wrap_close.so (and fixed perms so non-root can't see it), then modified /etc/init.d/slurm to preload it. Modified version of the code to write to /tmp/fd_close.log and serialize access with flock() (in case the other fds are getting impacted somewhere). here's the log for slurmd startup (final slurmd running on pid 43749) nid02274:~ # cat /tmp/fd_close.log Sat Feb 25 14:46:24 2017 close() of fd 0, pid 43743, return: 0 /root/wrap_close.so(close+0x120)[0x2aaaaaccebe5] tr[0x4020c4] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaaaef1b25] tr[0x402847] Sat Feb 25 14:46:24 2017 close() of fd 0, pid 43746, return: 0 /root/wrap_close.so(close+0x120)[0x2aaaaaccebe5] touch[0x4023ed] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaaaef1b25] touch[0x4028a6] nid02274:~ # I started a job on it and it didn't affect my environment. As near as I can tell this issue is caused most frequently when many jobs start simultaneously. I'll write a cron job to send small job arrays into this node for a while and see if I can force it to happen. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Sat, Feb 25, 2017 at 2:28 PM, <bugs@schedmd.com> wrote: > *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=3466#c26> on bug > 3466 <https://bugs.schedmd.com/show_bug.cgi?id=3466> from Tim Wickberg > <tim@schedmd.com> * > > Created attachment 4104 [details] <https://bugs.schedmd.com/attachment.cgi?id=4104> [details] <https://bugs.schedmd.com/attachment.cgi?id=4104&action=edit> > close.c LD_PRELOAD library > > Could we use an LD_PRELOAD hack to achieve the instrumented close()? I just > > need to make sure it _only_ affects slurmd (and possibly slurmstepd), and > > doesn't infect the environment of the user jobs (since they may be > > instrumenting close themselves). > > That was my thinking, but I'm actually on a train today and don't have > convenient access back to my main test box at the second... > > I'm attaching a "close.c" which does the interception. I think, but have not > had a chance to test, that the fork/exec() slurmstepd does should prevent this > from leaking into child processes, but that certainly bears some testing. > > My trivial test.c used to verify this works: > > #include <unistd.h> > > int main(int argc, char **argv) > { > close(0); > } > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
hmmm, i'm not convinced that the cred.new on slurmd fd 0 is related. a test program attempting to replicate the asserted cause: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char **argv) { int fd = creat("/tmp/test1.out", 0600); int fd2 = creat("/tmp/test2.out", 0600); char buffer[1024]; printf("pid: %d\n", getpid()); snprintf(buffer, 1024, "ls -l /proc/%d/fd", getpid()); close(STDIN_FILENO); printf("first dup2: %d\n", dup2(fd, STDIN_FILENO)); unlink("/tmp/test1.out"); system(buffer); close(STDIN_FILENO); printf("second dup2: %d\n", dup2(fd2, STDIN_FILENO)); system(buffer); return 0; } It's output: [dmj@localhost ~]$ ./a.out pid: 14950 first dup2: 0 total 0 l-wx------. 1 dmj dmj 64 Feb 26 08:59 0 -> '/tmp/test1.out (deleted)' lrwx------. 1 dmj dmj 64 Feb 26 08:59 1 -> /dev/pts/3 lrwx------. 1 dmj dmj 64 Feb 26 08:59 2 -> /dev/pts/3 l-wx------. 1 dmj dmj 64 Feb 26 08:59 3 -> '/tmp/test1.out (deleted)' l-wx------. 1 dmj dmj 64 Feb 26 08:59 4 -> /tmp/test2.out second dup2: 0 total 0 l-wx------. 1 dmj dmj 64 Feb 26 08:59 0 -> /tmp/test2.out lrwx------. 1 dmj dmj 64 Feb 26 08:59 1 -> /dev/pts/3 lrwx------. 1 dmj dmj 64 Feb 26 08:59 2 -> /dev/pts/3 l-wx------. 1 dmj dmj 64 Feb 26 08:59 3 -> '/tmp/test1.out (deleted)' l-wx------. 1 dmj dmj 64 Feb 26 08:59 4 -> /tmp/test2.out [dmj@localhost ~]$
If, however, I close the source fd for dup2, then I get the same error: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <errno.h> #include <string.h> int main(int argc, char **argv) { int fd = creat("/tmp/test1.out", 0600); int fd2 = creat("/tmp/test2.out", 0600); int fd3 = creat("/tmp/test3.out", 0600); close(fd3); char buffer[1024]; printf("pid: %d\n", getpid()); snprintf(buffer, 1024, "ls -l /proc/%d/fd", getpid()); close(STDIN_FILENO); printf("first dup2: %d\n", dup2(fd, STDIN_FILENO)); unlink("/tmp/test1.out"); system(buffer); close(STDIN_FILENO); printf("second dup2: %d\n", dup2(fd2, STDIN_FILENO)); system(buffer); close(STDIN_FILENO); int ret = dup2(fd3, STDIN_FILENO); printf("third dup2: %d, %s\n", ret, strerror(errno)); system(buffer); return 0; } [dmj@localhost ~]$ ./a.out pid: 15871 first dup2: 0 total 0 l-wx------. 1 dmj dmj 64 Feb 26 09:07 0 -> '/tmp/test1.out (deleted)' lrwx------. 1 dmj dmj 64 Feb 26 09:07 1 -> /dev/pts/3 lrwx------. 1 dmj dmj 64 Feb 26 09:07 2 -> /dev/pts/3 l-wx------. 1 dmj dmj 64 Feb 26 09:07 3 -> '/tmp/test1.out (deleted)' l-wx------. 1 dmj dmj 64 Feb 26 09:07 4 -> /tmp/test2.out second dup2: 0 total 0 l-wx------. 1 dmj dmj 64 Feb 26 09:07 0 -> /tmp/test2.out lrwx------. 1 dmj dmj 64 Feb 26 09:07 1 -> /dev/pts/3 lrwx------. 1 dmj dmj 64 Feb 26 09:07 2 -> /dev/pts/3 l-wx------. 1 dmj dmj 64 Feb 26 09:07 3 -> '/tmp/test1.out (deleted)' l-wx------. 1 dmj dmj 64 Feb 26 09:07 4 -> /tmp/test2.out third dup2: -1, Bad file descriptor total 0 lrwx------. 1 dmj dmj 64 Feb 26 09:07 1 -> /dev/pts/3 lrwx------. 1 dmj dmj 64 Feb 26 09:07 2 -> /dev/pts/3 l-wx------. 1 dmj dmj 64 Feb 26 09:07 3 -> '/tmp/test1.out (deleted)' l-wx------. 1 dmj dmj 64 Feb 26 09:07 4 -> /tmp/test2.out [dmj@localhost ~]$
(In reply to Doug Jacobsen from comment #28) > hmmm, i'm not convinced that the cred.new on slurmd fd 0 is related. > > a test program attempting to replicate the asserted cause: Yes - I believe it's a symptom, but not the cause. The code producing the error is in _forkexec_slurmstepd, and does this: (void) close(STDIN_FILENO); /* ignore return */ if (dup2(to_stepd[0], STDIN_FILENO) == -1) { error("dup2 over STDIN_FILENO: %m"); exit(1); } If STDIN had been closed at some point, to_stepd[0] could have been set to 0. But the close(STDIN_FILENO) would close it right before attempting to dup2() it as the source (and destination). That's my working theory as to the cause of the problem, although the source of the close(0) is still unknown. cred.new using FD 0 and staying open indefinitely would actually help prevent this from causing problems for successive job steps, since the pipe could no longer land on FD 0.
Hello, FYI this happened again in slurm 17.02.1 on cori. Of course it happened _before_ I setup the close() instrumentation again. That's set up and we'll just wait for the next occurence. -Doug
Alright, standing by hoping you get a usable trace from it. Of course, as soon as you start hunting for it I'm sure that it'll be a bit until you find it. None of the issues we've been working on in the past week involve errant FD handling, so I'm sure this issue is still lurking in the slurmd code somewhere. I'm sending this to Dominik to keep an eye on, as I'll be out on travel starting tomorrow afternoon, and I'm guessing it may take another few days to trip it again. - Tim
OK, this was reproduced in 17.02 with the close() instrumentation running. I'll attach the log. But the first close() of fd0 by slurmd (pid 33363 in this log) is apparently involved with libnss_sss: Fri Mar 10 07:23:26 2017 close() of fd 0, pid 33363, return: 0 /root/wrap_close.so(close+0x120)[0x2aaaaaccebe5] /lib64/libnss_sss.so.2(+0x4b38)[0x2aaaf080cb38] /lib64/libnss_sss.so.2(+0x52c9)[0x2aaaf080d2c9] /lib64/libnss_sss.so.2(_nss_sss_getpwuid_r+0x46)[0x2aaaf080af76] /lib64/libc.so.6(getpwuid_r+0xdc)[0x2aaaabc1b4cc] /usr/sbin/slurmd(slurm_getpwuid_r+0x38)[0x53fc9b] /usr/sbin/slurmd[0x4371a6] /usr/sbin/slurmd[0x437971] /usr/sbin/slurmd[0x43b090] /usr/sbin/slurmd[0x43b571] /usr/sbin/slurmd(slurmd_req+0x348)[0x43574f] /usr/sbin/slurmd[0x4300b6] /lib64/libpthread.so.0(+0x80a4)[0x2aaaab9500a4] /lib64/libc.so.6(clone+0x6d)[0x2aaaabc4a02d] I'll send a disassembled copy of my slurmd as well to aid in tracing the backtraces. It appears that the stack trace is: libnss_sss: _nss_sss_endservent_r libnss_sss: _nss_sss_endservent_r libnss_sss: _nss_sss_getpwuid_r libc: getpwuid_r slurmd: slurm_getpwuid_r slurmd: _send_slurmstepd_init slurmd: _forkexec_slurmstepd slurmd: _spawn_prolog_stepd slurmd: _rpc_prolog slurmd: slurmd_req slurmd: _service_connection libpthread (something) libc clone() huh -- i guess it looks like sssd is doing this?
> huh -- i guess it looks like sssd is doing this? Looks an awful lot like this is the problem: https://pagure.io/SSSD/sssd/issue/2409 Which version of sssd are you running? That's a pretty nasty bug - closing out our file descriptors are something in the standard library is just mean.
nid02267:~ # rpm -qa | grep sssd sssd-krb5-common-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64 sssd-ldap-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64 sssd-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64 sssd-tools-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64 python-sssd-config-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64 sssd-proxy-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64 nid02267:~ # These version numbers are a nightmare. ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Mar 10, 2017 at 10:49 AM, <bugs@schedmd.com> wrote: > *Comment # 34 <https://bugs.schedmd.com/show_bug.cgi?id=3466#c34> on bug > 3466 <https://bugs.schedmd.com/show_bug.cgi?id=3466> from Tim Wickberg > <tim@schedmd.com> * > > > huh -- i guess it looks like sssd is doing this? > > Looks an awful lot like this is the problem: > https://pagure.io/SSSD/sssd/issue/2409 > > Which version of sssd are you running? That's a pretty nasty bug - closing out > our file descriptors are something in the standard library is just mean. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Reading their bug, it looks like it was fixed with sssd 1.11.7. It looks like you're running something based off 1.11.5 (possibly with some extra SuSE/Cray patches applied, judging by the version number soup), so I think that likely the same underlying bug.
I am going to close this as invalid. Dominik