3466 – exhaustion of resources when same node starting many jobs

Ticket 3466 - exhaustion of resources when same node starting many jobs

Summary: exhaustion of resources when same node starting many jobs

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	16.05.9
Hardware:	Cray XC Linux

Severity:	4 - Minor Issue
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-02-14 05:22 MST by Doug Jacobsen
Modified:	2017-05-02 07:50 MDT (History)
CC List:	2 users (show)

See Also:
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmd log (207.30 KB, text/plain) 2017-02-14 05:22 MST, Doug Jacobsen	Details
patch (1.33 KB, patch) 2017-02-16 16:42 MST, Dominik Bartkiewicz	Details \| Diff
recent slurmd log with d ebugging output (44.44 KB, text/x-log) 2017-02-22 15:03 MST, Doug Jacobsen	Details
close.c LD_PRELOAD library (722 bytes, text/x-csrc) 2017-02-25 15:28 MST, Tim Wickberg	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Doug Jacobsen 2017-02-14 05:22:24 MST

Created attachment 4045 [details]
slurmd log

Hello,

We have a "shared" partition on cori where users can request just a single core on our haswell nodes.  This means that up to 32 jobs can run independently on these.

Since moving to slurm 16.05.9 (though there may have been some of this in 16.05.8 as well), it seems we are getting a lot of nodes dropping off (maybe 2-3 per day) owing to "batch job completion failure".

Looking the in the slurmd log, I see that dup2() is failing and slurmstepd is failing to send a message, following by loss of the starting job (and marking down of the node).

In looking at the accounting data it seems that in each of these cases > 20 jobs are trying to start (that is an estimate, not a hard number of job starts).

[2017-02-13T17:34:45.250] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.252] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2017-02-13T17:34:45.425] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.430] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2017-02-13T17:34:45.591] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.596] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2017-02-13T17:34:45.597] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.600] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2017-02-13T17:34:45.640] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.643] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2017-02-13T17:34:45.865] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.867] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable
[2017-02-13T17:34:45.870] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.873] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable
[2017-02-13T17:34:45.876] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.879] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable
[2017-02-13T17:34:45.882] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:45.884] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable
[2017-02-13T17:34:46.977] error: dup2 over STDIN_FILENO: Bad file descriptor
[2017-02-13T17:34:46.981] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error

I'm pretty sure that we should not have exhausted the overall open file system on these systems, and so dup2() must be getting an invalid file descriptor to work with (i.e., perhaps the pipe for stdin isn't getting set up correctly).

I'll attach the slurmd logs.

-Doug

Comment 2 Tim Wickberg 2017-02-15 12:49:20 MST

Doug - 

A few questions for you based on some discussion with Dominik earlier:

- Is sbcast being used on these nodes? I don't see any evidence of it in the slurmd log from the past day, but there are a few quirks there that we're looking into fixing. Although I don't see any evidence they're causing this at the moment.

- Can you attach the slurmctld log from around the time that this dup2 error message hit? 

[2017-02-13T17:34:45.425] error: dup2 over STDIN_FILENO: Bad file descriptor

I'm just curious what else is going on with the system at that point in time.

Comment 3 Doug Jacobsen 2017-02-15 13:03:53 MST

Hi Tim,

I'll send the slurmctld logs after lunch.  We are only seeing this on 
our shared nodes -- i.e., ones that run multiple jobs per node and never 
across nodes.  This issue seems to only happen when the node is 
relatively idle and is starting 10s of jobs at the same time (i.e., user 
submits a large job array).  We see this on both cori and edison.  
Perhaps 1-3 per day.

-Doug

Comment 4 Dominik Bartkiewicz 2017-02-16 16:42:02 MST

Created attachment 4065 [details]
patch

Hi

I have some idea what is going on there.
But to be sure could you apply this patch?
If it doesn't help at least give us some extra info.

Dominik

Comment 5 Doug Jacobsen 2017-02-16 23:09:18 MST

Hi Dominik,
I'll be traveling for a few days, but we have a scheduled maintenance next
Tuesday.  I'll get this in place next Tuesday and will let you know then
how it goes.

Thanks,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Thu, Feb 16, 2017 at 3:42 PM, <bugs@schedmd.com> wrote:

> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=3466#c4> on bug
> 3466 <https://bugs.schedmd.com/show_bug.cgi?id=3466> from Dominik
> Bartkiewicz <bart@schedmd.com> *
>
> Created attachment 4065 [details] <https://bugs.schedmd.com/attachment.cgi?id=4065&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=4065&action=edit>
> patch
>
> Hi
>
> I have some idea what is going on there.
> But to be sure could you apply this patch?
> If it doesn't help at least give us some extra info.
>
> Dominik
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 6 Doug Jacobsen 2017-02-20 22:24:24 MST

OK, this patch is queued up to be installed during the maintenance
tomorrow.  Will let you know how it goes.

-Doug

Comment 7 Tim Wickberg 2017-02-22 12:20:12 MST

I've found a few likely sources of this issue when investigating a few issues Dominik had discovered; I'm working on a patch for the 17.02 branch now that should be back-portable to 16.05 as well if necessary.

Just FYI, although I'm not sure if you've been testing it, there's a late RPC change that'll be in 17.02 (that differs from 17.02-rc1) to support files over 2GB. I'm not sure if you've run into that before, but we wanted to make sure the RPC change happened rather than need to wait until 17.11.

- Tim

Comment 8 Doug Jacobsen 2017-02-22 13:05:56 MST

Great.  We've just got started with 17.02 on gerty -- I don't think it's
really up yet owing to a bug (task plugin ordering between cray/cgroup)
that James will or has filed.

Regarding this bug, I did apply the provided patch but found that we are
still having this issue.
-Doug

Comment 9 Tim Wickberg 2017-02-22 13:10:41 MST

(In reply to Doug Jacobsen from comment #8)
> Great.  We've just got started with 17.02 on gerty -- I don't think it's
> really up yet owing to a bug (task plugin ordering between cray/cgroup)
> that James will or has filed.

This was fixed with commit e5fb93d49b711d; I haven't seen a bug filed from NERSC on this though.

> Regarding this bug, I did apply the provided patch but found that we are
> still having this issue.
> -Doug

You don't happen to have any output from that? The patch was just adding an extra diagnostic message in hopes of narrowing down the issue, there was no functional change there.

Comment 10 Doug Jacobsen 2017-02-22 15:03:35 MST

Created attachment 4087 [details]
recent slurmd log with d ebugging output

Comment 11 Doug Jacobsen 2017-02-22 15:03:57 MST

I've attached the relevant log, here is the relevant-looking data:

[2017-02-21T21:28:14.773] [3823118] Created file /var/opt/cray/alps/spool/status6056184792584719886
[2017-02-21T21:28:14.775] [3823118] task_p_pre_launch: Using sched_affinity for tasks
[2017-02-21T21:28:14.806] Launching batch job 3823108 for UID 53069
[2017-02-21T21:28:14.806] [3823125] core_spec/cray: init
[2017-02-21T21:28:14.809] error: dup2 over STDIN_FILENO to_stepd[0]=0: Bad file descriptor
[2017-02-21T21:28:14.810] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process
[2017-02-21T21:28:14.810] Launching batch job 3823127 for UID 53069
[2017-02-21T21:28:14.812] Could not launch job 3823108 and not able to requeue it, cancelling job
[2017-02-21T21:28:14.812] error: dup2 over STDIN_FILENO to_stepd[0]=0: Bad file descriptor
[2017-02-21T21:28:14.814] [3823125] Created file /var/opt/cray/alps/spool/status6056184792584719893
[2017-02-21T21:28:14.814] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable
[2017-02-21T21:28:14.814] Launching batch job 3823129 for UID 53069
[2017-02-21T21:28:14.816] [3823125] task_p_pre_launch: Using sched_affinity for tasks
[2017-02-21T21:28:14.851] Launching batch job 3823122 for UID 53069

Comment 12 Tim Wickberg 2017-02-22 19:01:33 MST

Hrm... the more I look through the bcast code the more convinced I am that's not the source of the issue.

I have a series of patches to clean up the code some, but haven't found any conclusive issue there, and will wait until after the 17.02 release to commit them.

Looking through the log you've provided I don't see the errors Dominik had added with his patch, and am forced to conclude the issue is somewhere else.

Do any of your local patches do anything with the slurmd file descriptors? As best I can tell at the moment, slurmd must be racing something closing out the pipe early - there's a relatively small window between setting up the pipe and calling dup2(), and none of the other error messages from that section are hitting.

It does look like you have the Shifter SPANK plugin loaded, although quickly skimming through that code I don't believe that should be causing this.

Comment 15 Doug Jacobsen 2017-02-25 07:31:47 MST

This just happened on another node -- same signature as typical, however this time I finally thought to look at the slurmd's open file descriptors:

nid02274:~ # ls -l /proc/52826/fd
total 0
l-wx------ 1 root root 64 Feb 25 06:26 0 -> /var/spool/slurmd/cred_state.new (deleted)
lrwx------ 1 root root 64 Feb 25 06:26 1 -> /dev/null
lrwx------ 1 root root 64 Feb 25 06:26 2 -> /dev/null
lrwx------ 1 root root 64 Feb 25 06:26 3 -> socket:[784209]
lrwx------ 1 root root 64 Feb 25 06:26 4 -> /dev/null
lrwx------ 1 root root 64 Feb 25 06:26 5 -> socket:[241869]
l-wx------ 1 root root 64 Feb 25 06:26 6 -> /var/run/slurmd.pid
l-wx------ 1 root root 64 Feb 25 06:26 8 -> /var/spool/slurmd/nid02274.log
lr-x------ 1 root root 64 Feb 25 06:26 9 -> /proc/job
nid02274:~ # 



As compared to a typical slurmd:
nid02273:~ # ls -l /proc/52884/fd
total 0
lrwx------ 1 root root 64 Feb 25 06:27 0 -> /dev/null
lrwx------ 1 root root 64 Feb 25 06:27 1 -> /dev/null
lrwx------ 1 root root 64 Feb 25 06:27 2 -> /dev/null
lrwx------ 1 root root 64 Feb 25 06:27 3 -> socket:[793545]
lrwx------ 1 root root 64 Feb 25 06:27 4 -> /dev/null
lrwx------ 1 root root 64 Feb 25 06:27 5 -> socket:[228273]
l-wx------ 1 root root 64 Feb 25 06:27 6 -> /var/run/slurmd.pid
l-wx------ 1 root root 64 Feb 25 06:27 7 -> /var/spool/slurmd/nid02273.log
lr-x------ 1 root root 64 Feb 25 06:27 9 -> /proc/job
nid02273:~ # 




Perhaps this is related to _forkexec_slurmstepd happening while save_cred_state is running?

Comment 16 Doug Jacobsen 2017-02-25 07:57:35 MST

I think I see the issue in slurm 16.05.9 the cleanup segment of save_cred_state() has:

        if (cred_fd > 0)
                close(cred_fd);

instead of checking if cred_fd >= 0


Looks like this was fixed in slurm 17.02 already (thank you coverity!)

https://github.com/SchedMD/slurm/commit/0d4773a95129e45e8e3d9892e23432fe26fea60b

Comment 17 Doug Jacobsen 2017-02-25 10:04:42 MST

I suppose the question now is why (rarely) save_cred_state() is getting the new cred state on fd 0.  My read of the conditions of save_cred_state is that it should run fairly frequently (especially on our shared nodes), but that if it happens to open the it as fd 0 (stdin), then it will never close with the 16.05 logic, and then this will occur on the next fork.

Comment 18 Doug Jacobsen 2017-02-25 10:07:41 MST

i suppose something must be racing a close of stdin (/dev/null typically) and save_cred_state.

Comment 19 Tim Wickberg 2017-02-25 12:33:45 MST

(In reply to Doug Jacobsen from comment #18)
> i suppose something must be racing a close of stdin (/dev/null typically)
> and save_cred_state.

Yeah, that is still the main question from my perspective.

Fixing that "leak" may actually make the problem worse, as it'd then let other files continue to land on fd 0, possibly screwing up all successive jobs on the node.

_forkexec_slurmstepd() implicitly assumes that fd 0/1/2 are safe to close. If part of the pipe communicating with slurmd lands on those then the stepd launch will fail.

Do you mind sending me a current set of the patches you're running on Cori? David had outlined them previously; I'd like to look through those and make sure nothing looks awry. I know there's some internal content in them, if you want to email them to me directly that's fine.

And are there any other SPANK plugins in use besides Shifter? The Shifter plugin itself looked fine, assuming you're running a version in sync with your github repo.

Comment 20 Doug Jacobsen 2017-02-25 14:08:14 MST

I don't mind sending our patches privately (many come from SchedMD anyway), but this summary might be informative:

[dmj@dmjdev deploy]$ for i in $(cat cori.patches); do echo $i; egrep '^---' ../patches/$i ; echo; echo; done
nersc_job_comp.patch
--- a/src/plugins/jobcomp/nersc/jobcomp_nersc.c	1969-12-31 16:00:00.000000000 -0800
--- a/src/plugins/jobcomp/nersc/Makefile.am	1969-12-31 16:00:00.000000000 -0800


nersc_jobcomp_configure.patch
--- slurm-15.08.0-0rc1/configure.ac	2015-08-20 16:19:40.000000000 -0700
--- slurm-15.08.0-0rc1/src/plugins/jobcomp/Makefile.am	2015-05-21 13:20:09.000000000 -0700


nersc_slurm_errno_16.05.patch
--- a/slurm/slurm_errno.h	2016-05-31 14:42:39.000000000 -0700
--- a/src/common/slurm_errno.c	2016-05-31 14:42:39.000000000 -0700


nhc_tracing_16.05.patch
--- a/slurm/slurm.h.in	2016-05-31 14:42:39.000000000 -0700
--- a/src/common/read_config.c	2016-05-31 14:42:39.000000000 -0700
--- a/src/plugins/select/cray/select_cray.c	2016-05-31 14:42:39.000000000 -0700


set_slurmd_service_limits.patch
--- a/etc/slurmd.service.in	2016-07-13 12:52:39.590940453 -0700


hack_around_right_bracket_2.patch
--- a/src/slurmctld/port_mgr.c	2016-07-13 12:52:41.114274791 -0700


dmj_priority_mgmt.patch
--- a/src/plugins/sched/backfill/backfill.c	2017-01-31 11:55:44.000000000 -0800


srun_start_scalability.patch
--- a/src/srun/libsrun/launch.c	


slurm_16.05.6_bug3257_tempfix.patch
--- a/src/common/pack.h	


disable_kmem_cgroup_3b5befc9e8565.patch
---
--- a/src/common/xcgroup_read_config.c
--- a/src/common/xcgroup_read_config.h
--- a/src/plugins/task/cgroup/task_cgroup_memory.c


slurm-16.05.9-bug3446-f6d42fdbb293.patch
---
--- a/src/slurmctld/job_mgr.c
--- a/src/slurmctld/job_scheduler.c
--- a/src/slurmctld/node_scheduler.c
--- a/src/slurmctld/power_save.c
--- a/src/slurmctld/slurmctld.h


slurm-16.05.9-bug3461_bd7504fbe098.patch
---
--- a/src/plugins/burst_buffer/cray/burst_buffer_cray.c
--- a/src/slurmctld/job_mgr.c


slurm-16.05.9-bug3466-sharedstartup-dup2-issue.patch
--- a/src/slurmd/slurmd/req.c	


[dmj@dmjdev deploy]$ 




Other than the provided patch for 3466 we don't have anything that should directly patch slurmd, I think.

Regarding spank patches, I have three others loaded, two of which I have the source for:

[dmj@dmjdev deploy]$ cat ../config/ansible/roles/smw_slurm_config/files/cori_plugstack.conf | awk '{print $2}'
/usr/lib64/shifter/shifter_slurm.so
/usr/lib/nersc-slurm-plugins/vtune.so
/opt/cray/pe/atp/libAtpSLaunch.so
/usr/lib/nersc-slurm-plugins/zonesort.so
[dmj@dmjdev deploy]$ 



shifter you've seen.  zonesort is attached on 3188.  I can send you vtune (about to be replaced with a more capable, generalized perf loader plugin.  I don't have Cray's libAtpSLaunch source.

Comment 21 Doug Jacobsen 2017-02-25 14:13:04 MST

However, I'd guess that Cray's ATP plugin isn't to blame:

dmj@cori08:~> objdump -S /opt/cray/pe/atp/libAtpSLaunch.so | grep call | grep open
dmj@cori08:~> objdump -S /opt/cray/pe/atp/libAtpSLaunch.so | grep call | grep close
dmj@cori08:~>

Comment 22 Doug Jacobsen 2017-02-25 14:16:58 MST

vtune is similarly quiet on the open/closing front:

[dmj@dmjdev nerscSpank]$ egrep 'open|close' spank_vtune.c 
[dmj@dmjdev nerscSpank]$ 


spank_zonesort.c does open and close some file descriptors, but all the opens and closes look safe to me.

Comment 23 Tim Wickberg 2017-02-25 14:29:06 MST

Thanks for the summary, the only one I'd be curious about at the moment based on that listing is disable_kmem_cgroup_3b5befc9e8565.patch . (Does that patch something now available in 17.02? I vaguely recall some work with KMem limits getting merged at one point.)

Although if you're willing to send them all over, I'll add them to our current file for reference. But I'm guessing the fault lies elsewhere, and will have to keep perusing.

One thing that may help, although I'm not sure if you're willing to run it, would be to replace all the close() calls throughout the source with an instrumented version, and have it dump stack traces out whenever fd 0 is closed. If you'd be willing to run with that for a day or so I can work out the best way to make that happen.

Now that I think about it - do you have anything exciting in the prolog/epilog at this point? I know at one point you were SIGHUP'ing slurmd to rotate the logs out; if you still have something like that in place that could help explain the intermittent timing.

Comment 24 Doug Jacobsen 2017-02-25 14:41:13 MST

[dmj@dmjdev files]$ cat cori_nodeprolog.sh 
#!/bin/bash

count=0
while [[ "$count" -le 10 ]]; do
	output=$(id "$SLURM_JOB_USER")
	retcode=$?

	if [[ "$retcode" == 0 ]]; then
		break
	fi
	pid=$(pgrep -f /usr/sbin/sssd)
	if [[ "$pid" -gt 1 ]]; then
		kill -12 $pid  ## send sigusr2 to sssd to force it online
		echo "nersc-nodeprolog|$SLURM_JOB_ID|Sending SIGUSR2 to ssd ($pid), try number $count" > /dev/kmsg
	fi

	count=$(( $count + 1 ))
	if [[ "$count" -gt 1 ]]; then
		sleep 1
	fi
done
groups "$SLURM_JOB_USER" >/dev/null
echo "nersc-nodeprolog|$SLURM_JOB_ID|$(hostname)|$(date +%Y-%m-%dT%H:%M:%S)" > /dev/kmsg
exit 0
[dmj@dmjdev files]$ 
[dmj@dmjdev files]$ 
[dmj@dmjdev files]$ 
[dmj@dmjdev files]$ 
[dmj@dmjdev files]$ cat cori_nodeepilog.sh 
#!/bin/bash
HOSTNAME=$(printf "nid%05d" $(cat /proc/cray_xt/nid))
TIMESTAMP=$(date +%Y-%m-%dT%H:%M:%S)

### put any node modification and/or cleanup tasks here
/usr/sbin/unsetupRoot

### send slurmd logs to rabbitmq; does a SIGHUP
# backup slurmd log, rabbitmq_slurmdlog is not working perfectly all the time
#cd /var/spool/slurmd
#cp "$HOSTNAME.log" "$HOSTNAME.$TIMESTAMP.log"

#/usr/lib/nersc-slurm-plugins/rabbitmq_slurmdlog

## temporarily stop sighupping slurmd since we think this might be causing problems
#pkill -1 -f /usr/sbin/slurmd


### exec nhc after ALL other actions
### this is done last because the nhc will detach from the job container
exec /etc/slurm/nhc_slurm_epilog.sh
[dmj@dmjdev files]$ 
[dmj@dmjdev files]$ 




I'm not sure that nhc_slurm_epilog.sh is mine to share since it is primarily authored by Cray.  It does close all stdio descriptors --- but that is in the already-forked epilog, it doesn't do anything to change the state of slurmd.  unsetupRoot is a part of shifter cleanup, and should not impact the state of slurmd either (unmounts things under /var/udiMount).

Comment 25 Doug Jacobsen 2017-02-25 14:49:39 MST

The disable kmem cgroup patch is this one (from 17.02):
https://github.com/SchedMD/slurm/commit/3b5befc9e8565

We enabled it to try to prevent some lustre-related crashes in CLE6, but then found that the memory cgroup has many other pitfalls even setting the kmem enforcement aside in 3.12 series kernels.  And so have disabled the Memory Constraints yet again.

I don't see any opens or closes.

Could we use an LD_PRELOAD hack to achieve the instrumented close()?  I just need to make sure it _only_ affects slurmd (and possibly slurmstepd), and doesn't infect the environment of the user jobs (since they may be instrumenting close themselves).

Comment 26 Tim Wickberg 2017-02-25 15:28:17 MST

Created attachment 4104 [details]
close.c LD_PRELOAD library

> Could we use an LD_PRELOAD hack to achieve the instrumented close()?  I just
> need to make sure it _only_ affects slurmd (and possibly slurmstepd), and
> doesn't infect the environment of the user jobs (since they may be
> instrumenting close themselves).

That was my thinking, but I'm actually on a train today and don't have convenient access back to my main test box at the second...

I'm attaching a "close.c" which does the interception. I think, but have not had a chance to test, that the fork/exec() slurmstepd does should prevent this from leaking into child processes, but that certainly bears some testing.

My trivial test.c used to verify this works:

#include <unistd.h>

int main(int argc, char **argv)
{
    close(0);
}

Comment 27 Doug Jacobsen 2017-02-25 15:54:52 MST

OK, I reserved a node, built the library in /root/wrap_close.so (and fixed
perms so non-root can't see it), then modified /etc/init.d/slurm to preload
it.

Modified version of the code to write to /tmp/fd_close.log and serialize
access with flock() (in case the other fds are getting impacted somewhere).

here's the log for slurmd startup (final slurmd running on pid 43749)

nid02274:~ # cat /tmp/fd_close.log
Sat Feb 25 14:46:24 2017
 close() of fd 0, pid 43743, return: 0
/root/wrap_close.so(close+0x120)[0x2aaaaaccebe5]
tr[0x4020c4]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaaaef1b25]
tr[0x402847]
Sat Feb 25 14:46:24 2017
 close() of fd 0, pid 43746, return: 0
/root/wrap_close.so(close+0x120)[0x2aaaaaccebe5]
touch[0x4023ed]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaaaef1b25]
touch[0x4028a6]
nid02274:~ #


I started a job on it and it didn't affect my environment.  As near as I
can tell this issue is caused most frequently when many jobs start
simultaneously.  I'll write a cron job to send small job arrays into this
node for a while and see if I can force it to happen.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Sat, Feb 25, 2017 at 2:28 PM, <bugs@schedmd.com> wrote:

> *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=3466#c26> on bug
> 3466 <https://bugs.schedmd.com/show_bug.cgi?id=3466> from Tim Wickberg
> <tim@schedmd.com> *
>
> Created attachment 4104 [details] <https://bugs.schedmd.com/attachment.cgi?id=4104> [details] <https://bugs.schedmd.com/attachment.cgi?id=4104&action=edit>
> close.c LD_PRELOAD library
> > Could we use an LD_PRELOAD hack to achieve the instrumented close()?  I just
> > need to make sure it _only_ affects slurmd (and possibly slurmstepd), and
> > doesn't infect the environment of the user jobs (since they may be
> > instrumenting close themselves).
>
> That was my thinking, but I'm actually on a train today and don't have
> convenient access back to my main test box at the second...
>
> I'm attaching a "close.c" which does the interception. I think, but have not
> had a chance to test, that the fork/exec() slurmstepd does should prevent this
> from leaking into child processes, but that certainly bears some testing.
>
> My trivial test.c used to verify this works:
>
> #include <unistd.h>
>
> int main(int argc, char **argv)
> {
>     close(0);
> }
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 28 Doug Jacobsen 2017-02-26 10:01:39 MST

hmmm, i'm not convinced that the cred.new on slurmd fd 0 is related.

a test program attempting to replicate the asserted cause:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char **argv) {
	int fd = creat("/tmp/test1.out", 0600);
	int fd2 = creat("/tmp/test2.out", 0600);
	char buffer[1024];

	printf("pid: %d\n", getpid());
	snprintf(buffer, 1024, "ls -l /proc/%d/fd", getpid());

	close(STDIN_FILENO);
	printf("first dup2: %d\n", dup2(fd, STDIN_FILENO));
	unlink("/tmp/test1.out");
	system(buffer);

	close(STDIN_FILENO);
	printf("second dup2: %d\n", dup2(fd2, STDIN_FILENO));
	system(buffer);
	return 0;
}





It's output:

[dmj@localhost ~]$ ./a.out 
pid: 14950
first dup2: 0
total 0
l-wx------. 1 dmj dmj 64 Feb 26 08:59 0 -> '/tmp/test1.out (deleted)'
lrwx------. 1 dmj dmj 64 Feb 26 08:59 1 -> /dev/pts/3
lrwx------. 1 dmj dmj 64 Feb 26 08:59 2 -> /dev/pts/3
l-wx------. 1 dmj dmj 64 Feb 26 08:59 3 -> '/tmp/test1.out (deleted)'
l-wx------. 1 dmj dmj 64 Feb 26 08:59 4 -> /tmp/test2.out
second dup2: 0
total 0
l-wx------. 1 dmj dmj 64 Feb 26 08:59 0 -> /tmp/test2.out
lrwx------. 1 dmj dmj 64 Feb 26 08:59 1 -> /dev/pts/3
lrwx------. 1 dmj dmj 64 Feb 26 08:59 2 -> /dev/pts/3
l-wx------. 1 dmj dmj 64 Feb 26 08:59 3 -> '/tmp/test1.out (deleted)'
l-wx------. 1 dmj dmj 64 Feb 26 08:59 4 -> /tmp/test2.out
[dmj@localhost ~]$

Comment 29 Doug Jacobsen 2017-02-26 10:07:39 MST

If, however, I close the source fd for dup2, then I get the same error:

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>

int main(int argc, char **argv) {
	int fd = creat("/tmp/test1.out", 0600);
	int fd2 = creat("/tmp/test2.out", 0600);
	int fd3 = creat("/tmp/test3.out", 0600);
	close(fd3);
	char buffer[1024];

	printf("pid: %d\n", getpid());
	snprintf(buffer, 1024, "ls -l /proc/%d/fd", getpid());

	close(STDIN_FILENO);
	printf("first dup2: %d\n", dup2(fd, STDIN_FILENO));
	unlink("/tmp/test1.out");
	system(buffer);

	close(STDIN_FILENO);
	printf("second dup2: %d\n", dup2(fd2, STDIN_FILENO));
	system(buffer);
	close(STDIN_FILENO);
	int ret = dup2(fd3, STDIN_FILENO);
	printf("third dup2: %d, %s\n", ret, strerror(errno));
	system(buffer);
	return 0;
}



[dmj@localhost ~]$ ./a.out 
pid: 15871
first dup2: 0
total 0
l-wx------. 1 dmj dmj 64 Feb 26 09:07 0 -> '/tmp/test1.out (deleted)'
lrwx------. 1 dmj dmj 64 Feb 26 09:07 1 -> /dev/pts/3
lrwx------. 1 dmj dmj 64 Feb 26 09:07 2 -> /dev/pts/3
l-wx------. 1 dmj dmj 64 Feb 26 09:07 3 -> '/tmp/test1.out (deleted)'
l-wx------. 1 dmj dmj 64 Feb 26 09:07 4 -> /tmp/test2.out
second dup2: 0
total 0
l-wx------. 1 dmj dmj 64 Feb 26 09:07 0 -> /tmp/test2.out
lrwx------. 1 dmj dmj 64 Feb 26 09:07 1 -> /dev/pts/3
lrwx------. 1 dmj dmj 64 Feb 26 09:07 2 -> /dev/pts/3
l-wx------. 1 dmj dmj 64 Feb 26 09:07 3 -> '/tmp/test1.out (deleted)'
l-wx------. 1 dmj dmj 64 Feb 26 09:07 4 -> /tmp/test2.out
third dup2: -1, Bad file descriptor
total 0
lrwx------. 1 dmj dmj 64 Feb 26 09:07 1 -> /dev/pts/3
lrwx------. 1 dmj dmj 64 Feb 26 09:07 2 -> /dev/pts/3
l-wx------. 1 dmj dmj 64 Feb 26 09:07 3 -> '/tmp/test1.out (deleted)'
l-wx------. 1 dmj dmj 64 Feb 26 09:07 4 -> /tmp/test2.out
[dmj@localhost ~]$

Comment 30 Tim Wickberg 2017-02-26 15:21:18 MST

(In reply to Doug Jacobsen from comment #28)
> hmmm, i'm not convinced that the cred.new on slurmd fd 0 is related.
> 
> a test program attempting to replicate the asserted cause:

Yes - I believe it's a symptom, but not the cause.

The code producing the error is in _forkexec_slurmstepd, and does this:

(void) close(STDIN_FILENO); /* ignore return */
if (dup2(to_stepd[0], STDIN_FILENO) == -1) {
        error("dup2 over STDIN_FILENO: %m");
        exit(1);
}

If STDIN had been closed at some point, to_stepd[0] could have been set to 0. But the close(STDIN_FILENO) would close it right before attempting to dup2() it as the source (and destination). That's my working theory as to the cause of the problem, although the source of the close(0) is still unknown.

cred.new using FD 0 and staying open indefinitely would actually help prevent this from causing problems for successive job steps, since the pipe could no longer land on FD 0.

Comment 31 Doug Jacobsen 2017-03-06 23:30:15 MST

Hello,
FYI this happened again in slurm 17.02.1 on cori.  Of course it happened _before_ I setup the close() instrumentation again.  That's set up and we'll just wait for the next occurence.

-Doug

Comment 32 Tim Wickberg 2017-03-09 14:32:29 MST

Alright, standing by hoping you get a usable trace from it. Of course, as soon as you start hunting for it I'm sure that it'll be a bit until you find it.

None of the issues we've been working on in the past week involve errant FD handling, so I'm sure this issue is still lurking in the slurmd code somewhere.

I'm sending this to Dominik to keep an eye on, as I'll be out on travel starting tomorrow afternoon, and I'm guessing it may take another few days to trip it again.

- Tim

Comment 33 Doug Jacobsen 2017-03-10 11:35:49 MST

OK, this was reproduced in 17.02 with the close() instrumentation running.  I'll attach the log.

But the first close() of fd0 by slurmd (pid 33363 in this log) is apparently involved with libnss_sss:

Fri Mar 10 07:23:26 2017
 close() of fd 0, pid 33363, return: 0
/root/wrap_close.so(close+0x120)[0x2aaaaaccebe5]
/lib64/libnss_sss.so.2(+0x4b38)[0x2aaaf080cb38]
/lib64/libnss_sss.so.2(+0x52c9)[0x2aaaf080d2c9]
/lib64/libnss_sss.so.2(_nss_sss_getpwuid_r+0x46)[0x2aaaf080af76]
/lib64/libc.so.6(getpwuid_r+0xdc)[0x2aaaabc1b4cc]
/usr/sbin/slurmd(slurm_getpwuid_r+0x38)[0x53fc9b]
/usr/sbin/slurmd[0x4371a6]
/usr/sbin/slurmd[0x437971]
/usr/sbin/slurmd[0x43b090]
/usr/sbin/slurmd[0x43b571]
/usr/sbin/slurmd(slurmd_req+0x348)[0x43574f]
/usr/sbin/slurmd[0x4300b6]
/lib64/libpthread.so.0(+0x80a4)[0x2aaaab9500a4]
/lib64/libc.so.6(clone+0x6d)[0x2aaaabc4a02d]


I'll send a disassembled copy of my slurmd as well to aid in tracing the backtraces.

It appears that the stack trace is:

libnss_sss: _nss_sss_endservent_r
libnss_sss: _nss_sss_endservent_r
libnss_sss: _nss_sss_getpwuid_r
libc:   getpwuid_r
slurmd: slurm_getpwuid_r
slurmd: _send_slurmstepd_init
slurmd: _forkexec_slurmstepd
slurmd: _spawn_prolog_stepd
slurmd: _rpc_prolog
slurmd: slurmd_req
slurmd: _service_connection
libpthread (something)
libc clone()


huh -- i guess it looks like sssd is doing this?

Comment 34 Tim Wickberg 2017-03-10 11:49:11 MST

> huh -- i guess it looks like sssd is doing this?

Looks an awful lot like this is the problem:

https://pagure.io/SSSD/sssd/issue/2409

Which version of sssd are you running? That's a pretty nasty bug - closing out our file descriptors are something in the standard library is just mean.

Comment 36 Doug Jacobsen 2017-03-10 12:14:19 MST

nid02267:~ # rpm -qa | grep sssd
sssd-krb5-common-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64
sssd-ldap-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64
sssd-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64
sssd-tools-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64
python-sssd-config-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64
sssd-proxy-1.11.5.1-10.12.1.11559.2.PTF.993582.x86_64
nid02267:~ #

These version numbers are a nightmare.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Mar 10, 2017 at 10:49 AM, <bugs@schedmd.com> wrote:

> *Comment # 34 <https://bugs.schedmd.com/show_bug.cgi?id=3466#c34> on bug
> 3466 <https://bugs.schedmd.com/show_bug.cgi?id=3466> from Tim Wickberg
> <tim@schedmd.com> *
>
> > huh -- i guess it looks like sssd is doing this?
>
> Looks an awful lot like this is the problem:
> https://pagure.io/SSSD/sssd/issue/2409
>
> Which version of sssd are you running? That's a pretty nasty bug - closing out
> our file descriptors are something in the standard library is just mean.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 37 Tim Wickberg 2017-03-10 12:38:47 MST

Reading their bug, it looks like it was fixed with sssd 1.11.7.

It looks like you're running something based off 1.11.5 (possibly with some extra SuSE/Cray patches applied, judging by the version number soup), so I think that likely the same underlying bug.

Comment 38 Dominik Bartkiewicz 2017-05-02 07:50:56 MDT

I am going to close this as invalid.

Dominik