4545 – New lustre_no_flush option for native/cray

Ticket 4545 - New lustre_no_flush option for native/cray

Summary: New lustre_no_flush option for native/cray

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	17.11.0
Hardware:	Cray XC Linux

Severity:	4 - Minor Issue
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-12-20 06:54 MST by Doug Jacobsen
Modified:	2018-03-14 09:39 MDT (History)
CC List:	4 users (show)

See Also:	4309
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Doug Jacobsen 2017-12-20 06:54:33 MST

Hello,

I noticed commit 3552b25fa9a8 on the slurm-17.11 branch get added.  Since I can't read bug 4309, but can read the comments, this seems to be for jobs sharing a node.

Is it possible that this LaunchParameter could be made partition-specific?  This would allow a partition intended for job sharing to avoid cache flushing, while allowing node-exclusive partitions to continue to have the same behavior.

Thanks,
Doug

Comment 1 Moe Jette 2017-12-20 11:12:51 MST

(In reply to Doug Jacobsen from comment #0)
> Hello,
> 
> I noticed commit 3552b25fa9a8 on the slurm-17.11 branch get added.  Since I
> can't read bug 4309, but can read the comments, this seems to be for jobs
> sharing a node.
>
> Is it possible that this LaunchParameter could be made partition-specific? 
> This would allow a partition intended for job sharing to avoid cache
> flushing, while allowing node-exclusive partitions to continue to have the
> same behavior.

The same problem can also occur with exclusive nodes if the job has multiple job steps active at the same time.

Is having this configuration option available on a per-partition basis helpful given that additional information?

Comment 2 Moe Jette 2018-01-02 11:16:36 MST

Information provided. Please re-open if you need more information.

Comment 3 Doug Jacobsen 2018-02-02 17:46:36 MST

Hello,

So I spoke with LANL about this issue and I believe that this fix is not required to prevent srun or slurmstepd from generating a bus error.

At NERSC, we allow the caches to be flushed, even on nodes running multiple jobs and there is no issue with the flushes causing srun to bus error (though i could imagine it could generate issues for jobs accessing the OS in other ways).

I think that the specific issue here is that on Cray CLE6.0, by default, nodes get the OS, including the slurm installation and all of its plugin via a DVS mount of /.

Really / is an overlay filesystem where the lower portion is a loop-mounted squashfs layer and the upper layer is tmpfs.

When buffer caches are flushed during a dlopen I can imagine a timeout in some conditions waiting for a slurm plugin to be re-resolved over dvs.

The NERSC solution is to localize all files related to slurm or involved in slurmstepd launch into that tmpfs layer at boot time.

This is possible by creating a new netroot preload file:

gertsmw:/var/opt/cray/imps/config/sets/p0/dist # cat compute-preload.nersc
/usr/lib64/libslurm*so*
/usr/lib64/slurm/*.so
/usr/sbin/slurmd
/usr/sbin/slurmstepd
/usr/bin/sbatch
/usr/bin/srun
/usr/bin/sbcast
/usr/bin/numactl
/usr/lib64/libnuma*so*
/lib64/ast/libast.so*
/lib64/ast/libcmd.so*
/lib64/ast/libdll.so*
/lib64/ast/libshell.so*
/lib64/libacl.so*
/lib64/libattr.so*
/lib64/libc.so*
/lib64/libcap.so*
/lib64/libdl.so*
/lib64/libgcc_s.so*
...
...

I generate mine by including everything installed by slurm rpms, and then get the rest by strace -f the running slurmd while launching a job step.

Once the netroot preload file is generated, it needs to then be included in the cray_netroot_preload_worksheet CLE configuration.

e.g.,
cray_netroot_preload.settings.load.data.label.compute: null
cray_netroot_preload.settings.load.data.compute.targets: []
cray_netroot_preload.settings.load.data.compute.content_lists:
- dist/compute-preload.cray
- dist/compute-preload.nersc
cray_netroot_preload.settings.load.data.compute.size_limit: 0


This is a generally useful technique for preventing remote lookups of commonly accessed files within jobs.

I'm not sure how this should best be included in the Slurm documentation, but I think it probably should be.

Comment 4 Brett Kettering 2018-02-05 08:53:58 MST

Doug:

At LANL, David Shrader (dshrader@lanl.gov) is the person who has been on point for resolving this issue affecting a couple of code teams. He can give you much detail on the ramifications.

As an overview, this has bitten us primarily when our code teams are running 10's of concurrent jobs on a single node and 1000's of total runs in an allocation. We had a couple issues: 1) Lustre caches being flushed with each srun; and 2) kernel caches being flushed with each srun.

#2 created a performance issue where each srun loaded the same dynamic libraries (ParaView in this case). When the kernel cache is not flushed the first srun loads the libraries and the rest use them too. No need to reload them 1000's of times.

I encourage you to talk to David. You may also want to involve Michael Jennings (mej@lanl.gov), who is very knowledgeable about SLURM and kernel matters.

Hopefully, you all, with SchedMD and Cray, can figure out a solution that works for everyone.

Best Regards,
Brett

Comment 12 Moe Jette 2018-03-06 09:49:04 MST

Any updates on this?

I did add a FAQ item based mostly upon Doug's comment #3:
https://github.com/SchedMD/slurm/commit/9262408fe95ae64f7a8a53f068dc38cb29ed69af

Comment 13 Moe Jette 2018-03-14 09:39:56 MDT

I'm closing this bug. Feel free to reopen if more information becomes available.