Hello, I noticed commit 3552b25fa9a8 on the slurm-17.11 branch get added. Since I can't read bug 4309, but can read the comments, this seems to be for jobs sharing a node. Is it possible that this LaunchParameter could be made partition-specific? This would allow a partition intended for job sharing to avoid cache flushing, while allowing node-exclusive partitions to continue to have the same behavior. Thanks, Doug
(In reply to Doug Jacobsen from comment #0) > Hello, > > I noticed commit 3552b25fa9a8 on the slurm-17.11 branch get added. Since I > can't read bug 4309, but can read the comments, this seems to be for jobs > sharing a node. > > Is it possible that this LaunchParameter could be made partition-specific? > This would allow a partition intended for job sharing to avoid cache > flushing, while allowing node-exclusive partitions to continue to have the > same behavior. The same problem can also occur with exclusive nodes if the job has multiple job steps active at the same time. Is having this configuration option available on a per-partition basis helpful given that additional information?
Information provided. Please re-open if you need more information.
Hello, So I spoke with LANL about this issue and I believe that this fix is not required to prevent srun or slurmstepd from generating a bus error. At NERSC, we allow the caches to be flushed, even on nodes running multiple jobs and there is no issue with the flushes causing srun to bus error (though i could imagine it could generate issues for jobs accessing the OS in other ways). I think that the specific issue here is that on Cray CLE6.0, by default, nodes get the OS, including the slurm installation and all of its plugin via a DVS mount of /. Really / is an overlay filesystem where the lower portion is a loop-mounted squashfs layer and the upper layer is tmpfs. When buffer caches are flushed during a dlopen I can imagine a timeout in some conditions waiting for a slurm plugin to be re-resolved over dvs. The NERSC solution is to localize all files related to slurm or involved in slurmstepd launch into that tmpfs layer at boot time. This is possible by creating a new netroot preload file: gertsmw:/var/opt/cray/imps/config/sets/p0/dist # cat compute-preload.nersc /usr/lib64/libslurm*so* /usr/lib64/slurm/*.so /usr/sbin/slurmd /usr/sbin/slurmstepd /usr/bin/sbatch /usr/bin/srun /usr/bin/sbcast /usr/bin/numactl /usr/lib64/libnuma*so* /lib64/ast/libast.so* /lib64/ast/libcmd.so* /lib64/ast/libdll.so* /lib64/ast/libshell.so* /lib64/libacl.so* /lib64/libattr.so* /lib64/libc.so* /lib64/libcap.so* /lib64/libdl.so* /lib64/libgcc_s.so* ... ... I generate mine by including everything installed by slurm rpms, and then get the rest by strace -f the running slurmd while launching a job step. Once the netroot preload file is generated, it needs to then be included in the cray_netroot_preload_worksheet CLE configuration. e.g., cray_netroot_preload.settings.load.data.label.compute: null cray_netroot_preload.settings.load.data.compute.targets: [] cray_netroot_preload.settings.load.data.compute.content_lists: - dist/compute-preload.cray - dist/compute-preload.nersc cray_netroot_preload.settings.load.data.compute.size_limit: 0 This is a generally useful technique for preventing remote lookups of commonly accessed files within jobs. I'm not sure how this should best be included in the Slurm documentation, but I think it probably should be.
Doug: At LANL, David Shrader (dshrader@lanl.gov) is the person who has been on point for resolving this issue affecting a couple of code teams. He can give you much detail on the ramifications. As an overview, this has bitten us primarily when our code teams are running 10's of concurrent jobs on a single node and 1000's of total runs in an allocation. We had a couple issues: 1) Lustre caches being flushed with each srun; and 2) kernel caches being flushed with each srun. #2 created a performance issue where each srun loaded the same dynamic libraries (ParaView in this case). When the kernel cache is not flushed the first srun loads the libraries and the rest use them too. No need to reload them 1000's of times. I encourage you to talk to David. You may also want to involve Michael Jennings (mej@lanl.gov), who is very knowledgeable about SLURM and kernel matters. Hopefully, you all, with SchedMD and Cray, can figure out a solution that works for everyone. Best Regards, Brett
Any updates on this? I did add a FAQ item based mostly upon Doug's comment #3: https://github.com/SchedMD/slurm/commit/9262408fe95ae64f7a8a53f068dc38cb29ed69af
I'm closing this bug. Feel free to reopen if more information becomes available.