| Summary: | srun on Cray has limits imposed; reported in /sys/class/gni/kgni0/resources | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | S Senator <sts> |
| Component: | Cray ALPS | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | david.gloe, dwg, peltzpl |
| Version: | 17.02.2 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | LANL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.02.5 17.11.0-pre0 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Patch to remove most pkey limits | ||
|
Description
S Senator
2017-04-18 15:26:41 MDT
Cray Bugzilla 850439 / SFDC case 166486 (In reply to S Senator from comment #0) > srun commands such as the following: > srun -N1 cat /sys/class/gni/kgni0/resources > show memory limits as below. This appears to be an effect of the cray > plugins which have been specified. (slurm.conf is below) > > 1. Why is this limit imposed when running under slurm? > 2. How is this being triggered? > 3. What mechanism is available to force it to be unlimited (-1)? > > This is a gating item for our transition to slurm on our Cray platforms, > which is why it is marked as high impact. > > --- output of srun -N1 cat /sys/class/gni/kgni0/resources --- > > tt-login1$ srun -N1 cat /sys/class/gni/kgni0/resources <snip> Bear with me, I've never seen that set of limits before, and only vaguely understand what they may be doing. I don't think Slurm is directly triggering them - there are zero references to 'kgni' in our source. Can you run through a few more tests: - If you're logged in as a normal user account directly on the node, are the limits set there? - If you disable the JobContainer plugin, does this still happen? That does make some calls into a Cray API that I could see changing things around like this; disabling it would at least isolate that as the cause. There are a few other suggestions I have for the config, but those would be best saved for a lower severity config review bug. - Tim > - If you're logged in as a normal user account directly on the node, are the limits set there?
No. A normal user account shows as unlimited - when logging into the node as root, su-ing to a normal user account.
We'll definitely take you up on the request (at a much lower priority) to review our slurm.conf and related settings in about 5-7 days. [16:07][peltz@tt-login1] ~ $ srun -N 1 grep Container /etc/opt/slurm/slurm.conf #JobContainerType=job_container/cncu #? JobContainerType=job_container/none [16:07][peltz@tt-login1] ~ $ srun -N1 cat /sys/class/gni/kgni0/resources --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- Name Used Limit HWM MDD 6 -1 8 CQ 5 -1 FMA 1 -1 SFMA 0 -1 RDMA 0 -1 DIRECT 0 -1 IOMMU 2097152 1073741824 PCI-IOMMU 0 -1 CE 0 -1 DLA 0 -1 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 52 PKey: 0x1a8 JobId: 0x19 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 921 0 CQ 0 509 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 134217728 PCI-IOMMU 0 -1 CE 0 1 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 53 PKey: 0x1a9 JobId: 0x19 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 921 0 CQ 0 509 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 134217728 PCI-IOMMU 0 -1 CE 0 1 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 (In reply to S Senator from comment #3) > > - If you're logged in as a normal user account directly on the node, are the limits set there? > > No. A normal user account shows as unlimited - when logging into the node as > root, su-ing to a normal user account. Do you mind attaching the "normal" output as well? [16:32][peltz@nid00024] ~ $ cat /sys/class/gni/kgni0/resources --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- Name Used Limit HWM MDD 6 -1 8 CQ 5 -1 FMA 1 -1 SFMA 0 -1 RDMA 0 -1 DIRECT 0 -1 IOMMU 2097152 1073741824 PCI-IOMMU 0 -1 CE 0 -1 DLA 0 -1 non-VMDH 0 -1 SMDD Hold 0 -1 Going back to your original questions: > 1. Why is this limit imposed when running under slurm? It's not something Slurm itself is explicitly asking for, it appears to be the end result of some calls to the Cray APIs to manage the interconnect. I think alpsc_configure_nic() is what is setting up the protection keys automatically > 2. How is this being triggered? If you disable the switch/cray plugin I suspect this will go away. But I think you do want that enabled in production - I believe inter-node communication requires this plugin to work correctly. > 3. What mechanism is available to force it to be unlimited (-1)? Can you try running the step with --mem=0 ? It looks like that may influence the calculated limits get in to alpsc_configure_nic(). This might be an unintended side-effect of not setting memory amounts or limits on the jobs themselves; I'm trying to dig out more info on that now. Like this? If so, it looks the same for the MDD of 921. [16:20][peltz@tt-login1] ~ $ srun -N 1 --mem=0 cat /sys/class/gni/kgni0/resources --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- Name Used Limit HWM MDD 6 -1 6 CQ 5 -1 FMA 1 -1 SFMA 0 -1 RDMA 0 -1 DIRECT 0 -1 IOMMU 2097152 1073741824 PCI-IOMMU 0 -1 CE 0 -1 DLA 0 -1 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 14 PKey: 0x256 JobId: 0x6 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 921 0 CQ 0 509 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 134217728 PCI-IOMMU 0 -1 CE 0 1 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 15 PKey: 0x257 JobId: 0x6 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 921 0 CQ 0 509 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 134217728 PCI-IOMMU 0 -1 CE 0 1 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 I'm CC'ing David Gloe from Cray who wrote most of the associated code here; I'm hoping he may be able to shed some light on how these limits are being calculated and established. Hey David - As outlined in the bug, LANL is testing a conversion to Native Slurm mode on one of their XC systems, and has raised some questions around how certain limits are setup and established. As best I can tell, the ptag setup through the switch/cray plugin is what would be triggering this, but it's unclear to me (and I haven't found any docs discussing this directly) how to influence the resulting resource limits. thanks, - Tim We have a Cray bug for this same issue (from the same customer I believe): http://bugzilla.us.cray.com/show_bug.cgi?id=850439 I'll copy my comment from that bug below: Resource limits are applied under Moab/TORQUE/ALPS, but they are handled a bit differently for SLURM. By default, ALPS uses exclusive node reservations, so it will allocate all the network resources to the job. However, if ALPS is configured with suspend/resume, it will limit the network resources just as you see Slurm does. Slurm does not have the same concept of exclusive reservations as ALPS does, so we always limit resources as if another job could be launched to the node. The amount of resources given is controlled by the CPUs and memory the job is given. You should be able to increase the given network resources using the srun --mem and --mem-per-cpu options. As for the 3 ptags, I'm not sure what ptag 1 represents. The other two ptags are the actual ones for the srun job. These resources are set by the alpsc_configure_nic function, which is called by the Slurm switch/cray plugin switch_p_job_init function. David commented on our Cray bug of the same request. "INFO 850439 - kgni resource limits under slurm" I was able to do this and it does increase the MDD limit. srun -N 1 --mem=126G cat /sys/class/gni/kgni0/resources --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- Name Used Limit HWM MDD 6 -1 54 CQ 5 -1 FMA 1 -1 SFMA 0 -1 RDMA 0 -1 DIRECT 0 -1 IOMMU 2097152 1073741824 PCI-IOMMU 0 -1 CE 0 -1 DLA 0 -1 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 62 PKey: 0x402 JobId: 0x21 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3649 0 CQ 0 509 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 134217728 PCI-IOMMU 0 -1 CE 0 1 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 63 PKey: 0x403 JobId: 0x21 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3649 0 CQ 0 509 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 134217728 PCI-IOMMU 0 -1 CE 0 1 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 However, we would ideally not want to restrict this at all and not impose any intranode limits as we do exclusive node scheduling anyway. Thanks for the quick response there; I suspected there may be an equivalent Cray bug open to discuss this. Is there some set of values we can feed to alpsc_configure_nic() that would result in the unlimited values LANL desires? I did notice the cpu and memory scaling factors, but it's not clear to me how they're handled on the remote end. Is there a special value, like -1, that can do this? Or would 100 for both accomplish the "unlimited" result? There aren't any special values, 100 for the CPU and memory scaling factors should give you all of the resources. The issue is even if a job gets exclusive access to a node, it's possible the user could launch up to 4 sruns on the same node at a time. So giving more than 25% of the total resources could lead to oversubscription. On homogeneous partitions you could set DefMemPerNode/DefMemPerCPU to take all of the node memory. That should give you all the resources. Slightly different question - what happens if we set 'exclusive' in the call?
I'm inferring that it's the second argument to alpsc_configure_nic() from the comment above mentioning:
/*
* Configure the network
*
* I'm setting exclusive flag to zero for now until we can figure out a
* way to guarantee that the application not only has exclusive access
* to the node but also will not be suspended. This may not happen.
*
* Cray shmem still uses the network, even when it's using only one
* node, so we must always configure the network.
*/
It sounds like the potential over-subscription from running multiple steps alongside each other isn't a huge concern to LANL at the moment.
Assuming that either setting the exclusive flag to one, or the cpu_scaling and mem_scaling factors to 100 gets this, I should be able to put together a patch allowing that to be quickly.
If you set exclusive it looks like you'll get more CQ and CE resources, and IOMMU will be set to unlimited. Other limits should be the same as if you provided scaling = scalingMem = 100. Created attachment 4388 [details]
Patch to remove most pkey limits
Steve, Paul -
Are you able to test out the attached patch?
It's obviously not a final version - I'd expect to add some config flag to enable this, assuming there are no negative side-effects uncovered in testing.
David's mentioned a few of the downsides to running in this state - I'm assuming that those aren't critical in your environment when always allocating full-nodes.
- Tim
If you run in this mode you could set the gres craynetwork=1 instead of 4, which should limit to one step using the network on each node. This should definitely be controlled by a config flag, because it won't work correctly if preemption or multiple apps per node is used. With the patch, limits are increased, but not removed completely. [10:44][peltz@ga-login1] ~ $ srun -N 1 cat /sys/class/gni/kgni0/resources --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- Name Used Limit HWM MDD 6 -1 6 CQ 5 -1 FMA 1 -1 SFMA 0 -1 RDMA 0 -1 DIRECT 0 -1 IOMMU 2097152 1073741824 PCI-IOMMU 0 -1 CE 0 -1 DLA 0 -1 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 12 PKey: 0x664 JobId: 0x5 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3686 0 CQ 0 2037 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 1073741824 PCI-IOMMU 0 -1 CE 0 4 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 13 PKey: 0x665 JobId: 0x5 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3686 0 CQ 0 2037 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 1073741824 PCI-IOMMU 0 -1 CE 0 4 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 Those limits are identical to those you get from ALPS in exclusive mode: dgloe@purie:~> aprun cat /sys/class/gni/kgni0/resources --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- Name Used Limit HWM MDD 6 -1 22 CQ 5 -1 FMA 1 -1 SFMA 0 -1 RDMA 0 -1 DIRECT 0 -1 IOMMU 2097152 1073741824 PCI-IOMMU 0 -1 CE 0 -1 DLA 0 -1 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 12 PKey: 0x4b55 JobId: 0x5 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3686 0 CQ 0 2037 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 1073741824 PCI-IOMMU 0 -1 CE 0 4 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 13 PKey: 0x4b56 JobId: 0x5 RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3686 0 CQ 0 2037 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 1073741824 PCI-IOMMU 0 -1 CE 0 4 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 Application 1158418 resources: utime ~0s, stime ~0s, Rss ~4540, inblocks ~0, outblocks ~0 I'm getting different results than you, but I'm using M/T and not just ALPS.
peltz@tr2-fe1:~> msub -I -l nodes=1
qsub: waiting for job 52908.tr2-drm to start
qsub: job 52908.tr2-drm ready
peltz@tr2-login10:~> aprun cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name Used Limit HWM
MDD 6 -1 38
CQ 5 -1
FMA 1 -1
SFMA 0 -1
RDMA 0 -1
DIRECT 0 -1
IOMMU 524288 1073741824
PCI-IOMMU 0 -1
CE 0 -1
DLA 0 -1
non-VMDH 0 -1
SMDD Hold 0 -1
Application 3003734 resources: utime ~0s, stime ~0s, Rss ~6600, inblocks ~0, outblocks ~0
Even with explicitly stating exclusive mode:
peltz@tr2-login10:~> apstat -svv | grep access
default node access [exclusive]
peltz@tr2-login10:~> aprun -F exclusive cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name Used Limit HWM
MDD 6 -1 38
CQ 5 -1
FMA 1 -1
SFMA 0 -1
RDMA 0 -1
DIRECT 0 -1
IOMMU 524288 1073741824
PCI-IOMMU 0 -1
CE 0 -1
DLA 0 -1
non-VMDH 0 -1
SMDD Hold 0 -1
Application 3003735 resources: utime ~0s, stime ~0s, Rss ~6600, inblocks ~0, outblocks ~0
(In reply to Paul Peltz from comment #24) > I'm getting different results than you, but I'm using M/T and not just ALPS. > > peltz@tr2-fe1:~> msub -I -l nodes=1 > qsub: waiting for job 52908.tr2-drm to start > qsub: job 52908.tr2-drm ready > > peltz@tr2-login10:~> aprun cat /sys/class/gni/kgni0/resources > --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- > Name Used Limit HWM > MDD 6 -1 38 > CQ 5 -1 > FMA 1 -1 > SFMA 0 -1 > RDMA 0 -1 > DIRECT 0 -1 > IOMMU 524288 1073741824 > PCI-IOMMU 0 -1 > CE 0 -1 > DLA 0 -1 > non-VMDH 0 -1 > SMDD Hold 0 -1 > Application 3003734 resources: utime ~0s, stime ~0s, Rss ~6600, inblocks ~0, > outblocks ~0 > > Even with explicitly stating exclusive mode: > > peltz@tr2-login10:~> apstat -svv | grep access > default node access [exclusive] > peltz@tr2-login10:~> aprun -F exclusive cat /sys/class/gni/kgni0/resources > --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- > Name Used Limit HWM > MDD 6 -1 38 > CQ 5 -1 > FMA 1 -1 > SFMA 0 -1 > RDMA 0 -1 > DIRECT 0 -1 > IOMMU 524288 1073741824 > PCI-IOMMU 0 -1 > CE 0 -1 > DLA 0 -1 > non-VMDH 0 -1 > SMDD Hold 0 -1 > Application 3003735 resources: utime ~0s, stime ~0s, Rss ~6600, inblocks ~0, > outblocks ~0 Note that's for PTag 1, not for your application PTags. There's a relatively new ALPS feature where single node applications don't reserve any network resources at all. I'm guessing you have noNetwork set to 1 in alps.conf. Try checking the resource limits for a two node application. Yes, I'm running with noNetwork = 1.
boot-gadget:~ # ssh sdb grep noNetwork /etc/opt/cray/alps/alps.conf
# - noNetwork: if set, one-node apps will _not_ use network resources
noNetwork 1
Multi-Node job:
[12:52][peltz@ga-fe1] ~ $ srun -N2 cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name Used Limit HWM
MDD 6 -1 6
CQ 5 -1
FMA 1 -1
SFMA 0 -1
RDMA 0 -1
DIRECT 0 -1
IOMMU 2097152 1073741824
PCI-IOMMU 0 -1
CE 0 -1
DLA 0 -1
non-VMDH 0 -1
SMDD Hold 0 -1
--- PTag: 18 PKey: 0x66a JobId: 0x8 RefCount: 1 Suspend: Idle ---
Name Used Limit HWM
MDD 0 3686 0
CQ 0 2037
FMA 0 123
SFMA 0 123
RDMA 0 -1
DIRECT 0 -1
IOMMU 0 1073741824
PCI-IOMMU 0 -1
CE 0 4
DLA 0 15360
non-VMDH 0 -1
SMDD Hold 0 -1
--- PTag: 19 PKey: 0x66b JobId: 0x8 RefCount: 1 Suspend: Idle ---
Name Used Limit HWM
MDD 0 3686 0
CQ 0 2037
FMA 0 123
SFMA 0 123
RDMA 0 -1
DIRECT 0 -1
IOMMU 0 1073741824
PCI-IOMMU 0 -1
CE 0 4
DLA 0 15360
non-VMDH 0 -1
SMDD Hold 0 -1
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name Used Limit HWM
MDD 6 -1 6
CQ 5 -1
FMA 1 -1
SFMA 0 -1
RDMA 0 -1
DIRECT 0 -1
IOMMU 2097152 1073741824
PCI-IOMMU 0 -1
CE 0 -1
DLA 0 -1
non-VMDH 0 -1
SMDD Hold 0 -1
--- PTag: 8 PKey: 0x66a JobId: 0x3 RefCount: 1 Suspend: Idle ---
Name Used Limit HWM
MDD 0 3686 0
CQ 0 2037
FMA 0 123
SFMA 0 123
RDMA 0 -1
DIRECT 0 -1
IOMMU 0 1073741824
PCI-IOMMU 0 -1
CE 0 4
DLA 0 15360
non-VMDH 0 -1
SMDD Hold 0 -1
--- PTag: 9 PKey: 0x66b JobId: 0x3 RefCount: 1 Suspend: Idle ---
Name Used Limit HWM
MDD 0 3686 0
CQ 0 2037
FMA 0 123
SFMA 0 123
RDMA 0 -1
DIRECT 0 -1
IOMMU 0 1073741824
PCI-IOMMU 0 -1
CE 0 4
DLA 0 15360
non-VMDH 0 -1
SMDD Hold 0 -1
Sorry about the delay, we had a system issue. Here is what ALPS output looks like on a multinode job, which is as you say the same as SLURM. peltz@tr2-login3:~> aprun -N 1 -n 2 cat /sys/class/gni/kgni0/resources --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- Name Used Limit HWM MDD 6 -1 47 CQ 5 -1 FMA 1 -1 SFMA 0 -1 RDMA 0 -1 DIRECT 0 -1 IOMMU 524288 1073741824 PCI-IOMMU 0 -1 CE 0 -1 DLA 0 -1 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 84 PKey: 0x240 JobId: 0x2c RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3686 0 CQ 0 2037 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 1073741824 PCI-IOMMU 0 -1 CE 0 4 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 85 PKey: 0x241 JobId: 0x2c RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3686 0 CQ 0 2037 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 1073741824 PCI-IOMMU 0 -1 CE 0 4 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- Name Used Limit HWM MDD 6 -1 52 CQ 5 -1 FMA 1 -1 SFMA 0 -1 RDMA 0 -1 DIRECT 0 -1 IOMMU 524288 1073741824 PCI-IOMMU 0 -1 CE 0 -1 DLA 0 -1 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 84 PKey: 0x240 JobId: 0x2b RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3686 0 CQ 0 2037 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 1073741824 PCI-IOMMU 0 -1 CE 0 4 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 --- PTag: 85 PKey: 0x241 JobId: 0x2b RefCount: 1 Suspend: Idle --- Name Used Limit HWM MDD 0 3686 0 CQ 0 2037 FMA 0 123 SFMA 0 123 RDMA 0 -1 DIRECT 0 -1 IOMMU 0 1073741824 PCI-IOMMU 0 -1 CE 0 4 DLA 0 15360 non-VMDH 0 -1 SMDD Hold 0 -1 Application 3005122 resources: utime ~0s, stime ~0s, Rss ~7200, inblocks ~0, outblocks ~0 To go back to Tim, you said a more complete patch would be forthcoming with a configuration option, right? We are happy with the reported values now shown on node in the kgni resources. Are there any other consequences we should be aware of with enabling this change? Thanks, Paul (In reply to Paul Peltz from comment #28) > To go back to Tim, you said a more complete patch would be forthcoming with > a configuration option, right? We are happy with the reported values now > shown on node in the kgni resources. Are there any other consequences we > should be aware of with enabling this change? I'm trying to find a convenient spot to hide a config flag for it, that's the only tricky bit of this. As David indicated previously, this could potentially cause problems if running multiple steps simultaneously on the node - although I believe if the steps themselves avoid oversubscribing the node that shouldn't cause a problem, and if it does it's on the end user to worry about. I'll attach the patch here when finalized, and/or point you to the upstream commit if we don't hold it back until 17.11. - Tim Shifting the priority level down as a workaround is currently in place. I should have the final version of this ready shortly, and it'll be included in 17.02.4 when released. The option of "cray_net_exclusive" in the LaunchParameters turn this on. I will be out of the office for the holidays until Tuesday January 3rd. I don't see this in the changelog or in the source in the 17.02.4 release. Did it not get included? I have the patch in place still, but I wanted to verify this before building new RPMs. Unfortunately the patch isn't in yet; we've run into some issues with testing out the final version of the patch on the Cray dev systems, and need a bit more time to verify it. It should be in before 17.02.5. The patch you're running is still functionally identical to the final version - the only difference is the addition of a flag to LaunchParameters to turn it on/off at runtime. - Tim The final version of this has been committed with 23721c4c9e, and will be in 17.02.5 when released. It adds an option of cray_net_exclusive to LaunchParameters; when set all jobs will be given exclusive access to the node. |