Ticket 3713

Summary: srun on Cray has limits imposed; reported in /sys/class/gni/kgni0/resources
Product: Slurm Reporter: S Senator <sts>
Component: Cray ALPSAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: david.gloe, dwg, peltzpl
Version: 17.02.2   
Hardware: Cray XC   
OS: Linux   
Site: LANL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.02.5 17.11.0-pre0 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Patch to remove most pkey limits

Description S Senator 2017-04-18 15:26:41 MDT
srun commands such as the following:
  srun -N1 cat /sys/class/gni/kgni0/resources
show memory limits as below. This appears to be an effect of the cray plugins which have been specified. (slurm.conf is below)

1. Why is this limit imposed when running under slurm?
2. How is this being triggered?
3. What mechanism is available to force it to be unlimited (-1)?

This is a gating item for our transition to slurm on our Cray platforms, which is why it is marked as high impact.

--- output of srun -N1 cat /sys/class/gni/kgni0/resources ---

tt-login1$ srun -N1 cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              6
CQ         5               -1
FMA        1               -1
SFMA       0               -1
RDMA       0               -1
DIRECT     0               -1
IOMMU      2097152         1073741824
PCI-IOMMU  0               -1
CE         0               -1
DLA        0               -1
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 42 PKey: 0xc6 JobId: 0x14 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               921             0
CQ         0               509
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               134217728
PCI-IOMMU  0               -1
CE         0               1
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 43 PKey: 0xc7 JobId: 0x14 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               921             0
CQ         0               509
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               134217728
PCI-IOMMU  0               -1
CE         0               1
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- output of srun -N1 cat /sys/class/gni/kgni0/resources ---

--- slurm.conf ---
# Ansible managed: /var/ansible/distribution/scripts/roles/slurm/templates/slurm.conf.j2 modified on 2017-04-17 09:45:01 by root on tt-drm

#
# (c) Copyright 2015 Cray Inc.  All Rights Reserved.
#
# This file was generated by /home/crayadm/slurm/17.02.1-2/slurm-17.02.1-2/contribs/cray/csm/slurmconfgen_smw.py on Thu Mar  9 16:53:45 2017.
#
# See the slurm.conf man page for more information.
#
ClusterName=trinitite
ControlMachine=tt-sctld1
ControlAddr=192.168.0.22
BackupController=tt-drm
AuthType=auth/munge
CoreSpecPlugin=cray
CryptoType=crypto/munge
GresTypes=craynetwork,hbm
JobContainerType=job_container/cncu
JobSubmitPlugins=cray
KillOnBadExit=1
MpiParams=ports=20000-32767
ProctrackType=proctrack/cray
# Some programming models require unlimited virtual memory
PropagateResourceLimitsExcept=AS
# ReturnToService 2 will let rebooted nodes come back up immediately
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurm
SlurmUser=root
StateSaveLocation=/var/spool/slurm
SwitchType=switch/cray
TaskPlugin=task/cray,task/affinity,task/cgroup
#
# Port Settings
#
SlurmctldPort=60001
SlurmdPort=60001
SrunPortRange=60002-64500
#
# Scaling Settings
#
SlurmctldTimeout=300
TreeWidth=27
#
#
# SCHEDULING
DefMemPerNode=353
MaxMemPerNode=128000
SchedulerType=sched/backfill
SchedulerParameters=no_backup_scheduling
SelectType=select/cray
SelectTypeParameters=CR_SOCKET,OTHER_CONS_RES
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=tt-drm
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=debug3
SlurmctldLogFile=/var/spool/slurm/log/slurmctld.log
SlurmdDebug=debug3
SlurmdLogFile=/var/spool/slurm/%h.log
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
CpuFreqDef=performance
#
#
# Burst Buffer Support
BurstBufferType=burst_buffer/cray
#
#
# KNL Features Support
NodeFeaturesPlugins=knl_cray
LaunchParameters=mem_sort # zone_sort plugin
FastSchedule=1
ResumeProgram=/opt/slurm/default/sbin/capmc_resume
SuspendProgram=/opt/slurm/default/sbin/capmc_suspend
SuspendTime=30000000
ResumeTimeout=1800
#
# Debug Flags
#
#DebugFlags=NodeFeatures,Gres,TraceJobs,BurstBuffer,Protocol
DebugFlags=TraceJobs,NO_CONF_HASH
#
#
# COMPUTE NODES
NodeName=nid00[012-047,076-111,140-147,160-179] Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=craynetwork:4 Feature=haswell,compute 
NodeName=nid00[192-291] Sockets=1 CoresPerSocket=68 ThreadsPerCore=4 Gres=craynetwork:4 Feature=knl,compute State=UNKNOWN

#
#
# PARTITIONS
PartitionName=standard Nodes=nid00[012-047,076-111,140-147,160-179,192-291] Shared=EXCLUSIVE Priority=1 Default=YES DefaultTime=60 MaxTime=24:00:00 State=UP OverSubscribe=EXCLUSIVE
PartitionName=dst Nodes=nid00[012-047,076-111,140-147,160-179,192-291] Shared=EXCLUSIVE Priority=1 Default=NO DefaultTime=60 MaxTime=24:00:00 State=DOWN OverSubscribe=EXCLUSIVE
PartitionName=dat Nodes=nid00[012-047,076-111,140-147,160-179,192-291] Shared=EXCLUSIVE Priority=1 Default=NO DefaultTime=60 MaxTime=24:00:00 State=DOWN OverSubscribe=EXCLUSIVE
PartitionName=ccm_queue Nodes=nid00[012-047,076-111,140-147,160-179,192-291] Shared=EXCLUSIVE Priority=1 Default=NO DefaultTime=60 MaxTime=24:00:00 State=UP OverSubscribe=EXCLUSIVE

--- slurm.conf ---

$ rpm -a -q | grep slurm
slurm-17.02.2-SSE.1.x86_64
slurm-munge-17.02.2-SSE.1.x86_64
slurm-sql-17.02.2-SSE.1.x86_64
slurm-plugins-17.02.2-SSE.1.x86_64
slurm-devel-17.02.2-SSE.1.x86_64
slurm-slurmdbd-17.02.2-SSE.1.x86_64

$ uname -a
Linux tt-drm 3.12.60-52.63.1.12215.0.PTF.1017941-default #1 SMP Thu Jan 5 05:33:02 UTC 2017 (afd16ea) x86_64 x86_64 x86_64 GNU/Linux

$ cat /SUSE Linux Enterprise Server 12 (x86_64)
VERSION = 12
PATCHLEVEL = 0

$  cat /etc/os-release
NAME="SLES"
VERSION="12"
VERSION_ID="12"
PRETTY_NAME="SUSE Linux Enterprise Server 12"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12"
Comment 1 S Senator 2017-04-18 15:55:23 MDT
Cray Bugzilla 850439 / SFDC case 166486
Comment 2 Tim Wickberg 2017-04-18 15:58:19 MDT
(In reply to S Senator from comment #0)
> srun commands such as the following:
>   srun -N1 cat /sys/class/gni/kgni0/resources
> show memory limits as below. This appears to be an effect of the cray
> plugins which have been specified. (slurm.conf is below)
> 
> 1. Why is this limit imposed when running under slurm?
> 2. How is this being triggered?
> 3. What mechanism is available to force it to be unlimited (-1)?
> 
> This is a gating item for our transition to slurm on our Cray platforms,
> which is why it is marked as high impact.
> 
> --- output of srun -N1 cat /sys/class/gni/kgni0/resources ---
> 
> tt-login1$ srun -N1 cat /sys/class/gni/kgni0/resources

<snip>


Bear with me, I've never seen that set of limits before, and only vaguely understand what they may be doing.

I don't think Slurm is directly triggering them - there are zero references to 'kgni' in our source.

Can you run through a few more tests:

- If you're logged in as a normal user account directly on the node, are the limits set there?

- If you disable the JobContainer plugin, does this still happen? That does make some calls into a Cray API that I could see changing things around like this; disabling it would at least isolate that as the cause.

There are a few other suggestions I have for the config, but those would be best saved for a lower severity config review bug.

- Tim
Comment 3 S Senator 2017-04-18 16:04:02 MDT
> - If you're logged in as a normal user account directly on the node, are the limits set there?

No. A normal user account shows as unlimited - when logging into the node as root, su-ing to a normal user account.
Comment 4 S Senator 2017-04-18 16:04:46 MDT
We'll definitely take you up on the request (at a much lower priority) to review our slurm.conf and related settings in about 5-7 days.
Comment 5 Paul Peltz 2017-04-18 16:09:08 MDT
[16:07][peltz@tt-login1] ~ $ srun -N 1 grep Container /etc/opt/slurm/slurm.conf
#JobContainerType=job_container/cncu #?
JobContainerType=job_container/none
[16:07][peltz@tt-login1] ~ $ srun -N1 cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              8
CQ         5               -1             
FMA        1               -1             
SFMA       0               -1             
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      2097152         1073741824     
PCI-IOMMU  0               -1             
CE         0               -1             
DLA        0               -1             
non-VMDH   0               -1             
SMDD Hold  0               -1             
--- PTag: 52 PKey: 0x1a8 JobId: 0x19 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM            
MDD        0               921             0              
CQ         0               509            
FMA        0               123            
SFMA       0               123            
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      0               134217728      
PCI-IOMMU  0               -1             
CE         0               1              
DLA        0               15360          
non-VMDH   0               -1             
SMDD Hold  0               -1             
--- PTag: 53 PKey: 0x1a9 JobId: 0x19 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM            
MDD        0               921             0
CQ         0               509
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               134217728
PCI-IOMMU  0               -1
CE         0               1
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
Comment 6 Tim Wickberg 2017-04-18 16:15:20 MDT
(In reply to S Senator from comment #3)
> > - If you're logged in as a normal user account directly on the node, are the limits set there?
> 
> No. A normal user account shows as unlimited - when logging into the node as
> root, su-ing to a normal user account.

Do you mind attaching the "normal" output as well?
Comment 7 Paul Peltz 2017-04-19 13:42:49 MDT
[16:32][peltz@nid00024] ~ $ cat /sys/class/gni/kgni0/resources                                                           
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM            
MDD        6               -1              8              
CQ         5               -1             
FMA        1               -1             
SFMA       0               -1             
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      2097152         1073741824     
PCI-IOMMU  0               -1             
CE         0               -1             
DLA        0               -1             
non-VMDH   0               -1             
SMDD Hold  0               -1
Comment 8 Tim Wickberg 2017-04-19 14:01:38 MDT
Going back to your original questions:

> 1. Why is this limit imposed when running under slurm?

It's not something Slurm itself is explicitly asking for, it appears to be the end result of some calls to the Cray APIs to manage the interconnect. I think alpsc_configure_nic() is what is setting up the protection keys automatically

> 2. How is this being triggered?

If you disable the switch/cray plugin I suspect this will go away.

But I think you do want that enabled in production - I believe inter-node communication requires this plugin to work correctly.

> 3. What mechanism is available to force it to be unlimited (-1)?

Can you try running the step with --mem=0 ? It looks like that may influence the calculated limits get in to alpsc_configure_nic().

This might be an unintended side-effect of not setting memory amounts or limits on the jobs themselves; I'm trying to dig out more info on that now.
Comment 9 Paul Peltz 2017-04-19 16:21:14 MDT
Like this?  If so, it looks the same for the MDD of 921.

[16:20][peltz@tt-login1] ~ $ srun -N 1 --mem=0 cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM            
MDD        6               -1              6              
CQ         5               -1             
FMA        1               -1             
SFMA       0               -1             
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      2097152         1073741824     
PCI-IOMMU  0               -1             
CE         0               -1             
DLA        0               -1             
non-VMDH   0               -1             
SMDD Hold  0               -1             
--- PTag: 14 PKey: 0x256 JobId: 0x6 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM            
MDD        0               921             0              
CQ         0               509            
FMA        0               123            
SFMA       0               123            
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      0               134217728      
PCI-IOMMU  0               -1             
CE         0               1              
DLA        0               15360          
non-VMDH   0               -1             
SMDD Hold  0               -1             
--- PTag: 15 PKey: 0x257 JobId: 0x6 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM            
MDD        0               921             0              
CQ         0               509            
FMA        0               123            
SFMA       0               123            
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      0               134217728      
PCI-IOMMU  0               -1             
CE         0               1              
DLA        0               15360          
non-VMDH   0               -1             
SMDD Hold  0               -1
Comment 13 Tim Wickberg 2017-04-20 11:43:56 MDT
I'm CC'ing David Gloe from Cray who wrote most of the associated code here; I'm hoping he may be able to shed some light on how these limits are being calculated and established.

Hey David - 

As outlined in the bug, LANL is testing a conversion to Native Slurm mode on one of their XC systems, and has raised some questions around how certain limits are setup and established. As best I can tell, the ptag setup through the switch/cray plugin is what would be triggering this, but it's unclear to me (and I haven't found any docs discussing this directly) how to influence the resulting resource limits.

thanks,
- Tim
Comment 14 David Gloe 2017-04-20 11:55:12 MDT
We have a Cray bug for this same issue (from the same customer I believe):
http://bugzilla.us.cray.com/show_bug.cgi?id=850439

I'll copy my comment from that bug below:

Resource limits are applied under Moab/TORQUE/ALPS, but they are handled a bit differently for SLURM.

By default, ALPS uses exclusive node reservations, so it will allocate all the network resources to the job. However, if ALPS is configured with suspend/resume, it will limit the network resources just as you see Slurm does.

Slurm does not have the same concept of exclusive reservations as ALPS does, so we always limit resources as if another job could be launched to the node. The amount of resources given is controlled by the CPUs and memory the job is given. You should be able to increase the given network resources using the srun --mem and --mem-per-cpu options.

As for the 3 ptags, I'm not sure what ptag 1 represents. The other two ptags are the actual ones for the srun job.

These resources are set by the alpsc_configure_nic function, which is called by the Slurm switch/cray plugin switch_p_job_init function.
Comment 15 Paul Peltz 2017-04-20 12:06:57 MDT
David commented on our Cray bug of the same request. "INFO 850439 - kgni resource limits under slurm"

I was able to do this and it does increase the MDD limit.

srun -N 1 --mem=126G cat /sys/class/gni/kgni0/resources

--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM            
MDD        6               -1              54             
CQ         5               -1             
FMA        1               -1             
SFMA       0               -1             
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      2097152         1073741824     
PCI-IOMMU  0               -1             
CE         0               -1             
DLA        0               -1             
non-VMDH   0               -1             
SMDD Hold  0               -1             
--- PTag: 62 PKey: 0x402 JobId: 0x21 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM            
MDD        0               3649            0              
CQ         0               509            
FMA        0               123            
SFMA       0               123            
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      0               134217728      
PCI-IOMMU  0               -1             
CE         0               1              
DLA        0               15360          
non-VMDH   0               -1             
SMDD Hold  0               -1             
--- PTag: 63 PKey: 0x403 JobId: 0x21 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM            
MDD        0               3649            0              
CQ         0               509            
FMA        0               123            
SFMA       0               123            
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      0               134217728      
PCI-IOMMU  0               -1             
CE         0               1              
DLA        0               15360          
non-VMDH   0               -1
SMDD Hold  0               -1

However, we would ideally not want to restrict this at all and not impose any intranode limits as we do exclusive node scheduling anyway.
Comment 16 Tim Wickberg 2017-04-20 12:10:47 MDT
Thanks for the quick response there; I suspected there may be an equivalent Cray bug open to discuss this.

Is there some set of values we can feed to alpsc_configure_nic() that would result in the unlimited values LANL desires? I did notice the cpu and memory scaling factors, but it's not clear to me how they're handled on the remote end. Is there a special value, like -1, that can do this? Or would 100 for both accomplish the "unlimited" result?
Comment 17 David Gloe 2017-04-20 12:22:56 MDT
There aren't any special values, 100 for the CPU and memory scaling factors should give you all of the resources.

The issue is even if a job gets exclusive access to a node, it's possible the user could launch up to 4 sruns on the same node at a time. So giving more than 25% of the total resources could lead to oversubscription.

On homogeneous partitions you could set DefMemPerNode/DefMemPerCPU to take all of the node memory. That should give you all the resources.
Comment 18 Tim Wickberg 2017-04-20 12:39:08 MDT
Slightly different question - what happens if we set 'exclusive' in the call?

I'm inferring that it's the second argument to alpsc_configure_nic() from the comment above mentioning:

        /*
         * Configure the network
         *
         * I'm setting exclusive flag to zero for now until we can figure out a
         * way to guarantee that the application not only has exclusive access
         * to the node but also will not be suspended.  This may not happen.
         *
         * Cray shmem still uses the network, even when it's using only one
         * node, so we must always configure the network.
         */


It sounds like the potential over-subscription from running multiple steps alongside each other isn't a huge concern to LANL at the moment.

Assuming that either setting the exclusive flag to one, or the cpu_scaling and mem_scaling factors to 100 gets this, I should be able to put together a patch allowing that to be quickly.
Comment 19 David Gloe 2017-04-20 13:04:05 MDT
If you set exclusive it looks like you'll get more CQ and CE resources, and IOMMU will be set to unlimited.

Other limits should be the same as if you provided scaling = scalingMem = 100.
Comment 20 Tim Wickberg 2017-04-20 13:12:02 MDT
Created attachment 4388 [details]
Patch to remove most pkey limits

Steve, Paul -

Are you able to test out the attached patch?

It's obviously not a final version - I'd expect to add some config flag to enable this, assuming there are no negative side-effects uncovered in testing.

David's mentioned a few of the downsides to running in this state - I'm assuming that those aren't critical in your environment when always allocating full-nodes.

- Tim
Comment 21 David Gloe 2017-04-20 13:23:09 MDT
If you run in this mode you could set the gres craynetwork=1 instead of 4, which should limit to one step using the network on each node.

This should definitely be controlled by a config flag, because it won't work correctly if preemption or multiple apps per node is used.
Comment 22 Paul Peltz 2017-04-21 10:45:46 MDT
With the patch, limits are increased, but not removed completely.

[10:44][peltz@ga-login1] ~ $ srun -N 1 cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              6
CQ         5               -1
FMA        1               -1
SFMA       0               -1
RDMA       0               -1
DIRECT     0               -1
IOMMU      2097152         1073741824
PCI-IOMMU  0               -1
CE         0               -1
DLA        0               -1
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 12 PKey: 0x664 JobId: 0x5 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 13 PKey: 0x665 JobId: 0x5 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
Comment 23 David Gloe 2017-04-21 10:53:34 MDT
Those limits are identical to those you get from ALPS in exclusive mode:

dgloe@purie:~> aprun cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM            
MDD        6               -1              22             
CQ         5               -1             
FMA        1               -1             
SFMA       0               -1             
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      2097152         1073741824     
PCI-IOMMU  0               -1             
CE         0               -1             
DLA        0               -1             
non-VMDH   0               -1             
SMDD Hold  0               -1             
--- PTag: 12 PKey: 0x4b55 JobId: 0x5 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM            
MDD        0               3686            0              
CQ         0               2037           
FMA        0               123            
SFMA       0               123            
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      0               1073741824     
PCI-IOMMU  0               -1             
CE         0               4              
DLA        0               15360          
non-VMDH   0               -1             
SMDD Hold  0               -1             
--- PTag: 13 PKey: 0x4b56 JobId: 0x5 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM            
MDD        0               3686            0              
CQ         0               2037           
FMA        0               123            
SFMA       0               123            
RDMA       0               -1             
DIRECT     0               -1             
IOMMU      0               1073741824     
PCI-IOMMU  0               -1             
CE         0               4              
DLA        0               15360          
non-VMDH   0               -1             
SMDD Hold  0               -1             
Application 1158418 resources: utime ~0s, stime ~0s, Rss ~4540, inblocks ~0, outblocks ~0
Comment 24 Paul Peltz 2017-04-21 11:32:38 MDT
I'm getting different results than you, but I'm using M/T and not just ALPS.

peltz@tr2-fe1:~> msub -I -l nodes=1
qsub: waiting for job 52908.tr2-drm to start
qsub: job 52908.tr2-drm ready

peltz@tr2-login10:~> aprun cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              38
CQ         5               -1
FMA        1               -1
SFMA       0               -1
RDMA       0               -1
DIRECT     0               -1
IOMMU      524288          1073741824
PCI-IOMMU  0               -1
CE         0               -1
DLA        0               -1
non-VMDH   0               -1
SMDD Hold  0               -1
Application 3003734 resources: utime ~0s, stime ~0s, Rss ~6600, inblocks ~0, outblocks ~0

Even with explicitly stating exclusive mode:

peltz@tr2-login10:~> apstat -svv | grep access
        default node access [exclusive]
peltz@tr2-login10:~> aprun -F exclusive cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              38
CQ         5               -1
FMA        1               -1
SFMA       0               -1
RDMA       0               -1
DIRECT     0               -1
IOMMU      524288          1073741824
PCI-IOMMU  0               -1
CE         0               -1
DLA        0               -1
non-VMDH   0               -1
SMDD Hold  0               -1
Application 3003735 resources: utime ~0s, stime ~0s, Rss ~6600, inblocks ~0, outblocks ~0
Comment 25 David Gloe 2017-04-21 11:38:36 MDT
(In reply to Paul Peltz from comment #24)
> I'm getting different results than you, but I'm using M/T and not just ALPS.
> 
> peltz@tr2-fe1:~> msub -I -l nodes=1
> qsub: waiting for job 52908.tr2-drm to start
> qsub: job 52908.tr2-drm ready
> 
> peltz@tr2-login10:~> aprun cat /sys/class/gni/kgni0/resources
> --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
> Name       Used            Limit           HWM
> MDD        6               -1              38
> CQ         5               -1
> FMA        1               -1
> SFMA       0               -1
> RDMA       0               -1
> DIRECT     0               -1
> IOMMU      524288          1073741824
> PCI-IOMMU  0               -1
> CE         0               -1
> DLA        0               -1
> non-VMDH   0               -1
> SMDD Hold  0               -1
> Application 3003734 resources: utime ~0s, stime ~0s, Rss ~6600, inblocks ~0,
> outblocks ~0
> 
> Even with explicitly stating exclusive mode:
> 
> peltz@tr2-login10:~> apstat -svv | grep access
>         default node access [exclusive]
> peltz@tr2-login10:~> aprun -F exclusive cat /sys/class/gni/kgni0/resources
> --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
> Name       Used            Limit           HWM
> MDD        6               -1              38
> CQ         5               -1
> FMA        1               -1
> SFMA       0               -1
> RDMA       0               -1
> DIRECT     0               -1
> IOMMU      524288          1073741824
> PCI-IOMMU  0               -1
> CE         0               -1
> DLA        0               -1
> non-VMDH   0               -1
> SMDD Hold  0               -1
> Application 3003735 resources: utime ~0s, stime ~0s, Rss ~6600, inblocks ~0,
> outblocks ~0

Note that's for PTag 1, not for your application PTags. There's a relatively new ALPS feature where single node applications don't reserve any network resources at all. I'm guessing you have noNetwork set to 1 in alps.conf. Try checking the resource limits for a two node application.
Comment 26 Paul Peltz 2017-04-21 12:52:56 MDT
Yes, I'm running with noNetwork = 1.

boot-gadget:~ # ssh sdb grep noNetwork /etc/opt/cray/alps/alps.conf
# - noNetwork: if set, one-node apps will _not_ use network resources
        noNetwork       1

Multi-Node job:

[12:52][peltz@ga-fe1] ~ $ srun -N2 cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              6
CQ         5               -1
FMA        1               -1
SFMA       0               -1
RDMA       0               -1
DIRECT     0               -1
IOMMU      2097152         1073741824
PCI-IOMMU  0               -1
CE         0               -1
DLA        0               -1
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 18 PKey: 0x66a JobId: 0x8 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 19 PKey: 0x66b JobId: 0x8 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              6
CQ         5               -1
FMA        1               -1
SFMA       0               -1
RDMA       0               -1
DIRECT     0               -1
IOMMU      2097152         1073741824
PCI-IOMMU  0               -1
CE         0               -1
DLA        0               -1
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 8 PKey: 0x66a JobId: 0x3 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 9 PKey: 0x66b JobId: 0x3 RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
Comment 27 Paul Peltz 2017-04-21 14:36:24 MDT
Sorry about the delay, we had a system issue.  Here is what ALPS output looks like on a multinode job, which is as you say the same as SLURM.

peltz@tr2-login3:~> aprun -N 1 -n 2 cat /sys/class/gni/kgni0/resources
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              47
CQ         5               -1
FMA        1               -1
SFMA       0               -1
RDMA       0               -1
DIRECT     0               -1
IOMMU      524288          1073741824
PCI-IOMMU  0               -1
CE         0               -1
DLA        0               -1
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 84 PKey: 0x240 JobId: 0x2c RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 85 PKey: 0x241 JobId: 0x2c RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
Name       Used            Limit           HWM
MDD        6               -1              52
CQ         5               -1
FMA        1               -1
SFMA       0               -1
RDMA       0               -1
DIRECT     0               -1
IOMMU      524288          1073741824
PCI-IOMMU  0               -1
CE         0               -1
DLA        0               -1
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 84 PKey: 0x240 JobId: 0x2b RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
--- PTag: 85 PKey: 0x241 JobId: 0x2b RefCount: 1 Suspend: Idle ---
Name       Used            Limit           HWM
MDD        0               3686            0
CQ         0               2037
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               1073741824
PCI-IOMMU  0               -1
CE         0               4
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1
Application 3005122 resources: utime ~0s, stime ~0s, Rss ~7200, inblocks ~0, outblocks ~0
Comment 28 Paul Peltz 2017-04-21 14:39:49 MDT
To go back to Tim, you said a more complete patch would be forthcoming with a configuration option, right?  We are happy with the reported values now shown on node in the kgni resources.  Are there any other consequences we should be aware of with enabling this change?

Thanks,

Paul
Comment 29 Tim Wickberg 2017-05-03 07:10:05 MDT
(In reply to Paul Peltz from comment #28)
> To go back to Tim, you said a more complete patch would be forthcoming with
> a configuration option, right?  We are happy with the reported values now
> shown on node in the kgni resources.  Are there any other consequences we
> should be aware of with enabling this change?

I'm trying to find a convenient spot to hide a config flag for it, that's the only tricky bit of this.

As David indicated previously, this could potentially cause problems if running multiple steps simultaneously on the node - although I believe if the steps themselves avoid oversubscribing the node that shouldn't cause a problem, and if it does it's on the end user to worry about.

I'll attach the patch here when finalized, and/or point you to the upstream commit if we don't hold it back until 17.11.

- Tim
Comment 33 Tim Wickberg 2017-05-17 10:37:45 MDT
Shifting the priority level down as a workaround is currently in place.

I should have the final version of this ready shortly, and it'll be included in 17.02.4 when released. The option of "cray_net_exclusive" in the LaunchParameters turn this on.
Comment 35 David Gloe 2017-05-19 15:21:02 MDT
I will be out of the office for the holidays until Tuesday January 3rd.
Comment 42 Paul Peltz 2017-06-07 11:16:40 MDT
I don't see this in the changelog or in the source in the 17.02.4 release.  Did it not get included?  I have the patch in place still, but I wanted to verify this before building new RPMs.
Comment 43 Tim Wickberg 2017-06-07 11:25:16 MDT
Unfortunately the patch isn't in yet; we've run into some issues with testing out the final version of the patch on the Cray dev systems, and need a bit more time to verify it. It should be in before 17.02.5.

The patch you're running is still functionally identical to the final version - the only difference is the addition of a flag to LaunchParameters to turn it on/off at runtime.

- Tim
Comment 47 Tim Wickberg 2017-06-13 14:04:54 MDT
The final version of this has been committed with 23721c4c9e, and will be in 17.02.5 when released.

It adds an option of cray_net_exclusive to LaunchParameters; when set all jobs will be given exclusive access to the node.