Ticket 9002

Summary: Unnecessary error regarding pam_slurm_adopt, which is not enabled in our slurm.conf
Product: Slurm Reporter: Lawrence Wu <lawrence.wu>
Component: ConfigurationAssignee: Marshall Garey <marshall>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
Site: FRB Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Lawrence Wu 2020-05-06 15:49:15 MDT
For some reason, we are getting an unnecessary error in sinfo, slurmd, slurmctld which appears not to apply to our slurm.conf configuration.

$ sinfo
sinfo: error: If using PrologFlags=Contain for pam_slurm_adopt, either proctrack/cgroup or proctrack/cray_aries is required.  If not using pam_slurm_adopt, please ignore error.
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

Our slurm.conf is as follows. We compile slurm 19.05.5 to an NFS share at /mnt/vol_apps .

###################################################################################
# slurm.conf -- General Slurm configuration information for the 'cn-001'
# cluster

######################################
### Partition / Cluster definition ###
######################################
# - Nodes available to the Cluster
# - Assignment of nodes to each partition
# - Sharing
# - Priority/preemption
###

ClusterName=cn-001

# Job Controllers, primary and backup
SlurmctldHost=jc01
SlurmctldHost=jc02

# output can be obtained from `slurmd -C`, append `State=UNKNOWN` at the end for the initial state
#
# CPUs:    Number of logical processors on the node, default to Sockets*CoresPerSocket*ThreadsPerCore
# Sockets: Number of physical processor sockets/chips on the node
# CoresPerSocket: Number of cores in a single physical processor socket
# ThreadsPerCore: Number of logical threads in a single physical core
# RealMemory:     Size of real memory on the node in megabytes
# State:   State of the node with respect to the initiation of user jobs.
#          Acceptable values are "CLOUD", "DOWN", "DRAIN", "FAIL", "FAILING", "FUTURE" and "UNKNOWN"
# Production First Generation Compute Cluster: General purpose VMs with 8 vCPUs and 64G of ram.
NodeName=cn001p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn002p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn003p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn004p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn005p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn006p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn007p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn008p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn009p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn010p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn011p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn012p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn013p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn014p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn015p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn016p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn017p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn018p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn019p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
NodeName=cn020p CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41615
# Production Second Generation Compute Cluster
NodeName=cnp21 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
NodeName=cnp22 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
NodeName=cnp23 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
NodeName=cnp24 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
NodeName=cnp25 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
# Production Third Generation Compute Cluster
NodeName=cnp26 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
NodeName=cnp27 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
NodeName=cnp28 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
NodeName=cnp29 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
NodeName=cnp30 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=128755 TmpDisk=41675
# Production GPU Compute Cluster: Bare hardware, 768G of RAM and 1 "pascal" series GPU card.
NodeName=c101p CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=770000 TmpDisk=49975
NodeName=c102p CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=770000 TmpDisk=49975
NodeName=c103p CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=770000 TmpDisk=49975
NodeName=c104p CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=770000 TmpDisk=49975
NodeName=c105p CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=770000 TmpDisk=49975

# partition definitions picking up previously defined nodes. DEFAULT=YES selects the default partition.
# Production SLURM compute nodes
PartitionName=none         Nodes=cn001p,cn002p,cn003p,cn004p,cn005p,cn006p,cn007p,cn008p,cn009p,cn010p,cn011p,cn012p,cn013p,cn014p,cn015p,cn016p,cn017p,cn018p,cn019p,cn020p,cnp21,cnp22,cnp23,cnp24,cnp25,cnp26,cnp27,cnp28,cnp29,cnp30 MaxTime=INFINITE State=UP Default=YES
# Production GPU SLURM compute nodes
PartitionName=gpu1         Nodes=c101p,c102p,c103p,c104p,c105p MaxTime=INFINITE State=UP

### Constant Settings

# Authentication method, munge is primarily what is supported
AuthType=auth/munge
CryptoType=crypto/munge

DisableRootJobs=YES
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/log/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/log/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/log/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/mnt/vol_apps/slurm/clusters/cn-001/state
SwitchType=switch/none
TaskPlugin=task/none
TaskPluginParam=Sched
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core

PrologFlags=x11

 # make this into a future slurmdbd compatible boolean, but for now it is 
### Job Accounting options for text file storage on nfs, maybe sufficient with splunk import
AccountingStorageType=accounting_storage/filetxt
AccountingStorageLoc=/mnt/vol_apps/slurm/clusters/cn-001/accounting.log
AccountingStoreJobComment=YES
 # add slurmdbd in the future

### Accounting options for future slurmdbd mysql use
#AccountingStoreJobComment=YES
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePass=
#AccountingStoragePort=
#AccountingStorageUser=

### Job options
 # make this into a future slurmdbd compatible boolean, but for now it is 
### Job completion accounting options for text file storage on nfs, maybe sufficient with splunk import
JobCompType=jobcomp/filetxt
JobCompLoc=/mnt/vol_apps/slurm/clusters/cn-001/jobcomp.log
JobAcctGatherType=jobacct_gather/linux     # use linux job accounting gather for resource use per job
 # add slurmdbd in the future
JobAcctGatherFrequency=30
JobCheckpointDir=/mnt/vol_apps/slurm/clusters/cn-001/checkpoint
SlurmctldDebug=6
SlurmctldLogFile=/mnt/vol_apps/slurm/clusters/cn-001/slurmctld.log
SlurmdDebug=6
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/slurmsched.log



Supposedly an issue that might be related was fixed in 19.05.3, but we are running 19.05.5. Could that possibly be related?

https://bugs.schedmd.com/show_bug.cgi?id=6824

Sincerely,
Lawrence Wu
Comment 2 Marshall Garey 2020-05-11 12:05:25 MDT
You're getting this error because of this combination:

ProctrackType=proctrack/pgid
PrologFlags=x11

PrologFlags=x11 implies PrologFlags=contain, so this error actually does apply to your configuration (slurm.conf man page: https://slurm.schedmd.com/slurm.conf.html#OPT_PrologFlags).

However, this error created extra noise in client commands, so in commit 1d33536d8ed3196a we limited this error to slurmctld. That commit is only in 20.02; however, you can cherry pick this commit to limit the error to appear only in slurmctld.log, and you should be able to ignore it.

Does that answer your question?
Comment 3 Marshall Garey 2020-05-19 08:42:33 MDT
Closing as infogiven.