Ticket 7167 - No nodes detected after cluster reboot - nodelist on PartitionConfig pending status
Summary: No nodes detected after cluster reboot - nodelist on PartitionConfig pending ...
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 17.11.5
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-05-31 20:16 MDT by pascaloupsu
Modified: 2019-06-03 10:24 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description pascaloupsu 2019-05-31 20:16:46 MDT
Hello everyone! 

I just restarted my cluster (head node then all my compute nodes). 
The NODELIST(REASON) pops up when I schedule a job:

************************************************************************
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
19947    shortq  paleale pnrouxel PD       0:00      2 (PartitionConfig)
************************************************************************


*****************************************************************************
The sinfo seems to show that the partitioning is indeed not properly set up:

[root@main ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
shortq*      up 2-00:00:0      0    n/a
longq        up 5-00:00:0      0    n/a
interq       up   12:00:00     0    n/a
*****************************************************************************


*****************************************************************************
I manage to extract the nodes' IP addresses and ping them. It seems like something went funny in the slurm config. 
*****************************************************************************


*****************************************************************************
Here are the statuses of the Daemons: 
-----------------------------------------------------------------------------
[root@main ~]# systemctl status -l slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-05-31 21:36:18 EDT; 25min ago
  Process: 11621 ExecStart=/cm/shared/apps/slurm/17.11.12/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 11623 (slurmctld)
   Memory: 5.1M
   CGroup: /system.slice/slurmctld.service
           └─11623 /cm/shared/apps/slurm/17.11.12/sbin/slurmctld
-----------------------------------------------------------------------------
[root@node001 ~]# systemctl status -l slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-05-31 21:33:33 EDT; 30min ago
  Process: 3706 ExecStart=/cm/shared/apps/slurm/17.11.12/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 3709 (slurmd)
   Memory: 8.1M
   CGroup: /system.slice/slurmd.service
           └─3709 /cm/shared/apps/slurm/17.11.12/sbin/slurmd
-----------------------------------------------------------------------------


*****************************************************************************
1- The slurmctld is disabled, could that be the reason (although it is saying just after that it is active) ? 

2-a Should I restart the Daemons?

2-b if yes, how do I restart them?   
I made a trial to restart the slurmctld daemon like this: 
systemctl stop slurmctld
systemctl start slurmctld 

but this did not fix anything. 


*****************************************************************************
Here is my slurm.conf file:
*****************************************************************************
#
# See the slurm.conf man page for more information.
#

ClusterName=SLURM_CLUSTER
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave
SlurmdSpoolDir=/cm/local/apps/slurm/var/spool
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
#ProctrackType=proctrack/pgid
ProctrackType=proctrack/cgroup
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
PlugStackConfig=/etc/slurm/plugstack.conf.d/
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFs=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd

#JobCompType=jobcomp/filetxt
#JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log

#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
# AccountingStorageLoc=slurm_acct_db
# AccountingStoragePass=SLURMDBD_USERPASS

# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
# Scheduler
SchedulerType=sched/backfill
# Master nodes
ControlMachine=rdfmg
ControlAddr=rdfmg
AccountingStorageHost=rdfmg
# Nodes
NodeName=node[001-014]
# Partitions
PartitionName=shortq Default=YES MinNodes=1 DefaultTime=1-00:00:00 MaxTime=10-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO OverTimeLimit=0 State=UP
PartitionName=longq Default=NO MinNodes=1 DefaultTime=2-00:00:00 MaxTime=20-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO OverTimeLimit=0 State=UP
PartitionName=interq Default=NO MinNodes=1 DefaultTime=06:00:00 MaxTime=12:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO OverTimeLimit=0 State=UP
# Generic resources types
GresTypes=gpu,mic
# Epilog/Prolog parameters
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
# Fast Schedule option
FastSchedule=0
# Power Saving
SuspendTime=-1 # this disables power saving
SuspendTimeout=30
ResumeTimeout=60
SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff
ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron
# END AUTOGENERATED SECTION   -- DO NOT REMOVE
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
MaxTasksPerNode = 64
DefMemPerCPU = 2014
MaxMemPerCPU = 2014
MaxArraySize = 16384
****************************************************************************



Thank you
Comment 2 pascaloupsu 2019-06-01 07:55:41 MDT
I enabled slurmdctld with:
systemctl enable slurmctld

Did not make any difference
Comment 3 Jacob Jenson 2019-06-03 10:24:37 MDT
Pascal,

SchedMD has a professional services Slurm support team. This team can help you resolve this issue. 

Before this team can help resolve support issues we would need to put a Slurm support contract in place for NCSU. Would you like to set up a call to talk about the Slurm support options SchedMD offers? 

Thank you,
Jacob