| Summary: | No nodes detected after cluster reboot - nodelist on PartitionConfig pending status | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | pascaloupsu <pnrouxel> |
| Component: | Configuration | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 17.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
I enabled slurmdctld with: systemctl enable slurmctld Did not make any difference Pascal, SchedMD has a professional services Slurm support team. This team can help you resolve this issue. Before this team can help resolve support issues we would need to put a Slurm support contract in place for NCSU. Would you like to set up a call to talk about the Slurm support options SchedMD offers? Thank you, Jacob |
Hello everyone! I just restarted my cluster (head node then all my compute nodes). The NODELIST(REASON) pops up when I schedule a job: ************************************************************************ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 19947 shortq paleale pnrouxel PD 0:00 2 (PartitionConfig) ************************************************************************ ***************************************************************************** The sinfo seems to show that the partitioning is indeed not properly set up: [root@main ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST shortq* up 2-00:00:0 0 n/a longq up 5-00:00:0 0 n/a interq up 12:00:00 0 n/a ***************************************************************************** ***************************************************************************** I manage to extract the nodes' IP addresses and ping them. It seems like something went funny in the slurm config. ***************************************************************************** ***************************************************************************** Here are the statuses of the Daemons: ----------------------------------------------------------------------------- [root@main ~]# systemctl status -l slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2019-05-31 21:36:18 EDT; 25min ago Process: 11621 ExecStart=/cm/shared/apps/slurm/17.11.12/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 11623 (slurmctld) Memory: 5.1M CGroup: /system.slice/slurmctld.service └─11623 /cm/shared/apps/slurm/17.11.12/sbin/slurmctld ----------------------------------------------------------------------------- [root@node001 ~]# systemctl status -l slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2019-05-31 21:33:33 EDT; 30min ago Process: 3706 ExecStart=/cm/shared/apps/slurm/17.11.12/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 3709 (slurmd) Memory: 8.1M CGroup: /system.slice/slurmd.service └─3709 /cm/shared/apps/slurm/17.11.12/sbin/slurmd ----------------------------------------------------------------------------- ***************************************************************************** 1- The slurmctld is disabled, could that be the reason (although it is saying just after that it is active) ? 2-a Should I restart the Daemons? 2-b if yes, how do I restart them? I made a trial to restart the slurmctld daemon like this: systemctl stop slurmctld systemctl start slurmctld but this did not fix anything. ***************************************************************************** Here is my slurm.conf file: ***************************************************************************** # # See the slurm.conf man page for more information. # ClusterName=SLURM_CLUSTER SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave SlurmdSpoolDir=/cm/local/apps/slurm/var/spool SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #ProctrackType=proctrack/pgid ProctrackType=proctrack/cgroup #PluginDir= CacheGroups=0 #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig= PlugStackConfig=/etc/slurm/plugstack.conf.d/ #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/cgroup #TrackWCKey=no #TreeWidth=50 #TmpFs= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd #JobCompType=jobcomp/filetxt #JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm # AccountingStorageLoc=slurm_acct_db # AccountingStoragePass=SLURMDBD_USERPASS # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE # Scheduler SchedulerType=sched/backfill # Master nodes ControlMachine=rdfmg ControlAddr=rdfmg AccountingStorageHost=rdfmg # Nodes NodeName=node[001-014] # Partitions PartitionName=shortq Default=YES MinNodes=1 DefaultTime=1-00:00:00 MaxTime=10-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO OverTimeLimit=0 State=UP PartitionName=longq Default=NO MinNodes=1 DefaultTime=2-00:00:00 MaxTime=20-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO OverTimeLimit=0 State=UP PartitionName=interq Default=NO MinNodes=1 DefaultTime=06:00:00 MaxTime=12:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO OverTimeLimit=0 State=UP # Generic resources types GresTypes=gpu,mic # Epilog/Prolog parameters PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog # Fast Schedule option FastSchedule=0 # Power Saving SuspendTime=-1 # this disables power saving SuspendTimeout=30 ResumeTimeout=60 SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron # END AUTOGENERATED SECTION -- DO NOT REMOVE SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory MaxTasksPerNode = 64 DefMemPerCPU = 2014 MaxMemPerCPU = 2014 MaxArraySize = 16384 **************************************************************************** Thank you