| Summary: | job is pending | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | pclink |
| Component: | User Commands | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.02.7 | ||
| Hardware: | Other | ||
| OS: | Linux | ||
| Site: | ERI | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | RHEL |
| Machine Name: | TestHeadNode01 | CLE Version: | |
| Version Fixed: | 20.2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm configuration
slurmctld |
||
|
Description
pclink
2021-06-28 05:15:28 MDT
sorry slurm version is 20.02.7 Would you mind attaching your slurm.conf and the output of "sinfo", "sprio", and "scontrol show job $jobid" where $jobid is the id of the job that is stuck? Thanks! --Tim [root@TestHeadNode01 ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7 defq Job29 root PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
[root@TestHeadNode01 ~]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 0 n/a
You have new mail in /var/spool/mail/root
[root@TestHeadNode01 ~]# sprio
You are not running a supported priority plugin
(priority/basic).
Only 'priority/multifactor' is supported.
[root@TestHeadNode01 ~]# scontrol show job 7
JobId=7 JobName=Job29
UserId=root(0) GroupId=root(0) MCS_label=N/A
Priority=4294901755 Nice=0 Account=root QOS=normal
JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2021-06-28T13:58:11 EligibleTime=2021-06-28T13:58:11
AccrueTime=2021-06-28T13:58:11
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-29T12:54:17
Partition=defq AllocNode:Sid=TestHeadNode01:2579109
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/home/root//Job29/communicatingJobWrapper.sh
WorkDir=/root
StdErr=/home/root//Job29/Job29.log
StdIn=/dev/null
StdOut=/home/root//Job29/Job29.log
Power=
MailUser=root MailType=NONE
Created attachment 20146 [details]
slurm configuration
Thank you for the info! It looks like something is up regarding the partition since it is showing 0 nodes. Can you send the output of "scontrol show node" as well as the slurmctld log? Thanks! --Tim [root@TestHeadNode01 ~]# scontrol show node No nodes in the system Created attachment 20148 [details]
slurmctld
Thanks for the logs! It appears that your nodes real configuration vs the one that is expected in the slurm.conf don't match, which is preventing the nodes from showing up in the system. On one of the nodes (assuming they are the same), please run "slurmd -C". This will output what we think those nodes look like, and can be used on the "NodeName=" line to define the node properly. The output should look something like: > root@debnode0:~# slurmd -C > NodeName=debnode0 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7977 Which can be substituted in the slurm config: > NodeName=debnode[0-5] CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7977 To get the nodes to register. It also looks like the config is different on the different machines, so please make sure the config file is synced and restart the slurmctld AND slurmd processes to make sure they are all loading the new file. Let me know if this changes the situation any! --Tim thanks for your response kindly note that our cluster is manged by bright computing cluster 9.0 all compute node are getting the configuart from one place shared is name /cm/shared/apps/slurm/var/etc/slurm/slurm.conf regarding your response, i updated my configuration files then i will recheck again [root@TestComputeNode02 ~]# slurmd -C NodeName=TestComputeNode02 CPUs=2 Boards=1 SocketsPerBoard=2 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=7970 UpTime=2-08:23:22 thanks and best regards Sounds good. I do know you were using Bright, but there were a bunch of logs regarding configuration sync issues as well so I wanted to be sure the configs were getting where they need to go! Let me know how things go after the config updates! --Tim Hi! I just wanted to check in and see if the issue was resolved after the config changes. Thanks! --Tim i reinstalled slurm with recreat cluster again now i getting different error try to run test salloc mpirun hello --allow-run-as-root salloc: Required node not available (down, drained or reserved) salloc: Pending job allocation 35 salloc: job 35 queued and waiting for resources log [root@TestHeadNode01 ~]# tail -f /var/log/slurmctld [2021-07-09T03:09:03.790] restoring original state of nodes [2021-07-09T03:09:03.790] restoring original partition state [2021-07-09T03:09:03.790] read_slurm_conf: backup_controller not specified [2021-07-09T03:09:03.790] No parameter for mcs plugin, default values set [2021-07-09T03:09:03.790] mcs: MCSParameters = (null). ondemand set. [2021-07-09T03:09:04.791] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2021-07-09T22:33:59.137] sched: _slurm_rpc_allocate_resources JobId=35 NodeList=(null) usec=202 Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions can I get the output again of "sinfo" and "scontrol show job" for a job that is getting stuck? I'd also like to suggest that you set the log level for the slurmctld/slurmd to "debug" until this is sorted out. Thanks! --Tim Just wanted to check in and see if this was still an issue or if you were able to get that output! Thanks, --Tim Its been a while since I heard from you on this, so I'm going to time this out for now. Let me know if you are still having issues and I'll be happy to keep working on this with you! Thanks, --Tim thank you very much problem solved sorry for the late response |