Would you please attach your slurm.conf, "slurmd -C" from the compute node and also attach the output of "scontrol show nodes"? Created attachment 14853 [details]
requested config info
It seems that there is a mismatch between slurmd -C and slurm.conf Hi Please find in attachment a file with the requested infos. It seems that slurm.conf config: NodeName=vm[0-3] CoresPerSocket=8 Sockets=8 ThreadsPerCore=1 is interpreted differently by slurmd -C vm0: NodeName=vm0 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 while scontrol show nodes gives NodeName=vm0 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUTot=64 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=vm0 NodeHostName=vm0 Version=18.08 OS=Linux 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 RealMemory=15200 AllocMem=0 FreeMem=24467 Sockets=8 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=vm What's the real hardware layout? Can you show me the output of "lstopo" or "lstopo-no-graphics"? I've used a virtual machine for which I can only asked for cores number on each nodes and not sockets ....so my reproducer with virtual machine is may be not accurate as the compute node on the two production machines having the problem have 8 cores * 8 sockets *2 hyperthreadings You will find in attachment with the asked infos on the real production machine and a reproducer Adddress for controller , user name, partition name have been erased or change for confidential purpose ... Created attachment 14877 [details]
production machine infos
Hi According to the conf files (including lstopo) of the production machines where the problem appears and reproducer I've provided, did you succeed to reproduce ? Thanks Régine Hi Regine,
From your attached document.. this specifies you want to run 512 tasks on one single node (-w). The output of squeue should print what would be the required nodes if you divide 512/cores_per_node, I am investigating now why you have a 32 because I get 4 in my system. In any case this doesn't seem to affect allocation and anything. The real reason the job is not entering is because you're explicitly requesting for 512 tasks on 1 single node. So your theory in first comment seems ok, but I am not seeing it right now. I need a bit more time.
srun -n 512 -w machine4020 -p partition hostname
srun: Required node not available (down, drained or reserved)
srun: job 412735 queued and waiting for resources
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
412735 partition hostname user PD 0:00 32 (ReqNodeNotAvail, UnavailableNodes:machine4020)
..
Also, you say:
> On controller same slurm.conf but nodes are de calred without hyperthreading:
> batch$ cat /etc/slurm/nodes.conf
> NodeName=machine[4000-6291] CoresPerSocket=16 RealMemory=240000 Sockets=8 ThreadsPerCore=1
You need to have the same nodes.conf in the controller and the nodes, why do you have it different? You should see NO_CONF_HASH errors in your ctld logs. Please, fix this and synchronize the nodes.conf.
Will keep you informed, sorry for the delay.
Regine, I investigated a bit more. It turns out that this bug is a duplicate of bug 8110 where the same issue under different use case showed up. I will take a look at bug 8110 which seems stalled and see if I can give a second review. There's also a commit done recently, but it doesn't explain everything nor fixes the issue, it is just a small clarification in docs. commit a64a1d0cd858f13314d21cd28e7f115729adb256 Author: Marcin Stolarek <cinek@schedmd.com> AuthorDate: Fri Nov 15 18:49:49 2019 +0000 Commit: Ben Roberts <ben@schedmd.com> CommitDate: Thu Feb 20 11:37:18 2020 -0600 Docs - squeue nodecount simplify comment making it more general %D - Number of nodes is not fully evaluated before the job start. In real environment when -N is not specified this may be really complicated so it doesn't make sense to give the really smallest possible number here for PD job. Bug 8113 Is it ok to mark this one as a dup of 8110 and let you follow bug 8110 from now on? "This specifies you want to run 512 tasks on one
single node (-w). The output of squeue should print what would be the required
nodes if you divide 512/cores_per_node, I am investigating now why you have a
32 because I get 4 in my system."
srun -n 512 -w machine4020 -p partition hostname
the -w machine was for targetting an unvailable node only for reproducer...
If I specify
srun -n 512 -w machine[4020-4023] -p partition hostname
where only machine4020 is unvailable squeue writes the same thing ie 32 nodes
while it should write 4:
It appears that only number of sockets is missing in the deviding.
ie it should be 512 / 16 /8 instead of 512 /16
" Docs - squeue nodecount simplify comment making it more general
%D - Number of nodes is not fully evaluated before the job start. In
real environment when -N is not specified this may be really complicated
so it doesn't make sense to give the really smallest possible number
here for PD job."
This false nodes result in squeue is not for all pending jobs because when the job is PD for Resources reason, the correct nodes number appears in squeue.
It seems false for ReqNodeNotAvail.
Why can it be correct for Resources reason and can not for ReqNodeNotAvail reason ?
I don't think we have this behavioir in slurm-18...but not sure as machine nodes were different.
Why the man of squeue does not specify this behaviour ?
"Is it ok to mark this one as a dup of 8110 and let you follow bug 8110 from now
on?"
The 8110 does not have any update since end of 2019. Could you wake up it ?
Thanks
Regine, There's already a patch in place coming from bug 8110, it has been commited recently and will be in next 20.02. But before I can ensure this fix your issue, I need to reproduce it. I have a machine with your socket/core/memory configuration but your information is not clear to me: in the attached file you write: .... $ cat /etc/slurm/nodes.conf NodeName=machine[4000-6291] Procs=256 CoresPerSocket=16 RealMemory=240000 Sockets=8 ThreadsPerCore=2 ******************************************************************************************************************** On controller same slurm.conf but nodes arede calred without hyperthreading: batch$ cat /etc/slurm/nodes.conf NodeName=machine[4000-6291] CoresPerSocket=16 RealMemory=240000 Sockets=8 ThreadsPerCore=1 ******************************************************************************************************************** ..... Does it mean that slurm configuration is different in nodes vs controller? Can you explain me a bit better why? Thanks Hi I am timing this out since I haven't received any response. Please, mark it as open again when you have more input for me. Thanks for your comprehension. |
Hello One of users complain about the NODES field appearing in squeue on a pending job for ReqNodeNotAVail although when allocation is possible the number of nodes allocated is the good one It seems that squeue's NODES field which is job->num_nodes is computed with only numbers_of_cpus_asked/cores_per_socket(per node) and not numbers_of_cpus_asked/cores_per_socket/sockets(per node) So the users though that slurm asked for the wrong number of nodes. Let's see this example NodeName=vm[0-3] CoresPerSocket=8 Sockets=8 ThreadsPerCore=1 State=UNKNOWN sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST vm* up infinite 3 idle vm[0-2] vm* up infinite 1 down vm3 srun -n 192 hostname (ask for 3 nodes available) vm1 vm2 vm2 vm1 ...ok srun -n 256 hostname (ask for 4 nodes one not available) srun: Required node not available (down, drained or reserved) srun: job 28 queued and waiting for resources squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 28 vm hostname gaudinr PD 0:00 32 (ReqNodeNotAvail, UnavailableNodes:vm3) NODES=32 (256/8) is confusing as users are thinking they asked or slurm asked 32 nodes instead of 4 and might wrongly think that this is the reason why the job is pending... Could it be fix ? Thanks Regine