Ticket 9304

Summary: Confusing job->num_nodes / NODES field when job is pending for ReqNodeNotAvail or PartitionConfig reasons
Product: Slurm Reporter: Regine Gaudin <regine.gaudin>
Component: slurmctldAssignee: Felip Moll <felip.moll>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8113
https://bugs.schedmd.com/show_bug.cgi?id=8224
https://bugs.schedmd.com/show_bug.cgi?id=8110
Site: CEA Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: requested config info
production machine infos

Description Regine Gaudin 2020-06-30 05:58:35 MDT
Hello

One of users complain about the NODES field appearing in squeue on a pending job for ReqNodeNotAVail although when allocation is possible the number of nodes allocated is the good one

It seems that squeue's NODES field which is job->num_nodes is computed with only
numbers_of_cpus_asked/cores_per_socket(per node)
and not
numbers_of_cpus_asked/cores_per_socket/sockets(per node)

So the users though that slurm asked for the wrong number of nodes.

Let's see this example
NodeName=vm[0-3]  CoresPerSocket=8 Sockets=8 ThreadsPerCore=1 State=UNKNOWN 

 sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
vm*          up   infinite      3   idle vm[0-2]
vm*          up   infinite      1   down vm3


 srun -n 192 hostname (ask for 3 nodes available)
vm1
vm2
vm2
vm1
...ok

srun -n 256 hostname  (ask for 4 nodes one not available)
srun: Required node not available (down, drained or reserved)
srun: job 28 queued and waiting for resources

squeue
   JOBID PARTITION     NAME     USER ST      TIME  NODES NODELIST(REASON)
    28        vm hostname  gaudinr PD       0:00     32 (ReqNodeNotAvail, UnavailableNodes:vm3)


NODES=32 (256/8) is confusing as users are thinking they asked or slurm asked
32 nodes instead of 4 and might wrongly think that this is the reason why the job is pending... 

Could it be fix ?

Thanks

Regine
Comment 1 Jason Booth 2020-06-30 13:59:14 MDT
Would you please attach your slurm.conf, "slurmd -C" from the compute node and also attach the output of "scontrol show nodes"?
Comment 2 Regine Gaudin 2020-07-01 01:31:47 MDT
Created attachment 14853 [details]
requested config info
Comment 3 Regine Gaudin 2020-07-01 01:33:30 MDT
It seems that there is a mismatch between slurmd -C and slurm.conf
Comment 4 Regine Gaudin 2020-07-01 01:34:35 MDT
Hi

Please find in attachment a file with the requested infos.
It seems that
slurm.conf config:
NodeName=vm[0-3]  CoresPerSocket=8 Sockets=8 ThreadsPerCore=1

is interpreted differently by slurmd -C
vm0: NodeName=vm0 CPUs=8 Boards=1 SocketsPerBoard=8 CoresPerSocket=1 ThreadsPerCore=1 

while
scontrol show nodes gives
NodeName=vm0 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUTot=64 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=vm0 NodeHostName=vm0 Version=18.08
   OS=Linux 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017
   RealMemory=15200 AllocMem=0 FreeMem=24467 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=vm
Comment 5 Felip Moll 2020-07-01 04:59:04 MDT
What's the real hardware layout?
Can you show me the output of "lstopo" or "lstopo-no-graphics"?
Comment 6 Regine Gaudin 2020-07-02 04:20:25 MDT
I've used a virtual machine for which I can only asked for cores number on each nodes and not sockets ....so my reproducer with virtual machine is may be not accurate as the compute node on the two production machines having the problem have 8 cores * 8 sockets *2 hyperthreadings 

You will find in attachment with the asked infos on the real production machine and a reproducer

Adddress for controller , user name, partition name have been erased or change for confidential purpose ...
Comment 7 Regine Gaudin 2020-07-02 04:22:48 MDT
Created attachment 14877 [details]
production machine infos
Comment 8 Regine Gaudin 2020-07-06 01:46:49 MDT
Hi
According to the conf files (including lstopo)  of the production machines where the problem  appears and reproducer I've provided, did you succeed to reproduce ?
Thanks
Régine
Comment 9 Felip Moll 2020-07-06 10:25:41 MDT
Hi Regine,

From your attached document.. this specifies you want to run 512 tasks on one single node (-w). The output of squeue should print what would be the required nodes if you divide 512/cores_per_node, I am investigating now why you have a 32 because I get 4 in my system. In any case this doesn't seem to affect allocation and anything. The real reason the job is not entering is because you're explicitly requesting for 512 tasks on 1 single node. So your theory in first comment seems ok, but I am not seeing it right now. I need a bit more time.

srun -n 512 -w  machine4020 -p partition hostname

srun: Required node not available (down, drained or reserved)
srun: job 412735 queued and waiting for resources

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            412735      partition hostname user PD       0:00     32 (ReqNodeNotAvail, UnavailableNodes:machine4020)
..


Also, you say:

> On controller same slurm.conf  but nodes are de calred without hyperthreading:
> batch$ cat  /etc/slurm/nodes.conf
> NodeName=machine[4000-6291]  CoresPerSocket=16 RealMemory=240000 Sockets=8 ThreadsPerCore=1

You need to have the same nodes.conf in the controller and the nodes, why do you have it different? You should see NO_CONF_HASH errors in your ctld logs. Please, fix this and synchronize the nodes.conf.

Will keep you informed, sorry for the delay.
Comment 10 Felip Moll 2020-07-06 14:33:41 MDT
Regine,

I investigated a bit more. It turns out that this bug is a duplicate of bug 8110 where the same issue under different use case showed up.

I will take a look at bug 8110 which seems stalled and see if I can give a second review.

There's also a commit done recently, but it doesn't explain everything nor fixes the issue, it is just a small clarification in docs.

commit a64a1d0cd858f13314d21cd28e7f115729adb256
Author:     Marcin Stolarek <cinek@schedmd.com>
AuthorDate: Fri Nov 15 18:49:49 2019 +0000
Commit:     Ben Roberts <ben@schedmd.com>
CommitDate: Thu Feb 20 11:37:18 2020 -0600

    Docs - squeue nodecount simplify comment making it more general
    
    %D - Number of nodes is not fully evaluated before the job start. In
    real environment when -N is not specified this may be really complicated
    so it doesn't make sense to give the really smallest possible number
    here for PD job.
    
    Bug 8113


Is it ok to mark this one as a dup of 8110 and let you follow bug 8110 from now on?
Comment 11 Regine Gaudin 2020-07-07 02:04:17 MDT
"This specifies you want to run 512 tasks on one
single node (-w). The output of squeue should print what would be the required
nodes if you divide 512/cores_per_node, I am investigating now why you have a
32 because I get 4 in my system."

srun -n 512 -w  machine4020 -p partition hostname
 the -w machine was for targetting an unvailable node only for reproducer...


If I specify 
srun -n 512 -w  machine[4020-4023] -p partition hostname
where only machine4020 is unvailable squeue writes the same thing ie 32 nodes
while it should write 4:
It appears that only number of sockets is missing in the deviding.
ie it should be 512 / 16 /8 instead of 512 /16 


" Docs - squeue nodecount simplify comment making it more general

    %D - Number of nodes is not fully evaluated before the job start. In
    real environment when -N is not specified this may be really complicated
    so it doesn't make sense to give the really smallest possible number
    here for PD job."

This false nodes result in squeue is not for all pending jobs because when the job is PD for Resources reason, the correct nodes number appears in squeue.
It seems false for ReqNodeNotAvail. 
Why can it be correct for Resources reason and can not for ReqNodeNotAvail reason ?
I don't think we have this behavioir in slurm-18...but not sure as machine nodes were different.
Why the man of squeue does not specify this behaviour ?

"Is it ok to mark this one as a dup of 8110 and let you follow bug 8110 from now
on?"
The 8110 does not have any update since end of 2019. Could you wake up it ?

Thanks
Comment 12 Felip Moll 2020-07-14 11:55:02 MDT
Regine,

There's already a patch in place coming from bug 8110, it has been commited recently and will be in next 20.02.

But before I can ensure this fix your issue, I need to reproduce it. I have a machine with your socket/core/memory configuration but your information is not clear to me:

in the attached file you write:

....
$ cat  /etc/slurm/nodes.conf
NodeName=machine[4000-6291] Procs=256 CoresPerSocket=16 RealMemory=240000 Sockets=8 ThreadsPerCore=2

********************************************************************************************************************

On controller
same slurm.conf  but nodes arede calred without hyperthreading:
batch$ cat  /etc/slurm/nodes.conf
NodeName=machine[4000-6291]  CoresPerSocket=16 RealMemory=240000 Sockets=8 ThreadsPerCore=1

********************************************************************************************************************
.....



Does it mean that slurm configuration is different in nodes vs controller? Can you explain me a bit better why?

Thanks
Comment 13 Felip Moll 2020-08-13 09:01:21 MDT
Hi

I am timing this out since I haven't received any response.

Please, mark it as open again when you have more input for me.

Thanks for your comprehension.