Ticket 15254 - srun: error: Memory specification can not be satisfied
Summary: srun: error: Memory specification can not be satisfied
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Oscar Hernández
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-10-20 15:38 MDT by Wayfinder Infrastructure Support
Modified: 2022-10-21 16:51 MDT (History)
2 users (show)

See Also:
Site: wfr
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: NA
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (5.25 KB, text/plain)
2022-10-20 15:38 MDT, Wayfinder Infrastructure Support
Details
gres.conf (1.12 KB, text/plain)
2022-10-20 15:38 MDT, Wayfinder Infrastructure Support
Details
slurmd from node: ng-201-1 (179.98 KB, text/plain)
2022-10-20 15:39 MDT, Wayfinder Infrastructure Support
Details
slurmdbd (60.28 KB, text/plain)
2022-10-20 15:42 MDT, Wayfinder Infrastructure Support
Details
slurmctld (10.91 MB, text/plain)
2022-10-20 15:45 MDT, Wayfinder Infrastructure Support
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Wayfinder Infrastructure Support 2022-10-20 15:38:04 MDT
Created attachment 27382 [details]
slurm.conf

Hello,
I am unsure why this is suddenly happening but it seems to effect all our partitions but "a100" partition.

(base) deric.chau.ctr@nl-202-31:~$ srun -p quadro  --nodelist=ng-201-1 hostname
srun: error: Memory specification can not be satisfied

Attaching config files.
Comment 1 Wayfinder Infrastructure Support 2022-10-20 15:38:34 MDT
Created attachment 27383 [details]
gres.conf
Comment 2 Wayfinder Infrastructure Support 2022-10-20 15:39:39 MDT
Created attachment 27384 [details]
slurmd from node: ng-201-1
Comment 3 Wayfinder Infrastructure Support 2022-10-20 15:42:53 MDT
Created attachment 27385 [details]
slurmdbd
Comment 4 Wayfinder Infrastructure Support 2022-10-20 15:45:32 MDT
Created attachment 27386 [details]
slurmctld
Comment 5 Jason Booth 2022-10-20 16:37:17 MDT
It looks like you do not have that node as part of the config on that or any other partition other than volta.

> ng-201-1


NodeName=ng-202-26  CoresPerSocket=16 Sockets=2 ThreadsPerCore=2 Gres=gpu:quadrortx8000:1,gpu:quadrortx8000:1
NodeName=ng-202-32  CoresPerSocket=16 Sockets=2 ThreadsPerCore=2 Gres=gpu:v100:4
NodeName=ng-201-[16,21],ng-202-21  CoresPerSocket=20 Sockets=2 ThreadsPerCore=2 Gres=gpu:quadrortx8000:3,gpu:quadrortx8000:3
NodeName=ng-[201,202]-[1,5]  CoresPerSocket=20 Sockets=2 ThreadsPerCore=2 Gres=gpu:v100:4,gpu:v100:4
NodeName=ng-201-41,ng-202-41  CoresPerSocket=56 RealMemory=803922 Sockets=2 ThreadsPerCore=2 Gres=gpu:nvidia_a100-sxm4-40gb:4
NodeName=ng-204-1  Procs=255 Gres=gpu:quadrortx8000:2,gpu:quadrortx8000:2


> PartitionName=quadro ... Nodes=ng-201-21,ng-202-21,ng-204-1

> PartitionName=volta D... Nodes=ng-[201,202]-[1,5],ng-202-32
Comment 6 Wayfinder Infrastructure Support 2022-10-20 16:49:05 MDT
Sorry, that is true ng-201-1 is only on volta. I meant that the nodes in other partitions are not working with the same resource error but partition a100 nodes are working.

Deric
Comment 7 Oscar Hernández 2022-10-21 05:17:49 MDT
Hi Deric,

Was slurm.conf modified recently?

I would expect node definitions to have the memory defined with "RealMemory" option. However, in the current conf, just nodes "ng-201-41,ng-202-41" have it defined:

>NodeName=ng-202-26  CoresPerSocket=16 Sockets=2 ThreadsPerCore=2 Gres=gpu:quadrortx8000:1,gpu:quadrortx8000:1
>NodeName=ng-202-32  CoresPerSocket=16 Sockets=2 ThreadsPerCore=2 Gres=gpu:v100:4
>NodeName=ng-201-[16,21],ng-202-21  CoresPerSocket=20 Sockets=2 ThreadsPerCore=2 Gres=gpu:quadrortx8000:3,gpu:quadrortx8000:3
>NodeName=ng-[201,202]-[1,5]  CoresPerSocket=20 Sockets=2 ThreadsPerCore=2 Gres=gpu:v100:4,gpu:v100:4
>NodeName=ng-201-41,ng-202-41  CoresPerSocket=56 RealMemory=803922 Sockets=2 ThreadsPerCore=2 Gres=gpu:nvidia_a100-sxm4-40gb:4
>NodeName=ng-204-1  Procs=255 Gres=gpu:quadrortx8000:2,gpu:quadrortx8000:2

This nodes seem to be part of the only partition working (a100):

>PartitionName=a100 Default=NO MinNodes=1 MaxTime=7-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO OverSubscribe=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerNode=10240 SelectTypeParameters=CR_CORE_MEMORY AllowAccounts=ALL AllowQos=ALL LLN=YES ExclusiveUser=NO OverTimeLimit=0 State=UP Nodes=ng-201-41,ng-202-41

Could you try setting configuring memory for the other nodes (RealMemory) and restarting the controller?

Kind regards,
Oscar
Comment 8 Wayfinder Infrastructure Support 2022-10-21 11:12:41 MDT
Oscar,
The slurm.conf was not modified recently. I just checked an old slurm.conf from August. These partitions were working a few days ago.

I did set the memory for volta partition and it does not complain about memory any more but it still "Unable to allocate resources"
(base) deric.chau.ctr@nl-202-31:~$ srun -p volta  --nodelist=ng-201-1 --pty bash
srun: error: Unable to allocate resources: Requested node configuration is not available

Deric
Comment 9 Jason Booth 2022-10-21 13:34:22 MDT
Deric, would you try these submissions and let me know if this makes any difference?

srun -p quadro  --nodelist=ng-201-1 hostname --mem=0
srun -p quadro  --nodelist=ng-201-1 hostname --mem=1
Comment 10 Wayfinder Infrastructure Support 2022-10-21 14:53:29 MDT
Jason,

(base) deric.chau.ctr@nl-202-31:~$ srun -p volta  --nodelist=ng-201-1 hostname --mem=0
srun: error: Unable to allocate resources: Requested node configuration is not available
(base) deric.chau.ctr@nl-202-31:~$ srun -p volta  --nodelist=ng-201-1 hostname --mem=1
srun: error: Unable to allocate resources: Requested node configuration is not available
Comment 11 Jason Booth 2022-10-21 15:04:04 MDT
Please move up the --mem option before hostname

> srun -p volta  --nodelist=ng-201-1 --mem=0 hostname 
> srun -p volta  --nodelist=ng-201-1 --mem=1 hostname 

Please also provide the output of "scontrol show node ng-201-1", and "sinfo".
Comment 12 Deric Chau 2022-10-21 15:40:44 MDT
(base) deric.chau.ctr@nl-202-31:~$ srun -p volta  --nodelist=ng-201-1
--mem=0 hostname
ng-201-1
(base) deric.chau.ctr@nl-202-31:~$ srun -p volta  --nodelist=ng-201-1
--mem=1 hostname
ng-201-1
(base) deric.chau.ctr@nl-202-31:~$ scontrol show node ng-201-1
NodeName=ng-201-1 Arch=x86_64 CoresPerSocket=20
   CPUAlloc=0 CPUTot=80 CPULoad=3.08
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:v100:4,gpu:v100:4
   NodeAddr=ng-201-1 NodeHostName=ng-201-1 Version=18.08
   OS=Linux 4.18.0-18-generic #19~18.04.1-Ubuntu SMP Fri Apr 5 10:22:13 UTC
2019
   RealMemory=1 AllocMem=0 FreeMem=359996 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=volta
   BootTime=2022-08-17T00:33:45 SlurmdStartTime=2022-08-17T00:34:42
   CfgTRES=cpu=80,mem=1M,billing=80
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


(base) deric.chau.ctr@nl-202-31:~$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
volta*        up 7-00:00:00      5   idle ng-201-[1,5],ng-202-[1,5,32]
a100          up 7-00:00:00      2    mix ng-201-41,ng-202-41
quadro-dev    up 7-00:00:00      1   idle ng-202-26
quadro        up 7-00:00:00      3   idle ng-201-21,ng-202-21,ng-204-1
volta-dev     up 7-00:00:00      0    n/a
volta-qa      up 7-00:00:00      0    n/a

On Fri, Oct 21, 2022 at 2:04 PM <bugs@schedmd.com> wrote:

> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=15254#c11> on bug
> 15254 <https://bugs.schedmd.com/show_bug.cgi?id=15254> from Jason Booth
> <jbooth@schedmd.com> *
>
> Please move up the --mem option before hostname
> > srun -p volta  --nodelist=ng-201-1 --mem=0 hostname
> > srun -p volta  --nodelist=ng-201-1 --mem=1 hostname
>
> Please also provide the output of "scontrol show node ng-201-1", and "sinfo".
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 13 Jason Booth 2022-10-21 15:51:49 MDT
Thank you for that output.

What is happening here is that you have Slurm configured to track the memory of the node and use the RealMemory configured. Although Slurm does see free memory, you have the default configured of "1".

> RealMemory=1 AllocMem=0 FreeMem=359996


Running slurmd -C on the node will show you that you should also have RealMemory as part of your node definition lines.


For example:

> slurmd -C
> NodeName=[NODE_NAME ]CPUs=## Boards=# SocketsPerBoard=# CoresPerSocket=# ThreadsPerCore=# RealMemory=359996



This mechanism is enabled by setting the "SelectTypeParameters=CR_Core_Memory" or the partition config entry "SelectTypeParameters=CR_CORE_MEMORY".

It is more likely that either this was a recent change or that RealMemory was removed from your configuration. 

So to make things work, run "slurmd -C" on those nodes and copy the value of the proposed config into your slurm.conf and restart the slurmctld, and slurmd's.
Comment 14 Deric Chau 2022-10-21 16:49:13 MDT
Jason,
I added RealMemory details to each partition now and it is working again.

I checked our slurm.conf from months ago and it did not have RealMemory set
and "SelectTypeParameters=CR_CORE_MEMORY" was in the config and partitions.
Strange but at least it's work again. Thank you.

Deric

On Fri, Oct 21, 2022 at 2:51 PM <bugs@schedmd.com> wrote:

> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=15254#c13> on bug
> 15254 <https://bugs.schedmd.com/show_bug.cgi?id=15254> from Jason Booth
> <jbooth@schedmd.com> *
>
> Thank you for that output.
>
> What is happening here is that you have Slurm configured to track the memory of
> the node and use the RealMemory configured. Although Slurm does see free
> memory, you have the default configured of "1".
> > RealMemory=1 AllocMem=0 FreeMem=359996
>
>
> Running slurmd -C on the node will show you that you should also have
> RealMemory as part of your node definition lines.
>
>
> For example:
> > slurmd -C
> > NodeName=[NODE_NAME ]CPUs=## Boards=# SocketsPerBoard=# CoresPerSocket=# ThreadsPerCore=# RealMemory=359996
>
>
>
> This mechanism is enabled by setting the "SelectTypeParameters=CR_Core_Memory"
> or the partition config entry "SelectTypeParameters=CR_CORE_MEMORY".
>
> It is more likely that either this was a recent change or that RealMemory was
> removed from your configuration.
>
> So to make things work, run "slurmd -C" on those nodes and copy the value of
> the proposed config into your slurm.conf and restart the slurmctld, and
> slurmd's.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 15 Wayfinder Infrastructure Support 2022-10-21 16:51:15 MDT
Resolved from previous comment.