| Summary: | srun: error: Memory specification can not be satisfied | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Wayfinder Infrastructure Support <infrastructure-support.wayfinder> |
| Component: | Configuration | Assignee: | Oscar Hernández <oscar.hernandez> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | jbooth, oscar.hernandez |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | wfr | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | NA | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
gres.conf slurmd from node: ng-201-1 slurmdbd slurmctld |
||
Created attachment 27383 [details]
gres.conf
Created attachment 27384 [details]
slurmd from node: ng-201-1
Created attachment 27385 [details]
slurmdbd
Created attachment 27386 [details]
slurmctld
It looks like you do not have that node as part of the config on that or any other partition other than volta. > ng-201-1 NodeName=ng-202-26 CoresPerSocket=16 Sockets=2 ThreadsPerCore=2 Gres=gpu:quadrortx8000:1,gpu:quadrortx8000:1 NodeName=ng-202-32 CoresPerSocket=16 Sockets=2 ThreadsPerCore=2 Gres=gpu:v100:4 NodeName=ng-201-[16,21],ng-202-21 CoresPerSocket=20 Sockets=2 ThreadsPerCore=2 Gres=gpu:quadrortx8000:3,gpu:quadrortx8000:3 NodeName=ng-[201,202]-[1,5] CoresPerSocket=20 Sockets=2 ThreadsPerCore=2 Gres=gpu:v100:4,gpu:v100:4 NodeName=ng-201-41,ng-202-41 CoresPerSocket=56 RealMemory=803922 Sockets=2 ThreadsPerCore=2 Gres=gpu:nvidia_a100-sxm4-40gb:4 NodeName=ng-204-1 Procs=255 Gres=gpu:quadrortx8000:2,gpu:quadrortx8000:2 > PartitionName=quadro ... Nodes=ng-201-21,ng-202-21,ng-204-1 > PartitionName=volta D... Nodes=ng-[201,202]-[1,5],ng-202-32 Sorry, that is true ng-201-1 is only on volta. I meant that the nodes in other partitions are not working with the same resource error but partition a100 nodes are working. Deric Hi Deric, Was slurm.conf modified recently? I would expect node definitions to have the memory defined with "RealMemory" option. However, in the current conf, just nodes "ng-201-41,ng-202-41" have it defined: >NodeName=ng-202-26 CoresPerSocket=16 Sockets=2 ThreadsPerCore=2 Gres=gpu:quadrortx8000:1,gpu:quadrortx8000:1 >NodeName=ng-202-32 CoresPerSocket=16 Sockets=2 ThreadsPerCore=2 Gres=gpu:v100:4 >NodeName=ng-201-[16,21],ng-202-21 CoresPerSocket=20 Sockets=2 ThreadsPerCore=2 Gres=gpu:quadrortx8000:3,gpu:quadrortx8000:3 >NodeName=ng-[201,202]-[1,5] CoresPerSocket=20 Sockets=2 ThreadsPerCore=2 Gres=gpu:v100:4,gpu:v100:4 >NodeName=ng-201-41,ng-202-41 CoresPerSocket=56 RealMemory=803922 Sockets=2 ThreadsPerCore=2 Gres=gpu:nvidia_a100-sxm4-40gb:4 >NodeName=ng-204-1 Procs=255 Gres=gpu:quadrortx8000:2,gpu:quadrortx8000:2 This nodes seem to be part of the only partition working (a100): >PartitionName=a100 Default=NO MinNodes=1 MaxTime=7-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO OverSubscribe=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerNode=10240 SelectTypeParameters=CR_CORE_MEMORY AllowAccounts=ALL AllowQos=ALL LLN=YES ExclusiveUser=NO OverTimeLimit=0 State=UP Nodes=ng-201-41,ng-202-41 Could you try setting configuring memory for the other nodes (RealMemory) and restarting the controller? Kind regards, Oscar Oscar, The slurm.conf was not modified recently. I just checked an old slurm.conf from August. These partitions were working a few days ago. I did set the memory for volta partition and it does not complain about memory any more but it still "Unable to allocate resources" (base) deric.chau.ctr@nl-202-31:~$ srun -p volta --nodelist=ng-201-1 --pty bash srun: error: Unable to allocate resources: Requested node configuration is not available Deric Deric, would you try these submissions and let me know if this makes any difference? srun -p quadro --nodelist=ng-201-1 hostname --mem=0 srun -p quadro --nodelist=ng-201-1 hostname --mem=1 Jason, (base) deric.chau.ctr@nl-202-31:~$ srun -p volta --nodelist=ng-201-1 hostname --mem=0 srun: error: Unable to allocate resources: Requested node configuration is not available (base) deric.chau.ctr@nl-202-31:~$ srun -p volta --nodelist=ng-201-1 hostname --mem=1 srun: error: Unable to allocate resources: Requested node configuration is not available Please move up the --mem option before hostname
> srun -p volta --nodelist=ng-201-1 --mem=0 hostname
> srun -p volta --nodelist=ng-201-1 --mem=1 hostname
Please also provide the output of "scontrol show node ng-201-1", and "sinfo".
(base) deric.chau.ctr@nl-202-31:~$ srun -p volta --nodelist=ng-201-1 --mem=0 hostname ng-201-1 (base) deric.chau.ctr@nl-202-31:~$ srun -p volta --nodelist=ng-201-1 --mem=1 hostname ng-201-1 (base) deric.chau.ctr@nl-202-31:~$ scontrol show node ng-201-1 NodeName=ng-201-1 Arch=x86_64 CoresPerSocket=20 CPUAlloc=0 CPUTot=80 CPULoad=3.08 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:v100:4,gpu:v100:4 NodeAddr=ng-201-1 NodeHostName=ng-201-1 Version=18.08 OS=Linux 4.18.0-18-generic #19~18.04.1-Ubuntu SMP Fri Apr 5 10:22:13 UTC 2019 RealMemory=1 AllocMem=0 FreeMem=359996 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=volta BootTime=2022-08-17T00:33:45 SlurmdStartTime=2022-08-17T00:34:42 CfgTRES=cpu=80,mem=1M,billing=80 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s (base) deric.chau.ctr@nl-202-31:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST volta* up 7-00:00:00 5 idle ng-201-[1,5],ng-202-[1,5,32] a100 up 7-00:00:00 2 mix ng-201-41,ng-202-41 quadro-dev up 7-00:00:00 1 idle ng-202-26 quadro up 7-00:00:00 3 idle ng-201-21,ng-202-21,ng-204-1 volta-dev up 7-00:00:00 0 n/a volta-qa up 7-00:00:00 0 n/a On Fri, Oct 21, 2022 at 2:04 PM <bugs@schedmd.com> wrote: > *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=15254#c11> on bug > 15254 <https://bugs.schedmd.com/show_bug.cgi?id=15254> from Jason Booth > <jbooth@schedmd.com> * > > Please move up the --mem option before hostname > > srun -p volta --nodelist=ng-201-1 --mem=0 hostname > > srun -p volta --nodelist=ng-201-1 --mem=1 hostname > > Please also provide the output of "scontrol show node ng-201-1", and "sinfo". > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Thank you for that output. What is happening here is that you have Slurm configured to track the memory of the node and use the RealMemory configured. Although Slurm does see free memory, you have the default configured of "1". > RealMemory=1 AllocMem=0 FreeMem=359996 Running slurmd -C on the node will show you that you should also have RealMemory as part of your node definition lines. For example: > slurmd -C > NodeName=[NODE_NAME ]CPUs=## Boards=# SocketsPerBoard=# CoresPerSocket=# ThreadsPerCore=# RealMemory=359996 This mechanism is enabled by setting the "SelectTypeParameters=CR_Core_Memory" or the partition config entry "SelectTypeParameters=CR_CORE_MEMORY". It is more likely that either this was a recent change or that RealMemory was removed from your configuration. So to make things work, run "slurmd -C" on those nodes and copy the value of the proposed config into your slurm.conf and restart the slurmctld, and slurmd's. Jason, I added RealMemory details to each partition now and it is working again. I checked our slurm.conf from months ago and it did not have RealMemory set and "SelectTypeParameters=CR_CORE_MEMORY" was in the config and partitions. Strange but at least it's work again. Thank you. Deric On Fri, Oct 21, 2022 at 2:51 PM <bugs@schedmd.com> wrote: > *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=15254#c13> on bug > 15254 <https://bugs.schedmd.com/show_bug.cgi?id=15254> from Jason Booth > <jbooth@schedmd.com> * > > Thank you for that output. > > What is happening here is that you have Slurm configured to track the memory of > the node and use the RealMemory configured. Although Slurm does see free > memory, you have the default configured of "1". > > RealMemory=1 AllocMem=0 FreeMem=359996 > > > Running slurmd -C on the node will show you that you should also have > RealMemory as part of your node definition lines. > > > For example: > > slurmd -C > > NodeName=[NODE_NAME ]CPUs=## Boards=# SocketsPerBoard=# CoresPerSocket=# ThreadsPerCore=# RealMemory=359996 > > > > This mechanism is enabled by setting the "SelectTypeParameters=CR_Core_Memory" > or the partition config entry "SelectTypeParameters=CR_CORE_MEMORY". > > It is more likely that either this was a recent change or that RealMemory was > removed from your configuration. > > So to make things work, run "slurmd -C" on those nodes and copy the value of > the proposed config into your slurm.conf and restart the slurmctld, and > slurmd's. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Resolved from previous comment. |
Created attachment 27382 [details] slurm.conf Hello, I am unsure why this is suddenly happening but it seems to effect all our partitions but "a100" partition. (base) deric.chau.ctr@nl-202-31:~$ srun -p quadro --nodelist=ng-201-1 hostname srun: error: Memory specification can not be satisfied Attaching config files.