| Summary: | srun: error: Unable to allocate resources: Requested node configuration is not available | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Wayfinder Infrastructure Support <infrastructure-support.wayfinder> |
| Component: | Scheduling | Assignee: | Oscar Hernández <oscar.hernandez> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | wfr | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
gres.conf slurmd2 slurmctld |
||
Created attachment 27241 [details]
gres.conf
Please raise the log level and duplicate the issue, then lower the log level. Compress the logs and attach those to this issue. > scontrol setdebug debug Reproduce the error > scontrol setdebug info Attach the logs. Please also attach the compressed slurmd.log from ng-201-21. Created attachment 27243 [details]
slurmd2
Hi,
Increased the log level but it doesn't but it looks like the file did not change from the first.
Deric
Hi Deric,
Thanks for the provided data.
> Increased the log level but it doesn't but it looks like the file did not
> change from the first.
This is expected if you were only looking at the slurmd logs, the log level changes suggested by Jason do only affect the slurmctld.log. Which is the one we think is going to give us more information of what is going on.
Could I ask you to repeat the test? this time also enabling "SelectType" debugflag? To do so, you should:
(enable logging)
scontrol setdebug debug
scontrol setdebugflags +selecttype
Run the failing job.
(then to return to normal logging)
scontrol setdebug info
scontrol setdebugflags -selecttype
You will then need to attach the relevant part of the slurmctld.log (the timeframe of the lo in which the job was submitted). Also, it would be great if you could also give us the JOBID of the failed job, so that we can correctly identify it in the logs ("sacct" command should give you the information).
Just another couple of questions. Was this working before? Does it successfully run if you execute "srun --nodelist=ng-201-21 hostname"?
Kind regards,
Oscar
Oscar,
This was working before.
I ran it twice and noticed that it says "volta" partition. This node should be in the quadro partition.
(base) deric.chau.ctr@ng-201-21:~$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
149497 hostname volta 1 FAILED 1:0
149498 hostname volta 1 FAILED 1:0
Attaching slurmctld
Created attachment 27252 [details]
slurmctld
Hi Deric, Thanks for the test and the logs. Looking at your configuration and the output you are getting, I would say this is the expected behavior. Have you recently changed the default partition in the cluster? What is happening is that Slurm is trying to launch the job to the default partition, which is currently "Volta": PartitionName=volta Default=YES MinNodes=1 ... But as the node you are asking for is part of the "quadro" partition. Slurm is letting you know that the resources requested (node ng-201-21) are not in partition "volta". The logs do also confirm that: [2022-10-13T14:45:44.663] No nodes satisfy requirements for JobId=149497 in partition volta [2022-10-13T14:45:44.663] _slurm_rpc_allocate_resources: Requested node configuration is not available This should work: srun -p quadro --nodelist=ng-201-21 A partition must always be specified when submitting a job. If it is not, slurm will try to allocate it in the default partition (it won't autoselect it based on the node requested). Let me know if that helps and has sense for your situation. Kind regards, Oscar Resolved. Thank you |
Created attachment 27240 [details] slurm.conf Hello, I'm having trouble getting a bash prompt to any of my nodes in the quadro partition. srun --nodelist=ng-201-21 --pty bash srun: error: Unable to allocate resources: Requested node configuration is not available root@nm-203-19:/etc/slurm# scontrol show node ng-201-21 NodeName=ng-201-21 Arch=x86_64 CoresPerSocket=20 CPUAlloc=0 CPUTot=80 CPULoad=3.04 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:quadrortx8000:3,gpu:quadrortx8000:3 NodeAddr=ng-201-21 NodeHostName=ng-201-21 Version=18.08 OS=Linux 4.18.0-18-generic #19~18.04.1-Ubuntu SMP Fri Apr 5 10:22:13 UTC 2019 RealMemory=772676 AllocMem=0 FreeMem=746675 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=201458 Weight=1 Owner=N/A MCS_label=N/A Partitions=quadro BootTime=2022-09-30T19:15:11 SlurmdStartTime=2022-09-30T19:16:04 CfgTRES=cpu=80,mem=772676M,billing=80 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s