Ticket 1606

Summary:	sbatch --partition=smithp-ash --account=smithp-ash --time=10:00:00 --nodes=20 --ntasks=400 /tmp/visit.u0033047.14:26:29 reports sbatch: error: Batch job submission failed: Requested node configuration is not available
Product:	Slurm	Reporter:	Brian Haymore <brian.haymore>
Component:	Scheduling	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	brian, da
Version:	14.11.4
Hardware:	Linux
OS:	Linux
Site:	University of Utah	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	My slurm.conf file

Description Brian Haymore 2015-04-16 19:30:52 MDT

Created attachment 1825 [details]
My slurm.conf file

BTW before I get started I should report in that we have as of April 2nd fully converted all of our clusters to SLURM. :)

We have a report that I've been able to reproduce tonight.

This is a cluster with 417 compute nodes.  The first 251 nodes are 12 (24 with HT) core ( 2 sockets, 6 cores per socket, 2 threads per core) nodes.  Then from node 252 to 415 we have 20 (40 with HT) core (2 sockets, 10 cores per socket, 2 threads per core) nodes.  Then nodes 416 and 417 are again a copy of the first 251 nodes with 12 (24 with HT) core ( 2 sockets, 6 cores per socket, 2 threads per core) nodes.

See here from the slurm.conf:

# SmithP Owner Nodes
NodeName=ash[001-251] Feature=smithp,c12 Weight=1 NodeAddr=10.242.71.[1-251] Sockets=2 CPUs=24 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=24000 TmpDisk=425000
NodeName=ash[416-417] Feature=smithp,c12 Weight=1 NodeAddr=10.242.72.[162-163] Sockets=2 CPUs=24 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=24000 TmpDisk=425000
NodeName=ash[252-254] Feature=smithp,c20 Weight=1 NodeAddr=10.242.71.[252-254] Sockets=2 CPUs=40 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=64000 TmpDisk=820000
NodeName=ash[255-415] Feature=smithp,c20 Weight=1 NodeAddr=10.242.72.[1-161] Sockets=2 CPUs=40 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=64000 TmpDisk=820000

I also have this set in slurm.conf:

SelectTypeParameters=CR_ONE_TASK_PER_CORE


So to walk you through the steps to reproduce this issue first here is the current state of the cluster:

[u0033047@ash1:~]$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
ash-guest     up 3-00:00:00     56  alloc ash[001-055,412]
ash-guest     up 3-00:00:00    361   idle ash[056-411,413-417]
smithp-ash    up 14-00:00:0     56  alloc ash[001-055,412]
smithp-ash    up 14-00:00:0    361   idle ash[056-411,413-417]

So we see we have 361 nodes and most of them are the newer 20 core nodes that are free.

So when the user tries to submit with this command we will see the initial error:

[u0033047@ash1:~]$ srun -p smithp-ash -A smithp-ash -t 1:00:00 --nodes=20 --ntasks=400 --pty /bin/bash -l
srun: Force Terminated job 27902
srun: error: Unable to allocate resources: Requested node configuration is not available


But if I add in a '-C c20' to constrain things to just the 20 core nodes that I"ve added the c20 feature tag to we get past this and work:

[u0033047@ash1:~]$ srun -p smithp-ash -A smithp-ash -t 1:00:00 --nodes=20 --ntasks=400 -C c20 --pty /bin/bash -l
[u0033047@ash252 ~]$



From the failed attempt I see in /var/log/messages on the slurmctld server this:

Apr 17 01:25:50 ashrm slurmctld[6985]: _pick_best_nodes: job 27902 never runnable
Apr 17 01:25:50 ashrm slurmctld[6985]: _slurm_rpc_allocate_resources: Requested node configuration is not available

Yet the second attempt with just adding the '-C c20' we see in messages:

Apr 17 01:26:56 ashrm slurmctld[6985]: sched: _slurm_rpc_allocate_resources JobId=27903 NodeList=ash[252-271] usec=3022

So it seems that somehow slurm is getting confused but the logs don't seem to show much as to what the source of that confusion is when it has to pick between the 12 core and 20 core nodes.  

Anything else I can provide or explain to help with this?  Thanks.

Comment 1 Brian Christiansen 2015-04-17 01:58:36 MDT

Hey Brian,

I'll try to reproduce it with your slurm.conf and get back to you. 

Thanks,
Brian

Comment 2 Brian Christiansen 2015-04-17 08:08:11 MDT

I'm able to reproduce the problem and understand what's happening. Slurm is trying to get the 20 nodes and the 400 tasks from the 24cpu,12core nodes but can't. Requesting the feature filters out the 24cpu nodes. 

A couple of workarounds are:
1. specify the feature for these types of jobs.
2. specify --ntasks-per-node=20
3. change the node's weights so that the 40cpu nodes are looked at first.

The cons_res plugin doesn't have this issue. I will look into possibly changing the select/linear plugin to account for this type of situation.

Let me know if you have any questions.

Thanks,
Brian

Comment 3 Brian Haymore 2015-04-22 05:45:29 MDT

So I had another question based on your response.  You point out that select/cons_res does not suffer from this.  If I am not focused on sharing nodes, nor focused on allocating less then a whole node should I consider moving to select/cons_res?

I guess the simpler question here is "if" nobody really uses select/linear and select/cons_res is much more heavily used, tested, and developed maybe we should change?

I'd rather not change, but I'd also rather pick the path that would give us the best chance for being in the "better" state.  Guide me on this please.  If linear is well used/developed just tell me I'm good. :)  Thanks.

Comment 4 Moe Jette 2015-04-22 05:55:15 MDT

(In reply to Brian Haymore from comment #3)
> I guess the simpler question here is "if" nobody really uses select/linear
> and select/cons_res is much more heavily used, tested, and developed

Yes on all counts.

> maybe
> we should change?

Your call. You can certainly use select/cons_res to allocate whole nodes, and it is much more heavily used, tested, and developed...

Comment 5 Brian Haymore 2015-04-22 12:21:48 MDT

So I'm playing with this on a test cluster of 8 8 core nodes.  The setup I have uses:

SelectType=select/cons_res
SelectTypeParameters=CR_Core,CR_ONE_TASK_PER_CORE

From reading things and simple tests this seems to give me pretty much what I had with linear.

However from reading things in this mode the default suggests that if I were to ask for a job to run on 6 nodes and have a ntasks of 36 that it would evenly put 6 processes per node, yet it is not.  I'm finding the allocation of cores on nodes to be a bit odd.  For example doing what I said above, -N 6 -n 36 I get this distribution:

SLURM_STEP_TASKS_PER_NODE=7(x4),6,2
SLURM_STEP_NODELIST=sp[003-008]


Just for reference my node definitions are as below.  You will see that all nodes have the same socket, core, thread numbers...

# CHPC General Nodes
NodeName=sp[001-002] Feature=chpc,c8 Weight=1 NodeAddr=sp[001-002] Sockets=2 CPUs=16 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24148 TmpDisk=103373
#
# Cardoen Owner Nodes
NodeName=sp[003-004] Feature=cardoen,c8 Weight=1 NodeAddr=sp[003-004] Sockets=2 CPUs=16 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24148 TmpDisk=103373
#
# Cuma Onwer Nodes
NodeName=sp005 Feature=cuma,c8 Weight=1 NodeAddr=sp005 Sockets=2 CPUs=16 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24148 TmpDisk=103373
#
# Orendt Owner Nodes
NodeName=sp[006-008] Feature=orendt,c8 Weight=1 NodeAddr=sp[006-008] Sockets=2 CPUs=16 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24148 TmpDisk=103373

So why am I getting allocated 7 cores on the first 4 nodes, 6 on the 5th node and 2 on the 6th node?

Comment 6 Moe Jette 2015-04-22 15:05:08 MDT

There's one more bit that you need. If you want to continue allocating whole nodes to jobs, add "Shared=exclusive" to the partition(s) configuration line. The tasks will then be evenly distributed across those nodes.

Without that, your job will get 36 cores (one for each task) distributed in an arbitrary fashion across those 6 nodes.

Comment 7 Brian Haymore 2015-04-23 06:21:59 MDT

Thanks!  That got it.  Though the web docs seem to suggest that this should have been the behavior from the start when you read through the SelectTypeParameters for cons_res.  Specifically the wording from the slurm.conf documentation and the CR_Pack_Nodes implies that the without CR_Pack_Nodes set that job tasks will be evenly distributed, but with they will be packed in.  There isn't any mention onf this on the cons_res specific documentation page either.  Maybe an update to the docs could get into the queue for when time allows?  Thanks again for the help!

Comment 8 Moe Jette 2015-04-23 07:02:22 MDT

(In reply to Brian Haymore from comment #7)
> Thanks!  That got it
> Maybe an update to the docs could get
> into the queue for when time allows?

Done in v14.11.6 (when released):
https://github.com/SchedMD/slurm/commit/a001b4b67aa4986ddf3664f59b455e7604225015