| Summary: | sbatch --partition=smithp-ash --account=smithp-ash --time=10:00:00 --nodes=20 --ntasks=400 /tmp/visit.u0033047.14:26:29 reports sbatch: error: Batch job submission failed: Requested node configuration is not available | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Brian Haymore <brian.haymore> |
| Component: | Scheduling | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Utah | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | My slurm.conf file | ||
Hey Brian, I'll try to reproduce it with your slurm.conf and get back to you. Thanks, Brian I'm able to reproduce the problem and understand what's happening. Slurm is trying to get the 20 nodes and the 400 tasks from the 24cpu,12core nodes but can't. Requesting the feature filters out the 24cpu nodes. A couple of workarounds are: 1. specify the feature for these types of jobs. 2. specify --ntasks-per-node=20 3. change the node's weights so that the 40cpu nodes are looked at first. The cons_res plugin doesn't have this issue. I will look into possibly changing the select/linear plugin to account for this type of situation. Let me know if you have any questions. Thanks, Brian So I had another question based on your response. You point out that select/cons_res does not suffer from this. If I am not focused on sharing nodes, nor focused on allocating less then a whole node should I consider moving to select/cons_res? I guess the simpler question here is "if" nobody really uses select/linear and select/cons_res is much more heavily used, tested, and developed maybe we should change? I'd rather not change, but I'd also rather pick the path that would give us the best chance for being in the "better" state. Guide me on this please. If linear is well used/developed just tell me I'm good. :) Thanks. (In reply to Brian Haymore from comment #3) > I guess the simpler question here is "if" nobody really uses select/linear > and select/cons_res is much more heavily used, tested, and developed Yes on all counts. > maybe > we should change? Your call. You can certainly use select/cons_res to allocate whole nodes, and it is much more heavily used, tested, and developed... So I'm playing with this on a test cluster of 8 8 core nodes. The setup I have uses: SelectType=select/cons_res SelectTypeParameters=CR_Core,CR_ONE_TASK_PER_CORE From reading things and simple tests this seems to give me pretty much what I had with linear. However from reading things in this mode the default suggests that if I were to ask for a job to run on 6 nodes and have a ntasks of 36 that it would evenly put 6 processes per node, yet it is not. I'm finding the allocation of cores on nodes to be a bit odd. For example doing what I said above, -N 6 -n 36 I get this distribution: SLURM_STEP_TASKS_PER_NODE=7(x4),6,2 SLURM_STEP_NODELIST=sp[003-008] Just for reference my node definitions are as below. You will see that all nodes have the same socket, core, thread numbers... # CHPC General Nodes NodeName=sp[001-002] Feature=chpc,c8 Weight=1 NodeAddr=sp[001-002] Sockets=2 CPUs=16 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24148 TmpDisk=103373 # # Cardoen Owner Nodes NodeName=sp[003-004] Feature=cardoen,c8 Weight=1 NodeAddr=sp[003-004] Sockets=2 CPUs=16 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24148 TmpDisk=103373 # # Cuma Onwer Nodes NodeName=sp005 Feature=cuma,c8 Weight=1 NodeAddr=sp005 Sockets=2 CPUs=16 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24148 TmpDisk=103373 # # Orendt Owner Nodes NodeName=sp[006-008] Feature=orendt,c8 Weight=1 NodeAddr=sp[006-008] Sockets=2 CPUs=16 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24148 TmpDisk=103373 So why am I getting allocated 7 cores on the first 4 nodes, 6 on the 5th node and 2 on the 6th node? There's one more bit that you need. If you want to continue allocating whole nodes to jobs, add "Shared=exclusive" to the partition(s) configuration line. The tasks will then be evenly distributed across those nodes. Without that, your job will get 36 cores (one for each task) distributed in an arbitrary fashion across those 6 nodes. Thanks! That got it. Though the web docs seem to suggest that this should have been the behavior from the start when you read through the SelectTypeParameters for cons_res. Specifically the wording from the slurm.conf documentation and the CR_Pack_Nodes implies that the without CR_Pack_Nodes set that job tasks will be evenly distributed, but with they will be packed in. There isn't any mention onf this on the cons_res specific documentation page either. Maybe an update to the docs could get into the queue for when time allows? Thanks again for the help! (In reply to Brian Haymore from comment #7) > Thanks! That got it > Maybe an update to the docs could get > into the queue for when time allows? Done in v14.11.6 (when released): https://github.com/SchedMD/slurm/commit/a001b4b67aa4986ddf3664f59b455e7604225015 |
Created attachment 1825 [details] My slurm.conf file BTW before I get started I should report in that we have as of April 2nd fully converted all of our clusters to SLURM. :) We have a report that I've been able to reproduce tonight. This is a cluster with 417 compute nodes. The first 251 nodes are 12 (24 with HT) core ( 2 sockets, 6 cores per socket, 2 threads per core) nodes. Then from node 252 to 415 we have 20 (40 with HT) core (2 sockets, 10 cores per socket, 2 threads per core) nodes. Then nodes 416 and 417 are again a copy of the first 251 nodes with 12 (24 with HT) core ( 2 sockets, 6 cores per socket, 2 threads per core) nodes. See here from the slurm.conf: # SmithP Owner Nodes NodeName=ash[001-251] Feature=smithp,c12 Weight=1 NodeAddr=10.242.71.[1-251] Sockets=2 CPUs=24 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=24000 TmpDisk=425000 NodeName=ash[416-417] Feature=smithp,c12 Weight=1 NodeAddr=10.242.72.[162-163] Sockets=2 CPUs=24 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=24000 TmpDisk=425000 NodeName=ash[252-254] Feature=smithp,c20 Weight=1 NodeAddr=10.242.71.[252-254] Sockets=2 CPUs=40 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=64000 TmpDisk=820000 NodeName=ash[255-415] Feature=smithp,c20 Weight=1 NodeAddr=10.242.72.[1-161] Sockets=2 CPUs=40 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=64000 TmpDisk=820000 I also have this set in slurm.conf: SelectTypeParameters=CR_ONE_TASK_PER_CORE So to walk you through the steps to reproduce this issue first here is the current state of the cluster: [u0033047@ash1:~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST ash-guest up 3-00:00:00 56 alloc ash[001-055,412] ash-guest up 3-00:00:00 361 idle ash[056-411,413-417] smithp-ash up 14-00:00:0 56 alloc ash[001-055,412] smithp-ash up 14-00:00:0 361 idle ash[056-411,413-417] So we see we have 361 nodes and most of them are the newer 20 core nodes that are free. So when the user tries to submit with this command we will see the initial error: [u0033047@ash1:~]$ srun -p smithp-ash -A smithp-ash -t 1:00:00 --nodes=20 --ntasks=400 --pty /bin/bash -l srun: Force Terminated job 27902 srun: error: Unable to allocate resources: Requested node configuration is not available But if I add in a '-C c20' to constrain things to just the 20 core nodes that I"ve added the c20 feature tag to we get past this and work: [u0033047@ash1:~]$ srun -p smithp-ash -A smithp-ash -t 1:00:00 --nodes=20 --ntasks=400 -C c20 --pty /bin/bash -l [u0033047@ash252 ~]$ From the failed attempt I see in /var/log/messages on the slurmctld server this: Apr 17 01:25:50 ashrm slurmctld[6985]: _pick_best_nodes: job 27902 never runnable Apr 17 01:25:50 ashrm slurmctld[6985]: _slurm_rpc_allocate_resources: Requested node configuration is not available Yet the second attempt with just adding the '-C c20' we see in messages: Apr 17 01:26:56 ashrm slurmctld[6985]: sched: _slurm_rpc_allocate_resources JobId=27903 NodeList=ash[252-271] usec=3022 So it seems that somehow slurm is getting confused but the logs don't seem to show much as to what the source of that confusion is when it has to pick between the 12 core and 20 core nodes. Anything else I can provide or explain to help with this? Thanks.