4232 – Gpu type : Requested node configuration is not available

Ticket 4232 - Gpu type : Requested node configuration is not available

Summary: Gpu type : Requested node configuration is not available

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.02.7
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-10-07 03:06 MDT by kaydhi555
Modified:	2017-10-07 03:07 MDT (History)
CC List:	1 user (show)

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
All logs and confs (21.46 KB, text/plain) 2017-10-07 03:06 MDT, kaydhi555	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description kaydhi555 2017-10-07 03:06:57 MDT

Created attachment 5340 [details]
All logs and confs

Hi, 
I've got a weird behaviour (on slurm 17.02.7) when specifying the GPU type in the slurm.conf and gres.conf : both slumrmd and slurmctld can see the GPUs, but when I submit a job, requesting GPUs, it fails with the error message : 
sbatch --gres=gpu:1 runscripts/show_gpu.sh
sbatch: error: Batch job submission failed: Requested node configuration is not available

The show_gpu.sh script is just an "echo $CUDA_VISIBLE_DEVICES".

Please find attached the logs of slurmd, slurmctld, gres.conf and slurm.conf. 

The interesting part resides in the slurmctld log : 
in the INIT:
slurmctld: gres/gpu: state for titanic-1
slurmctld:   gres_cnt found:4 configured:4 avail:4 alloc:0
slurmctld:   gres_bit_alloc:
slurmctld:   gres_used:(null)
slurmctld:   type[0]:pascal
slurmctld:    topo_cpus_bitmap[0]:NULL
slurmctld:    topo_gres_bitmap[0]:0-3
slurmctld:    topo_gres_cnt_alloc[0]:0
slurmctld:    topo_gres_cnt_avail[0]:4
slurmctld:   type[0]:pascal
slurmctld:    type_cnt_alloc[0]:0
slurmctld:    type_cnt_avail[0]:4
....
When submitting a job : 
slurmctld: gres: gpu state for job 90
slurmctld:   gres_cnt:1 node_cnt:0 type:(null)
slurmctld: debug2: found 1 usable nodes from config containing titanic-1
slurmctld: debug3: _pick_best_nodes: job 90 idle_nodes 1 share_nodes 1
slurmctld: _pick_best_nodes: job 90 never runnable in partition titanic
slurmctld: _slurm_rpc_submit_batch_job: Requested node configuration is not available

So the slurmctld find that the node titanic-1 can run it, but pick_best_nodes is the one crashing the thing?

Thanks for your help.