| Summary: | Gpu type : Requested node configuration is not available | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | kaydhi555 |
| Component: | Scheduling | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | kaydhi555 |
| Version: | 17.02.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | All logs and confs | ||
Created attachment 5340 [details] All logs and confs Hi, I've got a weird behaviour (on slurm 17.02.7) when specifying the GPU type in the slurm.conf and gres.conf : both slumrmd and slurmctld can see the GPUs, but when I submit a job, requesting GPUs, it fails with the error message : sbatch --gres=gpu:1 runscripts/show_gpu.sh sbatch: error: Batch job submission failed: Requested node configuration is not available The show_gpu.sh script is just an "echo $CUDA_VISIBLE_DEVICES". Please find attached the logs of slurmd, slurmctld, gres.conf and slurm.conf. The interesting part resides in the slurmctld log : in the INIT: slurmctld: gres/gpu: state for titanic-1 slurmctld: gres_cnt found:4 configured:4 avail:4 alloc:0 slurmctld: gres_bit_alloc: slurmctld: gres_used:(null) slurmctld: type[0]:pascal slurmctld: topo_cpus_bitmap[0]:NULL slurmctld: topo_gres_bitmap[0]:0-3 slurmctld: topo_gres_cnt_alloc[0]:0 slurmctld: topo_gres_cnt_avail[0]:4 slurmctld: type[0]:pascal slurmctld: type_cnt_alloc[0]:0 slurmctld: type_cnt_avail[0]:4 .... When submitting a job : slurmctld: gres: gpu state for job 90 slurmctld: gres_cnt:1 node_cnt:0 type:(null) slurmctld: debug2: found 1 usable nodes from config containing titanic-1 slurmctld: debug3: _pick_best_nodes: job 90 idle_nodes 1 share_nodes 1 slurmctld: _pick_best_nodes: job 90 never runnable in partition titanic slurmctld: _slurm_rpc_submit_batch_job: Requested node configuration is not available So the slurmctld find that the node titanic-1 can run it, but pick_best_nodes is the one crashing the thing? Thanks for your help.