I have configured slurm.conf and gres.conf with GPU information. The slurmctld.log with gres debug on shows the nodes are seeing and reporting the gpu resources. But when trying to run an srun --gres=gpu:1 (or any other combination), I get get error "Requested node configuration is not available" [jpr@login002 gputest-cluster]$ srun --gres=gpu hostname srun: error: Unable to allocate resources: Requested node configuration is not available srun: Force Terminated job 40 The slurmctld.log shows:' [2017-10-10T13:15:17.379] debug: sched: Running job scheduler [2017-10-10T13:15:17.717] gres: gpu state for job 40 [2017-10-10T13:15:17.717] gres_cnt:1 node_cnt:0 type:(null) [2017-10-10T13:15:17.717] _pick_best_nodes: job 40 never runnable in partition normal [2017-10-10T13:15:17.718] _slurm_rpc_allocate_resources: Requested node configuration is not available This seems to be an exact duplicate of the issue reported in bug #4232 https://bugs.schedmd.com/show_bug.cgi?id=4232
Created attachment 5356 [details] slurm.conf
Created attachment 5357 [details] gres.conf
Interestingly, I ran a similar config on a 16.05.9 test box and it works just fine. I enabled GresType=gpu and set one of the nodes to have a Gres=gpu:tty:4 resource in slurm.conf. On the target node, my gres.conf contains: Name=gpu Type=tty File=/dev/tty[0-3] After putting the config in place, I can srun and request the resources as expected: jpr@oakmnt:~/projects/slurm$ sinfo -o "%N %G" NODELIST GRES oakcompute[0-5] (null) oakcompute6 gpu:tty:4 jpr@oakmnt:~/projects/slurm$ srun --gres=gpu ./test-gpu.sh 0 jpr@oakmnt:~/projects/slurm$ srun --gres=gpu:tty ./test-gpu.sh 0 jpr@oakmnt:~/projects/slurm$ srun --gres=gpu:tty:2 ./test-gpu.sh 0,1 jpr@oakmnt:~/projects/slurm$ srun --gres=gpu:tty:4 ./test-gpu.sh 0,1,2,3 My test-gpu.sh just echos CUDA_VISIBLE_DEVICES.
We are currently looking into this and will keep you updated. --Isaac
It appears on the initial cluster test that CPU affinities were causing the problem. Removing the CPUs=0 and CPUs=1 from the gres.conf lines caused the gpu resource allocation to succeed. The second test cluster which works with and without the CPUs lines in the test gres.conf file does have a slightly more enhanced slurm.conf with Proctrack set to linuxproc, cons_res active, a slurm accounting db, and jobacct_gather/linux. None of these settings are suggested as required by the https://slurm.schedmd.com/gres.html docs page.
One further point, it seems that this initial basic slurm.conf that I attached also doesn't support using the type parameter. The only gres.conf that works for me right now is: Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 Not sure what aspect of the slurm.conf enables the Type parameter and CPUs affinities. Both work on my more advanced slurm.conf described in the previous comment.
So I individually added each feature missing from my simpler slurm.conf where the Type parameter for gpus was failing (slurmdbd accounting, accounting_gather/linux, then select/cons_res). It was adding cons_res that finally allowed the Type parameter to function: SelectType=select/cons_res SelectTypeParameters=CR_Core now the sruns work as expected: [jpr@login002 slurm-config]$ srun --gres=gpu ./test-gpu.sh 0 [jpr@login002 slurm-config]$ srun --gres=gpu:p100 ./test-gpu.sh 0 [jpr@login002 slurm-config]$ srun --gres=gpu:p100:2 ./test-gpu.sh 0,1 [jpr@login002 slurm-config]$ srun --gres=gpu:p100:4 ./test-gpu.sh 0,1,2,3 I don't believe this requirement is documented.
Hi. This has been fixed in the following commit: https://github.com/SchedMD/slurm/commit/6ceaa49efa5d6 which will be available in the next Slurm 17.02.8 tag. Three of us have tested the patch and now select/linear can also make use of gres.conf defined lines including Type and/or CPUs options. Previously only select/cons_res would be able to accept requests with such configuration. I'm closing the bug as fixed. Please, reopen if you encounter further issues. Thanks for reporting.
Hi Is this problem resolved? We are running 17.11.3-2 And a gres.conf file in the form: NodeName=hpc-test06 Name=gpu Type=k20 File=/dev/nvidia0 Cores=0-15 NodeName=hpc-test06 Name=gpu Type=k20 File=/dev/nvidia1 Cores=0-15 works But the desired configuration: NodeName=hpc-test06 Name=gpu Type=k20 File=/dev/nvidia0 Cores=0-7 NodeName=hpc-test06 Name=gpu Type=k20 File=/dev/nvidia1 Cores=8-15 does not. The request for a gpu: srun --gres=gpu:k20:1 --ntasks=2 --cpus-per-task=8 /bin/bash -c 'echo $CUDA_VISIBLE_DEVICES' results in the error message: srun: error: Unable to allocate resources: Requested node configuration is not available This appears to be related to the test: if (gres_cpus != NO_VAL) { gres_cpus *= cpus_per_core; if ((gres_cpus < cpu_cnt) || (gres_cpus < job_ptr->details->ntasks_per_node) || ((job_ptr->details->cpus_per_task > 1) && (gres_cpus < job_ptr->details->cpus_per_task))) { bit_clear(jobmap, i); continue; } } in the function: _job_count_bitmap in the file: src/plugins/select/linear/select_linear.c Specifically the test: if ((gres_cpus < cpu_cnt) Which is set by: gres_cores = gres_plugin_job_test(job_ptr->gres_list, gres_list, use_total_gres, NULL, core_start_bit, core_end_bit, job_ptr->job_id, node_ptr->name); gres_cpus = gres_cores; gres_cpus is set to 16 for our configs only if the config parameter is Cores=0-15 Is there some other configuration parameter that I should have set that would have changed this behavior Thanks
Also I neglected to mention that we are using: SelectType=select/linear SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Memory
AJ, would you mind please submitting a new ticket? I don't want to hijack someone else's ticket.
Thanks I have submitted a new bug it is Bug 4827 Avalon