Ticket 4827 - request for single gpu on a dual gpu node fails (related to Bug 4244)
Summary: request for single gpu on a dual gpu node fails (related to Bug 4244)
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.11.3
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-02-22 13:46 MST by A J
Modified: 2019-07-03 09:16 MDT (History)
1 user (show)

See Also:
Site: USC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description A J 2018-02-22 13:46:06 MST
This problem is similar to Bug 4244 which was noted as being fixed.

We are running 17.11.3-2

And a gres.conf file in the form:

  NodeName=hpc-test06  Name=gpu Type=k20 File=/dev/nvidia0 Cores=0-15
  NodeName=hpc-test06  Name=gpu Type=k20 File=/dev/nvidia1 Cores=0-15

works

But the desired configuration:

  NodeName=hpc-test06  Name=gpu Type=k20 File=/dev/nvidia0 Cores=0-7
  NodeName=hpc-test06  Name=gpu Type=k20 File=/dev/nvidia1 Cores=8-15

does not. 

We are using the selecttype linear:


  SelectType=select/linear  
  SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Memory


The request for a gpu:

 srun --gres=gpu:k20:1 --ntasks=2 --cpus-per-task=8   /bin/bash -c 'echo $CUDA_VISIBLE_DEVICES'

results in the error message:

  srun: error: Unable to allocate resources: Requested node configuration is not available

This appears to be related to the test:

                if (gres_cpus != NO_VAL) {
                        gres_cpus *= cpus_per_core;
                        if ((gres_cpus < cpu_cnt) ||
                            (gres_cpus < job_ptr->details->ntasks_per_node) ||
                            ((job_ptr->details->cpus_per_task > 1) &&
                             (gres_cpus < job_ptr->details->cpus_per_task))) {
                                bit_clear(jobmap, i);
                                continue;
                        }
                }

 in the function:

  _job_count_bitmap

in the file:

   src/plugins/select/linear/select_linear.c

Specifically the test:

                        if ((gres_cpus < cpu_cnt) 

Which is set by:

                gres_cores = gres_plugin_job_test(job_ptr->gres_list,
                                                  gres_list, use_total_gres,
                                                  NULL, core_start_bit,
                                                  core_end_bit, job_ptr->job_id,
                                                  node_ptr->name);
                gres_cpus = gres_cores;

gres_cpus is set to 16 for our configs only if the config parameter is Cores=0-15 

Is there some other configuration parameter that I should have set that would have changed this behavior
Comment 1 Jacob Jenson 2018-02-22 13:53:52 MST
AJ,

Do you know if USC has a Slurm support contact? Our system was not able to associate your email address with a Slurm support contract. 

If your site has an existing Slurm support contract please email jacob@schedmd.com to figure why your email address is not associated with the contract. 

If your site does not have a current Slurm support contract please email sales@schedmd.com to request a quote. Once a Slurm support contract is in place this ticket will be routed to the support team for quick resolution. 

Jacob
Comment 2 Jess 2019-07-01 15:09:45 MDT
Avalon Johnson at USC is the primary contact for Slurm support  :)