| Summary: | Only half the requested CPU cores are available when asking for a single GPU (GRES) resource | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | hpc-ops |
| Component: | slurmstepd | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | marshall |
| Version: | 22.05.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ghent | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm config file | ||
Andy, I can't easily reproduce the behavior. Could you please attach the output of `lstopo-no-graphics` and your gres.conf? cheers, Marcin Could you please take a look at last comment. cheers, Marcin Hi, We're in the process of changing the config, and will see if this gets fixed. -- Andy Any update from your side? Hi, We're trying the upstream/slurm-22.05 branch to see if this works out better, but so far, no luck afaik. -- Andy Is this ticket effectively a duplicate of Bug 15614? Is there anything else I can help you with in the bug report? this is ok now. you can close this ticket |
Created attachment 27960 [details] slurm config file Hi, When submitting a job to a GPU cluster, we're not quite understanding why the step_0 cgroup only gets 16 cores instead of the expected 32. This is the job submission command: /usr/bin/salloc --reservation=maintenance2022Q4 --cpus-per-gpu=32 --gres=gpu:1 --job-name=INTERACTIVE --mail-type=NONE --nodes=1 --ntasks-per-node=32 --ntasks=32 --time=3-00:00:00 /usr/bin/srun --chdir=/user/gent/400/vsc40003 --cpu-bind=none --export=USER,HOME,TERM --mem=0 --mpi=none --nodes=1 --ntasks=1 --pty /bin/bash -i -l salloc: Granted job allocation 40271421 salloc: Waiting for resource configuration salloc: Nodes node3303.joltik.os are ready for job Which then yields: [vsc40003@node3303 ~]$ nproc 16 Looking at the job's info I see: TRES=cpu=32,mem=262080M,node=1,billing=33,gres/gpu=1 Looking at the cgroups, I see: [root@node3303 job_40271421]# cat cpuset.cpus 0-31 Idem for step_extern But [root@node3303 step_0]# cat cpuset.cpus 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 Somehow this job step only gets the even cores. Is this something expected, do we need to configure something differently? When not asking for any GPUs, we do see that 32 cores are assigned to this job step. Our cgroup config is: AllowedSwapSpace=0 CgroupAutomount=yes ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes -- Andy