Summary: | Request multiple GPUs | ||
---|---|---|---|
Product: | Slurm | Reporter: | John Wang <john.wang> |
Component: | GPU | Assignee: | Tyler Connel <tyler> |
Status: | RESOLVED TIMEDOUT | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | tyler |
Version: | 23.02.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=18217 | ||
Site: | Emory-Cloud | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
John Wang
2023-11-20 13:46:49 MST
I tested with following Slurm settings: #SBATCH --nodes=1 #SBATCH --gpus=6 #SBATCH --mem=360G #SBATCH --ntasks=6 or #SBATCH --nodes=1 #SBATCH --gpus=6 #SBATCH --mem=360G #SBATCH --ntasks-per-node=6 I got "sbatch: error: Batch job submission failed: Requested node configuration is not available". But, #SBATCH --nodes=1 #SBATCH --gpus=6 #SBATCH --mem=360G #SBATCH --ntasks=1 or #SBATCH --nodes=1 #SBATCH --gpus=6 #SBATCH --mem=360G #SBATCH --ntasks-per-node=1 works. Thanks, John Wang Hello @John, Does it help to specify --gpus-per-task=1? I'm presuming you're setting 1 GPU per task, so let me know if that's wrong. For example: #SBATCH --nodes=1 #SBATCH --gpus=6 #SBATCH --ntasks-per-node=6 #SBATCH --gpus-per-task=1 #SBATCH --mem=360G Best, Tyler Connel Also, if the prior suggestion doesn't help, please do share your gres.conf and slurm.conf and we can try to rule out any abnormality in the configuration of the nodes as well. Hi Tyler, I tested #SBATCH --gpus-per-task=1 in my Slurm script. I found my python program used the number of GPUs I requested. I have not heard back from the user. Below is cgroup.conf of Slurm in AWS ParallelCluster. Please review it to see if there is anything I could improve. $ cat cgroup.conf ### # Slurm cgroup support configuration file ### CgroupAutomount=yes ConstrainCores=yes # # WARNING!!! The slurm_parallelcluster_cgroup.conf file included below can be updated by the pcluster process. # Please do not edit it. include slurm_parallelcluster_cgroup.conf $ cat slurm_parallelcluster_cgroup.conf # slurm_parallelcluster.conf is managed by the pcluster processes. # Do not modify. # Please add user-specific slurm configuration options in cgroup.conf ConstrainRAMSpace=yes Thanks, John Wang Hello @John, There was a very similar issue reported pertaining to behavior that deviated from documentation for `--ntasks` and `--tasks-per-node` options. I suspect this issue is related, but somewhat different. They're nearing resolution and I want to see what they settle on and whether a similar change might apply to this issue. In the meantime, have you been able to hear back from the user yet as to whether specifying `--gpus-per-task` was helpful? Best, Hello John, Apologies for not replying earlier. I was looking back over this issue and I don't see anything awry with your cgroup.conf file, although there is an include directive that suggests other configuration settings could be present. Did you hear back from the user about setting --gpus-per-task for the batch script perchance? Best, Tyler Connel Hello John, This ticket has been idle for some time, so I'll mark it as timed-out and assume that the --gpus-per-task option was able to help you. Best, Tyler Connel |