| Summary: | GPU setup help | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | DRW GridOps <gridadm> |
| Component: | User Commands | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 22.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DRW Trading | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf from the relevant cluster | ||
Please try the following:
> $ srun -p lab21pbar --gres=gpu:1 uname -n
> $ srun -p lab21pbar --gres=gpu:A2:1 uname -n
Note the addition of the partition that the node is configured to run in "-p lab21pbar".
The default partition is "lab21pmain", and as the error states would never be runnable since the GPU is not part of those nodes in that partition.
oh my god. Please close this ticket as 'user is a moron'. Thank you. Resolving |
Created attachment 27995 [details] slurm.conf from the relevant cluster Hello, I'm trying to add a single GPU to my test cluster. I've got the card in, CUDA installed, and gotten the gres to show up in sinfo but am unable to actually run jobs against it. I'm not sure what I've got wrong here, and although I can find numerous examples online none seem to help. Could I get some insight into what I'm doing wrong here? The gist of it is I have a single R650 compute node (chhq-sulgcmp006) in the cluster with a Tesla A2. There are no other GPUs. I have AutoDetect disabled because we did not compile Slurm with nvml support, but from what I have read it should not be needed. My gres.conf: AutoDetect=off NodeName=chhq-sulgcmp006 Name=gpu Type=A2 File=/dev/nvidia0 COREs=0 Relevant bits of my slurm.conf (will attach the whole thing as well): GresTypes=disk,io_disk,io_nic,gpu NodeName=DEFAULT Sockets=2 CoresPerSocket=18 RealMemory=500000 ThreadsPerCore=1 NodeName=chhq-sulgcmp006 TmpDisk=4000 Gres=disk:20000,io_nic:1,io_disk:1,gpu:A2:1 sinfo: # sinfo -o "%20N %6t %3c %10m %10f %50G" NODELIST STATE CPU MEMORY AVAIL_FEAT GRES chhq-sulgcmp006,chhq idle 4+ 12000+ (null) disk:20000,io_disk:1,io_nic:1 chhq-vulgcmp[003-005 idle 4 12000 (null) disk:10000,io_disk:1,io_nic:1 srun attempt: $ srun --gres=gpu:1 uname -n srun: error: Unable to allocate resources: Requested node configuration is not available $ srun --gres=gpu:A2:1 uname -n srun: error: Unable to allocate resources: Requested node configuration is not available Help?