| Summary: | Scheduler considers the gpu resource request to be invalid when it should not | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Pascal <pascal.elahi> |
| Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | sam.yates |
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=14153 https://bugs.schedmd.com/show_bug.cgi?id=17172 |
||
| Site: | Pawsey | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
gres.conf cli_filter.lua |
||
|
Description
Pascal
2023-05-04 20:41:30 MDT
Hi Pascal, This looks like it might be related to a similar issue we've seen. The ticket shows that you are using 22.05.2. Can I have you confirm that this is correct? I'd also like to have you send a current copy of your slurm.conf and gres.conf files, if possible. Thanks, Ben Created attachment 30194 [details]
slurm.conf
Created attachment 30195 [details]
gres.conf
(In reply to Ben Roberts from comment #1) > This looks like it might be related to a similar issue we've seen. The > ticket shows that you are using 22.05.2. Can I have you confirm that this > is correct? I'd also like to have you send a current copy of your > slurm.conf and gres.conf files, if possible. I've attached our slurm.conf and gres.conf, and can confirm that we're running 22.05.2. Cheers, Sam Could you please share the cli_filter.lua script? Maybe I'm missing some detail, but I can't reproduce the issue with 22.05.2 and similar config.
For instance for the case of 2 GPUs I get:
>[root@slurmctl slurm-sources]# salloc -n 1 -N 1 -p gpu-dev --cpus-per-gpu=8 --gpus=2 srun -n 1 -N 1 --cpu-bind=none --gpu-bind=none scontrol show job -d | grep Nodes=
>salloc: Granted job allocation 10969
> NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> Nodes=test01 CPU_IDs=0-15 Mem=29440 GRES=gpu:2(IDX:4-5)
>salloc: Relinquishing job allocation 10969
>salloc: Job allocation 10969 has been revoked.
>[root@slurmctl slurm-sources]# salloc -V
>slurm 22.05.2
cheers,
Marcin
Created attachment 30329 [details]
cli_filter.lua
Pascal Sam, I was able to reproduce the issue on Slurm 22.05.2 and confirmed that it's already fixed in Slurm 22.05.7 (by commit 8e1231028dc). Could you please verify if the issue is resolved by Slurm update? cheers, Marcin Any update from your side? In case of no reply I'll go ahead and close the bug as a duplicate of bug 14153. cheers, Marcin Hi Marcin, So we haven't tested a newer version of slurm, primarily because Setonix currently has a older version "blessed" by Cray. We are looking into installing newer versions decoupled from the Cray "blessed" version but that is in the future. Feel free to close the ticket. Cheers, Pascal I'm marking the ticket as a duplicate of Bug 14153. *** This ticket has been marked as a duplicate of ticket 14153 *** |