Ticket 9947 - Interactive jobs (--gres=gpu:<n>, n > 0) sometimes fail to start
Summary: Interactive jobs (--gres=gpu:<n>, n > 0) sometimes fail to start
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
: 10103 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-10-07 06:26 MDT by karl-heinz.schmidmeier
Modified: 2020-11-16 10:16 MST (History)
1 user (show)

See Also:
Site: KIT
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.7 20.11.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Content of slurm.conf, gres.conf and cgroup.conf (3.95 KB, text/plain)
2020-10-07 06:26 MDT, karl-heinz.schmidmeier
Details
slurmctld.log ouput with the option setdebugflag +SelectType (958.43 KB, text/plain)
2020-10-23 06:38 MDT, karl-heinz.schmidmeier
Details
fix v2 (1.32 KB, patch)
2020-11-10 07:11 MST, Marcin Stolarek
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description karl-heinz.schmidmeier 2020-10-07 06:26:31 MDT
Created attachment 16150 [details]
Content of slurm.conf, gres.conf and cgroup.conf

Hallo,

I successively start jobs with gres=gpu:<n>, n={1, 3, 4, 2, 2, 3, 4, 1}. The running jobs are finished with "exit" and the next one is submitted (for certain combinations errors are reported). But if there is no job in the queue no errors are displayed (last example, n={4,4}).
 
Regards,
Karl-Heinz

Examples

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:1
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
salloc: Granted job allocation 15989
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 tmp]$ exit
exit
salloc: Relinquishing job allocation 15989
____________

GPU:3 failed

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:3
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             15989     gpu_4       sh   yc9907 CG       0:12      1 uccn490
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 15990 has been revoked.
____________

GPU:4 failed

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             15989     gpu_4       sh   yc9907 CG       0:12      1 uccn490
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 15991 has been revoked.
___________

GPU:2 works

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:2
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             15989     gpu_4       sh   yc9907 CG       0:12      1 uccn490
salloc: Pending job allocation 15992
salloc: job 15992 queued and waiting for resources
salloc: job 15992 has been allocated resources
salloc: Granted job allocation 15992
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
___________

GPU:2 works

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:2
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             15992     gpu_4       sh   yc9907 CG       6:43      1 uccn490
salloc: Pending job allocation 15993
salloc: job 15993 queued and waiting for resources
salloc: job 15993 has been allocated resources
salloc: Granted job allocation 15993
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 tmp]$ exit
exit
salloc: Relinquishing job allocation 15993
____________

GPU:3 failed

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:3
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             15993     gpu_4       sh   yc9907 CG       0:46      1 uccn490
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 15994 has been revoked.
____________

GPU:4 failed

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             15993     gpu_4       sh   yc9907 CG       0:46      1 uccn490
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 15995 has been revoked.
___________

GPU:1 works

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:1
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             15993     gpu_4       sh   yc9907 CG       0:46      1 uccn490
salloc: Pending job allocation 15996
salloc: job 15996 queued and waiting for resources
salloc: job 15996 has been allocated resources
salloc: Granted job allocation 15996
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 tmp]$
____________________

GPU:4 queue is empty

[yc9907@uccn998 tmp]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
salloc: Pending job allocation 16001
salloc: job 16001 queued and waiting for resources
salloc: job 16001 has been allocated resources
salloc: Granted job allocation 16001
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 tmp]$ exit
exit
salloc: Relinquishing job allocation 16001
____________________

GPU:4 queue is empty

[yc9907@uccn998 tmp]$ sleep 20;squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
salloc: Pending job allocation 16002
salloc: job 16002 queued and waiting for resources
salloc: job 16002 has been allocated resources
salloc: Granted job allocation 16002
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 tmp]$ exit
exit
salloc: Relinquishing job allocation 16002
Comment 1 Jason Booth 2020-10-09 08:43:48 MDT
We are looking into this. Thank you for attaching your configuration files. We will let you know if we need any additional information.
Comment 2 Marcin Stolarek 2020-10-13 06:16:53 MDT
Karl-Heinz,

I cannot easily reproduce the issue. Could you please share your slurmctld log from the time when it happens?
Is it possible to enable SelectType debug flag before repeating the test? You can do that without restart by:
scontrol setdebugflag +SelectType
-> execute the commands as you did below
scontrol setdebugflag -SelectType

Setting the debugflag has to be issued from priviledged user.

cheers,
Marcin
Comment 3 Marcin Stolarek 2020-10-19 10:25:38 MDT
Karl-Heinz,

Could you please take a look at comment 2?

cheers,
Marcin
Comment 4 karl-heinz.schmidmeier 2020-10-20 07:29:01 MDT
Marcin,

I would like to apologize for my late feedback. We are currently having a series of maintenances. I will set the flag and get back to you afterwards.

We have switched back to version 20.02.3!

Regards,
Karl-heinz
Comment 5 karl-heinz.schmidmeier 2020-10-23 06:38:00 MDT
Created attachment 16315 [details]
slurmctld.log ouput with the option setdebugflag +SelectType
Comment 6 karl-heinz.schmidmeier 2020-10-23 06:43:13 MDT
Marcin,

I submitted the jobs like last time.

Regards,
Karl-Heinz

Examples

Fri Oct 23-14:15:50 (40/656)
root@uccn997:/home/kit/scc/yc9907# cat bug9947_setdebugflag_+SelectType
[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:1
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
salloc: Granted job allocation 16735
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 ~]$ hostname
uccn490.localdomain
[yc9907@uccn490 ~]$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-661fcf0d-6789-be11-c34c-b7404bffe51a)
[yc9907@uccn490 ~]$ exit
exit
salloc: Relinquishing job allocation 16735
_________________

GPU:3 failed

[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:3
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             16735     gpu_4       sh   yc9907 CG       1:49      1 uccn490
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 16736 has been revoked.
_________________

GPU:4 failed

[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             16735     gpu_4       sh   yc9907 CG       1:49      1 uccn490
salloc: error: Job submit/allocate failed: Requested node configuration is not available
_________________

GPU:2 works

[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:2
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
salloc: Granted job allocation 16738
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 ~]$ exit
exit
salloc: Relinquishing job allocation 16738
_________________

GPU:2 works

[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:2
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             16738     gpu_4       sh   yc9907 CG       0:57      1 uccn490
salloc: Pending job allocation 16739
salloc: job 16739 queued and waiting for resources
salloc: job 16739 has been allocated resources
salloc: Granted job allocation 16739
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 ~]$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-661fcf0d-6789-be11-c34c-b7404bffe51a)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-f59d5aab-e029-7d6e-971e-ebdce05d3321)
[yc9907@uccn490 ~]$ exit
exit
salloc: Relinquishing job allocation 16739
_________________

GPU:3 failed

[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:3
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             16739     gpu_4       sh   yc9907 CG       1:10      1 uccn490
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 16740 has been revoked.
_________________

GPU:4 failed

[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             16739     gpu_4       sh   yc9907 CG       1:10      1 uccn490
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 16741 has been revoked.
_________________

GPU:1 works

[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:1
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             16739     gpu_4       sh   yc9907 CG       1:10      1 uccn490
salloc: Pending job allocation 16742
salloc: job 16742 queued and waiting for resources
salloc: job 16742 has been allocated resources
salloc: Granted job allocation 16742
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 ~]$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-661fcf0d-6789-be11-c34c-b7404bffe51a)
[yc9907@uccn490 ~]$ exit
exit
salloc: Relinquishing job allocation 16742
_________________

GPU:4 queue is empty

[yc9907@uccn998 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
salloc: Pending job allocation 16743
salloc: job 16743 queued and waiting for resources
salloc: job 16743 has been allocated resources
salloc: Granted job allocation 16743
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 ~]$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-661fcf0d-6789-be11-c34c-b7404bffe51a)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-f59d5aab-e029-7d6e-971e-ebdce05d3321)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-b34138f7-3506-7b0a-2820-b2a954e30175)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-2606330b-f1ac-1c18-00c1-60a840c2b87a)
[yc9907@uccn490 ~]$ exit
exit
salloc: Relinquishing job allocation 16743
_________________

GPU:4 queue is empty

[yc9907@uccn998 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[yc9907@uccn998 ~]$ squeue;salloc -p gpu_4 -n 5 -t 10 --gres=gpu:4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
salloc: Pending job allocation 16744
salloc: job 16744 queued and waiting for resources
salloc: job 16744 has been allocated resources
salloc: Granted job allocation 16744
salloc: Waiting for resource configuration
salloc: Nodes uccn490 are ready for job
[yc9907@uccn490 ~]$ exit
exit
salloc: Relinquishing job allocation 16744
salloc: Job allocation 16744 has been revoked.
[yc9907@uccn998 ~]$


Fri Oct 23-14:16:01 (41/657)
Comment 7 Marcin Stolarek 2020-10-28 08:26:24 MDT
Karl-heinz,

I was able to reproduce the issue and I see where is comming from. I have a patch that should fix it. Do you want to apply it locally and verify? The patch didn't pass our QA and is not yet scheduled for release.

The origin of the issue is incorrect handling of DefCpuPerGPU when specified as a default. You can workaround it setting it from cli_filter or job_submit plugin instead of specifying in slurm.conf.

cheers,
Marcin
Comment 12 karl-heinz.schmidmeier 2020-11-11 03:15:26 MST
Dear Mr. Stolarek,

I would like to check the patch on our test cluster. But the workaround with the cli_filter or job_submit plugin would also interest me very much. I have no experience with it yet and my knowledge about it is still rudimentary. Do you have an instructive example for this case?

Best regards
Karl-Heinz Schmidmeier
Comment 13 Marcin Stolarek 2020-11-12 03:19:38 MST
Comment on attachment 16578 [details]
fix v2

Karl-Heinz,

I'm switching the patch to public mode, you should be able to download it now.

The solution with job_submit plugin will require you to enable JobSubmitPlugins=lua and deploy a script like the one below in the same directory as slurm.conf:
#cat /etc/slurm/job_submit.lua
>  1 function _find_in_str(str, arg)
>  2         if str ~= nil then
>  3                 return string.find(str,arg)
>  4         else
>  5                 return false
>  6         end
>  7 end
>  8 
>  9 
> 10 function slurm_job_submit(job_desc, part_list, submit_uid)
> 11         if _find_in_str(job_desc.partition,"gpu") then
> 12                 if job_desc.cpus_per_tres == nil then
> 13                         job_desc.cpus_per_tres="gpu:20"
> 14                         slurm.info("SETTING");
> 15                 end
> 16                         slurm.info("SETTING2");
> 17         end
> 18                         slurm.info("SETTING3");
> 19 end
> 20 
> 21 function  slurm_job_modify(job_desc, job_ptr, part_list, modify_uid)
> 22 
> 23         return slurm.SUCCESS
> 24 end


The script will add --cpus-per-gpu=20 to every job submited to the partition with "gpu" in its name if there wasn't different --cpus-per-gpu value specificed by user. The drawback of that may be visible if someone submits a job to multiple partitions the --cpus-per-gpu=20 will effectively be added to every partition.
I think that it won't be an issue in your configuration (the only GPU nodes are in the gpu_ partition).

Let me know if you have any question,
Marcin
Comment 14 Marcin Stolarek 2020-11-12 03:20:58 MST
PS. I just noticed I left debuging output in the script example. Sorry for that  - you can skip the lines starting with "slurm.info".
Comment 19 Marcin Stolarek 2020-11-13 00:50:08 MST
Karl-Heinz,

The fix for reported issue passed the QA and got merged to our public repository[1]. It will be part of Slurm 20.02.7 release.
I'm marking the bug as fixed now.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/0b6faf691c6fb5445fdb01c74daf81ecb87e05db
Comment 20 Marcin Stolarek 2020-11-16 10:16:52 MST
*** Ticket 10103 has been marked as a duplicate of this ticket. ***