Ticket 21862

Summary:	Is it possible to define a TimeLimit per Gres/GPU?
Product:	Slurm	Reporter:	Motaz Diab <Motaz.Diab>
Component:	GPU	Assignee:	Megan Dahl <megan>
Status:	OPEN ---	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	b.h.mevik, benjamin.witham, megan
Version:	24.05.4
Hardware:	Linux
OS:	Linux
Site:	MDC Berlin Max Delbrück Center for Molecular Medicine	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	25.05.3, 25.11.0rc1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurm configuration file slurmctl.log and slurmd.log (for maxg20) slurmctl.log Slurm configuration file Slurm Controller Log File with debug Slurm Controller Log File (25.05.0) with debug

Description Motaz Diab 2025-01-21 18:42:16 MST

Dear,

We have defined this partition [gpu] with TimeLimit 2 weeks as followings:

PartitionName=gpu Nodes=max00[2-6],maxg[05,10,18,20,22,24-26] MaxTime=14-0 PriorityTier=5

All our GPU nodes are added into it.

Because that some users submit big array gpu jobs which request all gpu resources in slurm, the other short gpu jobs are blocked and waiting for days till some gpu resources be free again.

My question is: Is there any option to reserve only two gpu resources on every gpu node (which has 8 gpu resources in total) to be used with overwritten TimeLimit to be 30 minutes for example?  

Thanks
Best regards,
Motaz

Comment 1 Ben Roberts 2025-01-22 10:53:25 MST

Hi Motaz,

I want to make sure I understand the behavior you want.  It sounds like you want to keep 2 GPUs open on each node for short jobs (30 minutes or less).  This way you don't have to place an artificial limit on the number of jobs that can be run by other users, potentially leaving resources idle.  Is that right?

If so, this sounds like a good case for a floating reservation.  This allows you to create a reservation that is always X minutes in the future, in your case 30 minutes.  This will prevent jobs that request more than 30 minutes of run time from starting on these resources, but jobs that request less time than that will be able to be scheduled on the available resources.

Here's a quick example of how this might look:

$ scontrol create reservationname=short_gpu nodecnt=10 TRESPerNode=gres/gpu=2 account=sub1 starttime=now+30minutes duration=12:00:00 flags=time_float
Reservation created: short_gpu

$ scontrol show reservations 
ReservationName=short_gpu StartTime=2025-01-22T12:19:02 EndTime=2025-01-23T00:19:02 Duration=12:00:00
   Nodes=node[17-18,25-32] NodeCnt=10 CoreCnt=10 Features=(null) PartitionName=debug Flags=TIME_FLOAT
     NodeName=node17 CoreIDs=0
     NodeName=node18 CoreIDs=0
     NodeName=node25 CoreIDs=0
     NodeName=node26 CoreIDs=0
     NodeName=node27 CoreIDs=0
     NodeName=node28 CoreIDs=0
     NodeName=node29 CoreIDs=0
     NodeName=node30 CoreIDs=0
     NodeName=node31 CoreIDs=0
     NodeName=node32 CoreIDs=0
   TRES=cpu=10,gres/gpu=20
   Users=(null) Groups=(null) Accounts=sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)



You can find more information on the TIME_FLOAT flag here:
https://slurm.schedmd.com/reservations.html#float
https://slurm.schedmd.com/scontrol.html#OPT_TIME_FLOAT


Let me know if you have any questions about this.

Thanks,
Ben

Comment 2 Ben Roberts 2025-02-21 13:19:34 MST

Hi Motaz,

I wanted to see if the information I sent helped.  Let me know if you have any more questions about this or if this ticket is ok to close.

Thanks,
Ben

Comment 3 Motaz Diab 2025-02-22 10:50:45 MST

Dear,

I tried it but unfortunately it did not work as expected.

I created a float reservation on one node [maxg05] which has 4 GPUs, and I reserved 2 GPUs of them. I granted only my user [mdiab] access to it.

[root@max-mastr1.mdc-berlin.net:~] (1090) $ scontrol show reservation short_gpu
ReservationName=short_gpu StartTime=2025-02-22T18:36:56 EndTime=2025-02-22T22:36:56 Duration=04:00:00
   Nodes=maxg05 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=normal Flags=SPEC_NODES,TIME_FLOAT
     NodeName=maxg05 CoreIDs=0
   TRES=cpu=1,gres/gpu=2
   Users=mdiab Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[root@max-mastr1.mdc-berlin.net:~] (1091) $

I submitted several GPU jobs (one requested 10 min and others requested 5 hours):

[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 0:10:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530107
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530108
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530109
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530110
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530111
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530112
[mdiab@max-login5.mdc-berlin.net:~] $ 

I found 3 long jobs (5 hours) started on maxg05, so the reservation did not prevent them to use the reserved 2 of 4 GPUs.

[mdiab@max-login5.mdc-berlin.net:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 530112      mdiab       gpu     1     1 PD 2025-02-22T18:43:36                 N/A     0:00    5:00:00 (Resources)
 530111      mdiab       gpu     1     1 PD 2025-02-22T18:43:35                 N/A     0:00    5:00:00 (Resources)
 530108      mdiab       gpu     1     1  R 2025-02-22T18:43:31 2025-02-22T18:43:41     0:02    5:00:00 maxg05
 530109      mdiab       gpu     1     1  R 2025-02-22T18:43:34 2025-02-22T18:43:41     0:02    5:00:00 maxg05
 530110      mdiab       gpu     1     1  R 2025-02-22T18:43:34 2025-02-22T18:43:41     0:02    5:00:00 maxg05
 530107      mdiab       gpu     1     1  R 2025-02-22T18:43:24 2025-02-22T18:43:24     0:19      10:00 maxg05
[mdiab@max-login5.mdc-berlin.net:~] $

So is there something wrong done by me? or its concept is still unclear to me?

One more question,
How to create float reservation for all users?! I see that it's required to specify user/group or account, so should I list all user accounts in the creation command? 

Best regards,
Motaz

Comment 4 Ben Roberts 2025-02-26 10:57:34 MST

Hi Motaz,

I'm sorry, my example was a little unclear. In your example you're creating a reservation that allows your user to run jobs in the reservation. The idea behind the floating reservation is that you want to give it an ACL that prevents the majority of the users from being able to access it. In my example I used an account called 'sub1', but I didn't make it clear that in order for this to work I would need to submit jobs to an account other than 'sub1'.

You can also modify this to allow the primary users to have access to this reservation. To continue with the example I sent before, if the principal investors that I wanted to make sure didn't have to wait too long for the GPUs were users in the 'sub1' account, then I would create the reservation the way I did. Users who wanted to use the GPUs for up to 30 minutes at a time would be in other accounts and would be able to start short jobs on the GPUs. If a user in the 'sub1' account came along, their job would qualify for the reservation so it could be longer than the 30 minute limit imposed on the other jobs, and it wouldn't have to wait longer than 30 minutes to be able to start. You would want to also add the 'flex' flag to the reservation when you're creating it so that these jobs would be able to qualify for the reservation even when the start time is in the future.
https://slurm.schedmd.com/scontrol.html#OPT_FLEX

I hope this makes more sense. Let me know if you have any additional questions about this.

Thanks,
Ben

Comment 5 Motaz Diab 2025-02-26 17:11:53 MST

Hello again,

I re-created the same float reservation but I changed the specified user in command to one other than my user [in this example: amardt] as followings:

[root@max-mastr1.mdc-berlin.net:~] (1004) $ scontrol create reservationname=short_gpu Nodes=maxg05 user=amardt TRESPerNode=gres/gpu=2 starttime=now duration=4:00:00 flags=time_float
Reservation created: short_gpu
[root@max-mastr1.mdc-berlin.net:~] (1006) $ scontrol show res short_gpu
ReservationName=short_gpu StartTime=2025-02-27T00:37:28 EndTime=2025-02-27T04:37:28 Duration=04:00:00
   Nodes=maxg05 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=normal Flags=SPEC_NODES,TIME_FLOAT
     NodeName=maxg05 CoreIDs=0
   TRES=cpu=1,gres/gpu=2
   Users=amardt Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[root@max-mastr1.mdc-berlin.net:~] (1007) 

I re-submitted a group of short [30 minutes] and long jobs [5 hours] by my user [mdiab] as followings:

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534072
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534073
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534074
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534075
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534076
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch short-gpu.sh
Submitted batch job 534077
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch short-gpu.sh
Submitted batch job 534078
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $

Again I saw my long jobs arrived to all of 4 GPUs (2 normal and 2 reserved by short_gpu) on maxg05, which is like normal case and not our desired one which is to prevent long jobs from scheduling to the 2 reserved GPUs for short jobs.

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 534078      mdiab       gpu     1     1 PD 2025-02-27T00:35:36                 N/A     0:00      10:00 (Priority)
 534077      mdiab       gpu     1     1 PD 2025-02-27T00:35:31                 N/A     0:00      10:00 (Priority)
 534076      mdiab       gpu     1     1 PD 2025-02-27T00:35:27                 N/A     0:00    5:00:00 (Priority)
 534072      mdiab       gpu     1     1  R 2025-02-27T00:35:21 2025-02-27T00:35:51     0:01    5:00:00 maxg05
 534073      mdiab       gpu     1     1  R 2025-02-27T00:35:23 2025-02-27T00:35:51     0:01    5:00:00 maxg05
 534074      mdiab       gpu     1     1  R 2025-02-27T00:35:24 2025-02-27T00:35:51     0:01    5:00:00 maxg05
 534075      mdiab       gpu     1     1  R 2025-02-27T00:35:24 2025-02-27T00:35:51     0:01    5:00:00 maxg05
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $

Let me please re-explain our desired scenario:

We have a defined gpu partition with TimeLimit (2 weeks), which means any user can use up to all available GPUs for 2 weeks and block others.

What we would like to have is, to override the gpu partition TimeLimit per gpu resource (2 GPUs per node for example) to be 30 minutes.
 
Is it possible to override the partition TimeLimit per resource? or let me say to have two defined TimeLimit in the same partition?

This change should be applied on all users (not on a specific user or group like sub1). so any user job from any group or account will be checked as followings:

If the user job time < 30 minutes, then
   Go to CUDA IDs 0-1 on any gpu node in the gpu partition
else # Here the time is up to 2 weeks (The partition TimeLimit)
   Go to CUDA IDs >= 2 on any gpu node in the same gpu partition
end

Thanks a lot
Best regards,
Motaz

Comment 6 Ben Roberts 2025-02-28 11:40:42 MST

Hi Motaz,

I see in the command you used to create the reservation that you have 'starttime=now'.  This makes the reservation start immediately and doesn't leave the 30 minute window that is always being pushed out before the reservation "starts".  If you change that to 'starttime=now+30minutes' then I would expect it to behave properly.  

Here is an example where I create a test reservation with a start time that is 30 minutes in the future.  I also create this reservation for the root user, to prevent any user on the system from being able to access this reservation.  I have a 'gpu' partition that has 2 nodes in it, so I use this for simplicity in showing the behavior in the testing that follows.

$ scontrol create reservationname=short_gpu nodecnt=2 partition=gpu TRESPerNode=gres/gpu=2 user=root starttime=now+30minutes duration=12:00:00 flags=time_float
Reservation created: short_gpu

$ scontrol show reservations 
ReservationName=short_gpu StartTime=2025-02-28T12:42:34 EndTime=2025-03-01T00:42:34 Duration=12:00:00
   Nodes=node[07-08] NodeCnt=2 CoreCnt=2 Features=(null) PartitionName=gpu Flags=TIME_FLOAT
     NodeName=node07 CoreIDs=0
     NodeName=node08 CoreIDs=0
   TRES=cpu=2,gres/gpu=4
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)




With that reservation in place, I submit 3 test jobs to the gpu partition.  Each job requests 2 GPUs.  The first two jobs are able to start, one on each node.  The third job is blocked because the other GPUs on these nodes are reserved and the 1 hour wall time is too long to allow it to run in the open window of time.

$ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 9195

$ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 9196

$ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 9197

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9197       gpu     wrap      ben PD       0:00      1 (Resources)
              9196       gpu     wrap      ben  R       0:05      1 node08
              9195       gpu     wrap      ben  R       0:08      1 node07





Then I submit a job that only requests 20 minutes of wall time so that it will fit in the 30 minute window before the reservation starts.

$ sbatch -n1 -t20:00 --gres=gres/gpu=1 -pgpu --wrap='srun sleep 1200'
Submitted batch job 9198

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9197       gpu     wrap      ben PD       0:00      1 (Resources)
              9198       gpu     wrap      ben  R       0:01      1 node07
              9196       gpu     wrap      ben  R       0:16      1 node08
              9195       gpu     wrap      ben  R       0:19      1 node07





To further clarify the behavior, if I look at the start time of the reservation again, you can see that it has moved to be 30 minutes in the future still. 

$ date; scontrol show reservations | grep StartTime
Fri Feb 28 12:29:56 PM CST 2025
ReservationName=short_gpu StartTime=2025-02-28T12:59:56 EndTime=2025-03-01T00:59:56 Duration=12:00:00



To summarize, if you change the start time of the reservation to be some number of minutes in the future and include the 'time_float' flag, then the reservation should always be that number of minutes in the future.  This will allow jobs that request fewer than that many minutes of wall time to run on the reserved resources.  This effectivly allows you to have two time limits on the partition.  One defined on the partition itself and the reservation that blocks jobs longer than the window of time before it starts, but allows short jobs to run.

I'm reading your last update again and I notice that you are also pointing out that your user is able to run in the reservation that should only allow a different user.  This is strange.  Can I have you send a copy of your slurm.conf to see if there is something in there that isn't configured properly to enforce this?

Thanks,
Ben

Comment 7 Motaz Diab 2025-03-13 13:33:47 MDT

Hello,

I'd like to thank you for your last explanation, so its concept is now very clear to me, but it is still not working as expected. 

I doubt that it might be a bug in our current version [24.05.5].

I recreated the reservation again to reserve 2 GPUs in partition gpu on two nodes maxg2[4-5] (Each one has 8 GPUs as total) as followings:

[root@max-mastr1.mdc-berlin.net:~] (1022) $ scontrol create reservationname=short_gpu  Nodes=maxg2[4-5] user=root partition=gpu TRESPerNode=gres/gpu=2 starttime=now+30minutes duration=12:00:00 flags=time_float
Reservation created: short_gpu
[root@max-mastr1.mdc-berlin.net:~] (1023) $ scontrol show res short_gpu
ReservationName=short_gpu StartTime=2025-03-13T20:45:16 EndTime=2025-03-14T08:45:16 Duration=12:00:00
   Nodes=maxg[24-25] NodeCnt=2 CoreCnt=2 Features=(null) PartitionName=gpu Flags=SPEC_NODES,TIME_FLOAT
     NodeName=maxg24 CoreIDs=0
     NodeName=maxg25 CoreIDs=0
   TRES=cpu=2,gres/gpu=4
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[root@max-mastr1.mdc-berlin.net:~] (1024) $

I submitted 3 short jobs (10 minutes) and 2 long jobs (5 hours) where each of them requested 2 GPUs:
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 short-gpu.sh 
Submitted batch job 561660
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 short-gpu.sh 
Submitted batch job 561661
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg24 short-gpu.sh 
Submitted batch job 561662
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 long-gpu.sh 
Submitted batch job 561663
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg24 long-gpu.sh 
Submitted batch job 561664
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561664      mdiab       gpu     1     1 PD 2025-03-13T20:19:29                 N/A     0:00    5:00:00 (Priority)
 561663      mdiab       gpu     1     1 PD 2025-03-13T20:19:25                 N/A     0:00    5:00:00 (Priority)
 561662      mdiab       gpu     1     1 PD 2025-03-13T20:14:48                 N/A     0:00      10:00 (Priority)
 561661      mdiab       gpu     1     1 PD 2025-03-13T20:14:45                 N/A     0:00      10:00 (Priority)
 561660      mdiab       gpu     1     1 PD 2025-03-13T20:14:44                 N/A     0:00      10:00 (Priority)
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $

But none of them started. If I submit a job to maxg2[4-5] without requesting GPUs, then it starts shortly regardless of the requested time.

[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -p gpu -w maxg24 -t 1-0 --wrap "sleep 50"
Submitted batch job 561665
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue | grep 561665
 561665      mdiab       gpu     1     1  R 2025-03-13T20:25:16 2025-03-13T20:25:34     0:08 1-00:00:00 maxg24
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ 

Of course if I delete the reservation:

[root@max-mastr1.mdc-berlin.net:~] (1025) $ scontrol delete reservation short_gpu
[root@max-mastr1.mdc-berlin.net:~] (1026) $

=> then all of my pending jobs started immediately too.

[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561660      mdiab       gpu     1     1  R 2025-03-13T20:14:44 2025-03-13T20:27:04     0:07      10:00 maxg25
 561661      mdiab       gpu     1     1  R 2025-03-13T20:14:45 2025-03-13T20:27:04     0:07      10:00 maxg25
 561662      mdiab       gpu     1     1  R 2025-03-13T20:14:48 2025-03-13T20:27:04     0:07      10:00 maxg24
 561663      mdiab       gpu     1     1  R 2025-03-13T20:19:25 2025-03-13T20:27:04     0:07    5:00:00 maxg25
 561664      mdiab       gpu     1     1  R 2025-03-13T20:19:29 2025-03-13T20:27:04     0:07    5:00:00 maxg24
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $

I attached our slurm.conf here. 

Thanks
Motaz

Comment 8 Motaz Diab 2025-03-13 13:34:41 MDT

Created attachment 41128 [details]
Slurm configuration file

Comment 9 Ben Roberts 2025-03-13 15:09:26 MDT

I'm glad to hear that it makes sense now, but it's strange that it's not doing what it should.  One thing I notice that could be affecting how these jobs are scheduled is one of your scheduler parameters.  You have 'bf_resolution=600', which means that the chunks of time that get evaluated are in 10 minute increments.  The default for this is 60 second increments.  With a 30 minute window of time before the reservation it can be hard for it to be able to resolve it properly.  I would recommend setting this back down to the default of 60 seconds and see if the behavior changes.

If you still see the same problem after changing the bf_resolution, then I would like to see some debug logs for the issue.  You can enable debug logging with the following commands:
 scontrol setdebug debug2
 scontrol setdebugflags +backfill


I would like to have you run the same test and send the logs that cover the time of that test.  Please send all the logs for that time rather than grepping out a particular job id.  There are often relevant log entries that don't include the job id as part of the line.  Then you can set the log level back down to normal levels with the following commands:
 scontrol setdebug info
 scontrol setdebugflags -backfill


Thanks,
Ben

Comment 10 Motaz Diab 2025-03-14 08:47:33 MDT

Hello again,

I set bf_resolution back to 60 (default) but it did not help. 

I redid the same example on maxg20 (It has 8 GPUs as total) only.

ReservationName=short_gpu StartTime=2025-03-14T16:09:41 EndTime=2025-03-15T04:09:41 Duration=12:00:00
   Nodes=maxg20 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=gpu Flags=SPEC_NODES,TIME_FLOAT
     NodeName=maxg20 CoreIDs=0
   TRES=cpu=1,gres/gpu=2
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)


I submitted 4 long jobs and 4 short (each requested 2 GPUs)
I saw all short jobs started but long jobs not,

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561858      mdiab       gpu     1     1 PD 2025-03-14T15:28:01                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561857      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561856      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561855      mdiab       gpu     1     1 PD 2025-03-14T15:27:59                 N/A     0:00    5:00:00 (Resources)
 561851      mdiab       gpu     1     1  R 2025-03-14T15:27:54 2025-03-14T15:28:00     0:28      10:00 maxg20
 561852      mdiab       gpu     1     1  R 2025-03-14T15:27:54 2025-03-14T15:28:00     0:28      10:00 maxg20
 561853      mdiab       gpu     1     1  R 2025-03-14T15:27:55 2025-03-14T15:28:00     0:28      10:00 maxg20
 561854      mdiab       gpu     1     1  R 2025-03-14T15:27:55 2025-03-14T15:28:00     0:28      10:00 maxg20
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $

When all 4 short jobs completed, the long jobs are still pending

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561858      mdiab       gpu     1     1 PD 2025-03-14T15:28:01                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561857      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561856      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561855      mdiab       gpu     1     1 PD 2025-03-14T15:27:59                 N/A     0:00    5:00:00 (Resources)
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $

I re-submitted another 3 short jobs, but also they did not start and are pending

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561862      mdiab       gpu     1     1 PD 2025-03-14T15:36:23                 N/A     0:00      10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561861      mdiab       gpu     1     1 PD 2025-03-14T15:36:22                 N/A     0:00      10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561860      mdiab       gpu     1     1 PD 2025-03-14T15:36:04                 N/A     0:00      10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561858      mdiab       gpu     1     1 PD 2025-03-14T15:28:01                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561857      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561856      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561855      mdiab       gpu     1     1 PD 2025-03-14T15:27:59                 N/A     0:00    5:00:00 (Resources)
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ 

I enabled the debug as you asked and uploaded the slurmctl.log and slurmd.log (for maxg20).
Thanks and best regards,
Motaz

Comment 11 Motaz Diab 2025-03-14 08:47:56 MDT

Created attachment 41141 [details]
slurmctl.log and slurmd.log (for maxg20)

Comment 12 Ben Roberts 2025-03-21 10:28:36 MDT

Hi Motaz,

My apologies that it took me a while to get back to you on this. It looks like there is something else happening that is preventing these long GPU jobs from starting on that node. Unfortunately it's not clear from the logs what is keeping the job from starting. I was looking at job 561858 as an example. You can see in the logs that it sees that it recognizes that it could use node maxg20, but it doesn't start on it.

[2025-03-14T15:28:01.358] debug2: found 2 usable nodes from config containing maxg[10,20]
...
[2025-03-14T15:28:01.358] debug2: NodeSet[3] Nodes:maxg[10,20] NodeWeight:10 Flags:0 FeatureBits:0 SchedWeight:2815

Those are what is shown by the main scheduler. Then it is evaluated by the backfill scheduler and it shows that it tries to schedule the job, but doesn't provide details of why it isn't able to start it.

[2025-03-14T15:28:21.853] sched/backfill: _attempt_backfill: BACKFILL: test for JobId=561858 Prio=1000 Partition=gpu Reservation=NONE
[2025-03-14T15:28:21.853] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=561858.
[2025-03-14T15:28:21.853] debug2: sched/backfill: _try_sched: exclude core bitmap: 3644
[2025-03-14T15:28:21.853] debug2: select/cons_tres: select_p_job_test: evaluating JobId=561858
[2025-03-14T15:28:21.853] debug2: select/cons_tres: select_p_job_test: evaluating JobId=561858

Since it's the long jobs that aren't able to start on the node, they shouldn't be directly affected by the floating reservation you created. Something else is happening that is keeping these jobs from starting. It's possible that there is a large job that is reserving resources and the longer jobs are long enough that they would interfere with its start time. I hate to ask you to run this test one more time, but I'd like to have some additional logging collected. Could you run the following commands to enable two more debug flags:
scontrol setdebug debug2
scontrol setdebugflags +backfill,backfillmap,selecttype

Then if you would run a similar test and send the logs from that time period. Then you can turn the log level back down with this:
scontrol setdebug info
scontrol setdebugflags -backfill,backfillmap,selecttype

I would also like to see the output of 'scontrol show job <jobid>' for one of the long jobs that isn't able to run as well as one of the short ones that does run. Along with that I would like to see the output of 'scontrol show node maxg20' while a short job is running on it.

Thanks,
Ben

Comment 13 Motaz Diab 2025-03-31 06:54:37 MDT

Hello again,

I installed the latest version of Slurm (24.11.3) on a test cluster which has single partition and 8 gpu nodes (8 GPU devices on each).

[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1025) $ scontrol -V
slurm 24.11.3
[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1026) $


I created the reservation for 2 GPUs on single node (hai002):

[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1024) $ scontrol show reservations
ReservationName=short_gpu StartTime=2025-03-31T15:11:18 EndTime=2025-04-01T03:11:18 Duration=12:00:00
   Nodes=hai002 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=standard Flags=TIME_FLOAT
     NodeName=hai002 CoreIDs=0,56
   TRES=cpu=2,gres/gpu=2
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1025) $
[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1026) $ sinfo -Nel
Mon Mar 31 14:41:59 2025
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
hai001         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai002         1 standard*    reserved 224    2:56:2 975000        0      1 sapphire none                
hai003         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai004         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai005         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai006         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai007         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai008         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1027) $ 
[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1030) $ scontrol show node hai002
NodeName=hai002 Arch=x86_64 CoresPerSocket=56 
   CPUAlloc=0 CPUEfctv=220 CPUTot=224 CPULoad=0.01
   AvailableFeatures=sapphire-rapids
   ActiveFeatures=sapphire-rapids
   Gres=gpu:H100-SXM5:8,gpu_memory:no_consume:80G,gpu_compute_cap:no_consume:90
   NodeAddr=hai002 NodeHostName=hai002 Version=24.11.3
   OS=Linux 5.14.0-427.13.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024 
   RealMemory=975000 AllocMem=0 FreeMem=980413 Sockets=2 Boards=1
   CoreSpecCount=2 CPUSpecList=110-111,222-223 
   State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=standard 
   BootTime=2025-03-24T22:47:17 SlurmdStartTime=2025-03-30T16:48:36
   LastBusyTime=2025-03-31T13:45:18 ResumeAfterTime=None
   CfgTRES=cpu=220,mem=975000M,billing=220,gres/gpu=8
   AllocTRES=
   CurrentWatts=0 AveWatts=0
   
   ReservationName=short_gpu

[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1031) $

I submitted two 1-gpu jobs (long/1 hour + short/10 minutes)
[mdiab-srvadm@hai-login1.haicore.berlin:~] $ sbatch -N1 -n1 -c1 -t 0-1 -w hai002 --gres=gpu:1 --wrap "sleep 1000"
Submitted batch job 302
[mdiab-srvadm@hai-login1.haicore.berlin:~] $
[mdiab-srvadm@hai-login1.haicore.berlin:~] $ sbatch -N1 -n1 -c1 -t 10 -w hai002 --gres=gpu:1 --wrap "sleep 1000"
Submitted batch job 303
[mdiab-srvadm@hai-login1.haicore.berlin:~] $

But this time none of them started.
[mdiab-srvadm@hai-login1.haicore.berlin:~] $ squeue 
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
    303 mdiab-srva  standard     1     1 PD 2025-03-31T14:26:16                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
    302 mdiab-srva  standard     1     1 PD 2025-03-31T14:25:51                 N/A     0:00    1:00:00 (Resources)
[mdiab-srvadm@hai-login1.haicore.berlin:~] $
[mdiab-srvadm@hai-login1.haicore.berlin:~] $ scontrol show job 302
JobId=302 JobName=wrap
   UserId=mdiab-srvadm(961900513) GroupId=mdiab-srvadm(961900513) MCS_label=N/A
   Priority=1 Nice=0 Account=it QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2025-03-31T14:25:51 EligibleTime=2025-03-31T14:25:51
   AccrueTime=2025-03-31T14:25:51
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-31T14:50:34 Scheduler=Backfill:*
   Partition=standard AllocNode:Sid=hai-login1:2540965
   ReqNodeList=hai002 ExcNodeList=(null)
   NodeList=
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=4G,node=1,billing=1,gres/gpu=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/fast/home/mdiab-srvadm
   StdErr=/fast/home/mdiab-srvadm/slurm-302.out
   StdIn=/dev/null
   StdOut=/fast/home/mdiab-srvadm/slurm-302.out
   TresPerNode=gres/gpu:1
   TresPerTask=cpu=1
   

[mdiab-srvadm@hai-login1.haicore.berlin:~] $ scontrol show job 303
JobId=303 JobName=wrap
   UserId=mdiab-srvadm(961900513) GroupId=mdiab-srvadm(961900513) MCS_label=N/A
   Priority=1 Nice=0 Account=it QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_May_be_reserved_for_other_job Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2025-03-31T14:26:16 EligibleTime=2025-03-31T14:26:16
   AccrueTime=2025-03-31T14:26:16
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-31T14:50:34 Scheduler=Backfill:*
   Partition=standard AllocNode:Sid=hai-login1:2540965
   ReqNodeList=hai002 ExcNodeList=(null)
   NodeList=
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=4G,node=1,billing=1,gres/gpu=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/fast/home/mdiab-srvadm
   StdErr=/fast/home/mdiab-srvadm/slurm-303.out
   StdIn=/dev/null
   StdOut=/fast/home/mdiab-srvadm/slurm-303.out
   TresPerNode=gres/gpu:1
   TresPerTask=cpu=1
   

[mdiab-srvadm@hai-login1.haicore.berlin:~] $

I uploaded here the slurm.conf and the slurmctld.log (with enabled debug)

Note: I tested the same example for reserving some of CPUs (instead of GPUs) and it works, so it seems that only GPU reservation is broken.

Thanks,
Motaz

Comment 14 Motaz Diab 2025-03-31 06:55:18 MDT

Created attachment 41319 [details]
slurmctl.log

Comment 15 Motaz Diab 2025-03-31 06:56:02 MDT

Created attachment 41320 [details]
Slurm configuration file

Comment 16 Ben Roberts 2025-04-04 11:21:32 MDT

Thank you for reproducing this on your test cluster and collecting these logs.  I can see that there is something causing the resources to be excluded for this node.  Here is an excerpt showing it evaluate hai001 and it sees the GPUs and Cores correctly for that node, but it doesn't start on that node because it doesn't request it.

[2025-03-31T14:25:52.000] select/cons_tres: _can_job_run_on_node: SELECT_TYPE: 220 CPUs on hai001(state:0), mem 0/975000
[2025-03-31T14:25:52.000] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hai001 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:1-220,220 ThreadsPerCore:2
[2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE:   AnySocket gpu:8
[2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE:   Socket[0] Cores:55
[2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE:   Socket[1] Cores:55



But then when it comes to hai002 it says that the resources are excluded for the node.

[2025-03-31T14:26:31.000] select/cons_tres: _can_use_gres_exc_topo: SELECT_TYPE: can't include!, it is excluded 1 0
[2025-03-31T14:26:31.000] select/cons_tres: _can_job_run_on_node: SELECT_TYPE: Test fail on node hai002: gres_sock_list_create


Can I have you send a copy of your gres.conf for your test system?  I would like to see if there is something there that is keeping it from recognizing that it can use these GPUs/CPUs.  If you have any other .conf files in the same directory as your slurm.conf, it would be good to see all of them.

Thanks,
Ben

Comment 17 Motaz Diab 2025-04-05 04:45:48 MDT

Hello,

Here are three used config files gres.conf, job_container.conf, cgroup.conf:

[mdiab@cl-hpc02 12:40:25 config]$ cat gres.conf |grep -v "#"
AutoDetect=off
NodeName=hai00[1-8] Name=gpu Type=H100-SXM5 File=/dev/nvidia[0-7] Flags=nvidia_gpu_env
NodeName=hai00[1-8] Name=gpu_memory Count=80G Flags=CountOnly
NodeName=hai00[1-8] Name=gpu_compute_cap Count=90 Flags=CountOnly

[mdiab@cl-hpc02 12:40:35 config]$ cat job_container.conf |grep -v "#"
AutoBasePath=true
BasePath=/tmp/slurm
Dirs=/var/tmp
Shared=true

[mdiab@cl-hpc02 12:40:48 config]$ cat cgroup.conf |grep -v "#"
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

Also I have a job_submit.lua which only sends the interactive jobs into interactive partition silently.
[mdiab@cl-hpc02 12:41:11 config]$ cat job_submit.lua 

function slurm_job_submit(job_desc, part_list, submit_uid)

   -- Sending all interactive jobs to interactive partition.
   if not job_desc.script then
           job_desc.partition = 'interactive'
   end  

   -- Allow the job to proceed
   return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
   return slurm.SUCCESS
end

Thanks
Motaz

Comment 18 Ben Roberts 2025-04-09 14:38:51 MDT

Hi Motaz,

I discussed this ticket with a colleague who thinks that this might be related to an issue he has seen related to reservations and typed GPUs.  You have reproduced this on your production system and a test system and it looks like both of them have the GPUs defined with a type of 'H100-SXM5'.  Can you try removing that type from the GPU definition on your test cluster to confirm whether this is the same issue?  I'm not suggesting you remove it as the long term fix, but it will allow us to confirm the issue.

Thanks,
Ben

Comment 19 Ben Roberts 2025-05-14 08:48:43 MDT

Hi Motaz,

I wanted to follow up and see if you were able to try removing the GPU type on your test system to see if that made a difference.  

Thanks,
Ben

Comment 20 Motaz Diab 2025-05-15 04:17:28 MDT

Hello,

I'm very sorry to be very late in answering this ticket. 

I removed the gpu type and re-did the test but it is still not working

I created the reservation for 2 GPUs on hai005 which has 8 GPUs in total:

[motaz.diab@hai-login1.haicore.berlin:~] $ scontrol show res 
ReservationName=short_gpu StartTime=2025-05-15T12:40:30 EndTime=2025-05-16T00:40:30 Duration=12:00:00
   Nodes=hai005 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=standard Flags=SPEC_NODES,TIME_FLOAT
     NodeName=hai005 CoreIDs=0,56
   TRES=cpu=2,gres/gpu=2
   Users=(null) Groups=(null) Accounts=root Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)


I submitted several single-GPU jobs

[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12245 motaz.diab  standard     1     1 PD 2025-05-15T12:13:57                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12244 motaz.diab  standard     1     1 PD 2025-05-15T12:13:57                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12243 motaz.diab  standard     1     1 PD 2025-05-15T12:13:49                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12242 motaz.diab  standard     1     1 PD 2025-05-15T12:13:49                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12241 motaz.diab  standard     1     1 PD 2025-05-15T12:13:49                 N/A     0:00 1-00:00:00 (Resources)
  12236 motaz.diab  standard     1     1  R 2025-05-15T12:13:49 2025-05-15T12:14:04     0:03 1-00:00:00 hai005
  12237 motaz.diab  standard     1     1  R 2025-05-15T12:13:49 2025-05-15T12:14:04     0:03 1-00:00:00 hai005
  12238 motaz.diab  standard     1     1  R 2025-05-15T12:13:49 2025-05-15T12:14:04     0:03 1-00:00:00 hai005
  12239 motaz.diab  standard     1     1  R 2025-05-15T12:13:49 2025-05-15T12:14:04     0:03 1-00:00:00 hai005
  12240 motaz.diab  standard     1     1  R 2025-05-15T12:13:49 2025-05-15T12:14:04     0:03 1-00:00:00 hai005
  12235 motaz.diab  standard     1     1  R 2025-05-15T12:13:39 2025-05-15T12:13:40     0:27 1-00:00:00 hai005
[motaz.diab@hai-login1.haicore.berlin:~] $ 

As you see only 6 jobs started and no job can start on the two reserved GPUs although that runtime is 10 minutes.

Best regards,
Motaz

Comment 21 Ben Roberts 2025-05-15 11:04:14 MDT

Hi Motaz,

I understand that things get busy.  I re-ran the test I did and I'm now seeing behavior similar to what you're describing.  I'm not sure what's triggering it yet, but I wonder if it is the same underlying problem you're having.  In my case, when I have long jobs running and at least one more long job pending then the short jobs that are pending will stay queued.  If I cancel one of the running long jobs so that another one of the pending long jobs will start, OR if I cancel the pending long jobs, then the short jobs are able to start in the window of time before the floating reservation.  

Since this is happening on my system now too, I can look into what might be happening, but I would like to confirm that what you are seeing is the same thing.  Can you verify that for me?

Thanks,
Ben

Comment 22 Motaz Diab 2025-05-16 05:37:25 MDT

Hello again,

Today I have different behavior on both of clusters (Test and Production). No job starts because if the float reservation. Really I do not understand why. Also we have no change in our setup or configuration.

[motaz.diab@hai-login1.haicore.berlin:~] $ scontrol show reservation short_gpu
ReservationName=short_gpu StartTime=2025-05-16T14:00:41 EndTime=2025-05-17T02:00:41 Duration=12:00:00
   Nodes=hai005 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=standard Flags=SPEC_NODES,TIME_FLOAT
     NodeName=hai005 CoreIDs=0,56
   TRES=cpu=2,gres/gpu=2
   Users=(null) Groups=(null) Accounts=root Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12289 motaz.diab  standard     1     1 PD 2025-05-16T13:28:51                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12288 motaz.diab  standard     1     1 PD 2025-05-16T13:28:49                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12287 motaz.diab  standard     1     1 PD 2025-05-16T13:28:48                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12286 motaz.diab  standard     1     1 PD 2025-05-16T13:28:45                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12285 motaz.diab  standard     1     1 PD 2025-05-16T13:28:44                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12284 motaz.diab  standard     1     1 PD 2025-05-16T13:28:42                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12283 motaz.diab  standard     1     1 PD 2025-05-16T13:28:42                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12282 motaz.diab  standard     1     1 PD 2025-05-16T13:28:41                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12281 motaz.diab  standard     1     1 PD 2025-05-16T13:28:38                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12280 motaz.diab  standard     1     1 PD 2025-05-16T13:28:37                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12279 motaz.diab  standard     1     1 PD 2025-05-16T13:28:37                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12278 motaz.diab  standard     1     1 PD 2025-05-16T13:28:36                 N/A     0:00 1-00:00:00 (Resources)
[motaz.diab@hai-login1.haicore.berlin:~] $ scontrol show job 12279
JobId=12279 JobName=wrap
   UserId=motaz.diab(961800049) GroupId=motaz.diab(961800049) MCS_label=N/A
   Priority=1 Nice=0 Account=it QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_May_be_reserved_for_other_job Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2025-05-16T13:28:37 EligibleTime=2025-05-16T13:28:37
   AccrueTime=2025-05-16T13:28:37
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-05-16T13:30:45 Scheduler=Backfill:*
   Partition=standard AllocNode:Sid=hai-login1:1893554
   ReqNodeList=hai005 ExcNodeList=(null)
   NodeList=
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=4G,node=1,billing=1,gres/gpu=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/fast/home/motaz.diab
   StdErr=/fast/home/motaz.diab/slurm-12279.out
   StdIn=/dev/null
   StdOut=/fast/home/motaz.diab/slurm-12279.out
   TresPerNode=gres/gpu:1
   

[motaz.diab@hai-login1.haicore.berlin:~] $ scontrol show job 12278
JobId=12278 JobName=wrap
   UserId=motaz.diab(961800049) GroupId=motaz.diab(961800049) MCS_label=N/A
   Priority=1 Nice=0 Account=it QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2025-05-16T13:28:36 EligibleTime=2025-05-16T13:28:36
   AccrueTime=2025-05-16T13:28:36
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-05-16T13:31:20 Scheduler=Main
   Partition=standard AllocNode:Sid=hai-login1:1893554
   ReqNodeList=hai005 ExcNodeList=(null)
   NodeList=
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=4G,node=1,billing=1,gres/gpu=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/fast/home/motaz.diab
   StdErr=/fast/home/motaz.diab/slurm-12278.out
   StdIn=/dev/null
   StdOut=/fast/home/motaz.diab/slurm-12278.out
   TresPerNode=gres/gpu:1
   

[motaz.diab@hai-login1.haicore.berlin:~] $

I am happy that you could reproduce some weird behavior in the float reservation too, and if you can fix it, then our issue could be completely fixed too.

Thanks and best regards,
Motaz

Comment 23 Ben Roberts 2025-05-16 09:36:16 MDT

Hi Motaz,

I'm afraid I wasn't really clear about what I was hoping to see.  I'm hoping to verify that your behavior lines up with what I'm seeing.  Let me show you a simple example of what I'm seeing.  I have a floating reservation that is 30 minutes in the future for 2 GPUs per node on 2 nodes.

$ scontrol show reservations 
ReservationName=short_gpu StartTime=2025-05-16T10:03:51 EndTime=2025-05-16T22:03:51 Duration=12:00:00
   Nodes=node[07-08] NodeCnt=2 CoreCnt=2 Features=(null) PartitionName=gpu Flags=TIME_FLOAT
     NodeName=node07 CoreIDs=0
     NodeName=node08 CoreIDs=0
   TRES=cpu=4,gres/gpu=4
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)






I submit 3 jobs that I call 'long' jobs.  They each request 1 hour of wall time and 2 GPUs.  The first 2 jobs should be able to start, but the third should remain queued until one of the previous jobs ends or is cancelled.

$ sbatch -Jlong -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 11572

$ sbatch -Jlong -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 11573

$ sbatch -Jlong -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 11574






Then I submit a job called 'short' that just requests 20 minutes of wall time and 1 GPU.  This job should be able to start in the window of time before the start time of the floating reservation.  However, while there is a 'long' job queued, the 'short' job also remains queued.

$ sbatch -Jshort -n1 -t20:00 --gres=gres/gpu=1 -pgpu --wrap='srun sleep 1200'
Submitted batch job 11575

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             11574       gpu     long      ben PD       0:00      1 (Resources)
             11575       gpu    short      ben PD       0:00      1 (Priority)
             11573       gpu     long      ben  R       0:38      1 node08
             11572       gpu     long      ben  R       0:41      1 node07





I can cancel the queued 'long' job and then the 'short' job is able to start.  Note that it can take some time for the backfill scheduler to start an iteration, which is when the 'short' job is started.

$ scancel 11574

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             11573       gpu     long      ben  R       1:28      1 node08
             11572       gpu     long      ben  R       1:31      1 node07
             11575       gpu    short      ben  R       0:04      1 node07



Can you see if this lines up with the behavior you see on your system?

Thanks,
Ben

Comment 24 Motaz Diab 2025-05-16 11:16:11 MDT

Hello again

As I wrote above, I have different behavior now. No jobs starts (regardless of the requested time) when I create the float reservation. Once I delete the reservation, all pending related-jobs start immediately on the node.

Thanks
Motaz

Comment 25 Ben Roberts 2025-05-21 12:39:58 MDT

I'm sorry I didn't quite get what you were trying to show in comment 22.  That is interesting that none of the jobs are starting.  It kind of looks like there was something else waiting to start on that node that was blocking the longer jobs, though it is strange that the jobs were able to start after you removed the reservation.  

Does this happen every time now, or do you sometimes still see the behavior you were reporting previously?  I'm not seeing this behavior on my side, but I do still see what I show in comment 23.  If you are able to reproduce the problem with none of the jobs starting we should try to get to the bottom of that first and then see about the reservation problem.  If you don't have any jobs starting, I would like to see debug logs from when there are jobs queued.   You can enable the logging that should show relevant information like this:
scontrol setdebug debug2
scontrol setdebugflags +selecttype


With that enabled you can collect the information you did previously.  Once you are done you can set the log level back to normal levels like this:
scontrol setdebug info
scontrol setdebugflags -selecttype


Thanks,
Ben

Comment 26 Motaz Diab 2025-06-01 08:17:23 MDT

Hello,

I re-did the test and none of jobs (short + long) started.
[motaz.diab@hai-login1.haicore.berlin:~] $ scontrol show res
ReservationName=short_gpu StartTime=2025-06-01T16:45:55 EndTime=2025-06-02T04:45:55 Duration=12:00:00
   Nodes=hai008 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=standard Flags=SPEC_NODES,TIME_FLOAT
     NodeName=hai008 CoreIDs=0,56
   TRES=cpu=2,gres/gpu=2
   Users=(null) Groups=(null) Accounts=root Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12440 motaz.diab  standard     1     1 PD 2025-06-01T16:14:19                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12439 motaz.diab  standard     1     1 PD 2025-06-01T16:13:38                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12438 motaz.diab  standard     1     1 PD 2025-06-01T16:13:36                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12437 motaz.diab  standard     1     1 PD 2025-06-01T16:13:33                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12436 motaz.diab  standard     1     1 PD 2025-06-01T16:13:33                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12435 motaz.diab  standard     1     1 PD 2025-06-01T16:13:31                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12434 motaz.diab  standard     1     1 PD 2025-06-01T16:13:31                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12433 motaz.diab  standard     1     1 PD 2025-06-01T16:13:31                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12432 motaz.diab  standard     1     1 PD 2025-06-01T16:13:30                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12431 motaz.diab  standard     1     1 PD 2025-06-01T16:13:30                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12430 motaz.diab  standard     1     1 PD 2025-06-01T16:13:29                 N/A     0:00 1-00:00:00 (Resources)
[motaz.diab@hai-login1.haicore.berlin:~] $ 

Once I delete the reservation, the jobs started.
[root@hai-mastr1.haicore.berlin:~] (1003) $ scontrol delete reservation short_gpu
[root@hai-mastr1.haicore.berlin:~] (1005) $ squeue -u motaz.diab
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12440 motaz.diab  standard     1     1 PD 2025-06-01T16:14:19                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12439 motaz.diab  standard     1     1 PD 2025-06-01T16:13:38                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12438 motaz.diab  standard     1     1 PD 2025-06-01T16:13:36                 N/A     0:00      10:00 (AssocGrpGRES)
  12431 motaz.diab  standard     1     1  R 2025-06-01T16:13:30 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12432 motaz.diab  standard     1     1  R 2025-06-01T16:13:30 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12433 motaz.diab  standard     1     1  R 2025-06-01T16:13:31 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12434 motaz.diab  standard     1     1  R 2025-06-01T16:13:31 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12435 motaz.diab  standard     1     1  R 2025-06-01T16:13:31 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12436 motaz.diab  standard     1     1  R 2025-06-01T16:13:33 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12437 motaz.diab  standard     1     1  R 2025-06-01T16:13:33 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12430 motaz.diab  standard     1     1  R 2025-06-01T16:13:29 2025-06-01T16:16:21     0:17 1-00:00:00 hai008
[root@hai-mastr1.haicore.berlin:~] (1006) $ 

Thanks
Motaz

Comment 27 Motaz Diab 2025-06-01 08:17:43 MDT

Hello,

I re-did the test and none of jobs (short + long) started.
[motaz.diab@hai-login1.haicore.berlin:~] $ scontrol show res
ReservationName=short_gpu StartTime=2025-06-01T16:45:55 EndTime=2025-06-02T04:45:55 Duration=12:00:00
   Nodes=hai008 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=standard Flags=SPEC_NODES,TIME_FLOAT
     NodeName=hai008 CoreIDs=0,56
   TRES=cpu=2,gres/gpu=2
   Users=(null) Groups=(null) Accounts=root Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12440 motaz.diab  standard     1     1 PD 2025-06-01T16:14:19                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12439 motaz.diab  standard     1     1 PD 2025-06-01T16:13:38                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12438 motaz.diab  standard     1     1 PD 2025-06-01T16:13:36                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12437 motaz.diab  standard     1     1 PD 2025-06-01T16:13:33                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12436 motaz.diab  standard     1     1 PD 2025-06-01T16:13:33                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12435 motaz.diab  standard     1     1 PD 2025-06-01T16:13:31                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12434 motaz.diab  standard     1     1 PD 2025-06-01T16:13:31                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12433 motaz.diab  standard     1     1 PD 2025-06-01T16:13:31                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12432 motaz.diab  standard     1     1 PD 2025-06-01T16:13:30                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12431 motaz.diab  standard     1     1 PD 2025-06-01T16:13:30                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12430 motaz.diab  standard     1     1 PD 2025-06-01T16:13:29                 N/A     0:00 1-00:00:00 (Resources)
[motaz.diab@hai-login1.haicore.berlin:~] $ 

Once I delete the reservation, the jobs started.
[root@hai-mastr1.haicore.berlin:~] (1003) $ scontrol delete reservation short_gpu
[root@hai-mastr1.haicore.berlin:~] (1005) $ squeue -u motaz.diab
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12440 motaz.diab  standard     1     1 PD 2025-06-01T16:14:19                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12439 motaz.diab  standard     1     1 PD 2025-06-01T16:13:38                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12438 motaz.diab  standard     1     1 PD 2025-06-01T16:13:36                 N/A     0:00      10:00 (AssocGrpGRES)
  12431 motaz.diab  standard     1     1  R 2025-06-01T16:13:30 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12432 motaz.diab  standard     1     1  R 2025-06-01T16:13:30 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12433 motaz.diab  standard     1     1  R 2025-06-01T16:13:31 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12434 motaz.diab  standard     1     1  R 2025-06-01T16:13:31 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12435 motaz.diab  standard     1     1  R 2025-06-01T16:13:31 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12436 motaz.diab  standard     1     1  R 2025-06-01T16:13:33 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12437 motaz.diab  standard     1     1  R 2025-06-01T16:13:33 2025-06-01T16:16:22     0:16 1-00:00:00 hai008
  12430 motaz.diab  standard     1     1  R 2025-06-01T16:13:29 2025-06-01T16:16:21     0:17 1-00:00:00 hai008
[root@hai-mastr1.haicore.berlin:~] (1006) $ 

Thanks
Motaz

Comment 28 Motaz Diab 2025-06-01 09:33:16 MDT

Sorry I forgot to enable log debugging.

I redid the test but I noticed that if I submitted the short jobs (10 min) before the long ones, then they can start.
But when I submitted long jobs, none of them start and also if I submit short job after them it won't start too.

Note: All submitted jobs request single GPU.

[motaz.diab@hai-login1.haicore.berlin:~] $ scontrol show res
ReservationName=short_gpu StartTime=2025-06-01T17:57:08 EndTime=2025-06-02T05:57:08 Duration=12:00:00
   Nodes=hai008 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=standard Flags=SPEC_NODES,TIME_FLOAT
     NodeName=hai008 CoreIDs=0,56
   TRES=cpu=2,gres/gpu=2
   Users=(null) Groups=(null) Accounts=root Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12457 motaz.diab  standard     1     1 PD 2025-06-01T17:26:56                 N/A     0:00      10:00 (None)
  12456 motaz.diab  standard     1     1 PD 2025-06-01T17:26:48                 N/A     0:00 1-00:00:00 (None)
  12455 motaz.diab  standard     1     1 PD 2025-06-01T17:26:47                 N/A     0:00 1-00:00:00 (None)
  12454 motaz.diab  standard     1     1  R 2025-06-01T17:26:36 2025-06-01T17:26:45     0:18      10:00 hai008
  12453 motaz.diab  standard     1     1  R 2025-06-01T17:26:13 2025-06-01T17:26:21     0:42      10:00 hai008
  12452 motaz.diab  standard     1     1  R 2025-06-01T17:26:05 2025-06-01T17:26:06     0:57      10:00 hai008
  12450 motaz.diab  standard     1     1  R 2025-06-01T17:26:03 2025-06-01T17:26:05     0:58      10:00 hai008
  12451 motaz.diab  standard     1     1  R 2025-06-01T17:26:04 2025-06-01T17:26:05     0:58      10:00 hai008
[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12457 motaz.diab  standard     1     1 PD 2025-06-01T17:26:56                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12456 motaz.diab  standard     1     1 PD 2025-06-01T17:26:48                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12455 motaz.diab  standard     1     1 PD 2025-06-01T17:26:47                 N/A     0:00 1-00:00:00 (Resources)
  12454 motaz.diab  standard     1     1  R 2025-06-01T17:26:36 2025-06-01T17:26:45     0:25      10:00 hai008
  12453 motaz.diab  standard     1     1  R 2025-06-01T17:26:13 2025-06-01T17:26:21     0:49      10:00 hai008
[motaz.diab@hai-login1.haicore.berlin:~] $ 
[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12457 motaz.diab  standard     1     1 PD 2025-06-01T17:26:56                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
  12456 motaz.diab  standard     1     1 PD 2025-06-01T17:26:48                 N/A     0:00 1-00:00:00 (ReqNodeNotAvail, May be reserved for other job)
  12455 motaz.diab  standard     1     1 PD 2025-06-01T17:26:47                 N/A     0:00 1-00:00:00 (Resources)
[motaz.diab@hai-login1.haicore.berlin:~] $

You can find the slurmctld.log (with debug enabled) in the attachment file.
Thanks
Motaz

Comment 29 Motaz Diab 2025-06-01 09:41:02 MDT

Created attachment 41978 [details]
Slurm Controller Log File with debug

Comment 31 Megan Dahl 2025-06-02 18:12:18 MDT

Hello Motaz,

I believe the issue you are facing was recently fixed in Slurm 25.05.0. The changes can be found in the following commit:
3004008baf Destroy gres_list_exc in reservation_delete_resv_exc_parts

This issue was that the variable used to keep track of which gres resources a job being evaluated couldn’t be scheduled on due to reservations was not being cleared between different jobs. This was causing an issue because a higher priority job could overlap with a reservation reserving gres. Those gres would be excluded while evaluating that higher priority job. However, the excluded gres would still be marked as excluded in for the next job being evaluated, which would cause that job to also not be able to use those gres resources even if that job didn’t actually overlap with the reservation. In Slurm 25.05.0, this was fixed by freeing that variable before evaluating the next job in the queue.

This commit also fixes a slurmctld memory leak, since the dynamically allocated variable tracking the gres resources was not being freed before.

Please let me know if you have any further questions.

Thanks,
--Megan

Comment 32 Motaz Diab 2025-06-03 05:35:19 MDT

Hello again,

I updated slurm from 24.11.5 to 25.05.0 and re-did the same test.

All single-GPU jobs started on hai008 regardless of its timelimit as no float reservation is defined. 

Here is an example of 8 jobs (time > 30 min) but they started and used 8 GPUs

[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12514 motaz.diab  standard     1     1  R 2025-06-03T13:27:54 2025-06-03T13:28:06     1:08 1-00:00:00 hai008
  12515 motaz.diab  standard     1     1  R 2025-06-03T13:27:54 2025-06-03T13:28:06     1:08 1-00:00:00 hai008
  12509 motaz.diab  standard     1     1  R 2025-06-03T13:27:49 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12510 motaz.diab  standard     1     1  R 2025-06-03T13:27:50 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12511 motaz.diab  standard     1     1  R 2025-06-03T13:27:51 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12512 motaz.diab  standard     1     1  R 2025-06-03T13:27:51 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12513 motaz.diab  standard     1     1  R 2025-06-03T13:27:52 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12508 motaz.diab  standard     1     1  R 2025-06-03T13:27:49 2025-06-03T13:27:52     1:22 1-00:00:00 hai008
[motaz.diab@hai-login1.haicore.berlin:~] $

I uploaded again the slurmctld.log with debug
Best regards,
Motaz

Comment 33 Motaz Diab 2025-06-03 05:35:37 MDT

Hello again,

I updated slurm from 24.11.5 to 25.05.0 and re-did the same test.

All single-GPU jobs started on hai008 regardless of its timelimit as no float reservation is defined. 

Here is an example of 8 jobs (time > 30 min) but they started and used 8 GPUs

[motaz.diab@hai-login1.haicore.berlin:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
  12514 motaz.diab  standard     1     1  R 2025-06-03T13:27:54 2025-06-03T13:28:06     1:08 1-00:00:00 hai008
  12515 motaz.diab  standard     1     1  R 2025-06-03T13:27:54 2025-06-03T13:28:06     1:08 1-00:00:00 hai008
  12509 motaz.diab  standard     1     1  R 2025-06-03T13:27:49 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12510 motaz.diab  standard     1     1  R 2025-06-03T13:27:50 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12511 motaz.diab  standard     1     1  R 2025-06-03T13:27:51 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12512 motaz.diab  standard     1     1  R 2025-06-03T13:27:51 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12513 motaz.diab  standard     1     1  R 2025-06-03T13:27:52 2025-06-03T13:27:53     1:21 1-00:00:00 hai008
  12508 motaz.diab  standard     1     1  R 2025-06-03T13:27:49 2025-06-03T13:27:52     1:22 1-00:00:00 hai008
[motaz.diab@hai-login1.haicore.berlin:~] $

I uploaded again the slurmctld.log with debug
Best regards,
Motaz

Comment 34 Motaz Diab 2025-06-03 05:36:36 MDT

Created attachment 41986 [details]
Slurm Controller Log File (25.05.0) with debug

Comment 35 Megan Dahl 2025-06-03 12:12:23 MDT

Hello Motaz,

Thank you for sending the slurmctld logs. I will look through them and let you know what I find. I might need to request additional logs with more DebugFlags set if I am not able to reproduce locally, but I will let you know.

While you conducted this test, were there other reservations defined in your cluster? If so could you please send the output of scontrol show reservation? Additionally, I would like to determine if your test jobs were evaluated before any other jobs. Were the jobs you submitted the highest priority in the cluster?

Thanks,
--Megan

Comment 37 Motaz Diab 2025-06-04 05:21:45 MDT

Hello,

The test cluster was idle (except one interactive job was running on other node [hai001]) and no other reservation were defined.

Let me know please if you need me to enable other debug options.

Thanks
Motaz

Comment 38 Megan Dahl 2025-06-05 11:06:18 MDT

Hello Motaz,

Thank you for the additional information. We were able to reproduce your issue locally, and will keep you updated on the progress of fixing this issue.

Thanks,
--Megan

Comment 42 Megan Dahl 2025-09-03 17:25:09 MDT

Hello Motaz,

Sorry for the delay, this issue has now been fixed and the changes will be available in Slurm 25.05.3. The issue was that the available topology gres bitmap a job could select their gpus from was not excluding the gres allocated to reservations. This was allowing jobs that were not using a reservation to be allocated the reservation’s gpus. For this to happen the core topology of the gpus needs to span all the node’s sockets, or when multiple gpus of the same type need to not specify Cores in their configuration.

The changes can be found in the following commit:
ae7f8142f2 slurmctld - Prevent reserved gres from being used by the wrong jobs

In order to work around this issue in Slurm 25.05.0 you could add “shard” to the list of GresTypes. No node actually needs to use shards for this to work. Doing this will force Slurm to put each gpu into its own topology bitmap. This prevents the issue because the usable gres count does exclude gpus belonging to reservation, so when a gres topology is being checked for usable gpus it will have an available count of 0 and not be included.
https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes

Also, just to note, I see that your gres.conf has AutoDetect=off. It is often recommended to enable AutoDetect since it will automatically detect files, cores, links.
https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect

Since this is fixed, I’ll go ahead and mark this ticket as resolved. Please feel free to reopen this ticket if needed.

Regards,
--Megan

Comment 43 Bjørn-Helge Mevik 2025-09-15 05:07:23 MDT

I'm sorry to have to report that it seems this bug has not been *completely* fixed.  Specifically, it is still possible for jobs asking for GPUs in a reservation to get more GPUs than the reservation has, in the case of nodes where the GPUs are defined *without* specifying GPU type.

I have tested with 25.05.3 on a VM test cluster, with one gpu node (c5-1) that has Gres=gpu:a100:4 and one (c5-2) that has Gres=gpu:4:

# scontrol show node c5-1 | grep CfgTRES
   CfgTRES=cpu=4,mem=7667M,billing=24,gres/gpu=4,gres/gpu:a100=4
# scontrol show node c5-2 | grep CfgTRES
   CfgTRES=cpu=4,mem=7667M,billing=24,gres/gpu=4

I've created two reservations, one on each node, each asking for 2 gpus and 2 cpus:

# scontrol create res=test1 nodes=c5-1 tres=gres/gpu:a100:2,cpu=2 start=now duration=unlimited accounts=nn9999k
Reservation created: test1
# scontrol create res=test2 nodes=c5-2 tres=gres/gpu=2,cpu=2 start=now duration=unlimited accounts=nn9999k
Reservation created: test2
# scontrol show res
ReservationName=test1 StartTime=2025-09-15T11:23:23 EndTime=2026-09-15T11:23:23 Duration=365-00:00:00
   Nodes=c5-1 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=normal Flags=SPEC_NODES
     NodeName=c5-1 CoreIDs=0,2
   TRES=cpu=2,gres/gpu:a100=2,gres/gpu=2
   Users=(null) Groups=(null) Accounts=nn9999k Licenses=(null) State=ACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

ReservationName=test2 StartTime=2025-09-15T11:23:35 EndTime=2026-09-15T11:23:35 Duration=365-00:00:00
   Nodes=c5-2 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=normal Flags=SPEC_NODES
     NodeName=c5-2 CoreIDs=0,2
   TRES=cpu=2,gres/gpu=2
   Users=(null) Groups=(null) Accounts=nn9999k Licenses=(null) State=ACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)


Here are commands to demonstrate the problem.  This tests the reservation on the node *without* typed GPU specification:

$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --reserv=test2 --gpus=3
Submitted batch job 3044
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3044     accel     wrap      bhm PD       0:00      1 (Resources)
$ scancel --me
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --reserv=test2 --gpus=2
Submitted batch job 3045
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --reserv=test2 --gpus=1
Submitted batch job 3046
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3046     accel     wrap      bhm  R       0:09      1 c5-2
              3045     accel     wrap      bhm  R       0:16      1 c5-2

So while it is impossible for a single job asking for the reservation to use more than 2 gpus, it is still possible for two jobs to use more in total.

$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --nodelist=c5-2 --nodes=1 --gpus=3
Submitted batch job 3047
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3047     accel     wrap      bhm PD       0:00      1 (Resources)
$ scancel --me
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --nodelist=c5-2 --nodes=1 --gpus=2
Submitted batch job 3048
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --nodelist=c5-2 --nodes=1 --gpus=1
Submitted batch job 3049
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3049     accel     wrap      bhm PD       0:00      1 (Resources)
              3048     accel     wrap      bhm  R       0:15      1 c5-2

So jobs *not* asking for the reservation cannot use more gpus than what is available outside the reservation (the node has 4 gpus).

For completeness, here are commands testing the same on the node *with* typed GPU specification.  All results here are as expected:

$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --reserv=test1 --gpus=3
Submitted batch job 3038
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3038     accel     wrap      bhm PD       0:00      1 (Resources)
$ scancel --me
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --reserv=test1 --gpus=2
Submitted batch job 3039
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --reserv=test1 --gpus=1
Submitted batch job 3040
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3040     accel     wrap      bhm PD       0:00      1 (Resources)
              3039     accel     wrap      bhm  R       0:21      1 c5-1
$ scancel --me
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --nodelist=c5-1 --nodes=1 --gpus=3
Submitted batch job 3041
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3041     accel     wrap      bhm PD       0:00      1 (Resources)
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --nodelist=c5-1 --nodes=1 --gpus=2
Submitted batch job 3042
$ sbatch -A nn9999k --mem-per-cpu=100 -t 10 -p accel --wrap='sleep 120' --nodelist=c5-1 --nodes=1 --gpus=1
Submitted batch job 3043
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              3043     accel     wrap      bhm PD       0:00      1 (Resources)
              3042     accel     wrap      bhm  R       0:28      1 c5-1


Please let me know if you want me to upload the config files for this test cluster.

Comment 44 Megan Dahl 2025-09-15 18:39:46 MDT

Hello,

Thank you for reporting this issue. I am able to reproduce the behavior. I will keep you updated as I work on fixing this.

Regards,
--Megan