Ticket 21862

Summary: Is it possible to define a TimeLimit per Gres/GPU?
Product: Slurm Reporter: Motaz Diab <Motaz.Diab>
Component: GPUAssignee: Ben Roberts <ben>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 24.05.4   
Hardware: Linux   
OS: Linux   
Site: MDC Berlin Max Delbrück Center for Molecular Medicine Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurm configuration file
slurmctl.log and slurmd.log (for maxg20)
slurmctl.log
Slurm configuration file

Description Motaz Diab 2025-01-21 18:42:16 MST
Dear,

We have defined this partition [gpu] with TimeLimit 2 weeks as followings:

PartitionName=gpu Nodes=max00[2-6],maxg[05,10,18,20,22,24-26] MaxTime=14-0 PriorityTier=5

All our GPU nodes are added into it.

Because that some users submit big array gpu jobs which request all gpu resources in slurm, the other short gpu jobs are blocked and waiting for days till some gpu resources be free again.

My question is: Is there any option to reserve only two gpu resources on every gpu node (which has 8 gpu resources in total) to be used with overwritten TimeLimit to be 30 minutes for example?  

Thanks
Best regards,
Motaz
Comment 1 Ben Roberts 2025-01-22 10:53:25 MST
Hi Motaz,

I want to make sure I understand the behavior you want.  It sounds like you want to keep 2 GPUs open on each node for short jobs (30 minutes or less).  This way you don't have to place an artificial limit on the number of jobs that can be run by other users, potentially leaving resources idle.  Is that right?

If so, this sounds like a good case for a floating reservation.  This allows you to create a reservation that is always X minutes in the future, in your case 30 minutes.  This will prevent jobs that request more than 30 minutes of run time from starting on these resources, but jobs that request less time than that will be able to be scheduled on the available resources.

Here's a quick example of how this might look:

$ scontrol create reservationname=short_gpu nodecnt=10 TRESPerNode=gres/gpu=2 account=sub1 starttime=now+30minutes duration=12:00:00 flags=time_float
Reservation created: short_gpu

$ scontrol show reservations 
ReservationName=short_gpu StartTime=2025-01-22T12:19:02 EndTime=2025-01-23T00:19:02 Duration=12:00:00
   Nodes=node[17-18,25-32] NodeCnt=10 CoreCnt=10 Features=(null) PartitionName=debug Flags=TIME_FLOAT
     NodeName=node17 CoreIDs=0
     NodeName=node18 CoreIDs=0
     NodeName=node25 CoreIDs=0
     NodeName=node26 CoreIDs=0
     NodeName=node27 CoreIDs=0
     NodeName=node28 CoreIDs=0
     NodeName=node29 CoreIDs=0
     NodeName=node30 CoreIDs=0
     NodeName=node31 CoreIDs=0
     NodeName=node32 CoreIDs=0
   TRES=cpu=10,gres/gpu=20
   Users=(null) Groups=(null) Accounts=sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)



You can find more information on the TIME_FLOAT flag here:
https://slurm.schedmd.com/reservations.html#float
https://slurm.schedmd.com/scontrol.html#OPT_TIME_FLOAT


Let me know if you have any questions about this.

Thanks,
Ben
Comment 2 Ben Roberts 2025-02-21 13:19:34 MST
Hi Motaz,

I wanted to see if the information I sent helped.  Let me know if you have any more questions about this or if this ticket is ok to close.

Thanks,
Ben
Comment 3 Motaz Diab 2025-02-22 10:50:45 MST
Dear,

I tried it but unfortunately it did not work as expected.

I created a float reservation on one node [maxg05] which has 4 GPUs, and I reserved 2 GPUs of them. I granted only my user [mdiab] access to it.

[root@max-mastr1.mdc-berlin.net:~] (1090) $ scontrol show reservation short_gpu
ReservationName=short_gpu StartTime=2025-02-22T18:36:56 EndTime=2025-02-22T22:36:56 Duration=04:00:00
   Nodes=maxg05 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=normal Flags=SPEC_NODES,TIME_FLOAT
     NodeName=maxg05 CoreIDs=0
   TRES=cpu=1,gres/gpu=2
   Users=mdiab Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[root@max-mastr1.mdc-berlin.net:~] (1091) $

I submitted several GPU jobs (one requested 10 min and others requested 5 hours):

[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 0:10:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530107
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530108
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530109
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530110
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530111
[mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180"
Submitted batch job 530112
[mdiab@max-login5.mdc-berlin.net:~] $ 

I found 3 long jobs (5 hours) started on maxg05, so the reservation did not prevent them to use the reserved 2 of 4 GPUs.

[mdiab@max-login5.mdc-berlin.net:~] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 530112      mdiab       gpu     1     1 PD 2025-02-22T18:43:36                 N/A     0:00    5:00:00 (Resources)
 530111      mdiab       gpu     1     1 PD 2025-02-22T18:43:35                 N/A     0:00    5:00:00 (Resources)
 530108      mdiab       gpu     1     1  R 2025-02-22T18:43:31 2025-02-22T18:43:41     0:02    5:00:00 maxg05
 530109      mdiab       gpu     1     1  R 2025-02-22T18:43:34 2025-02-22T18:43:41     0:02    5:00:00 maxg05
 530110      mdiab       gpu     1     1  R 2025-02-22T18:43:34 2025-02-22T18:43:41     0:02    5:00:00 maxg05
 530107      mdiab       gpu     1     1  R 2025-02-22T18:43:24 2025-02-22T18:43:24     0:19      10:00 maxg05
[mdiab@max-login5.mdc-berlin.net:~] $

So is there something wrong done by me? or its concept is still unclear to me?

One more question,
How to create float reservation for all users?! I see that it's required to specify user/group or account, so should I list all user accounts in the creation command? 

Best regards,
Motaz
Comment 4 Ben Roberts 2025-02-26 10:57:34 MST
Hi Motaz,

I'm sorry, my example was a little unclear.  In your example you're creating a reservation that allows your user to run jobs in the reservation.  The idea behind the floating reservation is that you want to give it an ACL that prevents the majority of the users from being able to access it.  In my example I used an account called 'sub1', but I didn't make it clear that in order for this to work I would need to submit jobs to an account other than 'sub1'.  

You can also modify this to allow the primary users to have access to this reservation.  To continue with the example I sent before, if the principal investors that I wanted to make sure didn't have to wait too long for the GPUs were users in the 'sub1' account, then I would create the reservation the way I did.  Users who wanted to use the GPUs for up to 30 minutes at a time would be in other accounts and would be able to start short jobs on the GPUs.  If a user in the 'sub1' account came along, their job would qualify for the reservation so it could be longer than the 30 minute limit imposed on the other jobs, and it wouldn't have to wait longer than 30 minutes to be able to start.  You would want to also add the 'flex' flag to the reservation when you're creating it so that these jobs would be able to qualify for the reservation even when the start time is in the future.
https://slurm.schedmd.com/scontrol.html#OPT_FLEX

I hope this makes more sense.  Let me know if you have any additional questions about this.

Thanks,
Ben
Comment 5 Motaz Diab 2025-02-26 17:11:53 MST
Hello again,

I re-created the same float reservation but I changed the specified user in command to one other than my user [in this example: amardt] as followings:

[root@max-mastr1.mdc-berlin.net:~] (1004) $ scontrol create reservationname=short_gpu Nodes=maxg05 user=amardt TRESPerNode=gres/gpu=2 starttime=now duration=4:00:00 flags=time_float
Reservation created: short_gpu
[root@max-mastr1.mdc-berlin.net:~] (1006) $ scontrol show res short_gpu
ReservationName=short_gpu StartTime=2025-02-27T00:37:28 EndTime=2025-02-27T04:37:28 Duration=04:00:00
   Nodes=maxg05 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=normal Flags=SPEC_NODES,TIME_FLOAT
     NodeName=maxg05 CoreIDs=0
   TRES=cpu=1,gres/gpu=2
   Users=amardt Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[root@max-mastr1.mdc-berlin.net:~] (1007) 

I re-submitted a group of short [30 minutes] and long jobs [5 hours] by my user [mdiab] as followings:

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534072
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534073
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534074
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534075
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh
Submitted batch job 534076
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch short-gpu.sh
Submitted batch job 534077
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch short-gpu.sh
Submitted batch job 534078
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $

Again I saw my long jobs arrived to all of 4 GPUs (2 normal and 2 reserved by short_gpu) on maxg05, which is like normal case and not our desired one which is to prevent long jobs from scheduling to the 2 reserved GPUs for short jobs.

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 534078      mdiab       gpu     1     1 PD 2025-02-27T00:35:36                 N/A     0:00      10:00 (Priority)
 534077      mdiab       gpu     1     1 PD 2025-02-27T00:35:31                 N/A     0:00      10:00 (Priority)
 534076      mdiab       gpu     1     1 PD 2025-02-27T00:35:27                 N/A     0:00    5:00:00 (Priority)
 534072      mdiab       gpu     1     1  R 2025-02-27T00:35:21 2025-02-27T00:35:51     0:01    5:00:00 maxg05
 534073      mdiab       gpu     1     1  R 2025-02-27T00:35:23 2025-02-27T00:35:51     0:01    5:00:00 maxg05
 534074      mdiab       gpu     1     1  R 2025-02-27T00:35:24 2025-02-27T00:35:51     0:01    5:00:00 maxg05
 534075      mdiab       gpu     1     1  R 2025-02-27T00:35:24 2025-02-27T00:35:51     0:01    5:00:00 maxg05
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $

Let me please re-explain our desired scenario:

We have a defined gpu partition with TimeLimit (2 weeks), which means any user can use up to all available GPUs for 2 weeks and block others.

What we would like to have is, to override the gpu partition TimeLimit per gpu resource (2 GPUs per node for example) to be 30 minutes.
 
Is it possible to override the partition TimeLimit per resource? or let me say to have two defined TimeLimit in the same partition?

This change should be applied on all users (not on a specific user or group like sub1). so any user job from any group or account will be checked as followings:

If the user job time < 30 minutes, then
   Go to CUDA IDs 0-1 on any gpu node in the gpu partition
else # Here the time is up to 2 weeks (The partition TimeLimit)
   Go to CUDA IDs >= 2 on any gpu node in the same gpu partition
end

Thanks a lot
Best regards,
Motaz
Comment 6 Ben Roberts 2025-02-28 11:40:42 MST
Hi Motaz,

I see in the command you used to create the reservation that you have 'starttime=now'.  This makes the reservation start immediately and doesn't leave the 30 minute window that is always being pushed out before the reservation "starts".  If you change that to 'starttime=now+30minutes' then I would expect it to behave properly.  

Here is an example where I create a test reservation with a start time that is 30 minutes in the future.  I also create this reservation for the root user, to prevent any user on the system from being able to access this reservation.  I have a 'gpu' partition that has 2 nodes in it, so I use this for simplicity in showing the behavior in the testing that follows.

$ scontrol create reservationname=short_gpu nodecnt=2 partition=gpu TRESPerNode=gres/gpu=2 user=root starttime=now+30minutes duration=12:00:00 flags=time_float
Reservation created: short_gpu

$ scontrol show reservations 
ReservationName=short_gpu StartTime=2025-02-28T12:42:34 EndTime=2025-03-01T00:42:34 Duration=12:00:00
   Nodes=node[07-08] NodeCnt=2 CoreCnt=2 Features=(null) PartitionName=gpu Flags=TIME_FLOAT
     NodeName=node07 CoreIDs=0
     NodeName=node08 CoreIDs=0
   TRES=cpu=2,gres/gpu=4
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)




With that reservation in place, I submit 3 test jobs to the gpu partition.  Each job requests 2 GPUs.  The first two jobs are able to start, one on each node.  The third job is blocked because the other GPUs on these nodes are reserved and the 1 hour wall time is too long to allow it to run in the open window of time.

$ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 9195

$ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 9196

$ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600'
Submitted batch job 9197

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9197       gpu     wrap      ben PD       0:00      1 (Resources)
              9196       gpu     wrap      ben  R       0:05      1 node08
              9195       gpu     wrap      ben  R       0:08      1 node07





Then I submit a job that only requests 20 minutes of wall time so that it will fit in the 30 minute window before the reservation starts.

$ sbatch -n1 -t20:00 --gres=gres/gpu=1 -pgpu --wrap='srun sleep 1200'
Submitted batch job 9198

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9197       gpu     wrap      ben PD       0:00      1 (Resources)
              9198       gpu     wrap      ben  R       0:01      1 node07
              9196       gpu     wrap      ben  R       0:16      1 node08
              9195       gpu     wrap      ben  R       0:19      1 node07





To further clarify the behavior, if I look at the start time of the reservation again, you can see that it has moved to be 30 minutes in the future still. 

$ date; scontrol show reservations | grep StartTime
Fri Feb 28 12:29:56 PM CST 2025
ReservationName=short_gpu StartTime=2025-02-28T12:59:56 EndTime=2025-03-01T00:59:56 Duration=12:00:00



To summarize, if you change the start time of the reservation to be some number of minutes in the future and include the 'time_float' flag, then the reservation should always be that number of minutes in the future.  This will allow jobs that request fewer than that many minutes of wall time to run on the reserved resources.  This effectivly allows you to have two time limits on the partition.  One defined on the partition itself and the reservation that blocks jobs longer than the window of time before it starts, but allows short jobs to run.

I'm reading your last update again and I notice that you are also pointing out that your user is able to run in the reservation that should only allow a different user.  This is strange.  Can I have you send a copy of your slurm.conf to see if there is something in there that isn't configured properly to enforce this?

Thanks,
Ben
Comment 7 Motaz Diab 2025-03-13 13:33:47 MDT
Hello,

I'd like to thank you for your last explanation, so its concept is now very clear to me, but it is still not working as expected. 

I doubt that it might be a bug in our current version [24.05.5].

I recreated the reservation again to reserve 2 GPUs in partition gpu on two nodes maxg2[4-5] (Each one has 8 GPUs as total) as followings:

[root@max-mastr1.mdc-berlin.net:~] (1022) $ scontrol create reservationname=short_gpu  Nodes=maxg2[4-5] user=root partition=gpu TRESPerNode=gres/gpu=2 starttime=now+30minutes duration=12:00:00 flags=time_float
Reservation created: short_gpu
[root@max-mastr1.mdc-berlin.net:~] (1023) $ scontrol show res short_gpu
ReservationName=short_gpu StartTime=2025-03-13T20:45:16 EndTime=2025-03-14T08:45:16 Duration=12:00:00
   Nodes=maxg[24-25] NodeCnt=2 CoreCnt=2 Features=(null) PartitionName=gpu Flags=SPEC_NODES,TIME_FLOAT
     NodeName=maxg24 CoreIDs=0
     NodeName=maxg25 CoreIDs=0
   TRES=cpu=2,gres/gpu=4
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[root@max-mastr1.mdc-berlin.net:~] (1024) $

I submitted 3 short jobs (10 minutes) and 2 long jobs (5 hours) where each of them requested 2 GPUs:
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 short-gpu.sh 
Submitted batch job 561660
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 short-gpu.sh 
Submitted batch job 561661
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg24 short-gpu.sh 
Submitted batch job 561662
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 long-gpu.sh 
Submitted batch job 561663
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg24 long-gpu.sh 
Submitted batch job 561664
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561664      mdiab       gpu     1     1 PD 2025-03-13T20:19:29                 N/A     0:00    5:00:00 (Priority)
 561663      mdiab       gpu     1     1 PD 2025-03-13T20:19:25                 N/A     0:00    5:00:00 (Priority)
 561662      mdiab       gpu     1     1 PD 2025-03-13T20:14:48                 N/A     0:00      10:00 (Priority)
 561661      mdiab       gpu     1     1 PD 2025-03-13T20:14:45                 N/A     0:00      10:00 (Priority)
 561660      mdiab       gpu     1     1 PD 2025-03-13T20:14:44                 N/A     0:00      10:00 (Priority)
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $

But none of them started. If I submit a job to maxg2[4-5] without requesting GPUs, then it starts shortly regardless of the requested time.

[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -p gpu -w maxg24 -t 1-0 --wrap "sleep 50"
Submitted batch job 561665
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue | grep 561665
 561665      mdiab       gpu     1     1  R 2025-03-13T20:25:16 2025-03-13T20:25:34     0:08 1-00:00:00 maxg24
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ 

Of course if I delete the reservation:

[root@max-mastr1.mdc-berlin.net:~] (1025) $ scontrol delete reservation short_gpu
[root@max-mastr1.mdc-berlin.net:~] (1026) $

=> then all of my pending jobs started immediately too.

[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561660      mdiab       gpu     1     1  R 2025-03-13T20:14:44 2025-03-13T20:27:04     0:07      10:00 maxg25
 561661      mdiab       gpu     1     1  R 2025-03-13T20:14:45 2025-03-13T20:27:04     0:07      10:00 maxg25
 561662      mdiab       gpu     1     1  R 2025-03-13T20:14:48 2025-03-13T20:27:04     0:07      10:00 maxg24
 561663      mdiab       gpu     1     1  R 2025-03-13T20:19:25 2025-03-13T20:27:04     0:07    5:00:00 maxg25
 561664      mdiab       gpu     1     1  R 2025-03-13T20:19:29 2025-03-13T20:27:04     0:07    5:00:00 maxg24
[mdiab@max-login1.mdc-berlin.net:~/test-scripts] $

I attached our slurm.conf here. 

Thanks
Motaz
Comment 8 Motaz Diab 2025-03-13 13:34:41 MDT
Created attachment 41128 [details]
Slurm configuration file
Comment 9 Ben Roberts 2025-03-13 15:09:26 MDT
I'm glad to hear that it makes sense now, but it's strange that it's not doing what it should.  One thing I notice that could be affecting how these jobs are scheduled is one of your scheduler parameters.  You have 'bf_resolution=600', which means that the chunks of time that get evaluated are in 10 minute increments.  The default for this is 60 second increments.  With a 30 minute window of time before the reservation it can be hard for it to be able to resolve it properly.  I would recommend setting this back down to the default of 60 seconds and see if the behavior changes.

If you still see the same problem after changing the bf_resolution, then I would like to see some debug logs for the issue.  You can enable debug logging with the following commands:
 scontrol setdebug debug2
 scontrol setdebugflags +backfill


I would like to have you run the same test and send the logs that cover the time of that test.  Please send all the logs for that time rather than grepping out a particular job id.  There are often relevant log entries that don't include the job id as part of the line.  Then you can set the log level back down to normal levels with the following commands:
 scontrol setdebug info
 scontrol setdebugflags -backfill


Thanks,
Ben
Comment 10 Motaz Diab 2025-03-14 08:47:33 MDT
Hello again,

I set bf_resolution back to 60 (default) but it did not help. 

I redid the same example on maxg20 (It has 8 GPUs as total) only.

ReservationName=short_gpu StartTime=2025-03-14T16:09:41 EndTime=2025-03-15T04:09:41 Duration=12:00:00
   Nodes=maxg20 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=gpu Flags=SPEC_NODES,TIME_FLOAT
     NodeName=maxg20 CoreIDs=0
   TRES=cpu=1,gres/gpu=2
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)


I submitted 4 long jobs and 4 short (each requested 2 GPUs)
I saw all short jobs started but long jobs not,

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561858      mdiab       gpu     1     1 PD 2025-03-14T15:28:01                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561857      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561856      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561855      mdiab       gpu     1     1 PD 2025-03-14T15:27:59                 N/A     0:00    5:00:00 (Resources)
 561851      mdiab       gpu     1     1  R 2025-03-14T15:27:54 2025-03-14T15:28:00     0:28      10:00 maxg20
 561852      mdiab       gpu     1     1  R 2025-03-14T15:27:54 2025-03-14T15:28:00     0:28      10:00 maxg20
 561853      mdiab       gpu     1     1  R 2025-03-14T15:27:55 2025-03-14T15:28:00     0:28      10:00 maxg20
 561854      mdiab       gpu     1     1  R 2025-03-14T15:27:55 2025-03-14T15:28:00     0:28      10:00 maxg20
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $

When all 4 short jobs completed, the long jobs are still pending

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561858      mdiab       gpu     1     1 PD 2025-03-14T15:28:01                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561857      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561856      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561855      mdiab       gpu     1     1 PD 2025-03-14T15:27:59                 N/A     0:00    5:00:00 (Resources)
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $

I re-submitted another 3 short jobs, but also they did not start and are pending

[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
 561862      mdiab       gpu     1     1 PD 2025-03-14T15:36:23                 N/A     0:00      10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561861      mdiab       gpu     1     1 PD 2025-03-14T15:36:22                 N/A     0:00      10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561860      mdiab       gpu     1     1 PD 2025-03-14T15:36:04                 N/A     0:00      10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561858      mdiab       gpu     1     1 PD 2025-03-14T15:28:01                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561857      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561856      mdiab       gpu     1     1 PD 2025-03-14T15:28:00                 N/A     0:00    5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20)
 561855      mdiab       gpu     1     1 PD 2025-03-14T15:27:59                 N/A     0:00    5:00:00 (Resources)
[mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ 

I enabled the debug as you asked and uploaded the slurmctl.log and slurmd.log (for maxg20).
Thanks and best regards,
Motaz
Comment 11 Motaz Diab 2025-03-14 08:47:56 MDT
Created attachment 41141 [details]
slurmctl.log and slurmd.log (for maxg20)
Comment 12 Ben Roberts 2025-03-21 10:28:36 MDT
Hi Motaz,

My apologies that it took me a while to get back to you on this.  It looks like there is something else happening that is preventing these long GPU jobs from starting on that node.  Unfortunately it's not clear from the logs what is keeping the job from starting.  I was looking at job 561858 as an example.  You can see in the logs that it sees that it recognizes that it could use node maxg20, but it doesn't start on it.

[2025-03-14T15:28:01.358] debug2: found 2 usable nodes from config containing maxg[10,20]
...
[2025-03-14T15:28:01.358] debug2: NodeSet[3] Nodes:maxg[10,20] NodeWeight:10 Flags:0 FeatureBits:0 SchedWeight:2815



Those are what is shown by the main scheduler.  Then it is evaluated by the backfill scheduler and it shows that it tries to schedule the job, but doesn't provide details of why it isn't able to start it.  

[2025-03-14T15:28:21.853] sched/backfill: _attempt_backfill: BACKFILL: test for JobId=561858 Prio=1000 Partition=gpu Reservation=NONE
[2025-03-14T15:28:21.853] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=561858.
[2025-03-14T15:28:21.853] debug2: sched/backfill: _try_sched: exclude core bitmap: 3644
[2025-03-14T15:28:21.853] debug2: select/cons_tres: select_p_job_test: evaluating JobId=561858
[2025-03-14T15:28:21.853] debug2: select/cons_tres: select_p_job_test: evaluating JobId=561858


Since it's the long jobs that aren't able to start on the node, they shouldn't be directly affected by the floating reservation you created.  Something else is happening that is keeping these jobs from starting.  It's possible that there is a large job that is reserving resources and the longer jobs are long enough that they would interfere with its start time.  I hate to ask you to run this test one more time, but I'd like to have some additional logging collected.  Could you run the following commands to enable two more debug flags:
 scontrol setdebug debug2
 scontrol setdebugflags +backfill,backfillmap,selecttype


Then if you would run a similar test and send the logs from that time period.  Then you can turn the log level back down with this:
 scontrol setdebug info
 scontrol setdebugflags -backfill,backfillmap,selecttype


I would also like to see the output of 'scontrol show job <jobid>' for one of the long jobs that isn't able to run as well as one of the short ones that does run.  Along with that I would like to see the output of 'scontrol show node maxg20' while a short job is running on it.  

Thanks,
Ben
Comment 13 Motaz Diab 2025-03-31 06:54:37 MDT
Hello again,

I installed the latest version of Slurm (24.11.3) on a test cluster which has single partition and 8 gpu nodes (8 GPU devices on each).

[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1025) $ scontrol -V
slurm 24.11.3
[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1026) $


I created the reservation for 2 GPUs on single node (hai002):

[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1024) $ scontrol show reservations
ReservationName=short_gpu StartTime=2025-03-31T15:11:18 EndTime=2025-04-01T03:11:18 Duration=12:00:00
   Nodes=hai002 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=standard Flags=TIME_FLOAT
     NodeName=hai002 CoreIDs=0,56
   TRES=cpu=2,gres/gpu=2
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null)
   MaxStartDelay=(null)

[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1025) $
[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1026) $ sinfo -Nel
Mon Mar 31 14:41:59 2025
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
hai001         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai002         1 standard*    reserved 224    2:56:2 975000        0      1 sapphire none                
hai003         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai004         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai005         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai006         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai007         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
hai008         1 standard*        idle 224    2:56:2 975000        0      1 sapphire none                
[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1027) $ 
[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1030) $ scontrol show node hai002
NodeName=hai002 Arch=x86_64 CoresPerSocket=56 
   CPUAlloc=0 CPUEfctv=220 CPUTot=224 CPULoad=0.01
   AvailableFeatures=sapphire-rapids
   ActiveFeatures=sapphire-rapids
   Gres=gpu:H100-SXM5:8,gpu_memory:no_consume:80G,gpu_compute_cap:no_consume:90
   NodeAddr=hai002 NodeHostName=hai002 Version=24.11.3
   OS=Linux 5.14.0-427.13.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024 
   RealMemory=975000 AllocMem=0 FreeMem=980413 Sockets=2 Boards=1
   CoreSpecCount=2 CPUSpecList=110-111,222-223 
   State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=standard 
   BootTime=2025-03-24T22:47:17 SlurmdStartTime=2025-03-30T16:48:36
   LastBusyTime=2025-03-31T13:45:18 ResumeAfterTime=None
   CfgTRES=cpu=220,mem=975000M,billing=220,gres/gpu=8
   AllocTRES=
   CurrentWatts=0 AveWatts=0
   
   ReservationName=short_gpu

[root@hai-mastr1.haicore.berlin:/var/log/slurm] (1031) $

I submitted two 1-gpu jobs (long/1 hour + short/10 minutes)
[mdiab-srvadm@hai-login1.haicore.berlin:~] $ sbatch -N1 -n1 -c1 -t 0-1 -w hai002 --gres=gpu:1 --wrap "sleep 1000"
Submitted batch job 302
[mdiab-srvadm@hai-login1.haicore.berlin:~] $
[mdiab-srvadm@hai-login1.haicore.berlin:~] $ sbatch -N1 -n1 -c1 -t 10 -w hai002 --gres=gpu:1 --wrap "sleep 1000"
Submitted batch job 303
[mdiab-srvadm@hai-login1.haicore.berlin:~] $

But this time none of them started.
[mdiab-srvadm@hai-login1.haicore.berlin:~] $ squeue 
  JOBID       USER PARTITION NODES  CPUS ST         SUBMIT_TIME          START_TIME     TIME TIME_LIMIT NODELIST(REASON)
    303 mdiab-srva  standard     1     1 PD 2025-03-31T14:26:16                 N/A     0:00      10:00 (ReqNodeNotAvail, May be reserved for other job)
    302 mdiab-srva  standard     1     1 PD 2025-03-31T14:25:51                 N/A     0:00    1:00:00 (Resources)
[mdiab-srvadm@hai-login1.haicore.berlin:~] $
[mdiab-srvadm@hai-login1.haicore.berlin:~] $ scontrol show job 302
JobId=302 JobName=wrap
   UserId=mdiab-srvadm(961900513) GroupId=mdiab-srvadm(961900513) MCS_label=N/A
   Priority=1 Nice=0 Account=it QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2025-03-31T14:25:51 EligibleTime=2025-03-31T14:25:51
   AccrueTime=2025-03-31T14:25:51
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-31T14:50:34 Scheduler=Backfill:*
   Partition=standard AllocNode:Sid=hai-login1:2540965
   ReqNodeList=hai002 ExcNodeList=(null)
   NodeList=
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=4G,node=1,billing=1,gres/gpu=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/fast/home/mdiab-srvadm
   StdErr=/fast/home/mdiab-srvadm/slurm-302.out
   StdIn=/dev/null
   StdOut=/fast/home/mdiab-srvadm/slurm-302.out
   TresPerNode=gres/gpu:1
   TresPerTask=cpu=1
   

[mdiab-srvadm@hai-login1.haicore.berlin:~] $ scontrol show job 303
JobId=303 JobName=wrap
   UserId=mdiab-srvadm(961900513) GroupId=mdiab-srvadm(961900513) MCS_label=N/A
   Priority=1 Nice=0 Account=it QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_May_be_reserved_for_other_job Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2025-03-31T14:26:16 EligibleTime=2025-03-31T14:26:16
   AccrueTime=2025-03-31T14:26:16
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-31T14:50:34 Scheduler=Backfill:*
   Partition=standard AllocNode:Sid=hai-login1:2540965
   ReqNodeList=hai002 ExcNodeList=(null)
   NodeList=
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=4G,node=1,billing=1,gres/gpu=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/fast/home/mdiab-srvadm
   StdErr=/fast/home/mdiab-srvadm/slurm-303.out
   StdIn=/dev/null
   StdOut=/fast/home/mdiab-srvadm/slurm-303.out
   TresPerNode=gres/gpu:1
   TresPerTask=cpu=1
   

[mdiab-srvadm@hai-login1.haicore.berlin:~] $

I uploaded here the slurm.conf and the slurmctld.log (with enabled debug)

Note: I tested the same example for reserving some of CPUs (instead of GPUs) and it works, so it seems that only GPU reservation is broken.

Thanks,
Motaz
Comment 14 Motaz Diab 2025-03-31 06:55:18 MDT
Created attachment 41319 [details]
slurmctl.log
Comment 15 Motaz Diab 2025-03-31 06:56:02 MDT
Created attachment 41320 [details]
Slurm configuration file
Comment 16 Ben Roberts 2025-04-04 11:21:32 MDT
Thank you for reproducing this on your test cluster and collecting these logs.  I can see that there is something causing the resources to be excluded for this node.  Here is an excerpt showing it evaluate hai001 and it sees the GPUs and Cores correctly for that node, but it doesn't start on that node because it doesn't request it.

[2025-03-31T14:25:52.000] select/cons_tres: _can_job_run_on_node: SELECT_TYPE: 220 CPUs on hai001(state:0), mem 0/975000
[2025-03-31T14:25:52.000] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hai001 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:1-220,220 ThreadsPerCore:2
[2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE:   AnySocket gpu:8
[2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE:   Socket[0] Cores:55
[2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE:   Socket[1] Cores:55



But then when it comes to hai002 it says that the resources are excluded for the node.

[2025-03-31T14:26:31.000] select/cons_tres: _can_use_gres_exc_topo: SELECT_TYPE: can't include!, it is excluded 1 0
[2025-03-31T14:26:31.000] select/cons_tres: _can_job_run_on_node: SELECT_TYPE: Test fail on node hai002: gres_sock_list_create


Can I have you send a copy of your gres.conf for your test system?  I would like to see if there is something there that is keeping it from recognizing that it can use these GPUs/CPUs.  If you have any other .conf files in the same directory as your slurm.conf, it would be good to see all of them.

Thanks,
Ben
Comment 17 Motaz Diab 2025-04-05 04:45:48 MDT
Hello,

Here are three used config files gres.conf, job_container.conf, cgroup.conf:

[mdiab@cl-hpc02 12:40:25 config]$ cat gres.conf |grep -v "#"
AutoDetect=off
NodeName=hai00[1-8] Name=gpu Type=H100-SXM5 File=/dev/nvidia[0-7] Flags=nvidia_gpu_env
NodeName=hai00[1-8] Name=gpu_memory Count=80G Flags=CountOnly
NodeName=hai00[1-8] Name=gpu_compute_cap Count=90 Flags=CountOnly

[mdiab@cl-hpc02 12:40:35 config]$ cat job_container.conf |grep -v "#"
AutoBasePath=true
BasePath=/tmp/slurm
Dirs=/var/tmp
Shared=true

[mdiab@cl-hpc02 12:40:48 config]$ cat cgroup.conf |grep -v "#"
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

Also I have a job_submit.lua which only sends the interactive jobs into interactive partition silently.
[mdiab@cl-hpc02 12:41:11 config]$ cat job_submit.lua 

function slurm_job_submit(job_desc, part_list, submit_uid)

   -- Sending all interactive jobs to interactive partition.
   if not job_desc.script then
           job_desc.partition = 'interactive'
   end  

   -- Allow the job to proceed
   return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
   return slurm.SUCCESS
end

Thanks
Motaz
Comment 18 Ben Roberts 2025-04-09 14:38:51 MDT
Hi Motaz,

I discussed this ticket with a colleague who thinks that this might be related to an issue he has seen related to reservations and typed GPUs.  You have reproduced this on your production system and a test system and it looks like both of them have the GPUs defined with a type of 'H100-SXM5'.  Can you try removing that type from the GPU definition on your test cluster to confirm whether this is the same issue?  I'm not suggesting you remove it as the long term fix, but it will allow us to confirm the issue.

Thanks,
Ben