Dear, We have defined this partition [gpu] with TimeLimit 2 weeks as followings: PartitionName=gpu Nodes=max00[2-6],maxg[05,10,18,20,22,24-26] MaxTime=14-0 PriorityTier=5 All our GPU nodes are added into it. Because that some users submit big array gpu jobs which request all gpu resources in slurm, the other short gpu jobs are blocked and waiting for days till some gpu resources be free again. My question is: Is there any option to reserve only two gpu resources on every gpu node (which has 8 gpu resources in total) to be used with overwritten TimeLimit to be 30 minutes for example? Thanks Best regards, Motaz
Hi Motaz, I want to make sure I understand the behavior you want. It sounds like you want to keep 2 GPUs open on each node for short jobs (30 minutes or less). This way you don't have to place an artificial limit on the number of jobs that can be run by other users, potentially leaving resources idle. Is that right? If so, this sounds like a good case for a floating reservation. This allows you to create a reservation that is always X minutes in the future, in your case 30 minutes. This will prevent jobs that request more than 30 minutes of run time from starting on these resources, but jobs that request less time than that will be able to be scheduled on the available resources. Here's a quick example of how this might look: $ scontrol create reservationname=short_gpu nodecnt=10 TRESPerNode=gres/gpu=2 account=sub1 starttime=now+30minutes duration=12:00:00 flags=time_float Reservation created: short_gpu $ scontrol show reservations ReservationName=short_gpu StartTime=2025-01-22T12:19:02 EndTime=2025-01-23T00:19:02 Duration=12:00:00 Nodes=node[17-18,25-32] NodeCnt=10 CoreCnt=10 Features=(null) PartitionName=debug Flags=TIME_FLOAT NodeName=node17 CoreIDs=0 NodeName=node18 CoreIDs=0 NodeName=node25 CoreIDs=0 NodeName=node26 CoreIDs=0 NodeName=node27 CoreIDs=0 NodeName=node28 CoreIDs=0 NodeName=node29 CoreIDs=0 NodeName=node30 CoreIDs=0 NodeName=node31 CoreIDs=0 NodeName=node32 CoreIDs=0 TRES=cpu=10,gres/gpu=20 Users=(null) Groups=(null) Accounts=sub1 Licenses=(null) State=INACTIVE BurstBuffer=(null) MaxStartDelay=(null) You can find more information on the TIME_FLOAT flag here: https://slurm.schedmd.com/reservations.html#float https://slurm.schedmd.com/scontrol.html#OPT_TIME_FLOAT Let me know if you have any questions about this. Thanks, Ben
Hi Motaz, I wanted to see if the information I sent helped. Let me know if you have any more questions about this or if this ticket is ok to close. Thanks, Ben
Dear, I tried it but unfortunately it did not work as expected. I created a float reservation on one node [maxg05] which has 4 GPUs, and I reserved 2 GPUs of them. I granted only my user [mdiab] access to it. [root@max-mastr1.mdc-berlin.net:~] (1090) $ scontrol show reservation short_gpu ReservationName=short_gpu StartTime=2025-02-22T18:36:56 EndTime=2025-02-22T22:36:56 Duration=04:00:00 Nodes=maxg05 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=normal Flags=SPEC_NODES,TIME_FLOAT NodeName=maxg05 CoreIDs=0 TRES=cpu=1,gres/gpu=2 Users=mdiab Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) MaxStartDelay=(null) [root@max-mastr1.mdc-berlin.net:~] (1091) $ I submitted several GPU jobs (one requested 10 min and others requested 5 hours): [mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 0:10:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180" Submitted batch job 530107 [mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180" Submitted batch job 530108 [mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180" Submitted batch job 530109 [mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180" Submitted batch job 530110 [mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180" Submitted batch job 530111 [mdiab@max-login5.mdc-berlin.net:~] $ sbatch -t 5:0:0 -p gpu --gres=gpu:tesla-v100-SXM2:1 --wrap "sleep 180" Submitted batch job 530112 [mdiab@max-login5.mdc-berlin.net:~] $ I found 3 long jobs (5 hours) started on maxg05, so the reservation did not prevent them to use the reserved 2 of 4 GPUs. [mdiab@max-login5.mdc-berlin.net:~] $ squeue --me JOBID USER PARTITION NODES CPUS ST SUBMIT_TIME START_TIME TIME TIME_LIMIT NODELIST(REASON) 530112 mdiab gpu 1 1 PD 2025-02-22T18:43:36 N/A 0:00 5:00:00 (Resources) 530111 mdiab gpu 1 1 PD 2025-02-22T18:43:35 N/A 0:00 5:00:00 (Resources) 530108 mdiab gpu 1 1 R 2025-02-22T18:43:31 2025-02-22T18:43:41 0:02 5:00:00 maxg05 530109 mdiab gpu 1 1 R 2025-02-22T18:43:34 2025-02-22T18:43:41 0:02 5:00:00 maxg05 530110 mdiab gpu 1 1 R 2025-02-22T18:43:34 2025-02-22T18:43:41 0:02 5:00:00 maxg05 530107 mdiab gpu 1 1 R 2025-02-22T18:43:24 2025-02-22T18:43:24 0:19 10:00 maxg05 [mdiab@max-login5.mdc-berlin.net:~] $ So is there something wrong done by me? or its concept is still unclear to me? One more question, How to create float reservation for all users?! I see that it's required to specify user/group or account, so should I list all user accounts in the creation command? Best regards, Motaz
Hi Motaz, I'm sorry, my example was a little unclear. In your example you're creating a reservation that allows your user to run jobs in the reservation. The idea behind the floating reservation is that you want to give it an ACL that prevents the majority of the users from being able to access it. In my example I used an account called 'sub1', but I didn't make it clear that in order for this to work I would need to submit jobs to an account other than 'sub1'. You can also modify this to allow the primary users to have access to this reservation. To continue with the example I sent before, if the principal investors that I wanted to make sure didn't have to wait too long for the GPUs were users in the 'sub1' account, then I would create the reservation the way I did. Users who wanted to use the GPUs for up to 30 minutes at a time would be in other accounts and would be able to start short jobs on the GPUs. If a user in the 'sub1' account came along, their job would qualify for the reservation so it could be longer than the 30 minute limit imposed on the other jobs, and it wouldn't have to wait longer than 30 minutes to be able to start. You would want to also add the 'flex' flag to the reservation when you're creating it so that these jobs would be able to qualify for the reservation even when the start time is in the future. https://slurm.schedmd.com/scontrol.html#OPT_FLEX I hope this makes more sense. Let me know if you have any additional questions about this. Thanks, Ben
Hello again, I re-created the same float reservation but I changed the specified user in command to one other than my user [in this example: amardt] as followings: [root@max-mastr1.mdc-berlin.net:~] (1004) $ scontrol create reservationname=short_gpu Nodes=maxg05 user=amardt TRESPerNode=gres/gpu=2 starttime=now duration=4:00:00 flags=time_float Reservation created: short_gpu [root@max-mastr1.mdc-berlin.net:~] (1006) $ scontrol show res short_gpu ReservationName=short_gpu StartTime=2025-02-27T00:37:28 EndTime=2025-02-27T04:37:28 Duration=04:00:00 Nodes=maxg05 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=normal Flags=SPEC_NODES,TIME_FLOAT NodeName=maxg05 CoreIDs=0 TRES=cpu=1,gres/gpu=2 Users=amardt Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) MaxStartDelay=(null) [root@max-mastr1.mdc-berlin.net:~] (1007) I re-submitted a group of short [30 minutes] and long jobs [5 hours] by my user [mdiab] as followings: [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh Submitted batch job 534072 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh Submitted batch job 534073 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh Submitted batch job 534074 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh Submitted batch job 534075 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch long-gpu.sh Submitted batch job 534076 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch short-gpu.sh Submitted batch job 534077 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ sbatch short-gpu.sh Submitted batch job 534078 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ Again I saw my long jobs arrived to all of 4 GPUs (2 normal and 2 reserved by short_gpu) on maxg05, which is like normal case and not our desired one which is to prevent long jobs from scheduling to the 2 reserved GPUs for short jobs. [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me JOBID USER PARTITION NODES CPUS ST SUBMIT_TIME START_TIME TIME TIME_LIMIT NODELIST(REASON) 534078 mdiab gpu 1 1 PD 2025-02-27T00:35:36 N/A 0:00 10:00 (Priority) 534077 mdiab gpu 1 1 PD 2025-02-27T00:35:31 N/A 0:00 10:00 (Priority) 534076 mdiab gpu 1 1 PD 2025-02-27T00:35:27 N/A 0:00 5:00:00 (Priority) 534072 mdiab gpu 1 1 R 2025-02-27T00:35:21 2025-02-27T00:35:51 0:01 5:00:00 maxg05 534073 mdiab gpu 1 1 R 2025-02-27T00:35:23 2025-02-27T00:35:51 0:01 5:00:00 maxg05 534074 mdiab gpu 1 1 R 2025-02-27T00:35:24 2025-02-27T00:35:51 0:01 5:00:00 maxg05 534075 mdiab gpu 1 1 R 2025-02-27T00:35:24 2025-02-27T00:35:51 0:01 5:00:00 maxg05 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ Let me please re-explain our desired scenario: We have a defined gpu partition with TimeLimit (2 weeks), which means any user can use up to all available GPUs for 2 weeks and block others. What we would like to have is, to override the gpu partition TimeLimit per gpu resource (2 GPUs per node for example) to be 30 minutes. Is it possible to override the partition TimeLimit per resource? or let me say to have two defined TimeLimit in the same partition? This change should be applied on all users (not on a specific user or group like sub1). so any user job from any group or account will be checked as followings: If the user job time < 30 minutes, then Go to CUDA IDs 0-1 on any gpu node in the gpu partition else # Here the time is up to 2 weeks (The partition TimeLimit) Go to CUDA IDs >= 2 on any gpu node in the same gpu partition end Thanks a lot Best regards, Motaz
Hi Motaz, I see in the command you used to create the reservation that you have 'starttime=now'. This makes the reservation start immediately and doesn't leave the 30 minute window that is always being pushed out before the reservation "starts". If you change that to 'starttime=now+30minutes' then I would expect it to behave properly. Here is an example where I create a test reservation with a start time that is 30 minutes in the future. I also create this reservation for the root user, to prevent any user on the system from being able to access this reservation. I have a 'gpu' partition that has 2 nodes in it, so I use this for simplicity in showing the behavior in the testing that follows. $ scontrol create reservationname=short_gpu nodecnt=2 partition=gpu TRESPerNode=gres/gpu=2 user=root starttime=now+30minutes duration=12:00:00 flags=time_float Reservation created: short_gpu $ scontrol show reservations ReservationName=short_gpu StartTime=2025-02-28T12:42:34 EndTime=2025-03-01T00:42:34 Duration=12:00:00 Nodes=node[07-08] NodeCnt=2 CoreCnt=2 Features=(null) PartitionName=gpu Flags=TIME_FLOAT NodeName=node07 CoreIDs=0 NodeName=node08 CoreIDs=0 TRES=cpu=2,gres/gpu=4 Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) MaxStartDelay=(null) With that reservation in place, I submit 3 test jobs to the gpu partition. Each job requests 2 GPUs. The first two jobs are able to start, one on each node. The third job is blocked because the other GPUs on these nodes are reserved and the 1 hour wall time is too long to allow it to run in the open window of time. $ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600' Submitted batch job 9195 $ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600' Submitted batch job 9196 $ sbatch -n2 -t1:00:00 --gres=gres/gpu=2 -pgpu --wrap='srun sleep 3600' Submitted batch job 9197 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9197 gpu wrap ben PD 0:00 1 (Resources) 9196 gpu wrap ben R 0:05 1 node08 9195 gpu wrap ben R 0:08 1 node07 Then I submit a job that only requests 20 minutes of wall time so that it will fit in the 30 minute window before the reservation starts. $ sbatch -n1 -t20:00 --gres=gres/gpu=1 -pgpu --wrap='srun sleep 1200' Submitted batch job 9198 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9197 gpu wrap ben PD 0:00 1 (Resources) 9198 gpu wrap ben R 0:01 1 node07 9196 gpu wrap ben R 0:16 1 node08 9195 gpu wrap ben R 0:19 1 node07 To further clarify the behavior, if I look at the start time of the reservation again, you can see that it has moved to be 30 minutes in the future still. $ date; scontrol show reservations | grep StartTime Fri Feb 28 12:29:56 PM CST 2025 ReservationName=short_gpu StartTime=2025-02-28T12:59:56 EndTime=2025-03-01T00:59:56 Duration=12:00:00 To summarize, if you change the start time of the reservation to be some number of minutes in the future and include the 'time_float' flag, then the reservation should always be that number of minutes in the future. This will allow jobs that request fewer than that many minutes of wall time to run on the reserved resources. This effectivly allows you to have two time limits on the partition. One defined on the partition itself and the reservation that blocks jobs longer than the window of time before it starts, but allows short jobs to run. I'm reading your last update again and I notice that you are also pointing out that your user is able to run in the reservation that should only allow a different user. This is strange. Can I have you send a copy of your slurm.conf to see if there is something in there that isn't configured properly to enforce this? Thanks, Ben
Hello, I'd like to thank you for your last explanation, so its concept is now very clear to me, but it is still not working as expected. I doubt that it might be a bug in our current version [24.05.5]. I recreated the reservation again to reserve 2 GPUs in partition gpu on two nodes maxg2[4-5] (Each one has 8 GPUs as total) as followings: [root@max-mastr1.mdc-berlin.net:~] (1022) $ scontrol create reservationname=short_gpu Nodes=maxg2[4-5] user=root partition=gpu TRESPerNode=gres/gpu=2 starttime=now+30minutes duration=12:00:00 flags=time_float Reservation created: short_gpu [root@max-mastr1.mdc-berlin.net:~] (1023) $ scontrol show res short_gpu ReservationName=short_gpu StartTime=2025-03-13T20:45:16 EndTime=2025-03-14T08:45:16 Duration=12:00:00 Nodes=maxg[24-25] NodeCnt=2 CoreCnt=2 Features=(null) PartitionName=gpu Flags=SPEC_NODES,TIME_FLOAT NodeName=maxg24 CoreIDs=0 NodeName=maxg25 CoreIDs=0 TRES=cpu=2,gres/gpu=4 Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) MaxStartDelay=(null) [root@max-mastr1.mdc-berlin.net:~] (1024) $ I submitted 3 short jobs (10 minutes) and 2 long jobs (5 hours) where each of them requested 2 GPUs: [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 short-gpu.sh Submitted batch job 561660 [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 short-gpu.sh Submitted batch job 561661 [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg24 short-gpu.sh Submitted batch job 561662 [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg25 long-gpu.sh Submitted batch job 561663 [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -w maxg24 long-gpu.sh Submitted batch job 561664 [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue --me JOBID USER PARTITION NODES CPUS ST SUBMIT_TIME START_TIME TIME TIME_LIMIT NODELIST(REASON) 561664 mdiab gpu 1 1 PD 2025-03-13T20:19:29 N/A 0:00 5:00:00 (Priority) 561663 mdiab gpu 1 1 PD 2025-03-13T20:19:25 N/A 0:00 5:00:00 (Priority) 561662 mdiab gpu 1 1 PD 2025-03-13T20:14:48 N/A 0:00 10:00 (Priority) 561661 mdiab gpu 1 1 PD 2025-03-13T20:14:45 N/A 0:00 10:00 (Priority) 561660 mdiab gpu 1 1 PD 2025-03-13T20:14:44 N/A 0:00 10:00 (Priority) [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ But none of them started. If I submit a job to maxg2[4-5] without requesting GPUs, then it starts shortly regardless of the requested time. [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ sbatch -p gpu -w maxg24 -t 1-0 --wrap "sleep 50" Submitted batch job 561665 [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue | grep 561665 561665 mdiab gpu 1 1 R 2025-03-13T20:25:16 2025-03-13T20:25:34 0:08 1-00:00:00 maxg24 [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ Of course if I delete the reservation: [root@max-mastr1.mdc-berlin.net:~] (1025) $ scontrol delete reservation short_gpu [root@max-mastr1.mdc-berlin.net:~] (1026) $ => then all of my pending jobs started immediately too. [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ squeue --me JOBID USER PARTITION NODES CPUS ST SUBMIT_TIME START_TIME TIME TIME_LIMIT NODELIST(REASON) 561660 mdiab gpu 1 1 R 2025-03-13T20:14:44 2025-03-13T20:27:04 0:07 10:00 maxg25 561661 mdiab gpu 1 1 R 2025-03-13T20:14:45 2025-03-13T20:27:04 0:07 10:00 maxg25 561662 mdiab gpu 1 1 R 2025-03-13T20:14:48 2025-03-13T20:27:04 0:07 10:00 maxg24 561663 mdiab gpu 1 1 R 2025-03-13T20:19:25 2025-03-13T20:27:04 0:07 5:00:00 maxg25 561664 mdiab gpu 1 1 R 2025-03-13T20:19:29 2025-03-13T20:27:04 0:07 5:00:00 maxg24 [mdiab@max-login1.mdc-berlin.net:~/test-scripts] $ I attached our slurm.conf here. Thanks Motaz
Created attachment 41128 [details] Slurm configuration file
I'm glad to hear that it makes sense now, but it's strange that it's not doing what it should. One thing I notice that could be affecting how these jobs are scheduled is one of your scheduler parameters. You have 'bf_resolution=600', which means that the chunks of time that get evaluated are in 10 minute increments. The default for this is 60 second increments. With a 30 minute window of time before the reservation it can be hard for it to be able to resolve it properly. I would recommend setting this back down to the default of 60 seconds and see if the behavior changes. If you still see the same problem after changing the bf_resolution, then I would like to see some debug logs for the issue. You can enable debug logging with the following commands: scontrol setdebug debug2 scontrol setdebugflags +backfill I would like to have you run the same test and send the logs that cover the time of that test. Please send all the logs for that time rather than grepping out a particular job id. There are often relevant log entries that don't include the job id as part of the line. Then you can set the log level back down to normal levels with the following commands: scontrol setdebug info scontrol setdebugflags -backfill Thanks, Ben
Hello again, I set bf_resolution back to 60 (default) but it did not help. I redid the same example on maxg20 (It has 8 GPUs as total) only. ReservationName=short_gpu StartTime=2025-03-14T16:09:41 EndTime=2025-03-15T04:09:41 Duration=12:00:00 Nodes=maxg20 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=gpu Flags=SPEC_NODES,TIME_FLOAT NodeName=maxg20 CoreIDs=0 TRES=cpu=1,gres/gpu=2 Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) MaxStartDelay=(null) I submitted 4 long jobs and 4 short (each requested 2 GPUs) I saw all short jobs started but long jobs not, [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me JOBID USER PARTITION NODES CPUS ST SUBMIT_TIME START_TIME TIME TIME_LIMIT NODELIST(REASON) 561858 mdiab gpu 1 1 PD 2025-03-14T15:28:01 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561857 mdiab gpu 1 1 PD 2025-03-14T15:28:00 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561856 mdiab gpu 1 1 PD 2025-03-14T15:28:00 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561855 mdiab gpu 1 1 PD 2025-03-14T15:27:59 N/A 0:00 5:00:00 (Resources) 561851 mdiab gpu 1 1 R 2025-03-14T15:27:54 2025-03-14T15:28:00 0:28 10:00 maxg20 561852 mdiab gpu 1 1 R 2025-03-14T15:27:54 2025-03-14T15:28:00 0:28 10:00 maxg20 561853 mdiab gpu 1 1 R 2025-03-14T15:27:55 2025-03-14T15:28:00 0:28 10:00 maxg20 561854 mdiab gpu 1 1 R 2025-03-14T15:27:55 2025-03-14T15:28:00 0:28 10:00 maxg20 [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ When all 4 short jobs completed, the long jobs are still pending [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me JOBID USER PARTITION NODES CPUS ST SUBMIT_TIME START_TIME TIME TIME_LIMIT NODELIST(REASON) 561858 mdiab gpu 1 1 PD 2025-03-14T15:28:01 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561857 mdiab gpu 1 1 PD 2025-03-14T15:28:00 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561856 mdiab gpu 1 1 PD 2025-03-14T15:28:00 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561855 mdiab gpu 1 1 PD 2025-03-14T15:27:59 N/A 0:00 5:00:00 (Resources) [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ I re-submitted another 3 short jobs, but also they did not start and are pending [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ squeue --me JOBID USER PARTITION NODES CPUS ST SUBMIT_TIME START_TIME TIME TIME_LIMIT NODELIST(REASON) 561862 mdiab gpu 1 1 PD 2025-03-14T15:36:23 N/A 0:00 10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561861 mdiab gpu 1 1 PD 2025-03-14T15:36:22 N/A 0:00 10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561860 mdiab gpu 1 1 PD 2025-03-14T15:36:04 N/A 0:00 10:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561858 mdiab gpu 1 1 PD 2025-03-14T15:28:01 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561857 mdiab gpu 1 1 PD 2025-03-14T15:28:00 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561856 mdiab gpu 1 1 PD 2025-03-14T15:28:00 N/A 0:00 5:00:00 (ReqNodeNotAvail, UnavailableNodes:maxg20) 561855 mdiab gpu 1 1 PD 2025-03-14T15:27:59 N/A 0:00 5:00:00 (Resources) [mdiab@max-login5.mdc-berlin.net:~/test-scripts] $ I enabled the debug as you asked and uploaded the slurmctl.log and slurmd.log (for maxg20). Thanks and best regards, Motaz
Created attachment 41141 [details] slurmctl.log and slurmd.log (for maxg20)
Hi Motaz, My apologies that it took me a while to get back to you on this. It looks like there is something else happening that is preventing these long GPU jobs from starting on that node. Unfortunately it's not clear from the logs what is keeping the job from starting. I was looking at job 561858 as an example. You can see in the logs that it sees that it recognizes that it could use node maxg20, but it doesn't start on it. [2025-03-14T15:28:01.358] debug2: found 2 usable nodes from config containing maxg[10,20] ... [2025-03-14T15:28:01.358] debug2: NodeSet[3] Nodes:maxg[10,20] NodeWeight:10 Flags:0 FeatureBits:0 SchedWeight:2815 Those are what is shown by the main scheduler. Then it is evaluated by the backfill scheduler and it shows that it tries to schedule the job, but doesn't provide details of why it isn't able to start it. [2025-03-14T15:28:21.853] sched/backfill: _attempt_backfill: BACKFILL: test for JobId=561858 Prio=1000 Partition=gpu Reservation=NONE [2025-03-14T15:28:21.853] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=561858. [2025-03-14T15:28:21.853] debug2: sched/backfill: _try_sched: exclude core bitmap: 3644 [2025-03-14T15:28:21.853] debug2: select/cons_tres: select_p_job_test: evaluating JobId=561858 [2025-03-14T15:28:21.853] debug2: select/cons_tres: select_p_job_test: evaluating JobId=561858 Since it's the long jobs that aren't able to start on the node, they shouldn't be directly affected by the floating reservation you created. Something else is happening that is keeping these jobs from starting. It's possible that there is a large job that is reserving resources and the longer jobs are long enough that they would interfere with its start time. I hate to ask you to run this test one more time, but I'd like to have some additional logging collected. Could you run the following commands to enable two more debug flags: scontrol setdebug debug2 scontrol setdebugflags +backfill,backfillmap,selecttype Then if you would run a similar test and send the logs from that time period. Then you can turn the log level back down with this: scontrol setdebug info scontrol setdebugflags -backfill,backfillmap,selecttype I would also like to see the output of 'scontrol show job <jobid>' for one of the long jobs that isn't able to run as well as one of the short ones that does run. Along with that I would like to see the output of 'scontrol show node maxg20' while a short job is running on it. Thanks, Ben
Hello again, I installed the latest version of Slurm (24.11.3) on a test cluster which has single partition and 8 gpu nodes (8 GPU devices on each). [root@hai-mastr1.haicore.berlin:/var/log/slurm] (1025) $ scontrol -V slurm 24.11.3 [root@hai-mastr1.haicore.berlin:/var/log/slurm] (1026) $ I created the reservation for 2 GPUs on single node (hai002): [root@hai-mastr1.haicore.berlin:/var/log/slurm] (1024) $ scontrol show reservations ReservationName=short_gpu StartTime=2025-03-31T15:11:18 EndTime=2025-04-01T03:11:18 Duration=12:00:00 Nodes=hai002 NodeCnt=1 CoreCnt=2 Features=(null) PartitionName=standard Flags=TIME_FLOAT NodeName=hai002 CoreIDs=0,56 TRES=cpu=2,gres/gpu=2 Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) MaxStartDelay=(null) [root@hai-mastr1.haicore.berlin:/var/log/slurm] (1025) $ [root@hai-mastr1.haicore.berlin:/var/log/slurm] (1026) $ sinfo -Nel Mon Mar 31 14:41:59 2025 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON hai001 1 standard* idle 224 2:56:2 975000 0 1 sapphire none hai002 1 standard* reserved 224 2:56:2 975000 0 1 sapphire none hai003 1 standard* idle 224 2:56:2 975000 0 1 sapphire none hai004 1 standard* idle 224 2:56:2 975000 0 1 sapphire none hai005 1 standard* idle 224 2:56:2 975000 0 1 sapphire none hai006 1 standard* idle 224 2:56:2 975000 0 1 sapphire none hai007 1 standard* idle 224 2:56:2 975000 0 1 sapphire none hai008 1 standard* idle 224 2:56:2 975000 0 1 sapphire none [root@hai-mastr1.haicore.berlin:/var/log/slurm] (1027) $ [root@hai-mastr1.haicore.berlin:/var/log/slurm] (1030) $ scontrol show node hai002 NodeName=hai002 Arch=x86_64 CoresPerSocket=56 CPUAlloc=0 CPUEfctv=220 CPUTot=224 CPULoad=0.01 AvailableFeatures=sapphire-rapids ActiveFeatures=sapphire-rapids Gres=gpu:H100-SXM5:8,gpu_memory:no_consume:80G,gpu_compute_cap:no_consume:90 NodeAddr=hai002 NodeHostName=hai002 Version=24.11.3 OS=Linux 5.14.0-427.13.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024 RealMemory=975000 AllocMem=0 FreeMem=980413 Sockets=2 Boards=1 CoreSpecCount=2 CPUSpecList=110-111,222-223 State=IDLE+RESERVED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=standard BootTime=2025-03-24T22:47:17 SlurmdStartTime=2025-03-30T16:48:36 LastBusyTime=2025-03-31T13:45:18 ResumeAfterTime=None CfgTRES=cpu=220,mem=975000M,billing=220,gres/gpu=8 AllocTRES= CurrentWatts=0 AveWatts=0 ReservationName=short_gpu [root@hai-mastr1.haicore.berlin:/var/log/slurm] (1031) $ I submitted two 1-gpu jobs (long/1 hour + short/10 minutes) [mdiab-srvadm@hai-login1.haicore.berlin:~] $ sbatch -N1 -n1 -c1 -t 0-1 -w hai002 --gres=gpu:1 --wrap "sleep 1000" Submitted batch job 302 [mdiab-srvadm@hai-login1.haicore.berlin:~] $ [mdiab-srvadm@hai-login1.haicore.berlin:~] $ sbatch -N1 -n1 -c1 -t 10 -w hai002 --gres=gpu:1 --wrap "sleep 1000" Submitted batch job 303 [mdiab-srvadm@hai-login1.haicore.berlin:~] $ But this time none of them started. [mdiab-srvadm@hai-login1.haicore.berlin:~] $ squeue JOBID USER PARTITION NODES CPUS ST SUBMIT_TIME START_TIME TIME TIME_LIMIT NODELIST(REASON) 303 mdiab-srva standard 1 1 PD 2025-03-31T14:26:16 N/A 0:00 10:00 (ReqNodeNotAvail, May be reserved for other job) 302 mdiab-srva standard 1 1 PD 2025-03-31T14:25:51 N/A 0:00 1:00:00 (Resources) [mdiab-srvadm@hai-login1.haicore.berlin:~] $ [mdiab-srvadm@hai-login1.haicore.berlin:~] $ scontrol show job 302 JobId=302 JobName=wrap UserId=mdiab-srvadm(961900513) GroupId=mdiab-srvadm(961900513) MCS_label=N/A Priority=1 Nice=0 Account=it QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2025-03-31T14:25:51 EligibleTime=2025-03-31T14:25:51 AccrueTime=2025-03-31T14:25:51 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-31T14:50:34 Scheduler=Backfill:* Partition=standard AllocNode:Sid=hai-login1:2540965 ReqNodeList=hai002 ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=4G,node=1,billing=1,gres/gpu=1 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/fast/home/mdiab-srvadm StdErr=/fast/home/mdiab-srvadm/slurm-302.out StdIn=/dev/null StdOut=/fast/home/mdiab-srvadm/slurm-302.out TresPerNode=gres/gpu:1 TresPerTask=cpu=1 [mdiab-srvadm@hai-login1.haicore.berlin:~] $ scontrol show job 303 JobId=303 JobName=wrap UserId=mdiab-srvadm(961900513) GroupId=mdiab-srvadm(961900513) MCS_label=N/A Priority=1 Nice=0 Account=it QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_May_be_reserved_for_other_job Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2025-03-31T14:26:16 EligibleTime=2025-03-31T14:26:16 AccrueTime=2025-03-31T14:26:16 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-31T14:50:34 Scheduler=Backfill:* Partition=standard AllocNode:Sid=hai-login1:2540965 ReqNodeList=hai002 ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=4G,node=1,billing=1,gres/gpu=1 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/fast/home/mdiab-srvadm StdErr=/fast/home/mdiab-srvadm/slurm-303.out StdIn=/dev/null StdOut=/fast/home/mdiab-srvadm/slurm-303.out TresPerNode=gres/gpu:1 TresPerTask=cpu=1 [mdiab-srvadm@hai-login1.haicore.berlin:~] $ I uploaded here the slurm.conf and the slurmctld.log (with enabled debug) Note: I tested the same example for reserving some of CPUs (instead of GPUs) and it works, so it seems that only GPU reservation is broken. Thanks, Motaz
Created attachment 41319 [details] slurmctl.log
Created attachment 41320 [details] Slurm configuration file
Thank you for reproducing this on your test cluster and collecting these logs. I can see that there is something causing the resources to be excluded for this node. Here is an excerpt showing it evaluate hai001 and it sees the GPUs and Cores correctly for that node, but it doesn't start on that node because it doesn't request it. [2025-03-31T14:25:52.000] select/cons_tres: _can_job_run_on_node: SELECT_TYPE: 220 CPUs on hai001(state:0), mem 0/975000 [2025-03-31T14:25:52.000] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hai001 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:1-220,220 ThreadsPerCore:2 [2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE: AnySocket gpu:8 [2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:55 [2025-03-31T14:26:31.000] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:55 But then when it comes to hai002 it says that the resources are excluded for the node. [2025-03-31T14:26:31.000] select/cons_tres: _can_use_gres_exc_topo: SELECT_TYPE: can't include!, it is excluded 1 0 [2025-03-31T14:26:31.000] select/cons_tres: _can_job_run_on_node: SELECT_TYPE: Test fail on node hai002: gres_sock_list_create Can I have you send a copy of your gres.conf for your test system? I would like to see if there is something there that is keeping it from recognizing that it can use these GPUs/CPUs. If you have any other .conf files in the same directory as your slurm.conf, it would be good to see all of them. Thanks, Ben
Hello, Here are three used config files gres.conf, job_container.conf, cgroup.conf: [mdiab@cl-hpc02 12:40:25 config]$ cat gres.conf |grep -v "#" AutoDetect=off NodeName=hai00[1-8] Name=gpu Type=H100-SXM5 File=/dev/nvidia[0-7] Flags=nvidia_gpu_env NodeName=hai00[1-8] Name=gpu_memory Count=80G Flags=CountOnly NodeName=hai00[1-8] Name=gpu_compute_cap Count=90 Flags=CountOnly [mdiab@cl-hpc02 12:40:35 config]$ cat job_container.conf |grep -v "#" AutoBasePath=true BasePath=/tmp/slurm Dirs=/var/tmp Shared=true [mdiab@cl-hpc02 12:40:48 config]$ cat cgroup.conf |grep -v "#" CgroupMountpoint=/sys/fs/cgroup ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes Also I have a job_submit.lua which only sends the interactive jobs into interactive partition silently. [mdiab@cl-hpc02 12:41:11 config]$ cat job_submit.lua function slurm_job_submit(job_desc, part_list, submit_uid) -- Sending all interactive jobs to interactive partition. if not job_desc.script then job_desc.partition = 'interactive' end -- Allow the job to proceed return slurm.SUCCESS end function slurm_job_modify(job_desc, job_rec, part_list, modify_uid) return slurm.SUCCESS end Thanks Motaz
Hi Motaz, I discussed this ticket with a colleague who thinks that this might be related to an issue he has seen related to reservations and typed GPUs. You have reproduced this on your production system and a test system and it looks like both of them have the GPUs defined with a type of 'H100-SXM5'. Can you try removing that type from the GPU definition on your test cluster to confirm whether this is the same issue? I'm not suggesting you remove it as the long term fix, but it will allow us to confirm the issue. Thanks, Ben