Ticket 6411 - Unable to correctly handle '--gres gpu:<N>' upon separate SSH / sjoin
Summary: Unable to correctly handle '--gres gpu:<N>' upon separate SSH / sjoin
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 23.11.4
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 6538 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-01-27 15:29 MST by Sebastien Varrette
Modified: 2024-12-04 21:40 MST (History)
5 users (show)

See Also:
Site: University of Luxembourg
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: Ubuntu
Machine Name: cluster
CLE Version:
Version Fixed: 19.05.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Generic resource management configuration (973 bytes, text/plain)
2019-01-27 15:29 MST, Sebastien Varrette
Details
cgroup support configuration file (2.48 KB, text/plain)
2019-01-27 15:30 MST, Sebastien Varrette
Details
cgroup allowed devices (1.06 KB, text/plain)
2019-01-27 15:31 MST, Sebastien Varrette
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Sebastien Varrette 2019-01-27 15:29:41 MST
Created attachment 9017 [details]
Generic resource management configuration

Hi, 

We have recently added 18 new GPU nodes (2 CPU sockets of 14 cores each, 4 Nvidia V100 cards) within our `iris` cluster and aligned both the configuration of the generic resource management and the general configuration. Ex:

```
$> grep -i GRES /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=iris-[169-186] CPUs=28 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 RealMemory=772614 Feature=skylake,volta Gres=gpu:volta:4 State=UNKNOWN

$> cat /etc/slurm/gres.conf
# COMPUTE NODES WITH 4xVOLTA V100 32GB SXM2
NodeName=iris-[169-186] Name=gpu Type=volta File=/dev/nvidia[0-3]
```

Things are running fine except when attempting to join / connect to a running job involving GRES gpu reservations. 

For instance, assuming an interactive job running on a GPU node reserving at least one of the GPU cards (below example: 2), we can see that the number of allocated cards is indeed restricted: 

```
(access) $> srun -p gpu -N 1 --ntasks-per-node 2 -c 14 --gres gpu:2 --pty bash -i
(gpunode) (249438 1N/2T/28CN) $> echo $CUDA_VISIBLE_DEVICES
0,1
(gpunode) (249438 1N/2T/28CN) $> nvidia-smi
Sun Jan 27 22:42:52 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.66       Driver Version: 410.66       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1C:00.0 Off |                    0 |
| N/A   37C    P0    45W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

Now when trying to access from a separate terminal the reserved node using either `srun --jobid [...]` or a direct SSH, these reservations constraints are no longer effective or even usable.

More specifically: 

* Direct SSH does not restrict the GPU access -- could it be a problem with slurm_pam_adopt ?

```
(access) $> squeue -u $USER
             JOBID PARTITION     NAME             USER ST       TIME  NODES NODELIST(REASON)
            249438       gpu                       bash svarrett  R       0:08    1:59:52      1 iris-183
(access) $> ssh iris-183
( iris-183) $> nvidia-smi        # show access to ALL 4 GPUs ???? 
Sun Jan 27 22:43:35 2019
-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.66       Driver Version: 410.66       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1C:00.0 Off |                    0 |
| N/A   37C    P0    45W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:1D:00.0 Off |                    0 |
| N/A   34C    P0    42W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:1E:00.0 Off |                    0 |
| N/A   38C    P0    43W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

* Access with `sjoin $JOBID` (see below definition, or `srun --jobid $JOBID --pty bash -i`) to initiate a job step under an already allocated job with job id $JOBID does not work at all (unlike for any other job reserved without a generic resource:

```
(access) $> srun -v --jobid 249438 --pty bash -i
[...]
srun: remote command    : `bash -i'
srun: jobid 249438: nodes(1):`iris-183', cpu counts: 28(x1)        # No GRES mention ? 

srun: Job 249438 step creation temporarily disabled, retrying      # message appear after some timeout
srun: Job 249438 step creation still disabled, retrying
srun: Job 249438 step creation still disabled, retrying
srun: Job 249438 step creation still disabled, retrying
srun: error: Unable to create step for job 249438: Job/step already completing or completed
```

Could it be linked to the absence of `/dev/nvidia*` within the `cgroup_allowed_devices_file.conf` (we have set `ConstrainDevices=yes` in `/etc/slurm/slurm.conf`) ? 

We have also noticed a potential inconsistencies that might be related to this problem: the stored allocated GRES (see `AllocGRES` below vs. `ReqGRES`) within the slurm database is not consistent -- it might relate to Bug #6366 – AllocGRES id recorded as 7696487?
For instance for the above mentioned job: 

```
$> sacct -j 249438 --format User,JobID,Jobname,partition,state,time,start,elapsed,ReqGRES,AllocGRES
     User        JobID    JobName  Partition      State  Timelimit               Start    Elapsed      ReqGRES    AllocGRES
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ---------- ------------ ------------
svarrette 249438             bash        gpu    RUNNING   02:00:00 2019-01-27T22:41:58   00:07:07        gpu:2    7696487:2
          249438.exte+     extern               RUNNING            2019-01-27T22:41:58   00:07:07        gpu:2    7696487:2
          249438.0           bash               RUNNING            2019-01-27T22:41:58   00:07:07        gpu:2    7696487:2
```

_Note_: we have the following definition for the `sjoin` utility mentioned above:

```
sjoin(){
        if [[ -z $1 ]]; then
                echo "Job ID not given."
        else
                JOBID=$1
                [[ -n $2 ]] && NODE="-w $2"
                srun --jobid $JOBID $NODE --pty bash -i
        fi
}
```
Comment 1 Sebastien Varrette 2019-01-27 15:30:47 MST
Created attachment 9018 [details]
cgroup support configuration file
Comment 2 Sebastien Varrette 2019-01-27 15:31:19 MST
Created attachment 9019 [details]
cgroup allowed devices
Comment 3 Dominik Bartkiewicz 2019-01-29 08:08:30 MST
Hi

slurm_pam_adopt doesn't set CUDA_VISIBLE_DEVICES.
In all cases, we recommend based on cgroups, relying only on environment variables is not safe.
Could you send me content of:

/sys/fs/cgroup/devices/slurm/uid_<uid>/job_<job_id>/devices.list
/sys/fs/cgroup/devices/slurm/uid_<uid>/job_<job_id>/step_0/devices.list
/sys/fs/cgroup/devices/slurm/uid_<uid>/job_<job_id>/step_extern/devices.list

Could you check content of
Comment 4 Sebastien Varrette 2019-01-29 12:38:43 MST
Here is the content for the jobs with `--gres gpu:2`:

```
[root@iris-183 ~]# cat /sys/fs/cgroup/devices/slurm/uid_${uid}/job_${job_id}/devices.list
a *:* rwm

[root@iris-183 ~]# cat /sys/fs/cgroup/devices/slurm/uid_${uid}/job_${job_id}/step_0/devices.list
a *:* rwm

[root@iris-183 ~]# cat /sys/fs/cgroup/devices/slurm/uid_${uid}/job_${job_id}/step_extern/devices.list
a *:* rwm
```
Comment 5 Dominik Bartkiewicz 2019-01-30 06:25:44 MST
Hi

Could you check in which cgroup is your process after using ssh to the node with an existing job?

cat /proc/self/cgroup

My tests of cgroup devices and pam slurm shows that both work fine, even if devices.list contains "a *:* rwm" (after this commit values inside this file can be incomplete https://git.sphere.ly/santhosh/kernel_cyanogen_msm8916/commit/ad676077a2ae4af4bb6627486ce19ccce04f1efe).


Dominik
Comment 6 Sebastien Varrette 2019-01-30 13:43:49 MST
From within a separate SSH connection to the reserved node: 

```
$> cat /proc/self/cgroup
22:hugetlb:/
21:memory:/
20:net_prio,net_cls:/
19:freezer:/slurm/uid_5000/job_252130/step_extern
18:perf_event:/
17:blkio:/
16:pids:/user.slice
15:cpuacct,cpu:/
14:devices:/
13:cpuset:/slurm/uid_5000/job_252130/step_extern
1:name=systemd:/user.slice/user-5000.slice/session-23452.scope
```

For the precised ssh process:

```
# parent pid (third firld) extracted from the PID of the current shell (`$$`) 
$> cat /proc/$(ps --no-headers -fp $$ | awk '{print $3}')/cgroup
22:hugetlb:/
21:memory:/
20:net_prio,net_cls:/
19:freezer:/slurm/uid_5000/job_252130/step_extern
18:perf_event:/
17:blkio:/
16:pids:/user.slice
15:cpuacct,cpu:/
14:devices:/
13:cpuset:/slurm/uid_5000/job_252130/step_extern
1:name=systemd:/user.slice/user-5000.slice/session-23452.scope
```

As a comparison, here is the content of /proc/self/cgroup from within the native job:

```
$> cat /proc/self/cgroup
22:hugetlb:/
21:memory:/slurm/uid_5000/job_252130/step_0/task_0
20:net_prio,net_cls:/
19:freezer:/slurm/uid_5000/job_252130/step_0
18:perf_event:/
17:blkio:/
16:pids:/system.slice/slurmd.service
15:cpuacct,cpu:/slurm/uid_5000/job_252130/step_0/task_0
14:devices:/slurm/uid_5000/job_252130/step_0
13:cpuset:/slurm/uid_5000/job_252130/step_0
1:name=systemd:/system.slice/slurmd.service
```
Comment 7 Dominik Bartkiewicz 2019-01-31 06:28:24 MST
Hi
Could you send me /etc/pam.d/sshd and all included files?
Maybe the problem is somewhere in the pam_adopt config.
Dominik
Comment 8 Sebastien Varrette 2019-01-31 06:38:55 MST
Sure.

```
$> pdsh -g gpu "rpm -qa | grep slurm-pam" | dshbak -c
----------------
iris-[169-186]
----------------
slurm-pam_slurm-17.11.12-1.el7.x86_64
```

```
$> pdsh -g gpu "rpm -ql slurm-pam_slurm" | dshbak -c
----------------
iris-[169-186]
----------------
/lib64/security/pam_slurm.so
/lib64/security/pam_slurm_adopt.so
```

```
$> pdsh -g gpu "cat /etc/pam.d/sshd" | dshbak -c
----------------
iris-[169-186]
----------------
account    required     pam_slurm_adopt.so action_adopt_failure=deny action_generic_failure=deny
account    sufficient   pam_access.so
#%PAM-1.0
auth	   required	pam_sepermit.so
auth       substack     password-auth
auth       include      postlogin
# Used with polkit to reauthorize users in remote sessions
-auth      optional     pam_reauthorize.so prepare
account    required     pam_nologin.so
account    include      password-auth
password   include      password-auth
# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session    required     pam_selinux.so open env_params
session    required     pam_namespace.so
session    optional     pam_keyinit.so force revoke
session    include      password-auth
session    include      postlogin
# Used with polkit to reauthorize users in remote sessions
-session   optional     pam_reauthorize.so prepare

# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
# END AUTOGENERATED SECTION   -- DO NOT REMOVE
```

Is there any file you want specifically ?
Comment 9 Dominik Bartkiewicz 2019-01-31 06:50:28 MST
Hi

/etc/pam.d/password-auth will be enough
Comment 10 Sebastien Varrette 2019-01-31 08:13:51 MST
```
$> pdsh -g gpu "cat /etc/pam.d/password-auth" | dshbak -c
----------------
iris-[169-186]
----------------
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth        required      pam_env.so
auth        sufficient    pam_unix.so nullok try_first_pass
auth        requisite     pam_succeed_if.so uid >= 1000 quiet_success
auth        sufficient    pam_ldap.so use_first_pass
auth        required      pam_deny.so

account     required      pam_unix.so broken_shadow
account     sufficient    pam_localuser.so
account     sufficient    pam_succeed_if.so uid < 1000 quiet
account     [default=bad success=ok user_unknown=ignore] pam_ldap.so
account     required      pam_permit.so

password    requisite     pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type=
password    sufficient    pam_unix.so sha512 shadow nullok try_first_pass use_authtok
password    sufficient    pam_ldap.so use_authtok
password    required      pam_deny.so

session     optional      pam_keyinit.so revoke
session     required      pam_limits.so
-session     optional      pam_systemd.so
session     [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid
session     required      pam_unix.so
session     optional      pam_ldap.so
```
Comment 11 Dominik Bartkiewicz 2019-01-31 10:09:17 MST
Hi

Could you try stop systemd-logind and check if after that ssh
process will be properly attached to cgroups?

systemctl stop systemd-logind
systemctl mask systemd-logind


Dominik
Comment 12 Sebastien Varrette 2019-01-31 11:09:35 MST
SO assuming I did it in the correct order: 

1. reservation to restrict to a single node to test 

     $> scontrol create res Reservation=slurmbug6411 StartTime=2019-01-31T18:55:00 Duration=1:00:00 Flags=Maint,Ignore_Jobs PartitionName=gpu Accounts=ulhpc Nodes=iris-181

2. make an interactive job with `--gres gpu:2` as before (job id: 252576)

     $> srun --reservation slurmbug6411 -p gpu --qos qos-gpu -N 1 --ntasks-per-node 2 -c 14 --gres gpu:2 --pty bash -i

3. check separate ssh and proper cgroup 

     $> ssh iris-181
     $> cat /proc/$(ps --no-headers -fp $$ | awk '{print $3}')/cgroup
     22:hugetlb:/
     21:memory:/
     20:net_prio,net_cls:/
     19:freezer:/slurm/uid_5000/job_252576/step_extern
     18:perf_event:/
     17:blkio:/
     16:pids:/user.slice
     15:cpuacct,cpu:/
     14:devices:/
     13:cpuset:/slurm/uid_5000/job_252576/step_extern
     1:name=systemd:/user.slice/user-5000.slice/session-24400.scope

4. logout from the separate SSH
5. login as root to the node and stop `systemd-logind` as indicated 

     $> ssh iris-181
     $> systemctl stop systemd-logind.service
     $> systemctl mask systemd-logind.service
     Created symlink from /etc/systemd/system/systemd-logind.service to /dev/null.

6  login again as regular user from the frontend with a separate ssh to the reserve node and check and proper cgroup.

Step 6 is depicted below and seems ok: 

```
$> ssh iris-181
~> cat /proc/$(ps --no-headers -fp $$ | awk '{print $3}')/cgroup
22:hugetlb:/
21:memory:/slurm/uid_5000/job_252576/step_extern/task_0
20:net_prio,net_cls:/
19:freezer:/slurm/uid_5000/job_252576/step_extern
18:perf_event:/
17:blkio:/
16:pids:/system.slice/sshd.service
15:cpuacct,cpu:/slurm/uid_5000/job_252576/step_extern/task_0
14:devices:/slurm/uid_5000/job_252576/step_extern
13:cpuset:/slurm/uid_5000/job_252576/step_extern
1:name=systemd:/system.slice/sshd.service


~> systemctl status systemd-logind.service
● systemd-logind.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead) since Thu 2019-01-31 18:57:18 CET; 6min ago
 Main PID: 41010 (code=killed, signal=TERM)
   Status: "Processing requests..."
```

The same happens if (while leaving the `systemd-logind` service masked) I repeat the procedure from step 2 after having killed the job 252576
Comment 13 Dominik Bartkiewicz 2019-02-01 03:39:37 MST
Hi

You can also comment this line in /etc/pam.d/password-auth:
-session     optional      pam_systemd.so

I assume that after this nvidia-smi should work fine in ssh session.

I will check if it is possible to add setting of CUDA_VISIBLE_DEVICES environment variable to slurm adapt plugin. But as I mentioned before, relying only on environment variables is not safe.


Dominik
Comment 14 Sebastien Varrette 2019-02-04 08:21:28 MST
Dear Dominik, 

We can confirm it seems to solve the separate SSH case (the nvidia-smi command is restricted to the cards sharing the same busID as in the initial reservation). 
Note that the sjoin case remains unsolved. 

However this fix does look like having a potential huge side-effect from the description of the service: https://www.freedesktop.org/software/systemd/man/systemd-logind.service.html 

Are you sure this is recommended ? How are the other centers deal with this issue ?
Comment 15 Dominik Bartkiewicz 2019-02-04 09:03:58 MST
Hi

Yes, this is recommended and required for properly working of pam_slurm_adopt.
Check documentation:
https://slurm.schedmd.com/pam_slurm_adopt.html

Some functions of logind are handled by slurm, other like polkit are not commonly used on clusters.

Dominik
Comment 16 Sebastien Varrette 2019-02-04 12:59:48 MST
Indeed, thanks for the reference, looks like we missed this guidelines. 
We will deploy it across the cluster. 

It just leaves the sjoin issue.
Comment 17 Dominik Bartkiewicz 2019-02-04 14:04:41 MST
Hi

By default, a job step tries to allocate all of the generic resources that have been allocated to the job, 
because you have already one step which keeps all gres.
The next step waits for resources. You can add "--gres=gpu:0" or "--gres=none".
then this step should start, but you won't have access to GPUs.
If you describe your needs better, maybe I will be able to give you some solution.

Dominik
Comment 18 Sebastien Varrette 2019-02-04 14:51:44 MST
Traditionally on our site (and hopefully on others), people use `sjoin $jobid` / `srun --jobid [...]` to be able to 

1. connect to the job while setting the correct SLURM_* variable as done in the job (which not set in the regular ssh)
2. monitor (with `htop`, `nvidia-smi`, `nvtop` etc.) the running job and eventually perform complementary tests based on the SLURM_* variables).
3. (in a very few case) run an additional job within the existing job allocation

It is quite important to keep that workflow also for GPU nodes.
Comment 21 Hyacinthe Cartiaux 2019-02-08 05:09:01 MST
Hi,

I think there's a side effect when disabling pam_systemd.so, it creates the directory $XDG_RUNTIME_DIR (/run/user/$UID).

After disabling pam_systemd.so, the directory is not created anymore, but the variable is still set (I think it's passed from the cluster frontend server), which cause some user applications to fail (unless XDG_RUNTIME_DIR is unset explicitly).

Have you also noticed this issue ? Do you have a proper solution ?

Thank you
Comment 22 Dominik Bartkiewicz 2019-02-11 04:53:08 MST
Hi

You can unset XDG_* using task prolog (https://slurm.schedmd.com/prolog_epilog.html).

Answering the previous questions:
Currently, there is no way to overallocation gres, that means you can't create (eg.: sjoin) step with access to GPUs which is bind to another step. To keep sjoin working with gres job you can add "--gres=none"

Currently, I am working on adding some slurm envs to ssh session but I think this will be available in 19.05

Dominik
Comment 24 Dominik Bartkiewicz 2019-02-28 03:34:15 MST
Hi

I want to inform you that we handle XDG_* problem in a separate bug.

In bug 5920 we consider splitting pam_adopt module into two contexts.
This will allow us to set right cgroups after pam_systemd and avoids overwriting cgroup.

Dominik
Comment 29 Albert Gil 2019-03-05 11:32:55 MST
*** Ticket 6538 has been marked as a duplicate of this ticket. ***
Comment 31 Dominik Bartkiewicz 2019-03-19 06:49:48 MDT
Hi

I will change this bug to enhancement.
We are working on adding some environment to ssh session but this will not be done before 19.05

Dominik
Comment 33 Sebastien Varrette 2020-03-17 04:47:22 MDT
Any update for the planned 19.06 release ?
Comment 34 Dominik Bartkiewicz 2020-03-17 05:17:13 MDT
Hi

Sorry I didn't inform you earlier.
This commit injects the SLURM_JOB_ID environment variable into adopted processes.
https://github.com/SchedMD/slurm/commit/65fb9dfa10a8763d7

This patch is included in all 19.05 and 20.02 releases.
Currently, we don't plan to add more environment variables to pam.
But SLURM_JOB_ID should be sufficient for handling properly srun after ssh to a node.

Dominik
Comment 35 Sebastien Varrette 2021-02-08 06:03:01 MST
Several users wish to access at least the nvidia-smi utility when joining their running job. Any further way to authorize it?
Comment 36 Valentin Plugaru 2021-02-08 06:03:10 MST
This email address is no longer active. Please use the email address valentin.plugaru@gmail.com for future correspondence.



Your message is not forwarded automatically.



The IT service at the University of Luxembourg
Comment 37 Sebastien Varrette 2021-02-14 15:25:28 MST
More specifically, ssh grant access and view to all GPU cards.  Any way to limit it?
Comment 38 Dominik Bartkiewicz 2021-02-15 05:14:18 MST
Hi

Currently, we have no plan to change the way we handle ssh connections.
pam_slurm_adopt already bind ssh connection to external steps cgroups.
If you use "ConstrainDevices=yes" this should limit access to GPUs.

Dominik
Comment 39 karansing.g 2024-12-04 21:39:41 MST
Now we have the configure the high-priority qos with 500 priority, and slurm.conf by following conf: 

SchedulerType=sched/backfill
UnkillableStepTimeout=180
PriorityType=priority/basic
PreemptType=preempt/qos
PreemptMode=SUSPEND,GANG
SchedulerTimeSlice=180

and qos details:
    Name    Preempt PreemptMode   Priority
---------- ---------- ----------- ----------
    normal            gang,suspe+        100
datascien+                cluster        100
highprior+     normal gang,suspe+        500

Current Issue:

When submitting a job from the normal QoS and then a job from the highpriority QoS, it results in time-slicing between the two jobs. However, we need the highpriority job to run to completion first, and only then should the normal job resume.
This configuration currently only works for CPU-based jobs.

Requirement: How can we extend this configuration to include GPU-based jobs, ensuring the same behavior (high-priority jobs preempt normal jobs and resume them after completion)?
Comment 40 karansing.g 2024-12-04 21:40:16 MST
Now we have the configure the high-priority qos with 500 priority, and slurm.conf by following conf: 

SchedulerType=sched/backfill
UnkillableStepTimeout=180
PriorityType=priority/basic
PreemptType=preempt/qos
PreemptMode=SUSPEND,GANG
SchedulerTimeSlice=180

and qos details:
    Name    Preempt PreemptMode   Priority
---------- ---------- ----------- ----------
    normal            gang,suspe+        100
datascien+                cluster        100
highprior+     normal gang,suspe+        500

Current Issue:

When submitting a job from the normal QoS and then a job from the highpriority QoS, it results in time-slicing between the two jobs. However, we need the highpriority job to run to completion first, and only then should the normal job resume.
This configuration currently only works for CPU-based jobs.

Requirement: How can we extend this configuration to include GPU-based jobs, ensuring the same behavior (high-priority jobs preempt normal jobs and resume them after completion)?