| Summary: | interactive srun and gpus | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ARC Admins <arc-slurm-admins> |
| Component: | GPU | Assignee: | Gavin D. Howard <gavin> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 19.05.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Michigan | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
I apologize that it took me this long to get back to you. I will answer your questions as soon as I run a few tests to make sure I will give you the right information. The short answer to your first question is that `--gpus` and `--gres` target slightly different things. `--gpus=<number>` is the total number of GPU's required *for the job*. Similarly, `--gpus-per-node=<number>` is the number of GPU's required per node *for the job*. This is *not* what `--gres=gpu:<number>` means. `--gres=gpu:<number>` is not the amount of GPU's required for the job or the index of the GPU to allocate to the job step; it is the number of GPU's to give to job allocation (if run outside of a job), or it is the number of GPU's to give to the job step (if run inside of a job). By default, job steps are given access to *all* GRES allocated for the job. However, when running `srun --gres=gpu:0`, you are giving the job step none of the GRES. As far as I can tell, a job step is created when `srun` is run outside of a job, and it is this step that consumes all of the GRES. That is also why `srun --gres=gpu:0` works inside the job: all of the GRES is taken, but you are saying that the job step needs no GRES. Therefore, Slurm runs it. So this is the reason the job waits forever when just using `srun hostname`: the `srun` call inside of the job is waiting on resources, and unfortunately, it will wait forever, since its parent job step has them, and the parent job step won't exit. There is a workaround: using `sbatch`. `sbatch` does not allocate a job step by default, or if it does, it does not allocate the GRES resources to that step. In my tests, the following worked: > sbatch --gpus=1 --wrap="srun hostname" as well as: > sbatch --gpus=1 --wrap="srun --gres=gpu:1 hostname" as well as: > sbatch --gpus=1 job.sh where `job.sh` contained the following: > #! /bin/bash > > srun --gres=gpu:1 hostname `job.sh` also worked with the same `sbatch` when it contained the following: > #! /bin/bash > > srun hostname Does this answer your questions? Hi, Gavin, I think this does answer my question. I'll pass the info along to my colleague to see if his satisfies his requirements as well. This came about because some users prefer to use `srun` to run "interactive" jobs, rather than `salloc`. Running `salloc` can be confusing for users. When they invoke an `salloc` session they remain on the node from which they've executed the `salloc` and NOT placed onto a compute node allocated to them which is what they expect. Since we're in an environment that charges for compute, there was some concern that `salloc` sessions might be invoked and never used (or used incorrectly) because it wasn't made clear to the user they'd be allocated resources. If you have any suggestions on how to make `salloc` easier to work with for these users, it'd be most appreciated. I'll check in with my colleague on the answers you've given. Thanks, as always! David Gavin, Out of curiosity (and discussing with my colleague) does `SallocDefaultCommand` have any bearing on `srun`? In other words, is invoking `srun` really just "doing salloc" behind the scenes? David David, As far as I know, `SallocDefaultCommand` only applies to `salloc`, and the reason is because for `srun`, the user must specify the script/command for the job to run. If they do not, `srun` will fail. `SallocDefaultCommand` is meant for the sysadmin to give a default command to run when the user does not give `salloc` a command. And if `SallocDefaultCommand` does not exist, the user's shell is run. So basically, `salloc` sets up an environment in a job allocation where the user can run one command (though the command can be an interactive shell); when the command exits, the job is considered finished and the job allocation is revoked. This means that, for example, if the user does not specify a command, `salloc` will drop them into a shell that, when the user exits, will end the job. So unless your users stay connected to sessions for long periods, a missing command on `salloc` should not leave rogue job allocations. (It can still happen, but there is a way around it below.) However, since users can also give a command when running `salloc`, they can just do that, kind of like `srun`. Like `srun`, it can be a script with calls to `srun` inside to create steps and such. In fact, the command can even be `srun` itself, which will behave as if it's inside a job allocation (which it is). This means you can do something like the following: > salloc --gpus=2 srun hostname or even something like this: > salloc --gpus=2 command.sh where in `command.sh`, you have two `srun` calls creating steps, each of which is taking a GPU. Now, about rogue job allocations. One thing you can do to stop the problems with them is set `SallocDefaultCommand` to something like this: > SallocDefaultCommand=/bin/false If you want some better error messages, you can write a script like the following: > #! /bin/sh > > printf '\n==================\n' 1>&2 > printf 'ERROR: You must specify a command for salloc on this cluster.\n' 1>&2 > printf '\n==================\n' 1>&2 > exit 111 Then set `SallocDefaultCommand` to that script. This will ensure two things: 1. Users will get an error message they can deal with. 2. The command will exit immediately, and the job allocation will be revoked, freeing resources. Does this answer all of your questions? Gavin, Thank you for the walkthrough; it's very helpful. My colleague that reported the issue was under the hope that setting something like: ``` SallocDefaultCommand = "srun \-\-gres=gpus:0 \-\-pty \-\-preserve\-env \-\-mpi=none $SHELL ``` might alleviate the issue he's seeing with his user, but I wasn't sure (hence the question on srun/salloc potential interplay). I'm working to get a better problem statement so I can understand the situation fully and what it is we're working to solve with Slurm itself. Best David David, Let me investigate that one. It does look promising. David,
It works, with four caveats.
First, `--preserve-env` is an option that only applies to `srun` when running it in the job allocation context. When the `srun` in the `SallocDefaultCommand` is run, it is run in the job step context, so that option is ignored and the `SLURM_JOB_NODES` and `SLURM_NTASKS` are unaffected.
Second, you need to add `--gpus=0` to the `srun` call in `SallocDefaultCommand`.
Third, probably the best way to put that command in `slurm.conf` is the following:
> SallocDefaultCommand="srun --gpus=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL"
Fourth, `srun` calls inside the allocation won't work unless the gres exactly matches. Also, remember that the first `srun` call without `--gpus=0` will take *all* of the available GPU's, so all `srun` calls inside the allocation need to include either `--gpus=0` or `--gpus=<whatever_the_step_actually_needs>`.
Any other questions related to this?
I am closing this bug since there has been no response. Feel free to reopen if you have any more questions related to running `srun` with GPU's. |
Created attachment 13643 [details] slurm.conf Hello, When attempting to use `srun` interactively with a GPU I am observing some interesting behavior. - I can get resources allocated using `srun`, but when attempting to use `srun` within the resulting session (e.g. `srun hostname`) I find that the command hangs: ``` [drhey@gl-build ~]$ srun --gpus=1 --nodes=1 --cpus-per-task=1 --time=10:00 -A hpcstaff -p gpu --pty /bin/bash srun: job 5264413 queued and waiting for resources srun: job 5264413 has been allocated resources [drhey@gl1003 ~]$ srun hostname ^Csrun: Cancelled pending job step with signal 2 srun: error: Unable to create step for job 5264413: Job/step already completing or completed ``` - Within the same session, I can get `srun --gpus=gpu:0 hostname` to work: ``` [drhey@gl1003 ~]$ srun --gpus=gpu:0 hostname gl1003.arc-ts.umich.edu ``` - Or, within the same session, I can use `--gres=gpu` to achieve the same result: ``` [drhey@gl1003 ~]$ srun --gres=gpu:0 hostname gl1003.arc-ts.umich.edu [drhey@gl1003 ~]$ exit ``` My questions: * Why does one have to specify these flags (`--gres=gpu`, `--gpus=gpu`) at all after being assigned the requested resources? Why doesn't `srun hostname` work? * What is the actual intent for `--gpus=gpu:*`? * `--gpus=gpu:0` seems to be requesting the first GPU (index 0). However, when I attempt `--gpus=gpu:1` (in a session where I've requested 2 GPUs) it also hangs * `man srun` implies that `--gpus` is the number of GPUs you need (akin to `--mem` in my mind), so what does `--gpus=gpu:*` change viz. functionality? The first bullet is the most important. The second is more for context and better understanding. Thanks, as always! David