Ticket 13753

Summary: multi-gpu job runs even when node lacks sufficient free gpus
Product: Slurm Reporter: Mark Allen <racsadmin>
Component: slurmctldAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact: Ben Roberts <ben>
Severity: 3 - Medium Impact    
Priority: ---    
Version: 21.08.5   
Hardware: Linux   
OS: Linux   
Site: University of Oregon Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: v1

Description Mark Allen 2022-04-01 16:03:20 MDT
A job with "--gpus=2" will schedule and run on a node with only one free GPU.  The environment of the job is just as if it were run with "--gpus=1".  The behavior is as if slurmd is presented with a request for two GPUs, has only one, and decides "Gosh, I guess I'll just do my best.".

We also are seeing a lot of messages like

    error: gres/gpu: job 18123454 dealloc node n110 type k80-12g gres count underflow (0 1)

which suggests that SLURM is somehow losing track of the GPU count (or free count).

We don't think we've started slurmctld/slurmd with inconsistent gres.conf/slurm.conf, though we're not certain.

As a detail, we've not seen a job with "--gpus=1" scheduling on a host with no free GPUs.  That could be happening, too, though.

There are no errors in the logs that seem related.  The only thing that seems plausible is this from a slurmd log

    error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_1345/job_18142814/devices.allow: Invalid argument

Worrying, but not obviously related.

There seem to be two problems here.  First, slurmd, when presented with a job that it doesn't have the resources to run, should be throwing an epic tantrum, downing the node and spewing plenty of descriptive error messages.  This should happen regardless of conf file consistency.

Second, slurmctld should not be scheduling jobs on nodes that obviously (per squeue, sinfo, scontrol show) do not have the available resources to run the job.

Thanks,
Mike
Comment 1 Marcin Stolarek 2022-04-05 01:02:53 MDT
>A job with "--gpus=2" will schedule and run on a node with only one free GPU.  
This is for sure neither expected nor easy to reproduce in the general case. Were those single-node jobs? Just to be sure we're on the sam page --gpus is a per-job requested resource, so can result in a job getting two nodes each with one gpu?

>error: gres/gpu: job 18123454 dealloc node n110 type k80-12g gres count underflow (0 1)
Could you please share full slurmd/slurmctld logs from the time span? Is that possible that number of devices configured per node was changed in the relevant time span?

>[..]gres.conf/slurm.conf, though we're not certain.
Could you please attach those as well? Are you able to check the job script/options the job was submitted with?

cheers,
Marcin
Comment 2 Mark Allen 2022-04-05 13:54:21 MDT
[adding this here, as email reply didn't work]

Hi Marcin,

Thanks--that was very helpful.  As you guessed, the issue here is that SLURM decided to use two nodes, one task on each, and each task got one GPU.  I cannot fathom why, nor does any of the documentation on the 'sbatch' man page suggest this could happen.

See below for the command used.

As far as I've noticed, we never saw behavior like this when using "--gres=gpu:2".  Obviously the new flags could work differently, but the doc doesn't really suggest that this behavior is possible.

If this is the intended behavior, what is the best way to get our desired behavior?  I suppose we could explicitly add "--nodes=1" or "--ntasks=1", though I would have thought both would be the implicit default.  Or perhaps there's a better way?

Mike


mcolema5@talapas-ln2 /projects/hpcrcf/mcolema5/alphafold 115$ srun --account=hpcrcf --partition=longgpu --cpus-per-task=8 --mem=32G --gpus=2 --pty bash
srun: job 18146045 queued and waiting for resources
srun: job 18146045 has been allocated resources


# sacct -p -j 18146045 --format=submitline
SubmitLine|
srun --account=hpcrcf --partition=longgpu --cpus-per-task=8 --mem=32G --gpus=2 --pty bash|
|
srun --account=hpcrcf --partition=longgpu --cpus-per-task=8 --mem=32G --gpus=2 --pty bash|


# showjob 18146045
JobId= 18146045
Partition= longgpu
  Account= hpcrcf UserId=mcolema5 (Michael Coleman) JobName=bash

NumNodes= 2 NumCPUs=16 NumTasks=2 Tasks/Node=0 CPUs/Task=8
TRES= cpu=16  mem=64G  node=2  billing=32  gres/gpu=2
MinCPUsNode=8 MinMemoryNode=32G

BatchHost= n101 NodeList=n[101-102]

SubmitTime= 2022-04-01T11:59:31
 StartTime= 2022-04-01T11:59:35
   EndTime= 2022-04-01T12:02:48
   RunTime=            00:03:13
 TimeLimit=         14-00:00:00

WorkDir= /gpfs/projects/hpcrcf/mcolema5/alphafold
Command= bash



JobState= COMPLETING Reason=None ExitCode=0:0
Comment 3 Marcin Stolarek 2022-04-06 01:30:50 MDT
> I cannot fathom why, nor does any of the documentation on the 'sbatch' man page suggest this could happen.
I can't agree, per sbatch manual: " -G, --gpus=[type:]<number> Specify the total number of GPUs required for the job."[1]. If you need two GPUs on one node -N1 should be added to the job spec.

>As far as I've noticed, we never saw behavior like this when using "-gres=gpu:2".
Correct --gres has a per node meaning, per man sbatch "The specified resources will be allocated to the job on each node."[2]

>I suppose we could explicitly add "--nodes=1" or "--ntasks=1", though I would have thought both would be the implicit default.
If you need a specific number of nodes or tasks the best way is explicitly specify those. The logic behind just -G  (--gpus) interpretation is that a user-specified that he needs two GPUs to run the job (for instance to complete it in a given or default time limit) and his job script can handle multiple nodes allocation (think about pure GPU job with RDMA GPU-GPU communication, where you may have X GPUs on Y nodes and the numbr of nodes doesn't really matter). Slurm prefers to put such a job on one node, but if it's possible to start it earlier on two nodes then resources are allocated. If the number of tasks wasn't specified then the default is one task per node, although remember that sbatch doesn't really launch tasks[3].

Are you continuously seeing errors like the one below?
>error: gres/gpu: job 18123454 dealloc node n110 type k80-12g gres count underflow (0 1)
There are some edge cases when this can happen for instance when the number of GPUs available on the node is reeduced or File= option is added/removed from gres.conf but if it appears continuously in the log please share full slurmd/slurmctld logs with us.

cheers,
Marcin 

[1]https://slurm.schedmd.com/sbatch.html#OPT_gpus
[2]https://slurm.schedmd.com/sbatch.html#OPT_gres
[3]https://slurm.schedmd.com/sbatch.html#OPT_ntasks
Comment 4 Mark Allen 2022-04-06 18:32:41 MDT
(In reply to Marcin Stolarek from comment #3)

I'd love to see a bit more on this on the man page.  Even your comment "If you need two GPUs on one node -N1 should be added to the job spec." would probably have saved us quite a few hours.

Playing around a bit, I see that "--gpus=3" (with no other flags) can generate one, two, or three tasks, depending on happenstance.  I don't think most users would easily infer that from the man page.

> remember that sbatch doesn't really launch tasks[3].
No, I didn't know that.  It's a bit annoying that it doesn't immediately say what _does_ launch tasks.  Maybe 'srun'?  But my 'sbatch' script doesn't invoke 'srun'.  I think this all cries out for more documentation.

> Are you continuously seeing errors like the one below?
> >error: gres/gpu: job 18123454 dealloc node n110 type k80-12g gres count underflow (0 1)
We were, but we haven't seen one since Mar 27.  Since we recently prior to that upgraded SLURM, and rebooted things, I'm inclined to hold off for now.  I don't feel confident that our restart was exactly correct (e.g., all conf files identical on all hosts).  If we start seeing it again, we'll file another ticket.

Thanks for your help.

Mike
Comment 5 Marcin Stolarek 2022-04-07 07:35:33 MDT
>I'd love to see a bit more on this on the man page.  Even your comment "If you need two GPUs on one node -N1 should be added to the job spec." would probably have saved us quite a few hours.

The other way, as you know, is to use --gres=gpu: or --gpus-per-node (which is difficult to miss when you read --gpus documentation since it's just next to it) options. Can I ask you to how would you modify the doc with the knowledge you have now?

cheers,
Marcin
Comment 6 Mark Allen 2022-04-07 14:03:29 MDT
(In reply to Marcin Stolarek from comment #5)
> Can I ask you to how would you modify the doc with the
> knowledge you have now?

As a start, I'd mention in '--gpus' that, for example, "--gpus=3" might be allocated across up to three hosts, and as a consequence the job might have as few as one task and as many as three.  Or put another way, the '--gpus' flag can by itself create multi-task jobs, even if the user has not requested multiple tasks in any other way.

It might also be useful to update this sentence "If -N is not specified, the default behavior is to allocate enough nodes to satisfy the requirements of the -n and -c options." to clarify that '--gpus' can also affect this.

Thanks,
Mike
Comment 7 Marcin Stolarek 2022-04-08 01:55:11 MDT
Created attachment 24325 [details]
v1

Will the patch in the attachment make it more clear for you?

cheers,
Marcin
Comment 8 Mark Allen 2022-04-08 11:32:16 MDT
Being honest, it's a slight improvement, but probably would not have been enough to trigger a lightbulb over my head.

The idea that this is a new way to implicitly create multiple tasks is surprising to me.  If you had asked me how I though that should happen, I likely would have said, create your tasks, and then use the --gpus-per-task flag, if you need each task to have more than one GPU.

In any case, feel free to close the ticket.  Our use cases are relatively simple, and I think I understand things well enough.

Thank you,
Mike
Comment 15 Marcin Stolarek 2022-04-14 01:51:43 MDT
Mike,

We decided to merge a slightly different documentation change[1].

I'm closing the case now as information given. Should you have any question please don't hesitate to reach out to us.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/c12f54e84f0ad6b0207822ef063df3ea0400357f