2342 – --gres=gpu option not working properly

Ticket 2342 - --gres=gpu option not working properly

Summary: --gres=gpu option not working properly

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	14.11.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-01-13 05:32 MST by Hadrian
Modified:	2016-01-14 05:52 MST (History)
CC List:	1 user (show)

See Also:
Site:	Case
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
SLURM Config File (4.51 KB, text/plain) 2016-01-13 06:36 MST, Hadrian	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Hadrian 2016-01-13 05:32:09 MST

The gpu nodes (Tesla M2090) had two gpus. However, even if one gpu is requested, the other gpu won't run in that gpu though it has enough processors and one more gpu.

Examples:

# Request a gpu from gpu012:
srun --gres=gpu:1 -c 2 -N 1 -p gpufermi --nodelist=gpu012t --pty /bin/bash
[sxg125@gpu012t ~]$ 

# Now, request another gpu from the same node:
srun --gres=gpu:1 -c 2 -N 1 -p gpufermi --nodelist=gpu012t --pty /bin/bash
srun: job 122118 queued and waiting for resources

The job is waiting for the resource though it has it. I would appreciate your help in resolving this issue.

Thank you,

-Sanjaya
(email: sxg125@case.edu)

Comment 1 Tim Wickberg 2016-01-13 06:16:27 MST

Can you please attach your slurm.conf and gres.conf?

Comment 2 Hadrian 2016-01-13 06:36:15 MST

Created attachment 2607 [details]
SLURM Config File

 cat /usr/local/slurm/gres.conf 
NodeName=quad06t  Name=gpu  File=/dev/nvidia[0-1]
NodeName=quad07t  Name=gpu  File=/dev/nvidia[0-1]
NodeName=gpu008t  Name=gpu  File=/dev/nvidia[0-1]
NodeName=gpu009t,gpu010t,gpu011t,gpu013t,gpu014t,gpu016t  Name=gpu  File=/dev/nvidia[0-1]
NodeName=gpu015t  Name=gpu File=/dev/nvidia0
NodeName=gpu025t,gpu026t,gpu027t,gpu028t,gpu029t,gpu030t  Name=gpu  File=/dev/nvidia[0-1]
NodeName=gpu012t  Name=gpu File=/dev/nvidia0 CPUs=0,1
NodeName=gpu012t  Name=gpu File=/dev/nvidia1 CPUs=2,3

Comment 3 Tim Wickberg 2016-01-13 10:03:42 MST

You're going to need to either set DefMemPerCPU, or provide --mem limits during job submission.

Without either of those, Slurm is allocating all of the memory in the node to the first job, and any successive job (regardless of GRES requests) will be stuck waiting for resources.

If you'd rather not track and allocate node memory you can change SelectType to CR_Core instead to disable it entirely.

A few notes from reading through the config -

- If you're able to drop the trailing 't' from the nodenames, you'd be able to collapse most of the configuration substantially using ranges. Your batch partition could be simplified to:

PartitionName=batch Nodes=comp[001,002,009-016,125-196] Priority=3 Default=YES

Output of sinfo, squeue, and other commands tools would also be much more readable.

- Partition Priority may not be doing what you expect. It's causes the system to schedule things separately on different tiers - if any jobs are in a higher priority partition they will schedule ahead of jobs in lower priority partition on nodes in common to both, regardless of the multifactor priority values.

We expect going to change that setting in 16.05 to make this clear - I'm assuming you're using it here just to change the multifactor weights.

In 15.08 you can have an effect on that by changing the PriorityWeightTRES value on that partition instead - http://slurm.schedmd.com/tres.html . You can also use PriorityWeightTRES to favor jobs based on memory or GPU requests - GPU requests in particular seem to be a useful metric for GPU nodes, as you'd likely favor a job that wants just 1 CPU but 2 GPUs over a job asking for 8 CPUs but no GPUs on that same hardware.

Comment 4 Hadrian 2016-01-14 00:52:58 MST

Thank you for your help.

It is working now. Could you please add me to this service request along with Hadrian Djohari? My details are below:

Sanjaya Gajurel
Email: sxg125@case.edu
Computational Scientist
Case Western Reserve Univesity

Comment 5 Tim Wickberg 2016-01-14 03:30:22 MST

Adding Sanjaya Gajurel to the bug. His account in Bugzilla has just been created (he'll see an email about it), he can go through password recovery if he wants to login and change anything.

Have you had a chance to look at setting DefMemPerCPU yet? I'd like to verify that's the cause of the problem you're having.

cheers,
- Tim

Comment 6 Hadrian 2016-01-14 05:28:05 MST

Yes, DefMemPerCPU line was commented. It is working now.

Thank you,

-Sanjaya

Comment 7 Tim Wickberg 2016-01-14 05:52:16 MST

(In reply to Hadrian from comment #6)
> Yes, DefMemPerCPU line was commented. It is working now.
> 
> Thank you,
> 
> -Sanjaya

Glad to hear that fixed it. Marking as resolved now.