| Summary: | --gres=gpu option not working properly | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hadrian <hxd58> |
| Component: | Scheduling | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | sxg125 |
| Version: | 14.11.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Case | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | SLURM Config File | ||
|
Description
Hadrian
2016-01-13 05:32:09 MST
Can you please attach your slurm.conf and gres.conf? Created attachment 2607 [details]
SLURM Config File
cat /usr/local/slurm/gres.conf
NodeName=quad06t Name=gpu File=/dev/nvidia[0-1]
NodeName=quad07t Name=gpu File=/dev/nvidia[0-1]
NodeName=gpu008t Name=gpu File=/dev/nvidia[0-1]
NodeName=gpu009t,gpu010t,gpu011t,gpu013t,gpu014t,gpu016t Name=gpu File=/dev/nvidia[0-1]
NodeName=gpu015t Name=gpu File=/dev/nvidia0
NodeName=gpu025t,gpu026t,gpu027t,gpu028t,gpu029t,gpu030t Name=gpu File=/dev/nvidia[0-1]
NodeName=gpu012t Name=gpu File=/dev/nvidia0 CPUs=0,1
NodeName=gpu012t Name=gpu File=/dev/nvidia1 CPUs=2,3
You're going to need to either set DefMemPerCPU, or provide --mem limits during job submission. Without either of those, Slurm is allocating all of the memory in the node to the first job, and any successive job (regardless of GRES requests) will be stuck waiting for resources. If you'd rather not track and allocate node memory you can change SelectType to CR_Core instead to disable it entirely. A few notes from reading through the config - - If you're able to drop the trailing 't' from the nodenames, you'd be able to collapse most of the configuration substantially using ranges. Your batch partition could be simplified to: PartitionName=batch Nodes=comp[001,002,009-016,125-196] Priority=3 Default=YES Output of sinfo, squeue, and other commands tools would also be much more readable. - Partition Priority may not be doing what you expect. It's causes the system to schedule things separately on different tiers - if any jobs are in a higher priority partition they will schedule ahead of jobs in lower priority partition on nodes in common to both, regardless of the multifactor priority values. We expect going to change that setting in 16.05 to make this clear - I'm assuming you're using it here just to change the multifactor weights. In 15.08 you can have an effect on that by changing the PriorityWeightTRES value on that partition instead - http://slurm.schedmd.com/tres.html . You can also use PriorityWeightTRES to favor jobs based on memory or GPU requests - GPU requests in particular seem to be a useful metric for GPU nodes, as you'd likely favor a job that wants just 1 CPU but 2 GPUs over a job asking for 8 CPUs but no GPUs on that same hardware. Thank you for your help. It is working now. Could you please add me to this service request along with Hadrian Djohari? My details are below: Sanjaya Gajurel Email: sxg125@case.edu Computational Scientist Case Western Reserve Univesity Adding Sanjaya Gajurel to the bug. His account in Bugzilla has just been created (he'll see an email about it), he can go through password recovery if he wants to login and change anything. Have you had a chance to look at setting DefMemPerCPU yet? I'd like to verify that's the cause of the problem you're having. cheers, - Tim Yes, DefMemPerCPU line was commented. It is working now. Thank you, -Sanjaya (In reply to Hadrian from comment #6) > Yes, DefMemPerCPU line was commented. It is working now. > > Thank you, > > -Sanjaya Glad to hear that fixed it. Marking as resolved now. |