Ticket 11061 - distributed configuration question
Summary: distributed configuration question
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 20.11.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-11 09:01 MST by Todd Merritt
Modified: 2021-04-06 08:56 MDT (History)
0 users

See Also:
Site: U of AZ
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Todd Merritt 2021-03-11 09:01:05 MST
Hi there,

I thought I read someplace recently that slurm would stop requiring the config file be present on all of the compute nodes and that it would start getting pushed from slurmctld at startup. I can't find where I had read that and am not seeing it anywhere in google. Am I remembering correctly and if so what version was this starting in and is there anyplace I can go to read more information on it?

Thanks!
Comment 1 Ben Roberts 2021-03-11 09:21:23 MST
Hi Todd,

You're right, there is an option to have the configuration pushed out to the nodes rather than requiring that they all have access to the config file.  This was introduced in 20.02 and requires that you add the SlurmctldParameters=enable_configless parameter to activate it.  There is more information about specifics and implementation details in the documentation here:
https://slurm.schedmd.com/configless_slurm.html

Please read through that document and let me know if you have questions about anything.  

Thanks,
Ben
Comment 2 Todd Merritt 2021-03-11 09:23:55 MST
Perfect, thanks!
Comment 3 Todd Merritt 2021-03-11 09:27:48 MST
If I want to go the SRV record route, that would just go under the origin for the domain that the compute nodes are in, correct? 

Thanks,
Todd
Comment 4 Ben Roberts 2021-03-11 09:42:09 MST
Yes, that should allow the nodes to find the record pointing them to the controller.
Comment 5 Todd Merritt 2021-03-11 09:43:02 MST
Perfect, thanks!
Comment 6 Todd Merritt 2021-04-06 06:05:33 MDT
Sorry, I had one more question occur as I was getting ready to implement this for our next maintenance window. The docs list both slurm.conf and gres.conf as supported files. We have nodes with differing gpu counts and thus different gres.conf's. If there's a local gres.conf and no local slurm.conf, will it pull the slurm.conf based on the SRV record and still use the local gres.conf or will it ignore the local file?

Thanks!
Comment 7 Ben Roberts 2021-04-06 08:41:16 MDT
I would recommend keeping all the nodes with the same gres.conf file to avoid any confusion down the road.  You can define all your nodes with unique node counts and types of GPUs in the same gres.conf file.  Here's an example of what it might look like where nodes 01-09 have 4 Tesla GPUs and nodes 10-18 have 6 p100 GPUs.

NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia0
NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia1
NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia2
NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia3

NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia0
NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia1
NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia2
NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia3
NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia4
NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia5


Is there some requirement on your side preventing you from defining your nodes like this in your gres.conf file?

Thanks,
Ben
Comment 8 Todd Merritt 2021-04-06 08:42:58 MDT
Nope, I think that should work fine.

Thanks!

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, April 6, 2021 at 7:41 AM
To: "Merritt, Todd R - (tmerritt)" <tmerritt@arizona.edu>
Subject: [EXT][Bug 11061] distributed configuration question


External Email
Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=11061#c7> on bug 11061<https://bugs.schedmd.com/show_bug.cgi?id=11061> from Ben Roberts<mailto:ben@schedmd.com>

I would recommend keeping all the nodes with the same gres.conf file to avoid

any confusion down the road.  You can define all your nodes with unique node

counts and types of GPUs in the same gres.conf file.  Here's an example of what

it might look like where nodes 01-09 have 4 Tesla GPUs and nodes 10-18 have 6

p100 GPUs.



NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia0

NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia1

NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia2

NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia3



NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia0

NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia1

NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia2

NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia3

NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia4

NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia5





Is there some requirement on your side preventing you from defining your nodes

like this in your gres.conf file?



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 9 Ben Roberts 2021-04-06 08:56:14 MDT
Ok, I'm glad that will work for you.  I'll close this ticket again.

Thanks,
Ben