| Summary: | distributed configuration question | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Todd Merritt <tmerritt> |
| Component: | Configuration | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | U of AZ | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Todd Merritt
2021-03-11 09:01:05 MST
Hi Todd, You're right, there is an option to have the configuration pushed out to the nodes rather than requiring that they all have access to the config file. This was introduced in 20.02 and requires that you add the SlurmctldParameters=enable_configless parameter to activate it. There is more information about specifics and implementation details in the documentation here: https://slurm.schedmd.com/configless_slurm.html Please read through that document and let me know if you have questions about anything. Thanks, Ben Perfect, thanks! If I want to go the SRV record route, that would just go under the origin for the domain that the compute nodes are in, correct? Thanks, Todd Yes, that should allow the nodes to find the record pointing them to the controller. Perfect, thanks! Sorry, I had one more question occur as I was getting ready to implement this for our next maintenance window. The docs list both slurm.conf and gres.conf as supported files. We have nodes with differing gpu counts and thus different gres.conf's. If there's a local gres.conf and no local slurm.conf, will it pull the slurm.conf based on the SRV record and still use the local gres.conf or will it ignore the local file? Thanks! I would recommend keeping all the nodes with the same gres.conf file to avoid any confusion down the road. You can define all your nodes with unique node counts and types of GPUs in the same gres.conf file. Here's an example of what it might look like where nodes 01-09 have 4 Tesla GPUs and nodes 10-18 have 6 p100 GPUs. NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia0 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia1 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia2 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia3 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia0 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia1 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia2 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia3 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia4 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia5 Is there some requirement on your side preventing you from defining your nodes like this in your gres.conf file? Thanks, Ben Nope, I think that should work fine. Thanks! From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, April 6, 2021 at 7:41 AM To: "Merritt, Todd R - (tmerritt)" <tmerritt@arizona.edu> Subject: [EXT][Bug 11061] distributed configuration question External Email Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=11061#c7> on bug 11061<https://bugs.schedmd.com/show_bug.cgi?id=11061> from Ben Roberts<mailto:ben@schedmd.com> I would recommend keeping all the nodes with the same gres.conf file to avoid any confusion down the road. You can define all your nodes with unique node counts and types of GPUs in the same gres.conf file. Here's an example of what it might look like where nodes 01-09 have 4 Tesla GPUs and nodes 10-18 have 6 p100 GPUs. NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia0 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia1 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia2 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia3 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia0 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia1 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia2 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia3 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia4 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia5 Is there some requirement on your side preventing you from defining your nodes like this in your gres.conf file? Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug. Ok, I'm glad that will work for you. I'll close this ticket again. Thanks, Ben |