Hi there, I thought I read someplace recently that slurm would stop requiring the config file be present on all of the compute nodes and that it would start getting pushed from slurmctld at startup. I can't find where I had read that and am not seeing it anywhere in google. Am I remembering correctly and if so what version was this starting in and is there anyplace I can go to read more information on it? Thanks!
Hi Todd, You're right, there is an option to have the configuration pushed out to the nodes rather than requiring that they all have access to the config file. This was introduced in 20.02 and requires that you add the SlurmctldParameters=enable_configless parameter to activate it. There is more information about specifics and implementation details in the documentation here: https://slurm.schedmd.com/configless_slurm.html Please read through that document and let me know if you have questions about anything. Thanks, Ben
Perfect, thanks!
If I want to go the SRV record route, that would just go under the origin for the domain that the compute nodes are in, correct? Thanks, Todd
Yes, that should allow the nodes to find the record pointing them to the controller.
Sorry, I had one more question occur as I was getting ready to implement this for our next maintenance window. The docs list both slurm.conf and gres.conf as supported files. We have nodes with differing gpu counts and thus different gres.conf's. If there's a local gres.conf and no local slurm.conf, will it pull the slurm.conf based on the SRV record and still use the local gres.conf or will it ignore the local file? Thanks!
I would recommend keeping all the nodes with the same gres.conf file to avoid any confusion down the road. You can define all your nodes with unique node counts and types of GPUs in the same gres.conf file. Here's an example of what it might look like where nodes 01-09 have 4 Tesla GPUs and nodes 10-18 have 6 p100 GPUs. NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia0 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia1 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia2 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia3 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia0 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia1 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia2 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia3 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia4 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia5 Is there some requirement on your side preventing you from defining your nodes like this in your gres.conf file? Thanks, Ben
Nope, I think that should work fine. Thanks! From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, April 6, 2021 at 7:41 AM To: "Merritt, Todd R - (tmerritt)" <tmerritt@arizona.edu> Subject: [EXT][Bug 11061] distributed configuration question External Email Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=11061#c7> on bug 11061<https://bugs.schedmd.com/show_bug.cgi?id=11061> from Ben Roberts<mailto:ben@schedmd.com> I would recommend keeping all the nodes with the same gres.conf file to avoid any confusion down the road. You can define all your nodes with unique node counts and types of GPUs in the same gres.conf file. Here's an example of what it might look like where nodes 01-09 have 4 Tesla GPUs and nodes 10-18 have 6 p100 GPUs. NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia0 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia1 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia2 NodeName=node[01-09] Name=gpu Type=tesla Count=1 File=/dev/nvidia3 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia0 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia1 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia2 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia3 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia4 NodeName=node[10-18] Name=gpu Type=p100 Count=1 File=/dev/nvidia5 Is there some requirement on your side preventing you from defining your nodes like this in your gres.conf file? Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug.
Ok, I'm glad that will work for you. I'll close this ticket again. Thanks, Ben