Ticket 12839

Summary:	configless support
Product:	Slurm	Reporter:	Todd Merritt <tmerritt>
Component:	User Commands	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.11.5
Hardware:	Linux
OS:	Linux
Site:	U of AZ	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Todd Merritt 2021-11-10 06:41:23 MST

This is probably loosely related to Bug 10334. We are running three clusters in a configless environment with a shared set of gateway nodes from which users submit jobs. As it is configured now, we need to distribute all of the slurm.conf and topology.conf files to each of the gateway nodes to keep sbatch, srun and friends happy which is sort of inconvenient. It would be great if the client commands could also honor the srv records settable via an environment variable or command line switch to retrieve their configuration parameters similar to the way slurmd does.

Comment 2 Ben Roberts 2021-11-10 12:50:24 MST

Hi Todd,

I don't know if you've seen the suggestion for how you might handle this in our documentation.  If you have submit hosts or gateway nodes that you use to submit jobs then one suggestion is to have those nodes run slurmd so that it can manage the configuration files for the client commands, but not add the login/gateway nodes to a partition so they don't have jobs run on them.

For reference this is mentioned in the initial section of the configless documentation:
https://slurm.schedmd.com/configless_slurm.html

Let me know if this sounds like it would work for you or if you have any questions.

Thanks,
Ben

Comment 3 Todd Merritt 2021-11-10 13:38:38 MST

Thanks Ben, I hadn't seen that. However, I have three separate slurmctlds running on three separate hosts. I'd have to run three separate slurmds and make sure they don't step on each other. That seems more fragile than the configuration that I currently have.

Comment 5 Ben Roberts 2021-11-10 15:26:17 MST

I misunderstood your question initially.  I thought there were unique gateway nodes for each cluster, but I see now that you say that the gateway nodes are shared.  You may still be able to get this to do what you need by defining a different SlurmdSpoolDir in the config for each server.  The config files will be in the SlurmdSpoolDir under the /conf-cache/ directory.  It would require some care to make sure the slurmd processes are all able to run concurrently.  There is more information about this in the Field Notes of our most recent Slurm User's Group, starting on slide 35:
https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf

Can you elaborate on how you are selecting the correct config file now?  We do have the SLURM_CONF environment variable that you can set to define where the client commands will look for the config file (https://slurm.schedmd.com/sbatch.html#OPT_SLURM_CONF).  Is this what you're using?  If part of your concern is about keeping the config files in sync when you make any changes on the controller, have you considered using a network share to have the config file stay in sync?  

Thanks,
Ben

Comment 6 Todd Merritt 2021-11-10 16:04:51 MST

Hi Ben,

Yes, today we select the config using the SLURM_CONF environment variable. We manage the slurm config across the three clusters with ansible but for ansible reasons and the the way that we set the role up, we can't manage the configuration on the login nodes. NFS shares would be a possibility (it's what we were using on the slurmd nodes in our one cluster before we went configless) but with multiple clusters and multiple login nodes it seems like it would be cumbersome.

Comment 7 Ben Roberts 2021-11-10 16:24:25 MST

You're right, there would be a pain period initially with the configuration of the shares for each cluster, but after the initial setup it should be pretty low maintenance.  I'm afraid I don't see an easier alternative for the situation you're describing.  

Thanks,
Ben

Comment 8 Ben Roberts 2021-12-09 13:15:11 MST

Hi Todd,

I wanted to follow up and see if you were able to configure shares for the different clusters and set up a system to have the environment variable switch between those shares.  Let me know if you still need help with this ticket.

Thanks,
Ben

Comment 9 Todd Merritt 2021-12-09 13:21:59 MST

Hi Ben,

You can close this. We've decided to preserve our current configuration implementation.

Comment 10 Ben Roberts 2021-12-09 13:26:30 MST

Ok, sounds good.  Let us know if there's anything we can do to help in the future.

Thanks,
Ben