| Summary: | configless support | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Todd Merritt <tmerritt> |
| Component: | User Commands | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | U of AZ | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Todd Merritt
2021-11-10 06:41:23 MST
Hi Todd, I don't know if you've seen the suggestion for how you might handle this in our documentation. If you have submit hosts or gateway nodes that you use to submit jobs then one suggestion is to have those nodes run slurmd so that it can manage the configuration files for the client commands, but not add the login/gateway nodes to a partition so they don't have jobs run on them. For reference this is mentioned in the initial section of the configless documentation: https://slurm.schedmd.com/configless_slurm.html Let me know if this sounds like it would work for you or if you have any questions. Thanks, Ben Thanks Ben, I hadn't seen that. However, I have three separate slurmctlds running on three separate hosts. I'd have to run three separate slurmds and make sure they don't step on each other. That seems more fragile than the configuration that I currently have. I misunderstood your question initially. I thought there were unique gateway nodes for each cluster, but I see now that you say that the gateway nodes are shared. You may still be able to get this to do what you need by defining a different SlurmdSpoolDir in the config for each server. The config files will be in the SlurmdSpoolDir under the /conf-cache/ directory. It would require some care to make sure the slurmd processes are all able to run concurrently. There is more information about this in the Field Notes of our most recent Slurm User's Group, starting on slide 35: https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf Can you elaborate on how you are selecting the correct config file now? We do have the SLURM_CONF environment variable that you can set to define where the client commands will look for the config file (https://slurm.schedmd.com/sbatch.html#OPT_SLURM_CONF). Is this what you're using? If part of your concern is about keeping the config files in sync when you make any changes on the controller, have you considered using a network share to have the config file stay in sync? Thanks, Ben Hi Ben, Yes, today we select the config using the SLURM_CONF environment variable. We manage the slurm config across the three clusters with ansible but for ansible reasons and the the way that we set the role up, we can't manage the configuration on the login nodes. NFS shares would be a possibility (it's what we were using on the slurmd nodes in our one cluster before we went configless) but with multiple clusters and multiple login nodes it seems like it would be cumbersome. You're right, there would be a pain period initially with the configuration of the shares for each cluster, but after the initial setup it should be pretty low maintenance. I'm afraid I don't see an easier alternative for the situation you're describing. Thanks, Ben Hi Todd, I wanted to follow up and see if you were able to configure shares for the different clusters and set up a system to have the environment variable switch between those shares. Let me know if you still need help with this ticket. Thanks, Ben Hi Ben, You can close this. We've decided to preserve our current configuration implementation. Ok, sounds good. Let us know if there's anything we can do to help in the future. Thanks, Ben |