Summary: | Configless slurmd does not honor changes in the DNS SRV record | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | slurmd | Assignee: | Daniel Armengod <armengod> |
Status: | OPEN --- | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | armengod, stephen |
Version: | 23.11.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | slurmd log file |
Description
Ole.H.Nielsen@fysik.dtu.dk
2024-07-22 03:36:13 MDT
Created attachment 37873 [details]
slurmd log file
Hello Ole, Thanks for the detailed writeup. I'm going to replicate this in my environment after carefully reviewing it, but it does look like slurmd does not honor the TTL field in the SRV record. I'll work on verifying the problem and proposing a patch if confirmed. Just to make sure, this is not causing any pressing issue on your end? Kind regards, Daniel Hi Daniel, Thanks a lot for your quick response! (In reply to Daniel Armengod from comment #2) > Thanks for the detailed writeup. I'm going to replicate this in my > environment after carefully reviewing it, but it does look like slurmd does > not honor the TTL field in the SRV record. Thanks for sharing offhand my view of the issue, let's see what comes up after your analysis. > I'll work on verifying the problem and proposing a patch if confirmed. Just > to make sure, this is not causing any pressing issue on your end? That's correct, we made the workaround of quickly restarting all slurmd's after changing the DNS SRV record a few weeks ago, and this resolved the pressing issue. Other Slurm users may be surprised in the future if they encounter the same issue after updating the DNS SRV record, so it would be good to get a fix or a documentation update. Best regards, Ole I have thought that the desired response of slurmd when it discovers that the TTL of the DNS SRV record has expired, is that it should create a new slurmd process, just like what happens with an "scontrol reconfigure" (from 23.11): Instruct all slurmctld and slurmd daemons to re-read the config‐ uration file. This mechanism can be used to modify configura‐ tion parameters set in slurm.conf(5) without interrupting run‐ ning jobs. Starting in 23.11, this command operates by creating new processes for the daemons, then passing control to the new processes when or if they start up successfully. This allows it to gracefully catch configuration problems and keep running with the previous configuration if there is a problem. Probably you'll agree on this? BTW, have you reproduced the issue in your own environment? Thanks, Ole (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #6) > I have thought that the desired response of slurmd when it discovers that > the TTL of the DNS SRV record has expired, is that it should create a new > slurmd process, just like what happens with an "scontrol reconfigure" (from > 23.11): Actually, I meant to say that if the DNS SRV record has *changed*, then slurmd should be restarted. It's only after the TTL has expired that slurmd is going to discover if it has a changed value. /Ole Hi Ole, Just getting a chance to put down in writing what we had the opportunity to discuss in-person at SLUG'24. Regardles of whether configless slurmd discovers its ctld through --conf-server or DNS SRV, fetching the config is a one-time operation. The config's "freshness" is not associated in any way with the DNS's record TTL field; this is akin to fetching a generic resource over the net. I'm going to propose a patch to slightly alter the wording in the configless documentation to emphasize the fact that discovery is a one-time operation performed at startup and that the daemon runs normally afterwards. Kind regards, Daniel Hi Daniel, It was nice to meet you at the SLUG'24 meeting last week. Thanks for taking the time to discuss this case with me. I appreciate the clarification that the DNS SRV record's TTL value isn't used for anything at all by slurmd after starting up - this was unexpected to me, given the normal semantics of DNS TTL values! (In reply to Daniel Armengod from comment #9) > Just getting a chance to put down in writing what we had the opportunity to > discuss in-person at SLUG'24. > > Regardles of whether configless slurmd discovers its ctld through > --conf-server or DNS SRV, fetching the config is a one-time operation. The > config's "freshness" is not associated in any way with the DNS's record TTL > field; this is akin to fetching a generic resource over the net. > > I'm going to propose a patch to slightly alter the wording in the configless > documentation to emphasize the fact that discovery is a one-time operation > performed at startup and that the daemon runs normally afterwards. It will be good to get a clarification in the configless documentation. Conclusion: All slurmd's have to be restarted if the DNS SRV record is changed in a configless setup, is this correct? I have a followup question: If a site doesn't use configless, the slurmd reads slurm.conf from some file at startup and caches the value of SlurmctldHost. What will happen when SlurmctldHost is changed in the slurm.conf file, similar to the present migration scenario? My expectation is that slurmd will ignore the updated slurm.conf file until slurmd is explicitly restarted with "systemctl restart slurmd". I don't know if a "scontrol reconfig" command executed on the new SlurmctldHost will be rejected or ignored by slurmd's? It would be good to have a documentation of migrating SlurmctldHost which works both with the configless scenario and with a local slurm.conf file. Thanks, Ole It would be good to |