| Summary: | Having to restart slurmctld to detect changes made using sacctmgr | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
| Component: | Database | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | wfeinstein |
| Version: | 17.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Greg Wickham
2018-09-12 08:34:19 MDT
Hi Greg, We have seen this before from other sites where there is a firewall preventing the slurmdbd from contacting the slurmctld. When slurmctld restarts, it initiates the connection in the other direction and gets the updated view, but if there's a firewall in the other direction, changes made through sacctmgr won't be propagated immediately. Kind regards, Jason Hi Jason, Given that slurmdbd and slurmctld run on the same node and there is no firewall, what else could be the issue? # ps ax | grep slurm 8473 ? Sl 26:49 /etc/slurm-active/sbin/slurmctld 20991 ? Sl 671:07 /etc/slurm-active/sbin/slurmdbd # /sbin/iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination # Changes to the slurm management database aren't picked up quickly (affecting user additions, QOS changes, account management, ... ) Hi Greg, You will want to raise the log level of slurmdbd to debug and debug2 for slurmctld to be these updates. I would suggest looking at the logs to see if these updates are being sent and revived. Please find an example below. slurmdbd logs [2018-09-13T11:08:55.001] debug: sending updates to linux at 127.0.0.1(8817) ver 8192 slurmctld logs [2018-09-13T11:08:55.002] debug2: Processing RPC: ACCOUNTING_UPDATE_MSG from uid=1020 It would also be helpful to see your slurm.conf and your slurmdbd.conf (StoragePass removed). -Jason Jason, At LBNL we are seeing the same issue and we do not have a firewall setup. We are consistently having to either delete and re-add the user or just restart slurmctld. It would be good to know what the root cause of this issue is and how it is resolved. Hi Greg, Were you able to raise the log level as mentioned in the last message and gather some additional debug logging for me too review? If so please to attach this information along with your configuration files (slurm.conf and your slurmdbd.conf). -Jason I am be out of the office until Wednesday, 26th September 2018.
For any issues with Ibex please either:
- send a request to the Ibex slack channel #general
(sign up at https://kaust-ibex.slack.com/signup)
- open a ticket by sending an email to ibex@hpc.kaust.edu.sa
Some useful information:
To access Ibex, the frontend nodes are:
ilogin.ibex.kaust.edu.sa (for Intel)
alogin.ibex.kaust.edu.sa (for AMD)
glogin.ibex.kaust.edu.sa (for Intel with GPUs)
For information regarding the unified clusters (tutorial, explanations etc) please refer to the wiki at:
http://hpc.kaust.edu.sa/ibex
-Greg
--
________________________________
This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
Hi Greg - Any update on this issue? Hi Greg, I am going to close this issue out as timedout for now since I do not have any further information to look at. Please feel free to re-open if you have some additional logging for me to analyze. Best regards, -Jason |