| Summary: | possible bug with backup controller gives access/permission denied errors | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Robert Yelle <ryelle> |
| Component: | slurmctld | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Oregon | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 19.05.7,20.02.3 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
ATT00001.htm |
||
Hi, Rob would you please attach you slurm.conf from both head nodes? I assume these are the same file however having you confirm this would be helpful. Created attachment 11396 [details]
slurm.conf
Hi Jason,
Our current slurm.conf file (same for both head nodes) is attached.
Thanks,
Rob
Created attachment 11397 [details]
ATT00001.htm
(In reply to Robert Yelle from comment #0) > When we bring the primary controller down and the backup controller kicks in How are you switching to the backup controller? Are you using systemd to shutdown the slurmctld service? This is an automated reply; I am out of the office until Sept 18 and will not be able to reply to you immediately. I will get back to you as soon as I am able. (In reply to Nate Rini from comment #4) > (In reply to Robert Yelle from comment #0) > > When we bring the primary controller down and the backup controller kicks in > > How are you switching to the backup controller? Are you using systemd to > shutdown the slurmctld service? Please disregard this request, I have managed to replicate your issue. I will provide updates once a patchset is sent for QA review. Rob, We have a patch undergoing QA review currently. An easy work around for the issue is to call `scontrol reconfigure` after the backup controller has taken control. I will provide updates once the patchset is upstream. Thanks, --Nate Thanks Nate! Rob On Oct 24, 2019, at 5:53 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 22<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=7649*c22__;Iw!5W9E9PnL_ac!S5ux2W0LyWnpWlfkAS_pVFC2OVlViOsWrCKWQcIDUNxdlmDCz0-N4_WkLE5w71s5$> on bug 7649<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=7649__;!5W9E9PnL_ac!S5ux2W0LyWnpWlfkAS_pVFC2OVlViOsWrCKWQcIDUNxdlmDCz0-N4_WkLJ6Yo-Bh$> from Nate Rini<mailto:nate@schedmd.com> Rob, We have a patch undergoing QA review currently. An easy work around for the issue is to call `scontrol reconfigure` after the backup controller has taken control. I will provide updates once the patchset is upstream. Thanks, --Nate ________________________________ You are receiving this mail because: * You reported the bug. Rob, This patch is now upstream: https://github.com/SchedMD/slurm/commit/e3238f97c9b95e31e9835900ecb6d1e072d29327 Please respond if you have any questions or issues. Thanks, --Nate |
Hello, We are testing our backup slurm controller on our second head node. When we bring the primary controller down and the backup controller kicks in, basic commands like "squeue" and "sinfo" work fine, but when we try to access our standard partitions, we get: "_slurm_rpc_allocate_resources: Access/permission denied" Access to our "special" partitions, e.g. partitions for preempt and specific condo owners appear to work fine. It seems the primary difference between the standard and special partitions was the following instruction present on the standard partitions: DenyAccounts={accountName} Unfortunately ALL accounts are denied access to the standard partitions by the backup controller, not just the account listed in "DenyAccounts". When I replaced this instruction with AllowAccounts=All then all partitions were accessible again from the backup controller. While we no longer have a need at present to deny accounts to certain partitions, this is undesirable behavior for our backup controller if we need to deny accounts from certain partitions again in the future. Are you able to reproduce this? Let me know if you need other information that would be helpful here. Thanks, Rob