A user with a DefaultAccount set correctly has jobs rejected with invalid qos/partition using sbatch or salloc if --account=... is not used for submission. If slurmctld is restarted these use the value defined in DefaultAccount and succeed, as expected. This appears to be an issue only on a second cluster which shares a data base with a primary cluster and with slurm version 20.02.1. On a newly-installed data base and cluster1: 1. create a cluster1, partition, qos, account, default account and user. qos=normal, account=default, defaultaccount=default 2. submit test jobs ==> all is good on cluster1 On a 2nd cluster which uses the same slurmdbd and cluster2: 3. create cluster2, partition, qos, account, default account and user (identical values & configuration as above) 4. sbatch or salloc test jobs (invoked by root test driver) ex. [works] # sbatch --chdir=/tmp --qos=normal --account=default --uid=user --wrap=date ex [fails] # sbatch --chdir=/tmp --qos=normal --uid=user --wrap=date If the slurmctld is restarted ("systemctl restart slurmctld") where it is running on a separate node, then the test job submission works. ---- Is there an scontrol command to cause slurmctld to be restarted? (or hack such as 'scontrol takeover' without a backup controller causing the primary to be restarted?) That is, could this be invoked by an appropriately privileged user (=~ automated test suite) to cause slurmctld to flush its cache & restart rather than forcing a human to invoke something like 'ssh slurmctlr-node systemctl restart slurmctld'? Automating ssh access in our environment is discouraged by security policy. ---- [root@vxlogin slurm]# sinfo --version slurm 20.02.1 [root@vxlogin slurm]# uname -a Linux vxlogin 3.10.0-1062.4.3.el7.x86_64 #1 SMP Wed Nov 13 23:58:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux [root@vxlogin slurm]# cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" ---
This is tagged as a slurmctld component but may be slurmdbd.
Hi Steven, > A user with a DefaultAccount set correctly has jobs rejected with invalid > qos/partition using sbatch or salloc if --account=... is not used for > submission. If slurmctld is restarted these use the value defined in > DefaultAccount and succeed, as expected. > > This appears to be an issue only on a second cluster which shares a data > base with a primary cluster and with slurm version 20.02.1. > > On a newly-installed data base and cluster1: > 1. create a cluster1, partition, qos, account, default account and user. > qos=normal, account=default, defaultaccount=default > 2. submit test jobs > ==> all is good on cluster1 > > On a 2nd cluster which uses the same slurmdbd and cluster2: > 3. create cluster2, partition, qos, account, default account and user > (identical values & configuration as above) > 4. sbatch or salloc test jobs (invoked by root test driver) > ex. [works] # sbatch --chdir=/tmp --qos=normal --account=default > --uid=user --wrap=date > > ex [fails] # sbatch --chdir=/tmp --qos=normal --uid=user --wrap=date > > If the slurmctld is restarted ("systemctl restart slurmctld") where it is > running on a separate node, then the test job submission works. Thanks for the detailed explanations. Unfortunately I've not being able to reproduce the issue so far. Could you detail a bit more the order that you follow to create/add the new cluster in the db, when do you start the new slurmctld, and when you create the new account and user of the new cluster in the db. Did you start the new slurmctld with slurmdbd down? And could you attach the logs from both slurmctlds and also from the shared slurmdbd? And both slurm.conf and the one slurmdbd.conf? Once it works for the first time, I guess that you are not able to reproduce it neither, right? > Is there an scontrol command to cause slurmctld to be restarted? (or hack > such as 'scontrol takeover' without a backup controller causing the primary > to be restarted?) Wait a minute, those that mean that you are using multiple SlurmctldHost (old BackupController/Addr)? If this is the case, then there is no "multiple clusters", but only one with high-availability. So far I understood that you have two clusters, meaning two independent slurmctld with two independent slurm.conf, both sharing the slurmdbd but each one with its own cluster on the database. Not multiple SlurmctldHost in a single slurm.conf. Was I right? > That is, could this be invoked by an appropriately > privileged user (=~ automated test suite) to cause slurmctld to flush its > cache & restart rather than forcing a human to invoke something like 'ssh > slurmctlr-node systemctl restart slurmctld'? > > Automating ssh access in our environment is discouraged by security policy. We don't have such command, mainly because slurmctld is designed (and needs) to refresh its cache whenever is necessary from the database. I'm not sure if this could become and enhancement. Regards, Albert
Created attachment 13779 [details] cluster1 ("vc") slurm.conf
(In reply to Albert Gil from comment #2) > Could you detail a bit more the order that you follow to create/add the new > cluster in the db, when do you start the new slurmctld, and when you create > the new account and user of the new cluster in the db. Did you start the new > slurmctld with slurmdbd down? After cluster1 is fully up (slurmdbd included) & has run its own test jobs, the cluster2 nodes are provisioned. The first node is the cluster2 slurmctld scheduler node. It spins up its own slurmctld instance, which is configured to point to the cluster1's dbd. That's the point where the 'add cluster cluster2' is invoked, which succeeds. The relevant tables are populated in mysql and the clustername (as shown in the clusters table matches what is retrieved from cluster2's slurm.conf.) So, no, the slurmdbd was definitely up when the new cluster was created. After the cluster2 compute nodes are up & running slurmd, the front-end/login node is provisioned. As a late stage in its provisioning, dependent on seeing the socket on the compute nodes, the users and their associations are created. The default account is set as a separate call to sacctmgr. (all of which have a return code = 0) > And could you attach the logs from both slurmctlds and also from the shared > slurmdbd? > And both slurm.conf and the one slurmdbd.conf? [done: note vc = cluster1, vx = cluster2] > > Once it works for the first time, I guess that you are not able to reproduce > it neither, right? Yes, once slurmctld has been restarted the problem does not reoccur. > Wait a minute, those that mean that you are using multiple SlurmctldHost > (old BackupController/Addr)? No. This was just an extrapolation that the takeover logic could be triggered to cause a cache flush since I didn't see any other mechanism besides restarting cluster2's slurmctld to accomplish that. > If this is the case, then there is no "multiple clusters", but only one with > high-availability. This was just speculation on my part about possible mechanisms. We do not use a secondary controller nor are these clusters configured to do so. > So far I understood that you have two clusters, meaning two independent > slurmctld with two independent slurm.conf, both sharing the slurmdbd but > each one with its own cluster on the database. Yes. [attaching the slurm.conf & slurmdbd.conf] > Not multiple SlurmctldHost in a single slurm.conf. > Was I right? You are right. There are not multiple SlurmctldHosts. That was just speculation. > We don't have such command, mainly because slurmctld is designed (and needs) > to refresh its cache whenever is necessary from the database. I was just speculating about existing logic that could be triggered to provoke a slurmctld cache flush, without requiring external agency (systemctl restart slurmctld, in this case.)
Created attachment 13780 [details] cluster1 ("vc") slurmdbd.conf
Created attachment 13781 [details] cluster2 ("vx") slurm.conf
Created attachment 13782 [details] cluster2 ("vx") slurmdbd.conf
I am reconfiguring the two clusters & will regenerate the logs & attach them as requested. This will take ~1 hour.
Created attachment 13784 [details] cluster1 ("vc") slurmctld.log
Created attachment 13785 [details] cluster2 ("vx") slurmctld.log
Created attachment 13786 [details] common cluster1 ("vc") slurmdbd.log
User|Def Acct|Def WCKey|Admin|Cluster|Account|Partition|Share|Priority|MaxJobs|MaxNodes|MaxCPUs|MaxSubmit|MaxWall|MaxCPUMins|QOS|Def QOS| sts|default||Administrator|vc|default|compile|1||||||||normal|| sts|default||Administrator|vc|default|login|1||||||||normal|| sts|default||Administrator|vc|default|exclusive|1||||||||normal|| sts|default||Administrator|vc|default|shared|1||||||||normal|| sts|default||Administrator|vx|default|login|1||||||||normal|| sts|default||Administrator|vx|default|exclusive|1||||||||normal|| sts|default||Administrator|vx|default|shared|1||||||||normal|| --- % id sts uid=24800(sts) gid=24800(sts) groups=24800(sts),1000(vagrant) --- % tail /var/log/slurm/slurmctld.vxsched.log [2020-04-14T13:23:57.116] error: User 24800 not found [2020-04-14T13:23:57.120] _job_create: invalid account or partition for user 24800, account '(null)', and partition 'login' [2020-04-14T13:23:57.150] _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified ---
Created attachment 13788 [details] test job submission script
Created attachment 13789 [details] remediation script explicitly setting the DefaultAccount does not seem necessary causing the slurmctld to be restarted remediates the problem
Hi Steven, Sorry for the delay. I'm still not able to reproduce the issue, but I've some clues to follow: 1) About the cluster registration: > After cluster1 is fully up (slurmdbd included) & has run its own test jobs, > the cluster2 nodes are provisioned. The first node is the cluster2 slurmctld > scheduler node. It spins up its own slurmctld instance, which is configured > to point to the cluster1's dbd. That's the point where the 'add cluster > cluster2' is invoked, which succeeds. I would like to know more details about the 'add cluster cluster2/vx' that you mentioned. Do you run a sacctmgr command on provision/loader/shload.sh file or similar to "sacctmgr add cluster", or this is done automatically whem vxsched is started? Or maybe you do both? In which order? I think that is just the slurmctld of boths clusters who do the registration when started, but from your logs and comments I cannot be certain: vc: [2020-04-14T12:03:54.329] Registering slurmctld at port 6817 with slurmdbd vx: [2020-04-14T13:00:57.375] Registering slurmctld at port 6817 with slurmdbd Could you increase the debug level of the slurmdbd to debug3? That would confirm how the cluster2 was registered to vcdb. 2) About a couple of non expected lines in the slurmctld logs: vc: [2020-04-14T12:03:54.864] killing old slurmctld[8076] vx: [2020-04-14T13:00:57.956] killing old slurmctld[8267] It looks like the on those fresh nodes provisioned there were already slurmcltds running? Do you see any reason why this could be? Maybe systemd is starting them before your provisioning script or similar? 3) About the --uid The failing sbatch commands are run by root and using --uid, could you try to avoid --uid as a test? The code path of --uid may be different than a normal submission, and in newer version of slurm 20.02.x we did some fixes related to --uid: https://github.com/SchedMD/slurm/blob/master/NEWS#L82 https://github.com/SchedMD/slurm/blob/master/NEWS#L86 I don't think it is, but I would like to discard the issue being related to it. Could you try to use sudo or something similar instead? 4) Small double-check question: the current workaround is restarting the slurmctld of *only* vx, right? The vc is not restarted, right? Neither vcdb, right? Thanks, Albert
> 4) Small double-check question: the current workaround is restarting the slurmctld of *only* vx, right? The vc is not restarted, right? Neither vcdb, right? Yes. The current work-around is to restart the cluster2/vx slurmctld, only. > The failing sbatch commands are run by root and using --uid, could you try to avoid --uid as a test? > Could you increase the debug level of the slurmdbd to debug3? I will rework the tests to use sudo rather than the '--uid' mechanism, and set the debug level to 'debug3'. This stage of the automated test is a final system verification as a (pseudo) random set of users job submission. > 2) About a couple of non expected lines in the slurmctld logs: > vc: > [2020-04-14T12:03:54.864] killing old slurmctld[8076] > vx: > [2020-04-14T13:00:57.956] killing old slurmctld[8267] This is a consequence of the automated construction of the cluster: The initial sched node starts a munge instance so that sacct commands may be run. slurmctld is manually started. QoS and other global db tables are populated. The preliminary munge and slurmctld processes are stopped. File system and directory permissions are set to what the systemd service files expect and require. munge is started using systemd. slurmctld is started using systemd. This occurs on the vcsched or vxsched nodes, which do not have slurmd installed. The compute nodes do not have the slurmctld rpms installed. If you have a well-provisioned (wrt. RAM and disk) you can reproduce this from https://github.com/hpc/hpc-collab. The vc and vx recipes are there. (Feedback appreciated.)
Hi Steven, > > The failing sbatch commands are run by root and using --uid, could you try to avoid --uid as a test? > > Could you increase the debug level of the slurmdbd to debug3? > > I will rework the tests to use sudo rather than the '--uid' mechanism, and > set the debug level to 'debug3'. Great, thanks! > > 2) About a couple of non expected lines in the slurmctld logs: > > vc: > > [2020-04-14T12:03:54.864] killing old slurmctld[8076] > > vx: > > [2020-04-14T13:00:57.956] killing old slurmctld[8267] > > This is a consequence of the automated construction of the cluster: > The initial sched node starts a munge instance so that sacct commands may be > run. I guess that you mean the sacctmgr command, right? And I guess that those commands are the addition of (def)accounts, users, QOSes, and also the cluster(s)? > slurmctld is manually started. From your other comments and the logs I understand that boths vc and vx slurmctld. Do you know why you need to start them at this stage? Note that most of the sacctmgr commands doesn't need the slurmctld to be running, but only slurmdbd. > QoS and other global db tables are populated. I assume that that's through the sacctmgr commands mentioned above, I guess. > The preliminary munge and slurmctld processes are stopped. > File system and directory permissions are set to what the systemd service > files expect and require. > munge is started using systemd. > slurmctld is started using systemd. I can see the manual part of the munge and slurmdbd (if the sacctmgr commands are not run on vcdb host), but not certain about the need of the manual part of slurmctld. Anyway, it shouldn't be the source of the problem. I just mentioned in case it helps you to simplify the scripts, and also in case of uncontrolled daemons being launched. Anyway, I'll note it to try to get a reproducer. > This occurs on the vcsched or vxsched nodes, which do not have slurmd > installed. The compute nodes do not have the slurmctld rpms installed. Ok, that matches what I saw in the logs. > If you have a well-provisioned (wrt. RAM and disk) you can reproduce this > from https://github.com/hpc/hpc-collab. The vc and vx recipes are there. > (Feedback appreciated.) Great! I'll play with it to see if it helps me to reproduce the issue. Thanks, Albert
(In reply to Albert Gil from comment #21) > From your other comments and the logs I understand that boths vc and vx > slurmctld. > Do you know why you need to start them at this stage? The main reason is as an additional validation step. Since each cluster node has automated verification of its capabilities, before successor nodes, such as the compute nodes, we use it to validate at each stage. Feel free to put in an issue at https://github.com/hpc/hpc-collab/issues or via direct messages, or code. Your (and other's) feedback would be highly appreciated. As mentioned in that README, part of the motivation for this was to have similar test platforms to generate reproducers of problems, scenarios, alternate configurations etc.
Hi Steven, > The main reason is as an additional validation step. Since each cluster node > has automated verification of its capabilities, before successor nodes, such > as the compute nodes, we use it to validate at each stage. Ok, that makes perfect sense! I'm starting wondering if maybe the issue is actually related to that small detail of starting a new controller while there is one opened... at least it's clue to follow! ;-) > Feel free to put in an issue at https://github.com/hpc/hpc-collab/issues or > via direct messages, or code. > > Your (and other's) feedback would be highly appreciated. As mentioned in that > README, part of the motivation for this was to have similar test platforms to > generate reproducers of problems, scenarios, alternate configurations etc. Noted. Some moths ago we made public some similar tool that we use for a very similar purpose, jfyi: https://gitlab.com/SchedMD/training/docker-scale-out Regards, Albert
(In reply to Albert Gil from comment #23) > I'm starting wondering if maybe the issue is actually related to that small > detail of starting a new controller while there is one opened... at least > it's clue to follow! ;-) There shouldn't be another controller running in each cluster. We only start them early to validate & test the commands. Then we stop that instance and start them "normally" via systemd, so that we have confidence that these test instances are as production-like as possible. I'll investigate further, though, to prove that my assumption about what is happening matches what should be happening.
What slurmdbd.conf DebugFlags should be set to go with DebugLevel=debug3?
Hi Steven, Thanks to update the version information. > There shouldn't be another controller running in each cluster. We only start > them early to validate & test the commands. Then we stop that instance and > start them "normally" via systemd, so that we have confidence that these > test instances are as production-like as possible. I'll investigate further, > though, to prove that my assumption about what is happening matches what > should be happening. Ok. I think that there are slurmctld still running once systemd starts the new ones for these couple of lines: vc: [2020-04-14T12:03:54.864] killing old slurmctld[8076] vx: [2020-04-14T13:00:57.956] killing old slurmctld[8267] These log traces are only printed if an already running slurmctld is found: https://github.com/SchedMD/slurm/blob/slurm-20.02/src/slurmctld/controller.c#L3007 > What slurmdbd.conf DebugFlags should be set to go with DebugLevel=debug3? I was only interested in the debug3 level of slurmdbd, but now that you mentioned DebugFlags, maybe this ones could help us: In slurmctld: DebugFlags=Agent In slurmdbd.conf: DebugFlags=DB_EVENT,DB_ASSOC,DB_QUERY Thanks, Albert
Created attachment 15170 [details] slurmctld log
Created attachment 15171 [details] slurmctld core dump
Created attachment 15172 [details] slurmdbd log
Updated logs & slurmctld with flags set as requested & attached.
The slurmctld core dump (2020-07-25 23:29, https://bugs.schedmd.com/attachment.cgi?id=15171) is probably not relevant. It appears to be from a broken slurm-spank-lua plugin.
*** Ticket 9793 has been marked as a duplicate of this ticket. ***
This does not seem reproducible with the latest release.
Hi Steven, Sorry for the (too) long delay on this one. > This does not seem reproducible with the latest release. I guess that these are good news. You mean 20.11.8, right? Have you done any other update on the system besides Slurm? I'm thinking on closing the bug as cannotreproduce. Would that be ok for you too? Regards, Albert
As I speak today it is 20.11.7, because due to our configuration we didn't need to rush to 20.11.8. Yes, please feel free to close this. Thank you, -Steve Senator ________________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, June 2, 2021 7:01:44 AM To: Senator, Steven Terry Subject: [EXTERNAL] [Bug 8849] slurmctld cache's DefaultAccount incorrectly; only reset with slurmctld restart Comment # 34<https://bugs.schedmd.com/show_bug.cgi?id=8849#c34> on bug 8849<https://bugs.schedmd.com/show_bug.cgi?id=8849> from Albert Gil<mailto:albert.gil@schedmd.com> Hi Steven, Sorry for the (too) long delay on this one. > This does not seem reproducible with the latest release. I guess that these are good news. You mean 20.11.8, right? Have you done any other update on the system besides Slurm? I'm thinking on closing the bug as cannotreproduce. Would that be ok for you too? Regards, Albert ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to S Senator from comment #35) > As I speak today it is 20.11.7, Thanks for the clarification. > because due to our configuration we didn't > need to rush to 20.11.8. Good! > Yes, please feel free to close this. Ok. I'm marking it as cannotreproduce, not fixed though. Please, feel free to reopen it if we run out of luck and the issue is reproduced again. Thanks Steve!