Ticket 5213 - Account QoS vs User/Account association QoS and DefQoS DB field
Summary: Account QoS vs User/Account association QoS and DefQoS DB field
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 17.11.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Danny Auble
QA Contact:
URL:
: 6305 7419 7502 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2018-05-25 11:43 MDT by Ben Matthews
Modified: 2020-02-13 02:40 MST (History)
3 users (show)

See Also:
Site: UCAR
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.3 20.02.0-pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
sacctmgr output (1.74 KB, text/plain)
2018-05-25 11:45 MDT, Ben Matthews
Details
Disable association deletion (1.23 KB, patch)
2018-07-18 17:34 MDT, Tim Wickberg
Details | Diff
stand-alone version of the association checker from the slurm expect regression suite (3.25 KB, text/plain)
2019-06-07 15:43 MDT, Ben Matthews
Details
slurmctld log, july 3, 2019 (17.39 MB, application/x-bzip2)
2019-07-05 16:04 MDT, Ben Matthews
Details
Add debug when attempting to update association manager. (1.50 KB, patch)
2019-07-09 09:57 MDT, Danny Auble
Details | Diff
slurm logs for July 24 2019 (73.42 MB, application/x-bzip2)
2019-08-12 13:53 MDT, Ben Matthews
Details
Remove hung connection in the DBD (895 bytes, patch)
2019-08-12 16:44 MDT, Danny Auble
Details | Diff
Remove hung connection in the DBD v2 (850 bytes, patch)
2019-08-12 16:58 MDT, Danny Auble
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Ben Matthews 2018-05-25 11:43:14 MDT
So, I have the following:

matthews@cheyenne1:~> sacctmgr show assoc user=nserra
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
geyser_ca+  p07010002     nserra                    1                                                                                                           1-00:00:00               economy,premium,reg+   regular
matthews@cheyenne1:~> sacctmgr show assoc account=p07010002
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
geyser_ca+  p07010002                               1                                                                                                           1-00:00:00                             normal
geyser_ca+  p07010002     nserra                    1                                                                                                           1-00:00:00               economy,premium,reg+   regular


Which probably arises from a bug in our accounting sync script, but I also don't think Slurm is handling it correctly. I don't know how this is supposed to work, but what I'd like to happen is that this user ("nserra") be allowed to run under account P07010002 and QoS "regular" if they don't otherwise specify a QoS. Indeed, the user can run under "regular", and that is apparently their default QoS, but only if they specify a QoS manually. Why is this? and is it indeed not the correct behavior?

[Understanding that it's basically impossible to read in bugzilla, I'll also attach the above as a file in a minute]

matthews@cheyenne1:~> sudo su - nserra
Token_Response:
Resetting modules to system default
nserra@cheyenne1:> salloc -n 1 -t 500 -A P07010002 -p dav
salloc: error: Job submit/allocate failed: Invalid qos specification
nserra@cheyenne1:> salloc -n 1 -t 500 -A P07010002 -p dav --qos=regular
salloc: Pending job allocation 504605
salloc: job 504605 queued and waiting for resources
salloc: job 504605 has been allocated resources
salloc: Granted job allocation 504605
salloc: Waiting for resource configuration
salloc: Nodes geyser16 are ready for job
nserra@cheyenne1:> exit
salloc: Relinquishing job allocation 504605
Comment 1 Ben Matthews 2018-05-25 11:45:46 MDT
Created attachment 6943 [details]
sacctmgr output
Comment 2 Tim Wickberg 2018-05-25 12:00:17 MDT
Also - I assume you've checked this, but SLURM_QOS is not set in your user environments? And you're not modifying anything through a job_submit filter?

Are the various sacctmgr updates actually making it to the slurmctld?

You should be able to find this corresponding association in the output of 'scontrol show assoc' somewhere. If not, then I'm guessing you may have a problem keeping the controller and the database in sync - usually caused by an errant firewall.
Comment 3 Ben Matthews 2018-05-25 12:18:59 MDT
(In reply to Tim Wickberg from comment #2)
> Also - I assume you've checked this, but SLURM_QOS is not set in your user
> environments? And you're not modifying anything through a job_submit filter?
> 
Correct. 
nserra@cheyenne1:> env | grep SLURM
SLURMPATH=/ncar/opt/slurm/latest/
nserra@cheyenne1:>

The submit filter has recently been updated to ignore anything with a substring of "SLURM" (among a few other strings), so that shouldn't be in play either.


> Are the various sacctmgr updates actually making it to the slurmctld?
> 
> You should be able to find this corresponding association in the output of
> 'scontrol show assoc' somewhere. If not, then I'm guessing you may have a
> problem keeping the controller and the database in sync - usually caused by
> an errant firewall.

This is the first instance of this we've noticed. I do find the account listed under the "regular" QoS and associated with this user in the "scontrol show assoc" output. I don't find any mentions of Default QoS anywhere in that output. I'd be happy to give you the whole output, but we should probably mark this bug private if you need that. (?)
Comment 4 Ben Matthews 2018-05-25 13:11:07 MDT
Also, I find this in the ctld log a couple times:

2018-05-25T04:17:04-06:00 ysmgt2 slurmctld[11679]: error: This association 10762(account='p07010002', user='nserra', partition='(null)') does not have access to qos normal

I'm also able to confirm that slurmctld is getting updates from slurmdbd (I queued a job as myself with a very long wall time, verified that it wasn't going to run, then increased my limits and it eventually started).
Comment 5 Ben Matthews 2018-05-29 12:28:54 MDT
So, I can create another account that looks similar to this, and this problem doesn't happen - it's something specific to this account. It seems to be happening for any user running under this account. 

The problem survives an "scontrol reconfig". I haven't tried a slurmctld restart yet (and would very much rather not have to). Is there any diagnostic info I can provide that would help track this down?
Comment 6 Ben Matthews 2018-05-29 13:16:26 MDT
Per Tim's off-the-record suggestion, doing a slurmdbd -R run seems to have fixed this for now. We would still like a real fix though - slurmdbd -R takes the better part of a half hour of downtime.
Comment 7 Tim Wickberg 2018-06-27 15:04:31 MDT
I unfortunately don't have enough info to go off here for now.

I know you're stressing the database a lot harder than others. There are a few minor DB fixes coming in 17.11.8 that may be this issue, but I suspect it's something else lingering in the association manager code that we're still looking for.

At the moment, aside from blindly chasing symptoms, I think our best bet is to continue working through lock validation in that code, and ensure that is all operating as desired. Unfortunately that does not give me anything concrete to point to as a fix here.

If it's alright, I'll tag this as infogiven, and move back to pursuing those potential lingering issues elsewhere (on a few internal bugs).

- Tim
Comment 8 Ben Matthews 2018-06-27 15:21:43 MDT
If we can't make progress on this, I guess we can't make progress on this. That said, what would you need?

I do have a dump of the broken DB that I can share (privately) from right before doing slurmdbd -R and right after. I don't know if simply knowing what what was wrong (or at least what changed) in the DB is enough of a clue? 

If/when we see this again, what would you like us to collect?
Comment 9 Tim Wickberg 2018-06-28 08:30:04 MDT
(In reply to Ben Matthews from comment #8)
> If we can't make progress on this, I guess we can't make progress on this.
> That said, what would you need?

Ideally - a trace from exactly when the corruption occurs. Which is some amount of time in advance of the crash.

> I do have a dump of the broken DB that I can share (privately) from right
> before doing slurmdbd -R and right after. I don't know if simply knowing
> what what was wrong (or at least what changed) in the DB is enough of a
> clue? 

It's a decent clue, but generally points at something in the assoc_mgr code having gone awry. Which is what I've suspected for a while - but finding it based on the side-effects is tricky. I will see what we can do to step up our verification efforts of that code - I suspect there is a missed lock leading to a race, which has made this difficult to reproduce.
Comment 10 Ben Matthews 2018-07-17 14:23:18 MDT
FYI, this has occurred again for a different user/project
Comment 11 Ben Matthews 2018-07-17 17:04:40 MDT
and survives having the user deleted and re-adde
Comment 12 Ben Matthews 2018-07-18 11:16:09 MDT
This problem also survives having the entire account removed and readded. We'll try slurmdbd -R this afternoon.
Comment 13 Tim Wickberg 2018-07-18 16:26:43 MDT
Is it possible to, at least temporarily, avoid calling 'sacctmgr delete'?

I think that's the most likely source of these errors.

The trace you'd sent me doesn't have any of these calls included, but I'm also guessing that was a recent run that didn't demonstrate the problem. Can you walk back through and see if you can correlate the corruption to when your script would have deleted user accounts?
Comment 14 Tim Wickberg 2018-07-18 17:34:11 MDT
Created attachment 7351 [details]
Disable association deletion

This may be interesting to run with.

If we go with the theory that the association deletion code is leading to the rgt/lft corruption you've seen, this disables it. It's thus falling back to the behavior established for associations that have been in the table for over a day, so the only functional change is that more associations will stick around with deleted=1 set than previously.

Of course, it's possible the real issue is over on the association addition side, in which case this won't change anything.
Comment 15 Tim Wickberg 2018-08-20 19:38:30 MDT
*ping*
Comment 16 Ben Matthews 2018-08-27 14:21:20 MDT
*pong*

This is very intermittent, so I'm not sure what to tell you. I haven't had a chance to try to reproduce this under synthetic conditions, though the theory that our user services team has put forward would seem to agree with your patch - they think it's related to adding/deleting/readding a user within a couple update cycles (say, one day).
Comment 17 Ben Matthews 2019-01-03 11:18:33 MST
This is reproducible in 18.08.. Of course, we did forget to pull the disable deletion patch info our 18.08 build, so probably not the most useful tidbit.
Comment 19 Danny Auble 2019-01-14 15:28:14 MST
Ben, just to confirm, you haven't seen this issue since using Tim's no-delete patch?
Comment 20 Ben Matthews 2019-01-14 15:46:20 MST
True, though I'm not sure we were running the patch long enough to have confidence that it fixed anything.
Comment 21 Danny Auble 2019-01-14 16:12:49 MST
I will assume then the problem is related to the delete function.

Based off this assumption I am led to believe you are deleting associations the same day they are formed, also this appears to be common.
Comment 22 Ben Matthews 2019-01-14 16:15:14 MST
In at least two cases, including the most recent one, the impacted association was deleted and re-added within a short period of time, probably less than 24 hours. Seems like a good bet, but again, I haven't 100% convinced myself that this is the cause.
Comment 23 Danny Auble 2019-01-14 16:17:29 MST
The code the patch negates only is active if an association is added and then removed within 24 hours.  If re-adding it were the problem, no matter how soon after, the patch wouldn't matter.
Comment 24 Danny Auble 2019-01-14 16:28:34 MST
I have been dumping and loading your file from bug 6305 without any issue.  Removing part of it and such.  It doesn't seem to affect anything.  All is working fine.

In our expect tests we have a function called check_acct_associations which tests the tree hierarchy.  So far I haven't been able to trigger the situation you are getting.  The hierarchy is always matching up.

What version of MySQL/Mariadb are you running?
Comment 26 Ben Matthews 2019-01-14 22:04:22 MST
root@slurm:/# mysql --version
mysql  Ver 15.1 Distrib 10.3.9-MariaDB, for Linux (ppc64le) using readline 5.1

This is a source build, configured with:

cmake -DMYSQL_DATADIR=/data -DWITH_SSL="system" -DBUILD_CONFIG=mysql_release ../

The whole server install is dockerized, so if you want a copy of our production environment (give or take some keys), it's not hard to provide. 

Normally, I would prefer not to roll our own like this, but at least at the time we deployed this, the latest and greatest was orders of magnitude faster than the disto-provided build, especially after adding/deleting a bunch of slurm associations. My understanding is that there was some work relatively recently around preventing fragmentation on disk which is helping a lot. Some (most?) of the problem was rollbacks that early caused by sacctmgr operations that would harmlessly fail (adding things that already exist for the most part) in early versions of our sync script, but even the current version benefits. 

I'm not sure if this is reproducible from a dump file or not. While we initially tried to use that approach to accounting, we found it to be less than robust, so currently we add/remove/edit associations one at a time. A recent version of the (admittedly ugly) script that does this (and some sample data) is attached to bug 6305. I'd rather not copy it to this bug, as the sample data probably shouldn't be shared publicly.

Side note: being able to transactionally tell Slurm (live) that I want the database to look like [dump file] and to just kill/cancel (or even ignore) any non-compliant jobs would be fantastic. 

High level, our accounting script runs at most every couple hours. The delete process is to set MaxJobs=0, MaxSubmit=0 and attempt to delete the association every run until it actually goes away (i.e. any queued jobs under that association finish). I suppose it's possible that repeated failed attempts to delete an association could be related? The associations that we see this happen to tend to be actively used. If you look at the .py script attached to 6305, remove_users_from_slurm() is the function that does this. csa() just shells out to sacctmgr.
Comment 27 Danny Auble 2019-01-15 12:05:38 MST
I am glad you are running a fairly recent version of MariaDB.  That is a breath of fresh air :).

I am sad you couldn't get the load/dump to be robust.  I think the real long term fix here is to get that working for you as it is the best/lightest way to add/remove lots of things.

Yeah, I can see the script attached to bug 6305 causing all sorts of slow down given how much traffic is introduced every time each of those commands are issued.

To process your associations each time (on my box) it takes roughly around 2 seconds with nothing else going on.  So if you are issuing lots of changes it multiplies every time you change anything.  For better or worse when you add/remove an association the entire list is sent over to the slurmctld.  This could probably be improved, but would be fairly complex and might not help much in every situation.

If the delete code is causing issues it should be fixed, the problem is so far I haven't been able to find a reproducer or see any way to make it happen.

Would you mind running a non-connected, unpatched slurmdbd that mirrors the real one and perhaps reports when things go bad?

If so we might be able to figure out what happened.

Is this possible?
Comment 28 Ben Matthews 2019-01-15 12:40:07 MST
Running a second copy of Slurm that sees the same sacctmgr calls and posix users as production shouldn't be a problem, though it wouldn't see the other background job/query traffic that the real install does, which could be a problem if this is a race of some sort. What kind of information would you like out of it that I can't give from our production system? Every sacctmgr call? ...?

If all you want is to see it broken, I can probably pull a broken copy of the db from our backup system. 

I think I'll try to adapt your verification expect code as a health check for our production install. Right now we detect this by waiting for users to complain, which isn't ideal. 


I'd love to be able to use the dump/load facility. The biggest problem is that there's really no way to load a new file that removes associations without first deleting all of the associations, which is a bit hard to do without interrupting users right now. You need a "replace" command. 

dump/load also doesn't currently handle some parameters at all. See, for example, bug 4325. 

Really this is all a symptom of a larger problem with Slurm - many sites have their own accounting system which they'd like to continue to use. To make that happen, we have to either write a plugin to replace slurmdbd or come up with some kind of crazy synchronization process. I've talked to a lot of people about this, and I haven't found anyone with a good solution (but we're far from the only site with the problem). Having some facility to say to Slurm "here's the state I want, don't care what you do with already queued stuff" would be awesome.
Comment 29 Danny Auble 2019-01-15 15:45:13 MST
(In reply to Ben Matthews from comment #28)
> Running a second copy of Slurm that sees the same sacctmgr calls and posix
> users as production shouldn't be a problem, though it wouldn't see the other
> background job/query traffic that the real install does, which could be a
> problem if this is a race of some sort. What kind of information would you
> like out of it that I can't give from our production system? Every sacctmgr
> call? ...?

I think to catch this have this in your slurmdbd.conf...

DebugFlags=DB_ASSOC,DB_QOS

> 
> If all you want is to see it broken, I can probably pull a broken copy of
> the db from our backup system.

That might be helpful.  I think all I will need is the cluster_assoc_table
 
> 
> I think I'll try to adapt your verification expect code as a health check
> for our production install. Right now we detect this by waiting for users to
> complain, which isn't ideal. 

Something like this...

more test.exp
source ./globals
source ./globals_accounting

if { ![check_acct_associations] } {
        send_user "\nFAILURE:  Our associations don't line up\n"
        incr exit_code 1
}

send_user "\nSUCCESS\n"

> 
> 
> I'd love to be able to use the dump/load facility. The biggest problem is
> that there's really no way to load a new file that removes associations
> without first deleting all of the associations, which is a bit hard to do
> without interrupting users right now. You need a "replace" command. 

Yeah, I can see what you are saying.  It adds missing things, but doesn't remove anything that isn't in the file.  Seems like a good enhancement request.  That should be possible, but clearly isn't there today.

> 
> dump/load also doesn't currently handle some parameters at all. See, for
> example, bug 4325. 

Yes, QOS was totally forgot on this when it was done (perhaps it was done before QOS existed).  But I wouldn't expect that to be a show stopper as that can happen with a one line sacctmgr at the first since it doesn't need to happen but only once in most cases.

> 
> Really this is all a symptom of a larger problem with Slurm - many sites
> have their own accounting system which they'd like to continue to use. To
> make that happen, we have to either write a plugin to replace slurmdbd or
> come up with some kind of crazy synchronization process. I've talked to a
> lot of people about this, and I haven't found anyone with a good solution
> (but we're far from the only site with the problem). Having some facility to
> say to Slurm "here's the state I want, don't care what you do with already
> queued stuff" would be awesome.

In theory 'load' should be you giving it a file of the state you want and it do the right thing.  While there are others you have talked to about this you are the first person to be vocal to me.  We are usually open to supported enhancement requests of this nature.

I am not sure what you mean about "already queued stuff" I am guessing that means what is already in the database?
Comment 30 Ben Matthews 2019-01-15 16:38:21 MST
(In reply to Danny Auble from comment #29)

> I am not sure what you mean about "already queued stuff" I am guessing that
> means what is already in the database?

stuff mostly == jobs.

For the most part sacctmgr is very insistent that you not do something that makes jobs that are sitting in the queue invalid. Likewise with defaults (defaultqos, defaultaccount, etc). For example, you can't delete an association if there is a queued job that would run under that association. You can't delete an association without first updating the user's default account to be one that is still valid after the delete, etc. This is a pain for our current workflow -- there is a ton of fragile code in our sync script to work around this. It also makes the dump/load workflow basically impossible because you can only drop all the associations when there are no jobs in the queue. 

What I mean is that I don't really care how you handle any of this. If a job runs under an association that no-longer exists because it was in the queue when that association was deleted, fine. If that job gets canceled to keep things consistent, also fine. It really doesn't matter. As long as our accounting is consistent within a day or so, everyone is happy here. When I had this conversation with Tim, he seemed surprised about this stance. 

> While there are others you have talked to about this you are the first person to be vocal to me.

Surprising. We can probably do something about that if it would help. I can tell you that the sites that I've talked to about this all have developed some sort of messy, site-specific workaround and have probably given up on complaining. As a relatively new Slurm site, I suppose the pain is still fresh for us.
Comment 32 Danny Auble 2019-01-16 09:08:30 MST
(In reply to Ben Matthews from comment #30)
> (In reply to Danny Auble from comment #29)
> 
> > I am not sure what you mean about "already queued stuff" I am guessing that
> > means what is already in the database?
> 
> stuff mostly == jobs.

Got it.  The statement makes sense now.

> 
> For the most part sacctmgr is very insistent that you not do something that
> makes jobs that are sitting in the queue invalid. Likewise with defaults
> (defaultqos, defaultaccount, etc). For example, you can't delete an
> association if there is a queued job that would run under that association.
> You can't delete an association without first updating the user's default
> account to be one that is still valid after the delete, etc. This is a pain
> for our current workflow -- there is a ton of fragile code in our sync
> script to work around this. It also makes the dump/load workflow basically
> impossible because you can only drop all the associations when there are no
> jobs in the queue. 
> 
> What I mean is that I don't really care how you handle any of this. If a job
> runs under an association that no-longer exists because it was in the queue
> when that association was deleted, fine. If that job gets canceled to keep
> things consistent, also fine. It really doesn't matter. As long as our
> accounting is consistent within a day or so, everyone is happy here. When I
> had this conversation with Tim, he seemed surprised about this stance.

Yeah, I can understand Tim's reaction.  Most people don't want to allow people to run willy-nilly on a cluster as you have implied here :).  We have found through time almost no one wants to have a job canceled randomly ever.  I suppose a flag could be added called 'Stormbreaker' or such where it does what you want by killing/canceling jobs as you remove associations/qos.  But this doesn't exist today and I don't know of any plans outside of a sponsored development that will add it.
 
> 
> > While there are others you have talked to about this you are the first person to be vocal to me.
> 
> Surprising. We can probably do something about that if it would help. I can
> tell you that the sites that I've talked to about this all have developed
> some sort of messy, site-specific workaround and have probably given up on
> complaining. As a relatively new Slurm site, I suppose the pain is still
> fresh for us.

I doubt more noise about it will help that much.  I do know people love reinventing wheels though.  But I am guessing you are correct that most if not all people have made their own way to interface with the database in some way.  I am hoping the end game for this is to get the 'load' to do what you want.  This process will not be covered in this bug though as it isn't related.  I think bug 4325 is more suited for this.  If you would like to sponsor the development to fast track it let me know.
Comment 33 Nate Rini 2019-01-18 11:33:05 MST
*** Ticket 6305 has been marked as a duplicate of this ticket. ***
Comment 34 Danny Auble 2019-01-28 09:22:27 MST
Ben, have you seen any issues with the unpatched backup yet?
Comment 35 Ben Matthews 2019-01-28 09:58:44 MST
Nope. I don't know what triggers this, but we go significant periods of time without seeing it (months) with or without the patch. 

Infrequent, but painful to recover from
Comment 36 Danny Auble 2019-01-30 11:55:04 MST
I completely understand.  I will be waiting on you to update the bug when it happens again.

Thanks for running the duplicate on this.  It will be very helpful when it happens again.
Comment 37 Danny Auble 2019-02-21 10:42:59 MST
Ben, I am guessing it hasn't happened again yet?
Comment 38 Ben Matthews 2019-02-26 16:41:43 MST
Nope, not so far.
Comment 39 Danny Auble 2019-03-28 13:44:45 MDT
I am guessing it hasn't happened again?  I find it strange it was happening all the time and now hasn't happened since we started monitoring it :).
Comment 40 Ben Matthews 2019-03-28 14:49:45 MDT
As always happens... nope. 

I'm guessing this is made more likely by something our accounting people do, for example, last time it was occurring a bunch was right around the end of an allocation cycle. 


(In reply to Danny Auble from comment #39)
> I am guessing it hasn't happened again?  I find it strange it was happening
> all the time and now hasn't happened since we started monitoring it :).
Comment 41 Danny Auble 2019-04-18 15:43:11 MDT
When is the next allocation cycle?
Comment 42 Ben Matthews 2019-06-07 15:40:54 MDT
Ok, this has happened again:

sacctmgr show assoc user=jtillots
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
geyser_ca+   sssg0001   jtillots                    1                                                                                                           1-00:00:00               ampsrt,economy,prem+   regular

matthews@cheyenne1:~> sudo su - jtillots
Token_Response:
Resetting modules to system default
jtillots@cheyenne1:~> srun -p dav -A SSSG0001 -t 1:00:00 -n 1 -c 1 --pty bash
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Interestingly, the check for this from the regression suite seems to succeed. Basically, I pulled out check_acct_associations() from the test suite and fixed it to be standalone. I'll attach it to the ticket so you can check that I didn't do anything that would break it if you want. 

matthews@cheyenne1:~> expect slurm_check_assoc
Testing Associations

SUCCESS
Comment 43 Ben Matthews 2019-06-07 15:43:08 MDT
Created attachment 10545 [details]
stand-alone version of the association checker from the slurm expect regression suite
Comment 44 Ben Matthews 2019-06-07 15:47:15 MDT
By the way: this is one of our admin's accounts, so I can leave it broken for a short while if there's anything you want to look at in the db/etc.
Comment 46 Ben Matthews 2019-06-07 17:16:42 MDT
Actually, this latest instance might be a different case. Turns out that this user got their username changed (but not the UID) so the slurmctld and pretty much everything else disagree about the name/uid mapping. 

scontrol show assoc |grep UserName=jtillots
UserName=jtillots(4294967294) DefAccount=sssg0001 DefWckey=(null) AdminLevel=Not Set
ClusterName=geyser_caldera Account=sssg0001 UserName=jtillots(4294967294) Partition= ID=21568
Comment 47 Danny Auble 2019-06-11 12:00:18 MDT
OK, so that is different ;).

Yeah you would have to go change that in the database...

sacctmgr update user jtillots set new=$NEWNAME
Comment 48 Ben Matthews 2019-07-05 12:36:52 MDT
Here's another one:

scontrol show assoc | grep cermani
UserName=cermani(31686) DefAccount=p93300301 DefWckey= AdminLevel=None
ClusterName=geyser_caldera Account=p93300301 UserName=cermani(31686) Partition= ID=19806
ClusterName=geyser_caldera Account=p93300601 UserName=cermani(31686) Partition= ID=19807

sacctmgr show assoc user=cermani
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
geyser_ca+   uesm0007    cermani                    1                                                                                                           1-00:00:00               economy,premium,reg+   regular
geyser_ca+  p93300601    cermani                    1                                                                                                           1-00:00:00               economy,premium,reg+   regular
geyser_ca+  p93300301    cermani                    1                                                                                                           1-00:00:00               economy,premium,reg+   regular

 expect slurm_check_assoc
Testing Associations

SUCCESS

Login node:
 id cermani
uid=31686(cermani) gid=1000(ncar) groups=1000(ncar),69315(p93300301),69339(p93300601),73491(uesm0007)

Controller node:
id cermani
uid=31686(cermani) gid=1000(mysql) groups=1000(mysql),69315(p93300301),69339(p93300601),73491(uesm0007),100(users)


cermani@cheyenne1:~> salloc -n 1 -p dav --account uesm0007 -t 1:00:00
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified
cermani@cheyenne1:~> salloc -n 1 -p dav --account p93300301 -t 1:00:00
salloc: Granted job allocation 2874695
salloc: Waiting for resource configuration
^Csalloc: Job allocation 2874695 has been revoked.
cermani@cheyenne1:~>


So, maybe it's not the rgt/lft thing after all?
Comment 49 Danny Auble 2019-07-05 13:49:42 MDT
That one is interesting. Was there any error messages in the slurmdbd log? It would seem that association didn't make it to the slurmctld.

In this case I would expect restarting the slurmctld would fix the problem, but why the new association didn't make it to the slurmctld is fairly strange. I would expect an error message as to why in the dbd log.

Like you said, this doesn't seem like the same problem.
Comment 50 Ben Matthews 2019-07-05 14:59:11 MDT
The only thing in the slurmdbd log since the beginning of the month (user complained today, so I assume this is pretty recent) looks like:

slurmdbd: error: id_assoc 17533 doesn't have any tres

The number changes, but that's the only class of message. I can attach the log if you'd like
Comment 51 Danny Auble 2019-07-05 15:30:34 MDT
Interesting, that error message is sort of benign and wouldn't have anything to do with this.

Could you send me the mysql output of

select creation_time, mod_time, deleted, id_assoc from geyser_caldera_assoc_table where acct='uesm0007' and user='cermani';

I would expect the error to be around the mod_time, but if you aren't seeing that perhaps there is an error in the slurmctld log about that time?
Comment 52 Ben Matthews 2019-07-05 15:42:00 MDT
+---------------+------------+---------+----------+
| creation_time | mod_time   | deleted | id_assoc |
+---------------+------------+---------+----------+
|    1562180759 | 1562195491 |       0 |    21892 |
+---------------+------------+---------+----------+
1 row in set (0.001 sec)

That ID isn't one of the ones that slurmdbd is complaining about -- it doesn't show up in the log at all. 

Still working on splitting the slurmctld log from syslog. It's not small.
Comment 53 Danny Auble 2019-07-05 15:47:18 MDT
Ben, based on reading this whole bug again, I don't think lft/rgt has anything to do with it at the moment.  From all I can tell there are messages being lost getting to the slurmctld.  Or the slurmctld doing the wrong thing with the message it got.

I wish we had the slurmctld log from the comment 0 failed request without a QOS.  I would had told us what QOS was trying to be requested.  The output of scontrol show assoc would also had shown us what the slurmctld thought was the default qos.

This particular one was created (Wed Jul  3 17:11:31 2019) and then updated 4 hours later (Wed Jul  3 13:05:59 2019).

Could you send me the output of

sacctmgr list txn ids=21892 -p

This would tell us exactly what was going on.

But I am surprised you aren't seeing any error messages.

It is expected the error messages wouldn't have this id in them as they aren't relevant.
Comment 54 Danny Auble 2019-07-05 15:48:46 MDT
(Times reversed in that last comment :))

Created Wed Jul  3 13:05:59 2019
Modified Wed Jul  3 17:11:31 2019
Comment 55 Ben Matthews 2019-07-05 15:53:04 MDT
root@slurm:/# sacctmgr list txn ids=21892 -p
Time|Action|Actor|Where|Info|
2017-06-19T16:53:04|Add Users|root|hunsaker|admin_level=1|

root@slurm:/# id hunsaker
uid=25882(hunsaker) gid=1000(mysql) groups=1000(mysql),1565(dsg),69379(sess0001),100(users)


We do combined syslog for everything, archived for the life of a machine, so I probably do have the logs for the first instance of this (though I recall there was some brief issue with log forwarding). I can look if you think it would be helpful.l
Comment 56 Ben Matthews 2019-07-05 16:04:32 MDT
Created attachment 10816 [details]
slurmctld log, july 3, 2019

Nothing is jumping out at me, but I guess I don't know what I'm looking for. Attached the entire slurmctld log from that day.
Comment 57 Danny Auble 2019-07-05 16:16:08 MDT
If you wouldn't mind, 504605 is the job that succeeded, so whenever that job ran right before it we could get the log.

The error message will be something like...

ul  5 16:07:01.451738 12864 srvcn        0x7fb5007d3700: error: Invalid qos (dasdas)
Jul  5 16:07:01.451808 12864 srvcn        0x7fb5007d3700: _slurm_rpc_allocate_resources: Invalid qos specification 

At least we would know what they tried to request.

The txn history isn't very helpful outside of it appeared when 21892 was added the transaction afterwards wasn't done until way later.  It must had been a large operation.

The one thing your log is strange with is I see no ACCOUNTING_UPDATE_MSG messages come through.  This would mean you are never getting anything from the DBD.

You should see

Processing RPC: ACCOUNTING_UPDATE_MSG from uid=XXXX

everytime a message is sent from the database.  But your log has none.  This would go along with my theory that messages are being lost going to your slurmctld.

Would you try to do something like add a new account or something and see if it makes it through?
Comment 58 Ben Matthews 2019-07-05 16:51:53 MDT
(In reply to Danny Auble from comment #57)
> If you wouldn't mind, 504605 is the job that succeeded, so whenever that job
> ran right before it we could get the log.
> 

I don't think I understand what you're asking for here? Do you want output from the actual job? I'm unlikely to be able to get that for you.

> 
> You should see
> 
> Processing RPC: ACCOUNTING_UPDATE_MSG from uid=XXXX
> 

> Would you try to do something like add a new account or something and see if
> it makes it through?

I did this: 

root@slurm:/# sacctmgr add user matthews account=scsg0001 cluster=Geyser_Caldera
 Associations =
  U = matthews  A = scsg0001   C = geyser_cal
 Non Default Settings
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

And I don't see ACCOUNTING_UPDATE_MSG (or anything that seems relevant) in the log. -- just scheduling information and job starts/finishes (background noise). That sort of thing typically happens many times per day. There should be 10s or 100s of examples in the log that I provided
Comment 59 Danny Auble 2019-07-05 16:55:33 MDT
(In reply to Ben Matthews from comment #58)
> (In reply to Danny Auble from comment #57)
> > If you wouldn't mind, 504605 is the job that succeeded, so whenever that job
> > ran right before it we could get the log.
> > 
> 
> I don't think I understand what you're asking for here? Do you want output
> from the actual job? I'm unlikely to be able to get that for you.
> 
Just output from the slurmctld log before that job was ran. 

> > 
> > You should see
> > 
> > Processing RPC: ACCOUNTING_UPDATE_MSG from uid=XXXX
> > 
> 
> > Would you try to do something like add a new account or something and see if
> > it makes it through?
> 
> I did this: 
> 
> root@slurm:/# sacctmgr add user matthews account=scsg0001
> cluster=Geyser_Caldera
>  Associations =
>   U = matthews  A = scsg0001   C = geyser_cal
>  Non Default Settings
> Would you like to commit changes? (You have 30 seconds to decide)
> (N/y): y
> 
> And I don't see ACCOUNTING_UPDATE_MSG (or anything that seems relevant) in
> the log. -- just scheduling information and job starts/finishes (background
> noise). That sort of thing typically happens many times per day. There
> should be 10s or 100s of examples in the log that I provided

Were you able to run a job with that account afterwards? You would only see the message at debug. From your log it looks like you were running that way.
Comment 60 Ben Matthews 2019-07-05 17:02:07 MDT
> Were you able to run a job with that account afterwards? You would only see
> the message at debug. From your log it looks like you were running that way.

Actually, no. Neat. I also did a modify to associate it with a QoS. 

matthews@cheyenne1:~> salloc --time 6:00:00 --partition dav --ntasks 1  --account scsg0001 --qos=regular
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

matthews@cheyenne1:~> sacctmgr show assoc user=matthews
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
geyser_ca+   sssg0001   matthews                    1                                                                                                           8-00:00:00               ampsrt,economy,prem+   regular
geyser_ca+   scsg0001   matthews                    1                                                                                                           1-00:00:00                            regular
Comment 61 Danny Auble 2019-07-05 17:04:55 MDT
There you go, nothing from your dbd is making it to your slurmctld for some reason.

No errors in the dbd?

Firewall?
Comment 62 Ben Matthews 2019-07-05 17:15:08 MDT
No firewall, they're on the same host (container)
Comment 63 Danny Auble 2019-07-05 17:35:17 MDT
Interesting, but this is your problem. For some reason the dbd can't talk to the slurmctld. Are they both running inside the same container?

Can you run sinfo from inside the dbd container or a dbd like container?
Comment 64 Ben Matthews 2019-07-08 13:51:52 MDT
I don't think that I buy that this is a firewall problem. We still have jobs showing up in sacct, which is a direct query to the dbd, no? 

root@slurm:/# sacct -a -S 2019-07-05 -E 2019-07-06 | wc -l
12808

root@slurm:/# sacct -a -S 2019-07-06 -E 2019-07-07 | wc -l
3285

root@slurm:/# sacct -a -S 2019-07-07 -E 2019-07-08 | wc -l
8643

For running jobs, as far as I can tell squeue/scontrol and sacct all agree.
Comment 65 Danny Auble 2019-07-08 14:06:36 MDT
The connection to the dbd is persistent from the slurmctld, not the other way around.  It gets started right when the slurmctld starts up and tells the dbd how to contact it and is unidirectional.

sacctmgr list clusters

Will give you what clusters are attached to the dbd...

sacctmgr list clusters format=cluster,controlhost,controlport
   Cluster     ControlHost  ControlPort 
---------- --------------- ------------ 
smd-server   192.168.1.253        34113 
 snowflake                            0 

you will notice in this example smd-server cluster is attached 192.168.1.253:34113, but cluster snowflake is not attached.  If it were the output would be like

   Cluster     ControlHost  ControlPort 
---------- --------------- ------------ 
smd-server   192.168.1.253        34113 
 snowflake     192.168.1.2        34213 

Since you are getting data to the dbd it appears things are good in that one direction, but it appears the other way is not possible.  The connection from slurmdbd to slurmctld is only made when updates are needing to be sent.  Using the information in that table listed in sacctmgr list clusters.

I am not saying it is a firewall, but something is stopping the messages from going.  I find it very strange you don't see any error message in the slurmdbd log though in any case.

If you have "debug" level debug you should see messages like

debug: sending updates to...

right before it sends.

Then I would expect an error about why it didn't work and a

error: update cluster: ...

message.

Perhaps seeing if the cluster is attached appropriately and going from there will shed some light on the matter.  running sinfo/squeue or such from where the dbd is running should be a fantastic debugger as it should be attaching to the same ip/port.  That is unless you are doing some magic to make the ctld talk to the dbd on a different ip than what is in your slurm.conf.
Comment 66 Danny Auble 2019-07-08 14:09:06 MDT
sacct uses the same kind of logic as the slurmctld, it's connection should behave very similarly, so it wouldn't be a surprise it is working correctly in this situation.
Comment 67 Ben Matthews 2019-07-08 14:17:46 MDT
There is some magic, but slurmdbd and slurmctld run on the same host (container) on a private network. External things get to them via some transparent NAT (due to stupid way that docker does networking). 

This does indeed look unhappy:

root@slurm:/# sacctmgr list clusters format=cluster,controlhost,controlport
   Cluster     ControlHost  ControlPort
---------- --------------- ------------
geyser_ca+                            0


root@slurm:/# ps aux | grep slurm | head -n 2
lsfadmin     13  0.0  0.0 614848 97664 pts/0    Sl   Jun25   4:44 /usr/local/sbin/slurmdbd -D -v
lsfadmin     14  0.4  0.1 6558656 261696 pts/0  Sl   Jun25  77:15 /usr/local/sbin/slurmctld -D -f /etc/slurm/slurm.conf -v
root@slurm:/#

root@slurm:/# grep -i slurm\.ssg /etc/slurm/slurm.conf
ControlMachine=slurm.ssg.ucar.edu
AccountingStorageHost=slurm.ssg.ucar.edu

From this container, it's a private network:

root@slurm:/# ping slurm.ssg.ucar.edu
PING slurm.ssg.ucar.edu (192.168.100.21): 56 data bytes
64 bytes from 192.168.100.21: icmp_seq=0 ttl=64 time=0.065 ms
64 bytes from 192.168.100.21: icmp_seq=1 ttl=64 time=0.054 ms

From the rest of the world, these rules on the host do the magic:

-A POSTROUTING -o publicbr -s 192.168.100.21  -j SNAT --to 128.117.181.205
-A PREROUTING -d 128.117.181.205 -j DNAT --to-destination 192.168.100.21

(where 128.117.181.205 maps to slurm.ssg.ucar.edu in public DNS). 

As far as I know, none of this has changed in a long time, but it should still work in the Slurm controller's context, even if it does nowhere else (as far as I can tell, everything can talk to Slurm just fine).


Anyway, what's a good test to see what's failing? strace slurmdbd and try to update an association? I think that would be less invasive than looking at the ctld's state?
Comment 68 Danny Auble 2019-07-08 14:32:46 MDT
OK, now things are starting to make sense.

The fact that the cluster doesn't have a ControlHost|Port makes it clear why no errors are happening.  It just doesn't think it has a connection so it never sends any updates.  That is extremely strange.

Could you restart your dbd and have level "debug" debug on?

I would expect a message like

debug:  REQUEST_PERSIST_INIT: CLUSTER:snowflake VERSION:8704 UID:7558 IP:192.168.1.2 CONN:7

to happen almost immediately.

Then after that I would expect you would see the sacctmgr list clusters to say it is connected.

Why it isn't connected now is the super strange part especially sense you are getting data for jobs.
Comment 69 Danny Auble 2019-07-08 14:37:33 MDT
Also putting

DebugFlags=DB_ASSOC

in your slurmdbd.conf (not slurm.conf) and reconfig or restart the dbd will give you the exact sql we use to put the registration into the database right after that "REQUEST_PERSIST_INIT" message.
Comment 70 Ben Matthews 2019-07-08 14:54:17 MDT
Since we're still limping along, I have a restart scheduled for Wednesday in case things go badly or I have to restart the slurmctld as well. In the meantime, what can we do to understand what happened? Is there any way short of a restart to force a reconnect? I'm aware that restarting the dbd should be safe, but I've been burned before. 

It seems like if you were running this over an actual network, disconnects would not be impossible and that it should just retry.
Comment 71 Danny Auble 2019-07-08 15:05:01 MDT
I would expect restarting the dbd outside of an upgrade to be very low risk.  Not as low risk as running sinfo, but not too far up from that.  With that said I can understand a conservative approach.

I am very surprised there is a half connect here.  It is clear you are getting information to the database meaning the slurmctld is connected to the dbd, but it is ultra strange how that is happening where the database doesn't have a record of the connection.

There is no way to force a reconnect outside of restarting either the slurmctld or the slurmdbd.

One idea you could try is adding another cluster and having a slurmctld representing that cluster register and see what happens.  In this method you would be doing the same logic, restarting as many times as you would like, just not disturbing the running system.
Comment 72 Ben Matthews 2019-07-08 15:22:07 MDT
Well, you convinced enough people that we just restarted slurmdbd. A couple things:

This is all I got in the logs:

Jul  8 15:08:17 stargate4 supervisord: slurmdbd slurmdbd: Terminate signal (SIGINT or SIGTERM) received
Jul  8 15:08:17 stargate4 supervisord: slurmdbd slurmdbd: Terminate signal (SIGINT or SIGTERM) received
Jul  8 15:08:23 stargate4 supervisord: slurmdbd slurmdbd: Accounting storage MYSQL plugin loaded
Jul  8 15:08:23 stargate4 supervisord: slurmdbd slurmdbd: Not running as root. Can't drop supplementary groups
Jul  8 15:08:23 stargate4 supervisord: slurmdbd slurmdbd: Accounting storage MYSQL plugin loaded
Jul  8 15:08:23 stargate4 supervisord: slurmdbd slurmdbd: Not running as root. Can't drop supplementary groups
Jul  8 15:08:25 stargate4 supervisord: slurmdbd slurmdbd: slurmdbd version 18.08.3 started
Jul  8 15:08:25 stargate4 supervisord: slurmdbd slurmdbd: slurmdbd version 18.08.3 started


We did have "DebugLevel=4" in slurmdbd.conf, I changed it to "debug2" to match everything else. 

Upon seeing that this didn't make much difference, I set DebugLevelSyslog as well, which got me a couple more, but still nothing including "REQUEST_PERSIST_INIT"

Regardless, I'm now able to add a new association for my user and run a job under it, so that's good. This looks better too:

root@slurm:/# sacctmgr list clusters format=cluster,controlhost,controlport
   Cluster     ControlHost  ControlPort
---------- --------------- ------------
geyser_ca+  192.168.100.21         6817

That said, user user cermani/account uesm0007 still doesn't work.

root@slurm:/# scontrol show assoc | grep cermani
UserName=cermani(31686) DefAccount=p93300301 DefWckey= AdminLevel=None
ClusterName=geyser_caldera Account=p93300301 UserName=cermani(31686) Partition= ID=19806
ClusterName=geyser_caldera Account=p93300601 UserName=cermani(31686) Partition= ID=19807

root@slurm:/# sacctmgr show assoc user=cermani
   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
geyser_ca+   uesm0007    cermani                    1                                                                                                           1-00:00:00               economy,premium,reg+   regular
geyser_ca+  p93300601    cermani                    1                                                                                                           1-00:00:00               economy,premium,reg+   regular
geyser_ca+  p93300301    cermani                    1                                                                                                           1-00:00:00               economy,premium,reg+   regular


So, I think we're back where we started? At least we're not worse off.
Comment 73 Danny Auble 2019-07-08 15:29:32 MDT
Well, you are actually way better off than before.

192.168.100.21         6817

in the database means all is working.  Why that wasn't there or how it goes away is really strange.

You don't happen to have anything randomly going and clearing those columns do you?

For future reference DebugLevel=4 4 = info and 5 = debug ...  But just using debug or debug2 or whatever should be sufficient.

Note I rarely if ever tell people to ever run in production higher than debug as debug2 is quite heavy and affects performance greatly (especially in the slurmctld).  It is up to you, but if it isn't hurting things you can leave it though in most cases it shouldn't be needed.

Note with debug you should see every time an sacct connects with a REQUEST_PERSIST_INIT.  If you aren't seeing that I would count that as very strange.
Comment 74 Danny Auble 2019-07-08 15:31:56 MDT
I missed that you couldn't see the missing associations.

Restarting the dbd does not send new info back to the slurmctld, only restarting the slurmctld would.  If you create a new association I would expect it to show up though.

Try adding something new and see if it takes.
Comment 75 Ben Matthews 2019-07-08 15:33:11 MDT
Yes, new things seem to take, but there are still some old ones that haven't. Is there a way to fix that short of restarting slurmctld, which I'd really rather not do.
Comment 76 Danny Auble 2019-07-08 15:39:24 MDT
There is a sneaky way to do this.

Try adding an account and then move it's parent.  This will force the entire association tree to be sent.

sacctmgr add acct crap parent=p93300301
sacctmgr mod acct crap set parent=root
sacctmgr del acct crap

Should do it.

I completely understand the desire to avoid having to restart the controller.
Comment 77 Ben Matthews 2019-07-08 15:40:15 MDT
Oh, and no, nothing is randomly clearing anything in the DB. We do have a periodic backup, but that's hopefully read-only.

That said, we do

sacctmgr create cluster [] 

Several times per day and let it fail. That way the update script can re-build everything on a completely clean slurm install if necessary. Of course it errors out in the average case, but that hasn't been a problem in the past. We do the same for QoS levels and a few other things. Associations I actually diff and only add if necessary for performance reasons.
Comment 78 Danny Auble 2019-07-08 15:44:57 MDT
When you do your backup make sure you have --single-transaction on.  Without it perhaps you are missing a registration message.

I wouldn't expect

sacctmgr create cluster []

To do anything here either, though I have never really tested doing [] as a cluster name, seems like it works just fine (minus the error from the hostlist code), though it shouldn't affect the other clusters in the database.
Comment 79 Ben Matthews 2019-07-08 15:49:01 MDT
Your suggestion from #76 doesn't seem to have changed anything:

root@slurm:/# scontrol show assoc | grep cermani
UserName=cermani(31686) DefAccount=p93300301 DefWckey= AdminLevel=None
ClusterName=geyser_caldera Account=p93300301 UserName=cermani(31686) Partition= ID=19806
ClusterName=geyser_caldera Account=p93300601 UserName=cermani(31686) Partition= ID=19807
root@slurm:/#


And sorry, I guess I was unclear [] just means I'm being too lazy to type out the full cluster name. The actual call should be:

sacctmgr -i create cluster Geyser_Caldera MaxWall=1-0:0:0
Comment 80 Danny Auble 2019-07-08 16:05:52 MDT
Would you mind sending me any errors from the slurmctld when you did that sequence? I would had expected it to work, but perhaps it only works when the associations exist. You could always remove and then re-add. That was my only real trick to get everything to go over.

Yeah I would expect reading the same cluster to be benign.
Comment 81 Ben Matthews 2019-07-08 16:35:00 MDT
There's a bunch of stuff that looks like:

Jul  8 15:42:26 stargate4 supervisord: slurmctld slurmctld: Warning: Note very large processing time from _slurm_rpc_accounting_update_msg: usec=15596055 began=15:42:10.572
Jul  8 15:42:26 stargate4 supervisord: slurmctld slurmctld: error: assoc_mgr_update gave error: Unspecified error
Jul  8 15:42:26 stargate4 supervisord: slurmctld slurmctld: build_job_queue has run for 2000000 usec, exiting with 100 of 216 jobs tested, 0 job-partition pairs added
Jul  8 15:42:26 stargate4 supervisord: slurmctld slurmctld: Warning: Note very large processing time from schedule: usec=5803511 began=15:42:20.365

or this:
Jul  8 15:44:44 stargate4 supervisord: slurmctld slurmctld: Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=3153910 began=15:44:41.230
Jul  8 15:44:44 stargate4 supervisord: slurmctld slurmctld: Warning: Note very large processing time from dump_all_job_state: usec=5390769 began=15:44:39.006

or this:

Jul  8 15:53:38 stargate4 supervisord: slurmctld slurmctld: error: assoc_mgr_update gave error: Unspecified error
Jul  8 15:53:38 stargate4 supervisord: slurmctld slurmctld: error: assoc_mgr_update gave error: Unspecified error


That's about all I'm seeing that looks like errors. If you want the whole log let me know.
Comment 82 Danny Auble 2019-07-08 16:41:49 MDT
error:
assoc_mgr_update gave error: Unspecified error

Is what I was looking for, the others are just FYI type messages. It does confirm my suspicion. So there is no way to short circuit the adding of missing associations. Outside of restarting, removing then re-adding is the only way to get them to go in.
Comment 83 Ben Matthews 2019-07-08 19:20:13 MDT
I'll try removing and re-adding. 

Is there a way to identify associations in this state (aside from waiting for user complaints)? What causes this? The error message isn't so great.
Comment 84 Danny Auble 2019-07-09 09:57:27 MDT
Created attachment 10846 [details]
Add debug when attempting to update association manager.

Ben,

I agree, the error message isn't the most helpful.  This patch will go into 19.05 and will give the information you need.  But since restarting would be required to get it in I don't think it will be much help at the moment, but will probably help in the future.

As noted before, the missing information appears to have been caused when your cluster table loses the ip and port of the slurmctld.  How that is happening is the current mystery.  It appears this is the problem from the start though.

I have confirmed

sacctmgr -i create cluster Geyser_Caldera MaxWall=1-0:0:0

Does what you would expect and does not appear to contribute to this scenario.

When it does happen again, as I am guessing it eventually will, please grab this info from the database...

select creation_time, mod_time, name, control_host, control_port, last_port from cluster_table where name="geyser_caldera";

The creation_time I am expecting will never change, I would expect the mod_time to change when the slurmctld connects or disconnects. If the mod_time doesn't match those times (you can verify in the slurmctld.log/slurmdbd.log) there is something external to Slurm that is modifying things.  If it does match up and we are missing something, the slurmdbd logs should be handle in this situation as they should now be giving messages like

debug: cluster geyser_caldera has disconnected

when your slurmctld goes down.

I would expect you to be getting message about REQUEST_PERSIST_INIT all the time now as well.

I don't think you are running backup slurmctld's or dbd's are you?  If so perhaps that is messing something up?  I don't think you are though.
Comment 89 Ben Matthews 2019-07-11 13:08:21 MDT
No, we're not running failover at all. We'd like to implement that, but nobody has had the time. A restart of slurmctld did indeed fix it, so until next time I guess ... Glad it finally seems like we're making a little progress on this.
Comment 90 Danny Auble 2019-07-12 10:20:13 MDT
Comment on attachment 10846 [details]
Add debug when attempting to update association manager.

Ben I wouldn't mind if you put in a cron script just running

sacctmgr list clusters

waiting for the host and port to zero out.  Perhaps we can pinpoint a moment when it happens.

I will note that this debug patch is in 19.05.1 commit ad0d3776cf602.  But it will only give a note about what associations were not there on the update, so helpful when you are trying to figure that out, but not much otherwise.

I do think this is the whole problem though.  I have never heard of it happening from other sites and can't readily figure out how it is possible.

I am glad we are making progress on it though.
Comment 91 Danny Auble 2019-07-26 10:22:44 MDT
Ben, I'm guessing you haven't found it happen again yet?
Comment 92 Ben Matthews 2019-08-09 11:48:19 MDT
Actually, it did again while I was on vacation. I've just restarted the relevant bits today, but I did get the mod_time on the cluster again. I'll see what I can find in the logs based on that and get back to you. 

MariaDB [slurm_acct_db]> select creation_time, mod_time, name, control_host, control_port, last_port from cluster_table where name="geyser_caldera";
+---------------+------------+----------------+--------------+--------------+-----------+
| creation_time | mod_time   | name           | control_host | control_port | last_port |
+---------------+------------+----------------+--------------+--------------+-----------+
|    1497649922 | 1563999481 | geyser_caldera |              |            0 |      6817 |
+---------------+------------+----------------+--------------+--------------+-----------+
1 row in set (0.000 sec)
Comment 93 Danny Auble 2019-08-12 10:02:10 MDT
Well, at least we may have something to determine what is happening here.  Let me know what you find.
Comment 94 Ben Matthews 2019-08-12 12:13:19 MDT
Not a ton to go on, but this seems like the interesting bit:

2019-07-24T14:18:00.321587-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_PARTITION_INFO uid=16745
2019-07-24T14:18:00.321678-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_dump_partitions, size=972 usec=51
2019-07-24T14:18:00.329470-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_FED_INFO from uid=16745
2019-07-24T14:18:00.329703-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_get_fed usec=232
2019-07-24T14:18:00.331413-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_PARTITION_INFO uid=16745
2019-07-24T14:18:00.331505-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_dump_partitions, size=972 usec=52
2019-07-24T14:18:01.151616-06:00 stargate4 slurmdbd[14]: debug2: persistent connection is closed
2019-07-24T14:18:01.151736-06:00 stargate4 slurmdbd[14]: debug:  Problem sending response to connection 18(192.168.100.21) uid(210)
2019-07-24T14:18:01.151800-06:00 stargate4 slurmdbd[14]: debug2: Closed connection 18 uid(210)
2019-07-24T14:18:01.151862-06:00 stargate4 slurmctld[15]: Warning: Note very large processing time from dump_all_job_state: usec=23258181 began=14:17:37.893
2019-07-24T14:18:01.152227-06:00 stargate4 supervisord: slurmctld slurmctld: Warning: Note very large processing time from dump_all_job_state: usec=23258181 began=14:17:37.893
2019-07-24T14:18:01.152522-06:00 stargate4 slurmdbd[14]: debug:  cluster geyser_caldera has disconnected
2019-07-24T14:18:01.152661-06:00 stargate4 aa0e3af76781[4937]: 2019-07-24 14:18:01,151 DEBG 'slurmctld' stderr output:
2019-07-24T14:18:01.153020-06:00 stargate4 aa0e3af76781[4937]: slurmctld: Warning: Note very large processing time from dump_all_job_state: usec=23258181 began=14:17:37.893
2019-07-24T14:18:02.305942-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_FED_INFO from uid=16745
2019-07-24T14:18:02.306355-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_get_fed usec=447
2019-07-24T14:18:02.307944-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_PARTITION_INFO uid=16745
2019-07-24T14:18:02.308041-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_dump_partitions, size=972 usec=63
2019-07-24T14:18:02.321187-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_FED_INFO from uid=16745
2019-07-24T14:18:02.321484-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_get_fed usec=334
Comment 95 Danny Auble 2019-08-12 13:40:49 MDT
I'll try to reproduce.  It does seem interesting though.  I wonder what was going on the node at this time.  It is even more interesting the dbd sees the connection close and somehow the slurmctld is still about to write to it.

I am guessing this is just a snip of the logs, could you send me the full logs from 5 seconds before and afterwards?
Comment 96 Ben Matthews 2019-08-12 13:53:28 MDT
Created attachment 11198 [details]
slurm logs for July 24 2019

Attaching a whole day of logs from slurm (and things matching the regex "slurm") since that's the easy unit that I have. If you want 5 seconds before, you probably want to start looking about T14:17:56. 

Note that the system was in this state for a couple weeks, if you want another day or two of logs let me know.
Comment 97 Danny Auble 2019-08-12 16:04:36 MDT
OK, there are quite a few things that are concerning on this log.

One of which it looks like you were having disk problems on the node...

2019-07-24T13:38:08.503946-06:00 stargate4 slurmdbd[14]: debug:  REQUEST_PERSIST_INIT: CLUSTER:geyser_caldera VERSION:8448 UID:15968 IP:128.117.183.164 CONN:18
2019-07-24T13:38:08.504228-06:00 stargate4 slurmdbd[14]: debug2: acct_storage_p_get_connection: request new connection 1
2019-07-24T13:38:08.538279-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_FED_INFO from uid=16745
2019-07-24T13:38:08.538651-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_get_fed usec=303
2019-07-24T13:38:08.540257-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_PARTITION_INFO uid=16745
2019-07-24T13:38:08.540660-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_dump_partitions, size=972 usec=60
2019-07-24T13:38:08.546349-06:00 stargate4 slurmdbd[14]: debug2: DBD_GET_JOBS_COND: called
2019-07-24T13:38:08.548229-06:00 stargate4 slurmdbd[14]: error: We should have gotten a result: 'No error' ''
2019-07-24T13:38:08.548509-06:00 stargate4 slurmdbd[14]: error: Problem getting jobs for cluster geyser_caldera
2019-07-24T13:38:08.548786-06:00 stargate4 supervisord: slurmdbd slurmdbd: error: We should have gotten a result: 'No error' ''
2019-07-24T13:38:08.550996-06:00 stargate4 aa0e3af76781[4937]: 2019-07-24 13:38:08,548 DEBG 'slurmdbd' stderr output:
2019-07-24T13:38:08.551314-06:00 stargate4 aa0e3af76781[4937]: slurmdbd: error: We should have gotten a result: 'No error' ''
2019-07-24T13:38:08.551938-06:00 stargate4 supervisord: slurmdbd slurmdbd: error: Problem getting jobs for cluster geyser_caldera
2019-07-24T13:38:08.554092-06:00 stargate4 aa0e3af76781[4937]: 2019-07-24 13:38:08,548 DEBG 'slurmdbd' stderr output:
2019-07-24T13:38:08.554497-06:00 stargate4 aa0e3af76781[4937]: slurmdbd: error: Problem getting jobs for cluster geyser_caldera
...
2019-07-24T13:50:51.324594-06:00 stargate4 slurmctld[15]: error: Error writing file /data/state//hash.7/job.3057597/environment, No space left on device
2019-07-24T13:50:51.326655-06:00 stargate4 supervisord: slurmctld slurmctld: _slurm_rpc_submit_batch_job: I/O error writing script/environment to file
...
2019-07-24T13:51:28.018072-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: Error writing file /data/state//job_state.new, No space left on device
2019-07-24T13:51:28.018212-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: Error writing file /data/state//node_state.new, No space left on device

And clock problems as well...
2019-07-24T13:51:28.017962-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: The clock of the file system and this computer appear to not be synchronized

But I will ignore those as they appear to only be short lived.

If we are to believe CONN:18 is the connection representing the slurmctld it happens here.  (guessing uid 210 represents user slurm)

2019-07-24T13:50:36.802180-06:00 stargate4 slurmdbd[14]: debug:  REQUEST_PERSIST_INIT: CLUSTER:geyser_caldera VERSION:8448 UID:210 IP:192.168.100.21 CONN:18

Then for some reason it fails...

2019-07-24T14:18:01.151616-06:00 stargate4 slurmdbd[14]: debug2: persistent connection is closed
2019-07-24T14:18:01.151736-06:00 stargate4 slurmdbd[14]: debug:  Problem sending response to connection 18(192.168.100.21) uid(210)
2019-07-24T14:18:01.151800-06:00 stargate4 slurmdbd[14]: debug2: Closed connection 18 uid(210)
2019-07-24T14:18:01.152522-06:00 stargate4 slurmdbd[14]: debug:  cluster geyser_caldera has disconnected

This is fine and dandy, and indeed is where the NULL ControlHost comes from.  I can reproduce this portion, but every time the slurmctld tries to send a new message I get a REQUEST_PERSIST_INIT and I get error messages in the slurmctld log.

But what is really strange here is almost 13 minutes earlier I get...

2019-07-24T14:05:36.902904-06:00 stargate4 slurmdbd[14]: debug:  REQUEST_PERSIST_INIT: CLUSTER:geyser_caldera VERSION:8448 UID:210 IP:192.168.100.21 CONN:20
2019-07-24T14:05:36.903175-06:00 stargate4 slurmdbd[14]: debug2: acct_storage_p_get_connection: request new connection 1
2019-07-24T14:05:36.903580-06:00 stargate4 slurmctld[15]: error: slurmdbd: Getting response to message type 1407
2019-07-24T14:05:36.904170-06:00 stargate4 supervisord: slurmctld slurmctld: error: slurmdbd: Getting response to message type 1407
2019-07-24T14:05:36.904715-06:00 stargate4 slurmctld[15]: debug2: Processing RPC: REQUEST_FED_INFO from uid=16745
2019-07-24T14:05:36.905338-06:00 stargate4 slurmctld[15]: debug2: _slurm_rpc_get_fed usec=384
2019-07-24T14:05:36.905796-06:00 stargate4 aa0e3af76781[4937]: 2019-07-24 14:05:36,903 DEBG 'slurmctld' stderr output:
2019-07-24T14:05:36.905860-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: slurmdbd: Getting response to message type 1407

Which also appears to be from a (the?) slurmctld, but that is connection 20 which doesn't close in this log file.  So it leads me to believe you have 2 connections open to the database somehow. Note: 1407 is DBD_CLUSTER_TRES, so we are sending that information over.  Indeed we get a DBD_CLUSTER_TRES message, but that doesn't happen until

2019-07-24T14:19:27.106336-06:00 stargate4 slurmdbd[14]: debug2: DBD_CLUSTER_TRES: called for geyser_caldera(1=2570,2=15765010,3=0,4=57,5=17528,6=0,7=0,8=0)

So it really looks like there are 2 different slurmctlds (on the same node) trying to talk to the slurmdbd, or the one slurmctld has opened 2 different connections both talking to the dbd.  As strange as this sounds it would at least make sense on the database side as CONN:18 failure would overwrite CONN:20 entry.  How CONN:20 started is strange as I see no evidence of error in the log.

I do notice a few seconds before CONN:20 this I see messages like

2019-07-24T14:05:29.969600-06:00 stargate4 slurmctld[15]: error: Error writing file /data/state//priority_last_decay_ran.new, No space left on device
2019-07-24T14:05:29.969924-06:00 stargate4 supervisord: slurmctld slurmctld: error: Error writing file /data/state//priority_last_decay_ran.new, No space left on device
2019-07-24T14:05:29.970500-06:00 stargate4 aa0e3af76781[4937]: 2019-07-24 14:05:29,962 DEBG 'slurmctld' stderr output:
2019-07-24T14:05:29.970542-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: Error writing file /data/state//priority_last_decay_ran.new, No space left on device

2019-07-24T14:05:33.006639-06:00 stargate4 slurmctld[15]: error: The modification time of /data/state//job_state moved backwards by 909 seconds
2019-07-24T14:05:33.007006-06:00 stargate4 slurmctld[15]: error: The clock of the file system and this computer appear to not be synchronized
2019-07-24T14:05:33.007293-06:00 stargate4 supervisord: slurmctld slurmctld: error: The modification time of /data/state//job_state moved backwards by 909 seconds
2019-07-24T14:05:33.008384-06:00 stargate4 slurmctld[15]: error: Error writing file /data/state//job_state.new, No space left on device
2019-07-24T14:05:33.009586-06:00 stargate4 aa0e3af76781[4937]: 2019-07-24 14:05:33,006 DEBG 'slurmctld' stderr output:
2019-07-24T14:05:33.009681-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: The modification time of /data/state//job_state moved backwards by 909 seconds
2019-07-24T14:05:33.009837-06:00 stargate4 supervisord: slurmctld slurmctld: error: The clock of the file system and this computer appear to not be synchronized
2019-07-24T14:05:33.010237-06:00 stargate4 supervisord: slurmctld slurmctld: error: Error writing file /data/state//job_state.new, No space left on device
2019-07-24T14:05:33.010874-06:00 stargate4 aa0e3af76781[4937]: 2019-07-24 14:05:33,008 DEBG 'slurmctld' stderr output:
2019-07-24T14:05:33.010948-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: The clock of the file system and this computer appear to not be synchronized
2019-07-24T14:05:33.011030-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: Error writing file /data/state//job_state.new, No space left on device
2019-07-24T14:05:33.011190-06:00 stargate4 slurmctld[15]: error: Error writing file /data/state//node_state.new, No space left on device
2019-07-24T14:05:33.011675-06:00 stargate4 supervisord: slurmctld slurmctld: error: Error writing file /data/state//node_state.new, No space left on device
2019-07-24T14:05:33.012522-06:00 stargate4 aa0e3af76781[4937]: 2019-07-24 14:05:33,011 DEBG 'slurmctld' stderr output:
2019-07-24T14:05:33.012580-06:00 stargate4 aa0e3af76781[4937]: slurmctld: error: Error writing file /data/state//node_state.new, No space left on device

I am wondering if no disk left on the device is related to this happening?  Any ideas?
Comment 98 Danny Auble 2019-08-12 16:17:12 MDT
I just reproduced it by hanging the io to the slurmctld on the dbd side for a time period, then restarting the slurmctld.  It goes and gets a new connection to the database while the old connection is hung.  When the hung connection releases it gives us the scenario we are seeing now.

I can look at bullet proofing this a bit more, but it seems strange there wasn't a reboot of the slurmctld (or logs of it happening) around 2019-07-24T14:05:36.  The log seems incomplete there, but if the disk was full then perhaps that makes sense?
Comment 99 Danny Auble 2019-08-12 16:44:26 MDT
Created attachment 11201 [details]
Remove hung connection in the DBD

Ben, I think this will resolve the issue.  Whether or not you are experiencing exactly what I am seeing.

What it does is when a registration message comes in from a slurmctld it checks to see if we are already talking to that ctld.  If so we remove that registration connection and proceed with the new connection.  With the old connection becomes unwedged for whatever reason we close that without removing the registration as is happening now.

This patch should work on any version of Slurm I am guessing.

Could you run with this for a bit and see if you see the new error message but no fall out?
Comment 103 Danny Auble 2019-08-12 16:58:21 MDT
Created attachment 11205 [details]
Remove hung connection in the DBD v2

Sorry, Ben, here is the correct patch.
Comment 105 Ben Matthews 2019-08-13 12:51:34 MDT
The out of space condition is probably legitimate but specific to this one instance. I'll go back through the logs and see if I can find any other cases, but that was caused by another daemon which shares the same host.  

Time drift on the filesystem however makes no sense at all. The Slurm state save directory is a local (xfs) filesystem on top of a software (md) raid volume. If I'm reading the code correctly, you'd only get this message if time moved backward -- I don't see how that could have happened. At least right now, NTPd looks happy. 

Logs go to a different (NFS) volume, so it's unlikely that they were impacted by any of this. 

If the slurmctld were to die, it should be restarted automatically, so I suppose it's possible that it got restarted, but I'm not quickly finding any core files/evidence of that. I'll look through the log a bit more and let you know if I find anything. 

I'll take a look at the patch, assuming things go smoothly, I think I can get it installed later today.
Comment 106 Danny Auble 2019-08-13 14:06:03 MDT
Yeah, I don't know what was going on then, everything seemed weird on the node based on the log you sent.

The patch should fix the issue though. I'm excited for you to verify.  You will only need to restart the slurmdbd.
Comment 109 Ben Matthews 2019-08-15 14:09:57 MDT
Installing the patch went smoothly. I guess we'll give it a couple months and see how things go.
Comment 112 Danny Auble 2019-08-20 13:38:49 MDT
Ben, we are fairly confident this is the fix for the problem.  It has been added to 19.05 commit af7b453138c.

Please reopen if you still experience the issue.
Comment 113 Jason Booth 2019-08-22 10:11:27 MDT
*** Ticket 7419 has been marked as a duplicate of this ticket. ***
Comment 122 Alejandro Sanchez 2020-02-13 02:40:26 MST
*** Ticket 7502 has been marked as a duplicate of this ticket. ***