Ticket 8491

Summary: Recent jobs are not visible with sacct
Product: Slurm Reporter: Roger Moye <rmoye>
Component: AccountingAssignee: Gavin D. Howard <gavin>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: albert.gil
Version: 18.08.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9478
Site: QuantLab Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Mariadb log
MariaDB log with logging level set to 4.

Description Roger Moye 2020-02-11 15:17:13 MST
We want to track the amount of utilization per user so we used "sacctmgr" to add users and accounts into slurmdbd.  Subsequently, if we run "sacct -j <jobid>" we have noticed that a given job can show up in the database 2 hours after it has completed.   If we remove the user from slurmdbd (with sacctmgr) then jobs submitted by that user appear immediately with "sacct -j".

It seems that the accounting data is accurate but lags behind when users and accounts are added to slurmdbd.

Is this expected?  If not, how do we fix this?  Is there some kind of caching going on when we are tracking individual users that causes their jobs not to appear with sacct -j for a while?

-Roger
Comment 2 Gavin D. Howard 2020-02-11 16:06:42 MST
Just a note that support for 18.08 is going to end very soon. (Version 20.02 is due to be released before the end of this month.) I would suggest upgrading to 19.05.5 in the near future.

That said, let's see how I can help in the meantime.

Can you send me your slurm.conf, slurmctld logs, and slurmdbd logs?

Also, please turn on query logging on slurmdbd and send me the result after running sacctmgr -j in both ways; once before the user is removed, and once after the user is removed.
Comment 3 Roger Moye 2020-02-11 20:51:27 MST
I am currently out-of-town but I expect to be back at my desk later this week.  I will recreate the problem and send you all of the logs and config files you asked for.

The only issue with your request is that we have subsequently turned on AccountingStorageEnforce=limits,qos.  So if I remove an account from the database I will not be able to submit a job at all.   I can still reproduce the original issue though and capture the logs.

I will reply soon.

-Roger

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, February 11, 2020 5:07 PM
To: Roger Moye <rmoye@quantlab.com>
Subject: [Bug 8491] Recent jobs are not visible with sacct

Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=8491#c2> on bug 8491<https://bugs.schedmd.com/show_bug.cgi?id=8491> from Gavin D. Howard<mailto:gavin@schedmd.com>

Just a note that support for 18.08 is going to end very soon. (Version 20.02 is

due to be released before the end of this month.) I would suggest upgrading to

19.05.5 in the near future.



That said, let's see how I can help in the meantime.



Can you send me your slurm.conf, slurmctld logs, and slurmdbd logs?



Also, please turn on query logging on slurmdbd and send me the result after

running sacctmgr -j in both ways; once before the user is removed, and once

after the user is removed.

________________________________
You are receiving this mail because:

  *   You reported the bug.
-----------------------------------------------------------------------------------

The information in this communication and any attachment is confidential and intended solely for the attention and use of the named addressee(s). All information and opinions expressed herein are subject to change without notice. This communication is not to be construed as an offer to sell or the solicitation of an offer to buy any security. Any such offer or solicitation can only be made by means of the delivery of a confidential private offering memorandum (which should be carefully reviewed for a complete description of investment strategies and risks). Any reliance one may place on the accuracy or validity of this information is at their own risk. Past performance is not necessarily indicative of the future results of an investment. All figures are estimated and unaudited unless otherwise noted. If you are not the intended recipient, or a person responsible for delivering this to the intended recipient, you are not authorized to and must not disclose, copy, distribute, or retain this message or any part of it. In this case, please notify the sender immediately at 713-333-5440
Comment 4 Roger Moye 2020-02-14 10:22:26 MST
You asked for us to turn on query logging in slurmdbd.  Which flags do I set to do that?   The docs show several DebugFlags that can be set.  Which ones are relevant in this case?

-Roger
Comment 5 Gavin D. Howard 2020-02-14 11:46:59 MST
In slurmdbd.conf, you can set DebugFlags to this:

> DebugFlags=DB_QUERY
Comment 6 Gavin D. Howard 2020-02-28 16:34:58 MST
Just a reminder to send me your logs.
Comment 7 Roger Moye 2020-03-02 16:00:36 MST
Gavin,

I sent you the logs on 2/19 along with examples of sacct showing the jobs do and do not show up in the output.

The logs were sent via email for privacy reasons.  You confirmed receipt on 2/19 as well.

-Roger
Comment 8 Gavin D. Howard 2020-03-02 16:03:30 MST
Ah, yes, my fault. I remember now. Thank you.
Comment 9 Roger Moye 2020-03-02 16:03:37 MST
I will be out of the office on March 3.  I will be checking email intermittently.
For assistance please use http://qsm.quantlab.com or call the Help Desk at 713.333.3720.

-Roger
Comment 14 Gavin D. Howard 2020-03-03 12:41:59 MST
Roger,

Once you are back in the office, please send me the output from the following command:

> sacctmgr show stats
Comment 16 Gavin D. Howard 2020-03-05 10:57:26 MST
I will need a few more things to solve this issue.

First, can you send me your slurmdbd.conf?

Second, can you send the dmesg/syslog from the "master" node, along with the sql logs?

Third, we noticed some errors connecting to the database in the logs, but they seem to have gone away. Has this issue gone away? If not, can you reenable DebugFlags=DB_QUERY? I do not see anything in the logs related to that.
Comment 17 Roger Moye 2020-03-05 12:52:52 MST
I am going to send the logs to you via email as one of them is too big to upload.

I am not seeing the problem at the moment but I strongly suspect this is due to a restart of mysqld on 3/2.  Also the errors connecting to the database as shown in slurmdbd logs are in alignment with the moment that I restart mysqld.  I have included the latest slurmdbd log.  This is consistent with what I've seen in the past.   A restart of mysqld (or maybe slurmdbd as well) will cause the problem to temporarily disappear.

-Roger
Comment 19 Gavin D. Howard 2020-03-05 16:49:48 MST
Thank you.

The information you are providing me is making me think that the problem may lie with the amount of mysql connections that exist.

The first thing to do is to ensure that mysql is configured properly for Slurm. The information to do so is at https://slurm.schedmd.com/accounting.html#mysql-configuration .

The second thing to do is to see how many connections mysql has versus how many connections it allows. Just in case you don't know how to do it, the query you can give to mysql is:

> show status where variable_name = 'threads_connected';

and to see the max allowable connections is:

> show variables like "max_connections";

You should ensure that you are not pushing the limit because if you are, slurmdbd may be struggling to connect, as the logs seem to indicate, which could explain the lag on the accounting data.

The last thing you can do is to send me the mysql logs; maybe I can figure out why slurmdbd is struggling to connect.
Comment 20 Roger Moye 2020-03-06 10:00:32 MST
Our cluster was built with Bright Cluster Manager.  As for the mysql-configuration link you sent to me, I would assume that the config is okay since it has been online for over a year and we’ve been using sacctmgr and sacct a lot.
I verified the innodb_buffer_pool_size and also verified that Innodb is the default engine.
As for the rest:
MariaDB [(none)]> show status where variable_name = 'threads_connected';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Threads_connected | 5     |
+-------------------+-------+
1 row in set (0.00 sec)

MariaDB [(none)]> show variables like "max_connections";
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 151   |
+-----------------+-------+
1 row in set (0.00 sec)
I suppose there might be times when the threads connected could be much higher.
I will attached the mariadb log to the ticket.
-Roger
-----------------------------------------------------------------------------------

The information in this communication and any attachment is confidential and intended solely for the attention and use of the named addressee(s). All information and opinions expressed herein are subject to change without notice. This communication is not to be construed as an offer to sell or the solicitation of an offer to buy any security. Any such offer or solicitation can only be made by means of the delivery of a confidential private offering memorandum (which should be carefully reviewed for a complete description of investment strategies and risks). Any reliance one may place on the accuracy or validity of this information is at their own risk. Past performance is not necessarily indicative of the future results of an investment. All figures are estimated and unaudited unless otherwise noted. If you are not the intended recipient, or a person responsible for delivering this to the intended recipient, you are not authorized to and must not disclose, copy, distribute, or retain this message or any part of it. In this case, please notify the sender immediately at 713-333-5440
Comment 21 Roger Moye 2020-03-06 10:02:35 MST
Created attachment 13293 [details]
Mariadb log
Comment 22 Gavin D. Howard 2020-03-06 10:17:56 MST
Unfortunately, the mariadb logs were not that useful to us; they don't seem to contain any reasons for denied connections. Can you turn up the log level in mariadb? Also, can you try connecting to mariadb by hand and see what happens in the logs with the increased log level?

As a side note: the "Unknown/unsupported storage engine: InnoDB" errors also need to be fixed, but that is outside the realm of what SchedMD can help with since it is a mysql problem.
Comment 23 Roger Moye 2020-03-06 12:40:47 MST
The "Unknown/unsupported storage engine" error was from December 2019.  I was updating some database tuning parameters per the SchedMD docs which redirected me to the Mysql documentation.  One of those parameters (I do not remember which) was not valid for InnoDB because our version of MySQL was too old.

As for continued troubleshooting, I am about to head out on vacation for a few days so I will need to table this until I get back.

Thanks so much for your help!
-Roger
Comment 24 Gavin D. Howard 2020-03-19 16:12:09 MDT
Just a reminder to send me the material I asked for in Comment 22.
Comment 25 Roger Moye 2020-03-23 09:34:24 MDT
Sorry for the long delay.  I will circle back to this as soon as I can.


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, March 19, 2020 5:12 PM
To: Roger Moye <rmoye@quantlab.com>
Subject: [Bug 8491] Recent jobs are not visible with sacct

Comment # 24<https://bugs.schedmd.com/show_bug.cgi?id=8491#c24> on bug 8491<https://bugs.schedmd.com/show_bug.cgi?id=8491> from Gavin D. Howard<mailto:gavin@schedmd.com>

Just a reminder to send me the material I asked for in Comment 22<show_bug.cgi?id=8491#c22>.

________________________________
You are receiving this mail because:

  *   You reported the bug.
-----------------------------------------------------------------------------------

The information in this communication and any attachment is confidential and intended solely for the attention and use of the named addressee(s). All information and opinions expressed herein are subject to change without notice. This communication is not to be construed as an offer to sell or the solicitation of an offer to buy any security. Any such offer or solicitation can only be made by means of the delivery of a confidential private offering memorandum (which should be carefully reviewed for a complete description of investment strategies and risks). Any reliance one may place on the accuracy or validity of this information is at their own risk. Past performance is not necessarily indicative of the future results of an investment. All figures are estimated and unaudited unless otherwise noted. If you are not the intended recipient, or a person responsible for delivering this to the intended recipient, you are not authorized to and must not disclose, copy, distribute, or retain this message or any part of it. In this case, please notify the sender immediately at 713-333-5440
Comment 26 Roger Moye 2020-03-30 15:35:10 MDT
Sorry it has taken so long.

This afternoon I verified that the problem remains.  Then I did the following:

MariaDB [(none)]> show status where variable_name = 'max_used_connections';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 47    |
+----------------------+-------+
1 row in set (0.00 sec)

We seem to be no where near the limit of 150 connections.  

I increased the log level of MySQL to 4.  Attached is the log.  

The verbose logging started at 16:20 (near the bottom of the log).  At 16:23 I manually connected to MySQL.  You will see a single line that shows the connection.  

-Roger
Comment 27 Roger Moye 2020-03-30 15:35:58 MDT
Created attachment 13530 [details]
MariaDB log with logging level set to 4.
Comment 28 Roger Moye 2020-04-20 14:12:56 MDT
Just wanted to follow up on this and see if anything was noticed in the last set of logs I uploaded.

-Roger
Comment 29 Gavin D. Howard 2020-04-20 16:11:26 MDT
I apologize for not getting back to you, but this past week, I had other, more pressing issues to figure out, including one that took me offline for the greater part of Thursday and today.

Yes, I noticed something in those logs: the mass of aborted connections. I am trying to figure out where they came from, and I have no idea when I will figure that out.
Comment 30 Roger Moye 2020-04-21 07:28:30 MDT
I noticed the aborted connections too.  One of the troubleshooting steps you asked me to perform was to connect to mysql manually from the command line.  I did that and when I terminated that connection gracefully an aborted connection error was produced as well.  

-Roger
Comment 31 Gavin D. Howard 2020-04-21 16:12:11 MDT
The other error that caught my attention is this one:

> 200103 11:38:44 [ERROR] Error in accept: Too many open files in system

Would that have any bearing on what happened?
Comment 32 Roger Moye 2020-04-22 08:12:40 MDT
In my previous response I was just thinking outloud.  I wonder if this is a real error that we need to worry about since I can reproduce it just by exiting mysql.

However, I was reading the mysql docs this morning about aborted connections.  The log I gave you would imply that we've had 1.3 million aborted connections.  These aborted connections can be caused by a ton of things.   One of them being that the client does not call mysql_close() before terminating.  Odd that merely existing from the mysql prompt would produce the error.

I've been reviewing this document:
https://dev.mysql.com/doc/refman/5.6/en/communication-errors.html

Very hard to say if these aborted connections are the culprit or just noise.

-Roger
Comment 33 Gavin D. Howard 2020-04-29 13:54:00 MDT
At this time, I am unable to reproduce this issue. It looks like it might be related to a failure of slurmdbd and slurmctld to communicate, I want to be able to rule out bugs that have been fixed in newer versions, such as in https://github.com/SchedMD/slurm/commit/af7b453138ce2ad4c4a44456c119946f4ee1d96f .

On top of that, 18.08 officially reached its End of Life.

If it is at all possible, please upgrade to 19.05. This will help me establish a baseline and also ensure that the behavior still exists.
Comment 34 Gavin D. Howard 2020-05-13 15:22:11 MDT
Just a reminder to let me know the current status of your upgrading, whether or not you intend to do it.
Comment 35 Roger Moye 2020-05-15 09:29:16 MDT
We do intend to do it, just not sure when.  We have multiple projects in flight right now that will compete with each other for my time.   I think there will be a step in one of our big projects where a slurm upgrade will be required anyway, so I currently think that’s when the upgrade will happen.  I don’t have a timeline yet though.

Roger

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, May 13, 2020 4:22 PM
To: Roger Moye <rmoye@quantlab.com>
Subject: [Bug 8491] Recent jobs are not visible with sacct

Comment # 34<https://bugs.schedmd.com/show_bug.cgi?id=8491#c34> on bug 8491<https://bugs.schedmd.com/show_bug.cgi?id=8491> from Gavin D. Howard<mailto:gavin@schedmd.com>

Just a reminder to let me know the current status of your upgrading, whether or

not you intend to do it.

________________________________
You are receiving this mail because:

  *   You reported the bug.
-----------------------------------------------------------------------------------

The information in this communication and any attachment is confidential and intended solely for the attention and use of the named addressee(s). All information and opinions expressed herein are subject to change without notice. This communication is not to be construed as an offer to sell or the solicitation of an offer to buy any security. Any such offer or solicitation can only be made by means of the delivery of a confidential private offering memorandum (which should be carefully reviewed for a complete description of investment strategies and risks). Any reliance one may place on the accuracy or validity of this information is at their own risk. Past performance is not necessarily indicative of the future results of an investment. All figures are estimated and unaudited unless otherwise noted. If you are not the intended recipient, or a person responsible for delivering this to the intended recipient, you are not authorized to and must not disclose, copy, distribute, or retain this message or any part of it. In this case, please notify the sender immediately at 713-333-5440
Comment 36 Gavin D. Howard 2020-05-15 10:12:34 MDT
That makes sense.

In the meantime, I will attempt to continue solving this problem.
Comment 37 Gavin D. Howard 2020-08-05 12:29:57 MDT
At this point, I think it's best to close this issue and wait until the upgrade to 19.05 or 20.02 has happened. Once it has, feel free to open a new issue.