7570 – sacctmgr updates are not seen by slurmctld

Ticket 7570 - sacctmgr updates are not seen by slurmctld

Summary: sacctmgr updates are not seen by slurmctld

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	18.08.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-08-13 13:55 MDT by Wei Feinstein
Modified:	2019-09-23 08:37 MDT (History)
CC List:	0 users

See Also:	5213
Site:	LBNL - Lawrence Berkeley National Laboratory
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Tar file of all slurm logs for August (57.90 MB, application/x-gzip) 2019-08-13 15:22 MDT, Wei Feinstein	Details
SCS slurm log files for August tar format (84.40 MB, application/x-gzip) 2019-08-13 15:23 MDT, Wei Feinstein	Details
slurmscs-out.tar (130.00 KB, application/x-tar) 2019-08-13 17:02 MDT, Wei Feinstein	Details
Patch to maintain existing database connection (850 bytes, patch) 2019-08-14 10:45 MDT, Ben Roberts	Details \| Diff
slurm-7570.tar.bz2 (15.75 MB, application/x-bzip2) 2019-08-15 18:09 MDT, Wei Feinstein	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Wei Feinstein 2019-08-13 13:55:41 MDT

Since upgrading to 18.08.7 I have 2 systems having issues with associations being added to the database but slurmctld is not seeing them.  The issue is not persistent because restarting slurmctld solves the issue.  And it does not occur for a while.  I have a system currently in the state with the accounts being created not being picked up with slurmctld association. Is there anyone available to help troubleshoot this issue?   This problem had been happening for a while and I have added the debug statements in the assoc_mgr.c program as recommended but I have not been able to see any root cause.   Now the problem has manifested to a new cluster recently upgraded and we would like for this to be resolved as soon as possible.

Please advice.

Comment 2 Ben Roberts 2019-08-13 14:55:24 MDT

Hi Jackie,

I'd like to have you send a couple pieces of information for me.  We have seen similar behavior previously where something was happening to clear out information in the database that is needed to talk to slurmctld.  Can you get into the database and run this query for me?

select control_host, control_port, name, rpc_version from cluster_table;


Similarly, I'd like to have you run this command and send the output:
sacctmgr show cluster


You mention that the problem goes away when you restart slurmctld.  I would be interested to see if the output of the above commands is any different right after a restart.  You also say that the problem doesn't happen for a while after a restart.  Are there new users added during this time, or is it just that the users who were added before start working and things are ok until you try to add someone new?

If you can also send the slurmctld and slurmdbd logs from the time you tried to add a new user, along with the username I can search for, that would be helpful.

Thanks,
Ben

Comment 4 Wei Feinstein 2019-08-13 15:11:27 MDT

See below:


On Tue, Aug 13, 2019 at 1:55 PM <bugs@schedmd.com> wrote:

> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c2> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I'd like to have you send a couple pieces of information for me.  We have seen
> similar behavior previously where something was happening to clear out
> information in the database that is needed to talk to slurmctld.  Can you get
> into the database and run this query for me?
>
> select control_host, control_port, name, rpc_version from cluster_table;
>
> 2 different servers behaving the same way and I am running the commands on
each for you to see:
*Master.scs - *

select control_host, control_port, name, rpc_version from cluster_table;

+--------------+--------------+------------+-------------+

| control_host | control_port | name       | rpc_version |

+--------------+--------------+------------+-------------+

|              |            0 | perceus-00 |        8448 |

+--------------+--------------+------------+-------------+

*1 row in set (0.00 sec)*

*Master.brc - *

select control_host, control_port, name, rpc_version from cluster_table;

+--------------+--------------+-----------------+-------------+

| control_host | control_port | name            | rpc_version |

+--------------+--------------+-----------------+-------------+

|              |            0 | 0-a0-d1-ec-bc-c |           0 |

| 10.0.0.10    |         6817 | brc             |        8448 |

|              |            0 | master          |          11 |

|              |            0 | master.brc      |          11 |

+--------------+--------------+-----------------+-------------+

4 rows in set (0.00 sec)

> Similarly, I'd like to have you run this command and send the output:
> sacctmgr show cluster
>
>
> *Master.scs - *

sacctmgr show cluster

   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES
GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS
Def QOS

---------- --------------- ------------ ----- --------- -------
------------- --------- ------- ------------- --------- -----------
-------------------- ---------
perceus-00                            0  8448         1
                                                            mako_debug,normal


*Master.brc - *

sacctmgr show cluster

   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES
GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS
Def QOS

---------- --------------- ------------ ----- --------- -------
------------- --------- ------- ------------- --------- -----------
-------------------- ---------

       brc       10.0.0.10         6817  8448         1

> You mention that the problem goes away when you restart slurmctld.  I would be
> interested to see if the output of the above commands is any different right
> after a restart.
>
>
Yes it is.  I have not ran these commands before and after a restart and
never show them changed as far as the sacctmgr show cluster one is
concerned. The database content not sure.


>  You also say that the problem doesn't happen for a while
> after a restart.
>
> No it does not and there is no logs reporting any issues. I have increased
debug levels before and nothing is reported in the slurmctld.log nor
slurmdbd.log regarding assoc_mgr.c debug lines.

> Are there new users added during this time, or is it just
> that the users who were added before start working and things are ok until you
> try to add someone new?
>
> Some were new and one was adding an additional association to an existing
user.

>
> If you can also send the slurmctld and slurmdbd logs from the time you tried to
> add a new user, along with the username I can search for, that would be
> helpful.
>
> The logs are a bit tricky because I don't have the specific time these
were created. I can give you all of the logs for August for BRC since
August 5th and for scs it would be for the past week.  Will that work?


> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 5 Wei Feinstein 2019-08-13 15:22:45 MDT

Created attachment 11217 [details]
Tar file of all slurm logs for August

Comment 6 Wei Feinstein 2019-08-13 15:23:30 MDT

Created attachment 11218 [details]
SCS slurm log files for August tar format

Comment 8 Ben Roberts 2019-08-13 16:43:24 MDT

Thank you for that data. It looks like there might be two different things happening on the different systems. The 'brc' system looks like it has the correct information to be able to communicate appearing in the database. However when I look at the logs for that system I see that on the 24th of July there were the following entries:
[2019-07-24T15:22:14.782] error: CONN:7 Your user doesn't have privilege to perform this action
[2019-07-24T15:22:14.782] error: CONN:7 Security violation, DBD_ADD_ASSOCS

I tried to see what the slurmctld logs said for that time period, but they didn't go back far enough. If this is the problem on this system then it looks like something I've seen previously where the slurm.conf and slurmdbd.conf have different SlurmUser's specified. Do you know if July 24th sounds like a reasonable time for the last attempt to add a user? Can you grep for 'SlurmUser' in both your slurm.conf and slurmdbd.conf to see if they are different?

For the 'scs' system, it looks like could be the issue I was mentioning earlier where the database is missing the information it needs to be able to talk with the slurm controller. You said that you never see a control host or port in the 'scontrol show cluster' output on 'scs' after a reboot. How soon after a reboot have you checked that output?

I tried looking through the logs for this system for what I thought might be relevant errors, but didn't see them. Would it be possible for you to try adding a test user to the system as it is right now to narrow down a time and username to look at in the logs? Also, if you're able to restart slurmctld I would like to see how things look when you run 'scontrol show controller' and the database query right after the restart.

Thanks,
Ben

Comment 9 Wei Feinstein 2019-08-13 17:02:31 MDT

Created attachment 11220 [details]
slurmscs-out.tar

For brc during the July time that system was running 17.11 and not 18.08
and it was not experiencing the problem then.

scs -

/etc/slurm/slurm.conf:*SlurmUser*=slurm

/etc/slurm/slurmdbd.conf:*SlurmUser*=slurm


brc -

/etc/slurm/slurm.conf:SlurmUser=slurm

/etc/slurm/slurmdbd.conf:SlurmUser=slurm


ok

I just created myself an account in account=test


See attached log file for what happened.


After that I restarted slurmctld and ran sacctmgr show cluster



   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES
GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS
Def QOS

---------- --------------- ------------ ----- --------- -------
------------- --------- ------- ------------- --------- -----------
-------------------- ---------

perceus-00       10.0.0.11         6817  8448         1
                                                            mako_debug,normal



which is the same as before the restart.


I had not seen the User privilege errors before.  This is the first time I
have noticed that.  Also I do not see 'scontrol show controller' as an
option in this version of slurm.




On Tue, Aug 13, 2019 at 3:43 PM <bugs@schedmd.com> wrote:

> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c8> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Thank you for that data.  It looks like there might be two different things
> happening on the different systems.  The 'brc' system looks like it has the
> correct information to be able to communicate appearing in the database.
> However when I look at the logs for that system I see that on the 24th of July
> there were the following entries:
> [2019-07-24T15:22:14.782] error: CONN:7 Your user doesn't have privilege to
> perform this action
> [2019-07-24T15:22:14.782] error: CONN:7 Security violation, DBD_ADD_ASSOCS
>
>
> I tried to see what the slurmctld logs said for that time period, but they
> didn't go back far enough.  If this is the problem on this system then it looks
> like something I've seen previously where the slurm.conf and slurmdbd.conf have
> different SlurmUser's specified.  Do you know if July 24th sounds like a
> reasonable time for the last attempt to add a user?  Can you grep for
> 'SlurmUser' in both your slurm.conf and slurmdbd.conf to see if they are
> different?
>
>
>
> For the 'scs' system, it looks like could be the issue I was mentioning earlier
> where the database is missing the information it needs to be able to talk with
> the slurm controller.  You said that you never see a control host or port in
> the 'scontrol show cluster' output on 'scs' after a reboot.  How soon after a
> reboot have you checked that output?
>
> I tried looking through the logs for this system for what I thought might be
> relevant errors, but didn't see them.  Would it be possible for you to try
> adding a test user to the system as it is right now to narrow down a time and
> username to look at in the logs?  Also, if you're able to restart slurmctld I
> would like to see how things look when you run 'scontrol show controller' and
> the database query right after the restart.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 11 Ben Roberts 2019-08-14 10:45:32 MDT

Created attachment 11232 [details]
Patch to maintain existing database connection

Hi Jackie,

Thanks for sending that information and the logs.  From the slurmdbd logs I can see that there are entries showing that new connection requests coming in, which cause the existing connections to be closed.  The logs look like this when a new connection is opened:

/var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Opened connection 15 from 10.0.2.2
/var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug:  REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:44678 IP:10.0.2.2 CONN:15
/var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: acct_storage_p_get_connection: request new connection 1
/var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Attempting to connect to localhost:3306



And a little while later the connection is closed with logs like this:

/var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: DBD_FINI: CLOSE:1 COMMIT:0
/var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: persistent connection is closed
/var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: Closed connection 15 uid(44678)



We've recently run into a similar scenario and a patch was created to address this issue.  The patch changes the behavior to close the new connection and leave the established one alone so that communication can continue to occur between slurmctld and slurmdbd.  I'm attaching this patch for you to try out.  After it has been applied you would need to restart slurmdbd.

The patch should change the behavior, but I do think it's strange that there are so many new connection requests coming in.  In just under 2 minutes you showed in the logs there are 28 new connection requests.  I assume you're not restarting slurmctld that often.  Do you know what might be causing this many new connections to the database?  The connections I see are coming from 3 IPs repeatedly:
10.0.2.2
10.0.0.10
10.0.0.11

I'm sorry, I must have been thinking of something else while I was typing 'scontrol show controller'.  What I was interested in was the 'scontrol show cluster' output, which looks like it has valid data for the ControlHost and ControlPort right after the restart.  This also makes it look like the bug we've encountered before because the host and port are there right after a restart and then after things have been running a while the connection information is dropped.  

Let me know if you have questions about any of this.

Thanks,
Ben

Comment 12 Wei Feinstein 2019-08-14 11:18:54 MDT

The process you are possibly seeing is telegraf which is for grafana to
gather data for data collection and graphs.
Do you have any other sites using grafana or other applications polling the
database regularly that have similar issues? And what else would be the fix?

Thanks

Jackie

On Wed, Aug 14, 2019 at 9:49 AM <bugs@schedmd.com> wrote:

> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c11> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Created attachment 11232 [details] <https://bugs.schedmd.com/attachment.cgi?id=11232&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=11232&action=edit>
> Patch to maintain existing database connection
>
> Hi Jackie,
>
> Thanks for sending that information and the logs.  From the slurmdbd logs I can
> see that there are entries showing that new connection requests coming in,
> which cause the existing connections to be closed.  The logs look like this
> when a new connection is opened:
>
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Opened connection
> 15 from 10.0.2.2
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug:
> REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:44678 IP:10.0.2.2
> CONN:15
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2:
> acct_storage_p_get_connection: request new connection 1
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Attempting to
> connect to localhost:3306
>
>
>
> And a little while later the connection is closed with logs like this:
>
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: DBD_FINI: CLOSE:1
> COMMIT:0
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: persistent
> connection is closed
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: Closed connection
> 15 uid(44678)
>
>
>
> We've recently run into a similar scenario and a patch was created to address
> this issue.  The patch changes the behavior to close the new connection and
> leave the established one alone so that communication can continue to occur
> between slurmctld and slurmdbd.  I'm attaching this patch for you to try out.
> After it has been applied you would need to restart slurmdbd.
>
> The patch should change the behavior, but I do think it's strange that there
> are so many new connection requests coming in.  In just under 2 minutes you
> showed in the logs there are 28 new connection requests.  I assume you're not
> restarting slurmctld that often.  Do you know what might be causing this many
> new connections to the database?  The connections I see are coming from 3 IPs
> repeatedly:
> 10.0.2.2
> 10.0.0.10
> 10.0.0.11
>
> I'm sorry, I must have been thinking of something else while I was typing
> 'scontrol show controller'.  What I was interested in was the 'scontrol show
> cluster' output, which looks like it has valid data for the ControlHost and
> ControlPort right after the restart.  This also makes it look like the bug
> we've encountered before because the host and port are there right after a
> restart and then after things have been running a while the connection
> information is dropped.
>
> Let me know if you have questions about any of this.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 13 Wei Feinstein 2019-08-14 12:34:21 MDT

Here is the process that we are running to collect the grafana data:

1.  We are running sinfo, sacct, squeue  within a telegraf (cron like)
program every 60s.  to gather data per partitions which we currently have
27 partitions on scs and 8 partitions on brc to collect        for
grafana.   Making the connections are only done through the slurm
commands.

Question: Isn't it normal behavior for multiple request from the commands
to come through. Users could have scripts that can run these commands and
this should not close the connection between slurmdbd and slurmctld.  What
changed?


In looking over the logs and the statement regarding the slurmdbd session
closing I am not seeing the same collation:

ie.
from slurmdbd.logs

[2019-08-13T09:19:51.663] debug:  REQUEST_PERSIST_INIT: CLUSTER:perceus-00
VERSION:8448 UID:497 IP:10.0.0.11 CONN:7
There is no corresponding connection close statement on this process until
I restarted slurmctld after I attempted to add the association and it
failed.

[2019-08-13T15:55:39.579] debug:  REQUEST_PERSIST_INIT: CLUSTER:perceus-00
VERSION:8448 UID:497 IP:10.0.0.11 CONN:9


When did the database loose the connection between slurmctld and slurmdbd?


In the slurmctld I see the following:

[2019-08-13T15:51:12.014] error: Error connecting, bad data: family = 0,
port = 0

[2019-08-13T15:52:24.133] error: Error connecting, bad data: family = 0,
port = 0

[2019-08-13T15:53:41.140] error: Error connecting, bad data: family = 0,
port = 0


[2019-08-13T15:55:36.942] Terminate signal (SIGINT or SIGTERM) received -
restart slurmctld by me



What are these messages from I am getting in slurmctld as well?


[2019-08-13T15:55:40.904] error: Can't find parent id 19385 for assoc
20700, this should never happen.

[2019-08-13T15:54:03.775] error: A non superuser 44529 tried to complete
batch JobId=20441374

Thanks

Jackie

On Wed, Aug 14, 2019 at 9:49 AM <bugs@schedmd.com> wrote:

> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c11> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Created attachment 11232 [details] <https://bugs.schedmd.com/attachment.cgi?id=11232&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=11232&action=edit>
> Patch to maintain existing database connection
>
> Hi Jackie,
>
> Thanks for sending that information and the logs.  From the slurmdbd logs I can
> see that there are entries showing that new connection requests coming in,
> which cause the existing connections to be closed.  The logs look like this
> when a new connection is opened:
>
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Opened connection
> 15 from 10.0.2.2
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug:
> REQUEST_PERSIST_INIT: CLUSTER:perceus-00 VERSION:8448 UID:44678 IP:10.0.2.2
> CONN:15
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2:
> acct_storage_p_get_connection: request new connection 1
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Attempting to
> connect to localhost:3306
>
>
>
> And a little while later the connection is closed with logs like this:
>
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: DBD_FINI: CLOSE:1
> COMMIT:0
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: persistent
> connection is closed
> /var/log/slurm/slurmdbd.log:[2019-08-13T15:50:02.471] debug2: Closed connection
> 15 uid(44678)
>
>
>
> We've recently run into a similar scenario and a patch was created to address
> this issue.  The patch changes the behavior to close the new connection and
> leave the established one alone so that communication can continue to occur
> between slurmctld and slurmdbd.  I'm attaching this patch for you to try out.
> After it has been applied you would need to restart slurmdbd.
>
> The patch should change the behavior, but I do think it's strange that there
> are so many new connection requests coming in.  In just under 2 minutes you
> showed in the logs there are 28 new connection requests.  I assume you're not
> restarting slurmctld that often.  Do you know what might be causing this many
> new connections to the database?  The connections I see are coming from 3 IPs
> repeatedly:
> 10.0.2.2
> 10.0.0.10
> 10.0.0.11
>
> I'm sorry, I must have been thinking of something else while I was typing
> 'scontrol show controller'.  What I was interested in was the 'scontrol show
> cluster' output, which looks like it has valid data for the ControlHost and
> ControlPort right after the restart.  This also makes it look like the bug
> we've encountered before because the host and port are there right after a
> restart and then after things have been running a while the connection
> information is dropped.
>
> Let me know if you have questions about any of this.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 14 Ben Roberts 2019-08-14 14:17:28 MDT

See responses below:

> Here is the process that we are running to collect the grafana data:
> 
> 1.  We are running sinfo, sacct, squeue  within a telegraf (cron like)
> program every 60s.  to gather data per partitions which we currently have
> 27 partitions on scs and 8 partitions on brc to collect        for
> grafana.   Making the connections are only done through the slurm
> commands.
> 
> Question: Isn't it normal behavior for multiple request from the commands
> to come through. Users could have scripts that can run these commands and
> this should not close the connection between slurmdbd and slurmctld.  What
> changed?

If the telegraf program is just running slurm commands it should not be causing the behavior I'm observing/describing.  It is normal that client commands come through without interrupting the database connection.  I can see that there is at least one other site using grafana and they haven't reported this behavior.  Is it possible there are other things that might be connecting to the database directly?


> In looking over the logs and the statement regarding the slurmdbd session
> closing I am not seeing the same collation:
> 
> ie.
> from slurmdbd.logs
> 
> [2019-08-13T09:19:51.663] debug:  REQUEST_PERSIST_INIT: CLUSTER:perceus-00
VERSION:8448 UID:497 IP:10.0.0.11 CONN:7
> There is no corresponding connection close statement on this process until
> I restarted slurmctld after I attempted to add the association and it
> failed.
> 
> [2019-08-13T15:55:39.579] debug:  REQUEST_PERSIST_INIT: CLUSTER:perceus-00
VERSION:8448 UID:497 IP:10.0.0.11 CONN:9

These two log entries look like they're for different connections.  The UID and IP are the same between them, but the connection identifier (CONN) is different.  During the time I have logs for I did see other connections being established from 10.0.0.11.  I would need more logs for the time in between to be able to comment on this further.  



> When did the database loose the connection between slurmctld and slurmdbd?

I would also need more of the log files to be able to see when the initial connection was lost exactly.  I don't see that the slurmctld logs you sent cover a time where there was a restart.

 
> In the slurmctld I see the following:
> 
> [2019-08-13T15:51:12.014] error: Error connecting, bad data: family = 0,
port = 0
> 
> [2019-08-13T15:52:24.133] error: Error connecting, bad data: family = 0,
port = 0
> 
> [2019-08-13T15:53:41.140] error: Error connecting, bad data: family = 0,
port = 0
> 
> [2019-08-13T15:55:36.942] Terminate signal (SIGINT or SIGTERM) received -
restart slurmctld by me

These error messages are related to the connection information being lost.  It is in the slurm_open_stream function where it's trying to open the connection to the database.  You can see it in this section of the code:
https://github.com/SchedMD/slurm/blob/432ebf2dc67a503c18e770df3f047c14943510b6/src/common/slurm_protocol_socket.c#L497-L501



> What are these messages from I am getting in slurmctld as well?
> 
> [2019-08-13T15:55:40.904] error: Can't find parent id 19385 for assoc
20700, this should never happen.
> 
> [2019-08-13T15:54:03.775] error: A non superuser 44529 tried to complete
batch JobId=20441374


It's possible these error messages are also related to this.  The first one is when it's trying to find information about an association, which could not have been sent over correctly if a new user was added with sacctmgr but isn't sent correctly to slurmctld.  The second message I'm not as sure about.  Can you identify the user it's talking about by running 'id 44529' as root to see which user it is?  


For your reference you can read more details about the investigation of the case that is similar to yours in Bug 5213.  The most relevant details start showing up around here:
https://bugs.schedmd.com/show_bug.cgi?id=5213#c90

It sounds like you're hesitant to try the patch I sent from that ticket.  Can I ask if there's a reason for your hesitancy?

Thanks,
Ben

Comment 15 Wei Feinstein 2019-08-14 14:54:12 MDT

On Wed, Aug 14, 2019 at 1:17 PM <bugs@schedmd.com> wrote:

> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c14> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> See responses below:
> > Here is the process that we are running to collect the grafana data:
> >
> > 1.  We are running sinfo, sacct, squeue  within a telegraf (cron like)
> > program every 60s.  to gather data per partitions which we currently have
> > 27 partitions on scs and 8 partitions on brc to collect        for
> > grafana.   Making the connections are only done through the slurm
> > commands.
> >
> > Question: Isn't it normal behavior for multiple request from the commands
> > to come through. Users could have scripts that can run these commands and
> > this should not close the connection between slurmdbd and slurmctld.  What
> > changed?
>
> If the telegraf program is just running slurm commands it should not be causing
> the behavior I'm observing/describing.  It is normal that client commands come
> through without interrupting the database connection.  I can see that there is
> at least one other site using grafana and they haven't reported this behavior.
> Is it possible there are other things that might be connecting to the database
> directly?
>
> Nothing is connecting to the database directly.  Only an admin would do it
if its necessary to check things but that was not done by me at all.

> > In looking over the logs and the statement regarding the slurmdbd session
> > closing I am not seeing the same collation:
> >
> > ie.
> > from slurmdbd.logs
> >
> > [2019-08-13T09:19:51.663] debug:  REQUEST_PERSIST_INIT: CLUSTER:perceus-00
> VERSION:8448 UID:497 IP:10.0.0.11 CONN:7> There is no corresponding connection close statement on this process until
> > I restarted slurmctld after I attempted to add the association and it
> > failed.
> >
> > [2019-08-13T15:55:39.579] debug:  REQUEST_PERSIST_INIT: CLUSTER:perceus-00
> VERSION:8448 UID:497 IP:10.0.0.11 CONN:9
>
> These two log entries look like they're for different connections.  The UID and
> IP are the same between them, but the connection identifier (CONN) is
> different.  During the time I have logs for I did see other connections being
> established from 10.0.0.11.  I would need more logs for the time in between to
> be able to comment on this further.
>
>
> The first connection was started earlier that morning and it was just to
show you a connection to slurmdbd earlier in the morning with no disconnect
of that session in any of the logs. So I don't see where the connection
dropped.
The second connection was started after I restarted slurmctld when I ran
the test you asked me to do so we could collect the logs.

> > When did the database loose the connection between slurmctld and slurmdbd?
>
> I would also need more of the log files to be able to see when the initial
> connection was lost exactly.  I don't see that the slurmctld logs you sent
> cover a time where there was a restart.
>
>
> All of the logs were sent to you and all are centrally logging to one
hosts.

> > In the slurmctld I see the following:
> >
> > [2019-08-13T15:51:12.014] error: Error connecting, bad data: family = 0,
> port = 0>
> > [2019-08-13T15:52:24.133] error: Error connecting, bad data: family = 0,
> port = 0>
> > [2019-08-13T15:53:41.140] error: Error connecting, bad data: family = 0,
> port = 0>
> > [2019-08-13T15:55:36.942] Terminate signal (SIGINT or SIGTERM) received -
> restart slurmctld by me
>
> These error messages are related to the connection information being lost.  It
> is in the slurm_open_stream function where it's trying to open the connection
> to the database.  You can see it in this section of the code:https://github.com/SchedMD/slurm/blob/432ebf2dc67a503c18e770df3f047c14943510b6/src/common/slurm_protocol_socket.c#L497-L501
>
> But if you look at the time it was after I created the account and before
I restarted slurmctld.  So why did the connection see bad data?

> > What are these messages from I am getting in slurmctld as well?
> >
> > [2019-08-13T15:55:40.904] error: Can't find parent id 19385 for assoc
> 20700, this should never happen.>
> > [2019-08-13T15:54:03.775] error: A non superuser 44529 tried to complete
> batch JobId=20441374
>
>
> It's possible these error messages are also related to this.  The first one is
> when it's trying to find information about an association, which could not have
> been sent over correctly if a new user was added with sacctmgr but isn't sent
> correctly to slurmctld.  The second message I'm not as sure about.  Can you
> identify the user it's talking about by running 'id 44529' as root to see which
> user it is?
>
> No the user that its reporting on is a valid user and has been in the
scheduler for a while.  not sure what complete batch tasks it is
complaining about?

[root@master ~]# sacct -j 20441374 -X -o 'jobid,user,partition'

       JobID      User  Partition

------------ --------- ----------

20441374        gguidi        cm1

[root@master ~]# id gguidi

uid=44529(gguidi) gid=504(crd)
groups=504(crd),606(lawrencium),627(mako),632(californium),40031(pc_exagraph)

[root@master ~]# scontrol show assoc | grep gguidi

UserName=*gguidi*(44529) DefAccount=pc_exagraph DefWckey= AdminLevel=None

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=cm1 ID=19964

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=es1 ID=19622

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=cf1 ID=18674

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=mako ID=18673

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=lr_bigmem ID=18672

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=lr6 ID=18671

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=lr5 ID=18670

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=lr4 ID=18669

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=lr3 ID=18668

ClusterName=perceus-00 Account=pc_exagraph UserName=*gguidi*(44529)
Partition=lr2 ID=18667

[root@master ~]# sacctmgr show transaction | grep 18668

2019-04-08T10:49:24     Add Associations       root       id_assoc=*18668*
mod_time=1554745764+

2019-04-08T10:49:32  Modify Associations       root     (id_assoc=*18668*)
qos=',28,29,', delt+

[root@master ~]# sacctmgr show transaction | grep 18669

2019-04-08T10:49:25     Add Associations       root       id_assoc=*18669*
mod_time=1554745765+
2019-04-08T10:49:32  Modify Associations       root     (id_assoc=*18669*)
qos=',28,29,', delt+

>
> For your reference you can read more details about the investigation of the
> case that is similar to yours in Bug 5213 <https://bugs.schedmd.com/show_bug.cgi?id=5213>.  The most relevant details start
> showing up around here:https://bugs.schedmd.com/show_bug.cgi?id=5213#c90
>
> It sounds like you're hesitant to try the patch I sent from that ticket.  Can I
> ask if there's a reason for your hesitancy?
>
>
No hesitancy just want to make sure we are fixing the correct problem.  I
am going on vac tomorrow until Wednesday and I want to deploy it but don't
want to be called over my vacation time for the same issue.

>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 16 Ben Roberts 2019-08-14 16:31:55 MDT

Hi Jackie,

It sounds like you may have intended to send more information to me than made it.  I got a tar file this morning called slurmscs-out.tar.  I looked inside that tar file and there was a single file called slurm-testscs.info:

$ tar tvf slurmscs-out.tar 
-rw-r--r-- scoggins/staff 131485 2019-08-13 18:01 slurm-testscs.info


When I looked inside that file I can see that you grepped out all the log entries with a time stamp of "15:5" for today from the slurmctld.log and slurmdbd.log files.  Then it looks like you paged through that log file with 'more':

[root@master tmp]# grep 2019-08-13T15:5 /var/log/slurm/slurmctld.log /var/log/slurm/slurmdbd.log >> /tmp/slurmlogs.withissue
[root@master tmp]#  more /tmp/slurmlogs.withissue 



But when I look at the log entries that show up they only cover the time from 15:50:00 through 15:51:51:

/var/log/slurm/slurmctld.log:[2019-08-13T15:50:00.929] _slurm_rpc_submit_batch_job: JobId=20441555 InitPrio=25064 usec=3065
...
/var/log/slurm/slurmctld.log:[2019-08-13T15:51:33.721] error: A non superuser 44529 tried to complete batch JobId=20441374


/var/log/slurm/slurmdbd.log:[2019-08-13T15:50:00.032] debug2: Opened connection 15 from 10.0.2.2
...
/var/log/slurm/slurmdbd.log:[2019-08-13T15:51:51.372] debug2: Closed connection 10 uid(295)



It looks like your restart of slurmctld was around 15:55:xx and I don't have logs that cover that time period.  

The issue about a non superuser trying to complete a job sounds like a different issue and shouldn't be related to new users not making it from slurmdbd to slurmctld.  Are you able to get the output of 'scontrol show job 20441374' still or is that job no longer on the system?  It would probably be best to investigate that issue further in a separate ticket so it doesn't get lost in the communication issue.

Thanks,
Ben

Comment 17 Wei Feinstein 2019-08-14 18:27:04 MDT

I have applied the patch and will let you know if I see any further problems.

Thanks

Jackie

Comment 18 Ben Roberts 2019-08-15 16:07:48 MDT

Hi Jackie,

I'm sorry there was trouble getting the logs for the right period of time.  I'm glad to hear you've applied the patch.  I was waiting to hear if anything came up today, but it's been a while now and I haven't heard of a problem.  I assume you've been testing adding new users today, is that right?  If anything does come up after applying the patch I'm happy to look into it.  

Thanks,
Ben

Comment 19 Wei Feinstein 2019-08-15 16:19:39 MDT

Thanks Ben. I could get the additional logs but I thought it best to try
out the patch first. I’m on vacation next Wednesday next week. Others are
adding accounts and will let me know if they see the problem.  I’ll grab
all of the logs if anything happens.

Thanks

Jackie Scoggins

On Aug 15, 2019, at 3:07 PM, bugs@schedmd.com wrote:

*Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c18> on bug
7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
<ben@schedmd.com> *

Hi Jackie,

I'm sorry there was trouble getting the logs for the right period of time.  I'm
glad to hear you've applied the patch.  I was waiting to hear if anything came
up today, but it's been a while now and I haven't heard of a problem.  I assume
you've been testing adding new users today, is that right?  If anything does
come up after applying the patch I'm happy to look into it.

Thanks,
Ben

------------------------------
You are receiving this mail because:

   - You reported the bug.

Comment 20 Wei Feinstein 2019-08-15 17:07:21 MDT

Although Im out I see that the slurm accounts are still not going to the
slurmctld.  I just saw an account created and none of the group members
were added to the controller.

What steps would you like for me to take so you can troubleshoot this issue?

Jackie

On Thu, Aug 15, 2019 at 3:19 PM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> Thanks Ben. I could get the additional logs but I thought it best to try
> out the patch first. I’m on vacation next Wednesday next week. Others are
> adding accounts and will let me know if they see the problem.  I’ll grab
> all of the logs if anything happens.
>
> Thanks
>
> Jackie Scoggins
>
> On Aug 15, 2019, at 3:07 PM, bugs@schedmd.com wrote:
>
> *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c18> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I'm sorry there was trouble getting the logs for the right period of time.  I'm
> glad to hear you've applied the patch.  I was waiting to hear if anything came
> up today, but it's been a while now and I haven't heard of a problem.  I assume
> you've been testing adding new users today, is that right?  If anything does
> come up after applying the patch I'm happy to look into it.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 21 Wei Feinstein 2019-08-15 18:09:13 MDT

Created attachment 11260 [details]
slurm-7570.tar.bz2

I am seeing RPC's pending and max_rpc_cnt timeout during the time of the
account creations. In reading the doc "if using this parameter to set
max_rpc_cnt high enough that scheduling isn't always disabled, and low
enough that the interactive workload can get through in a reasonable period
of time. max_rpc_cnt needs to be below 256 (the default RPC thread limit).
Running around the middle (150) may give you good results. NOTE: When
increasing the amount of time spent in the backfill scheduling cycle, Slurm
can be prevented from responding to client requests in a timely manner. To
address this you can use *max_rpc_cnt* to specify a number of queued RPCs
before the scheduler stops to respond to these requests."
 Could these timeouts be causing our issue and not processes opening and
closing the database?

[2019-08-15T15:59:49.873] sched: 256 pending RPCs at cycle end, consider
configuring *max_rpc*_cnt

[2019-08-15T16:12:43.517] SchedulerParameters=default_queue_depth=100,
*max_rpc*
_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

[2019-08-15T16:20:24.688] SchedulerParameters=default_queue_depth=100,
*max_rpc*
_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

[2019-08-15T16:33:19.790] sched: 256 pending RPCs at cycle end, consider
configuring *max_rpc*_cnt

[2019-08-15T16:38:40.814] sched: 256 pending RPCs at cycle end, consider
configuring *max_rpc*_cnt

I am going to attach logs from when I saw the association not get added to
the controller.

On Thu, Aug 15, 2019 at 4:07 PM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> Although Im out I see that the slurm accounts are still not going to the
> slurmctld.  I just saw an account created and none of the group members
> were added to the controller.
>
> What steps would you like for me to take so you can troubleshoot this
> issue?
>
> Jackie
>
> On Thu, Aug 15, 2019 at 3:19 PM Jacqueline Scoggins <jscoggins@lbl.gov>
> wrote:
>
>> Thanks Ben. I could get the additional logs but I thought it best to try
>> out the patch first. I’m on vacation next Wednesday next week. Others are
>> adding accounts and will let me know if they see the problem.  I’ll grab
>> all of the logs if anything happens.
>>
>> Thanks
>>
>> Jackie Scoggins
>>
>> On Aug 15, 2019, at 3:07 PM, bugs@schedmd.com wrote:
>>
>> *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c18> on bug
>> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
>> <ben@schedmd.com> *
>>
>> Hi Jackie,
>>
>> I'm sorry there was trouble getting the logs for the right period of time.  I'm
>> glad to hear you've applied the patch.  I was waiting to hear if anything came
>> up today, but it's been a while now and I haven't heard of a problem.  I assume
>> you've been testing adding new users today, is that right?  If anything does
>> come up after applying the patch I'm happy to look into it.
>>
>> Thanks,
>> Ben
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>

Comment 22 Ben Roberts 2019-08-16 10:59:31 MDT

Hi Jackie,

To clarify, it looks like this patch was applied on your scs system.  Is that right?  One thing I would like to see is whether the connection information is still getting lost after applying the patch.  Can I have you connect to the database for the scs cluster again and run this query:

select control_host, control_port, name, rpc_version from cluster_table;

I would also like to see the output of the following commands on the machine running slurmdbd:
ls -l `which slurmdbd`
ps aux | grep slurmdbd

And on the machine running slurmctld I would like to see the following:
sdiag
scontrol show cluster
scontrol show assoc | grep dongyingwu
id 43537


The log messages you asked about related to the max_rpc_cnt could be something to look into, but should just result in a delay in having the user show up.  The max_rpc_cnt controls how many requests slurm will allow to queue up before it will force the scheduler to stop to respond to the requests.  Without the max_rpc_cnt defined it should just wait to service them once the scheduling cycle is done.  I'll look into this a little more with the sdiag output I've asked for.

Thanks,
Ben

Comment 23 Wei Feinstein 2019-08-16 14:11:12 MDT

After taking a look at the binaries it looks like slurmdbd was not updated
from when I rebuilt it on scs.  I just remake it and updated all of the
files. If I see this issue again I will try the commands you mentioned
above.

Thanks

Jackie

On Fri, Aug 16, 2019 at 9:59 AM <bugs@schedmd.com> wrote:

> *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c22> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> To clarify, it looks like this patch was applied on your scs system.  Is that
> right?  One thing I would like to see is whether the connection information is
> still getting lost after applying the patch.  Can I have you connect to the
> database for the scs cluster again and run this query:
>
> select control_host, control_port, name, rpc_version from cluster_table;
>
> I would also like to see the output of the following commands on the machine
> running slurmdbd:
> ls -l `which slurmdbd`
> ps aux | grep slurmdbd
>
> And on the machine running slurmctld I would like to see the following:
> sdiag
> scontrol show cluster
> scontrol show assoc | grep dongyingwu
> id 43537
>
>
> The log messages you asked about related to the max_rpc_cnt could be something
> to look into, but should just result in a delay in having the user show up.
> The max_rpc_cnt controls how many requests slurm will allow to queue up before
> it will force the scheduler to stop to respond to the requests.  Without the
> max_rpc_cnt defined it should just wait to service them once the scheduling
> cycle is done.  I'll look into this a little more with the sdiag output I've
> asked for.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 24 Ben Roberts 2019-08-16 14:54:06 MDT

Ok, I'm glad to hear that the patch just didn't work.  I'll wait to hear how it goes.  Thanks Jackie.

Ben

Comment 25 Ben Roberts 2019-08-22 08:50:24 MDT

Hi Jackie,

I wanted to follow up and see if you were able to apply the patch successfully and whether it made a difference.  Let me know how things look.

Thanks,
Ben

Comment 26 Wei Feinstein 2019-08-22 13:57:30 MDT

I did apply the patch. I was told by one of the admins that she saw an
account added but did not get into slurmctld earlier this week but I have
not confirmed that. I will have to get more information from her and let
you know.

Thanks

Jackie

On Thu, Aug 22, 2019 at 7:50 AM <bugs@schedmd.com> wrote:

> *Comment # 25 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c25> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I wanted to follow up and see if you were able to apply the patch successfully
> and whether it made a difference.  Let me know how things look.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 27 Ben Roberts 2019-08-23 16:39:37 MDT

Hi Jackie,

Were you able to get any additional details about a user still not being passed to slurmctld after being added to the database?  I'm happy to look into this further if it's still an issue.

Thanks,
Ben

Comment 28 Wei Feinstein 2019-08-23 17:26:46 MDT

So far no one has complained. But I did restart the service on Wednesday so
I can't say it was fixed completely. Lets try next week and see if no one
complains the new accounts are not working.

Thanks

Jackie

On Fri, Aug 23, 2019 at 3:39 PM <bugs@schedmd.com> wrote:

> *Comment # 27 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c27> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> Were you able to get any additional details about a user still not being passed
> to slurmctld after being added to the database?  I'm happy to look into this
> further if it's still an issue.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 29 Ben Roberts 2019-08-28 12:08:56 MDT

Hi Jackie,

I just wanted to check in to see how things have been looking with the patch applied.  I also wanted to let you know that in Bug 5213 there was a follow up patch created for another issue that was found, related to the patch I provided earlier.  The fix is for an issue where a slurmdbd connection was created without validating that there was a port listed.  This could cause a seg fault under the right circumstances.  This is an issue that has been around for a long time, but only became visible with the patch I provided earlier.  I would recommend applying this patch as well to avoid the possibility of a crash.
https://github.com/SchedMD/slurm/commit/e57b297ea1c4740ff69cbbe18df6abe3c8b50a74

Please let me know if you have questions.

Thanks,
Ben

Comment 30 Wei Feinstein 2019-08-28 18:06:39 MDT

Thanks for the update. I have not had to add any new users on lrc this week
and the person who normally does it is out of the office until Monday next
week.  And I don't seem to see any from either system not being reported in
both dbd and ctld.

Jackie




On Wed, Aug 28, 2019 at 11:09 AM <bugs@schedmd.com> wrote:

> *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c29> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I just wanted to check in to see how things have been looking with the patch
> applied.  I also wanted to let you know that in Bug 5213 <https://bugs.schedmd.com/show_bug.cgi?id=5213> there was a follow up
> patch created for another issue that was found, related to the patch I provided
> earlier.  The fix is for an issue where a slurmdbd connection was created
> without validating that there was a port listed.  This could cause a seg fault
> under the right circumstances.  This is an issue that has been around for a
> long time, but only became visible with the patch I provided earlier.  I would
> recommend applying this patch as well to avoid the possibility of a crash.https://github.com/SchedMD/slurm/commit/e57b297ea1c4740ff69cbbe18df6abe3c8b50a74
>
> Please let me know if you have questions.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 31 Ben Roberts 2019-09-05 14:28:11 MDT

Hi Jackie,

I'm lowering the priority of this ticket while we monitor the effect of the patch.  Have you been able to verify that new users are recognized without a restart of slurmctld this past week?

Thanks,
Ben

Comment 32 Wei Feinstein 2019-09-05 17:33:52 MDT

Thanks for the update.

On Thu, Sep 5, 2019 at 1:28 PM <bugs@schedmd.com> wrote:

> Ben Roberts <ben@schedmd.com> changed bug 7570
> <https://bugs.schedmd.com/show_bug.cgi?id=7570>
> What Removed Added
> Severity 2 - High Impact 3 - Medium Impact
>
> *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c31> on bug
> 7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I'm lowering the priority of this ticket while we monitor the effect of the
> patch.  Have you been able to verify that new users are recognized without a
> restart of slurmctld this past week?
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 33 Ben Roberts 2019-09-20 12:25:42 MDT

Hi Jackie,

I just wanted to follow up and see if the patches supplied in this ticket have addressed the issue you were seeing with new users.  Let me know if this ticket is ok to close.

Thanks,
Ben

Comment 34 Wei Feinstein 2019-09-20 16:34:35 MDT

It’s ok to close it. Haven’t seen the issue since the patch.

Thanks

Jackie Scoggins

On Sep 20, 2019, at 11:25 AM, bugs@schedmd.com wrote:

*Comment # 33 <https://bugs.schedmd.com/show_bug.cgi?id=7570#c33> on bug
7570 <https://bugs.schedmd.com/show_bug.cgi?id=7570> from Ben Roberts
<ben@schedmd.com> *

Hi Jackie,

I just wanted to follow up and see if the patches supplied in this ticket have
addressed the issue you were seeing with new users.  Let me know if this ticket
is ok to close.

Thanks,
Ben

------------------------------
You are receiving this mail because:

   - You reported the bug.

Comment 35 Ben Roberts 2019-09-23 08:37:01 MDT

Thanks Jackie, I'm glad that worked.  Closing now.

Ben